Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quda work ndg force #612

Merged
merged 23 commits into from
May 8, 2024
Merged

Quda work ndg force #612

merged 23 commits into from
May 8, 2024

Conversation

Marcogarofalo
Copy link
Contributor

No description provided.

@kostrzewa
Copy link
Member

kostrzewa commented Apr 19, 2024

Awesome, this is working great. Here's a comparison on Juwels Booster on 4 nodes, 64c128 at the physical point with consistent random numbers.

no force offloading (for reference)

00001054 0.545663519531 0.020930789411 9.792867e-01 81 14613 1688 261376 327 32074 36839 0 1658 10002 1160 54301 143 9753 4907 97531 1 1.469851e+04 3.131941e-01
00001055 0.545676315454 -0.208904463798 1.232327e+00 80 14626 1609 261254 323 32075 36858 0 1706 9976 1140 54258 140 9681 4871 97607 1 1.460724e+04 3.132264e-01
00001056 0.545682037202 0.138889044523 8.703246e-01 80 14594 1654 260446 324 31963 35986 0 1649 10182 1133 54098 148 9950 4767 96687 1 1.474566e+04 3.132276e-01
00001057 0.545682509451 0.226925754920 7.969800e-01 80 14507 1617 258577 320 31691 35898 0 1681 10081 1123 53740 140 9920 4733 96450 1 1.461649e+04 3.132220e-01

light force offloading only

00001401 0.545681973756 0.569591499865 5.657565e-01 80 14628 1603 259506 315 31451 35516 0 1624 10011 1109 53519 138 9568 4718 96323 1 1.242204e+04 3.131983e-01
00001402 0.545701067640 0.041082913056 9.597496e-01 81 14689 1610 261058 316 31719 35647 0 1627 10023 1111 53903 145 9794 4727 97032 1 1.396068e+04 3.132205e-01
00001403 0.545696849214 0.160429969430 8.517775e-01 79 14590 1584 259024 311 31397 35699 0 1634 10081 1099 53497 141 9808 4745 96752 1 1.282431e+04 3.132271e-01
00001404 0.545662972281 -0.045102979988 1.046136e+00 80 14479 1594 256674 314 31056 35580 0 1608 10038 1103 52961 143 9822 4731 95628 1 1.241667e+04 3.131858e-01

+ ND force offloading

(first trajectory includes tuning)

00001401 0.545681973797 0.569108584896 5.660298e-01 80 14628 1603 259486 315 31451 35526 0 1626 10007 1107 53504 137 9569 4719 96307 1 1.025544e+04 3.131983e-01
00001402 0.545701067693 0.040016509593 9.607736e-01 81 14690 1611 261098 316 31727 35650 0 1627 10010 1112 53930 136 9832 4725 97018 1 8.803802e+03 3.132205e-01
00001403 0.545696849273 0.159103434533 8.529081e-01 79 14586 1583 258986 312 31393 35698 0 1636 10063 1101 53510 139 9783 4743 96757 1 8.879981e+03 3.132271e-01
00001404 0.545662972387 -0.047294547781 1.048431e+00 80 14479 1596 256673 313 31054 35575 0 1606 10046 1102 52984 143 9825 4727 95639 1 9.201252e+03 3.131858e-01

@kostrzewa
Copy link
Member

The speed-up will be even greater on a machine like Leonardo or LUMI-G. I'll let Andrey know that he can run some first tests for the finite-temperature runs. I'll put you in CC @Marcogarofalo

@urbach
Copy link
Contributor

urbach commented Apr 19, 2024 via email

@kostrzewa
Copy link
Member

14500 -> 12500 -> 9000 !

@kostrzewa
Copy link
Member

@Marcogarofalo There's an issue with the timing on the QUDA side. It seems like the time spent in computeTMCloverForceQuda is counted internally multiple times.

@kostrzewa
Copy link
Member

What I mean is the following:

   computeTMCloverForceQuda Total time =   382.651 secs
                 download     =    95.261 secs ( 24.895%),       with     2582 calls at 3.689e+04 us per call
                   upload     =    83.468 secs ( 21.813%),       with     1033 calls at 8.080e+04 us per call
                     init     =    15.232 secs (  3.981%),       with    26165 calls at 5.821e+02 us per call
                  compute     =  7920.459 secs (2069.889%),      with   292924 calls at 2.704e+04 us per call
                    comms     =    54.129 secs ( 14.146%),       with     6426 calls at 8.423e+03 us per call
                     free     =    20.674 secs (  5.403%),       with   236599 calls at 8.738e+01 us per call
        total accounted       =  8189.223 secs (2140.127%)
        total missing         = -7806.571 secs (-2040.127%)
WARNING: Accounted time  8189.223 secs in computeTMCloverForceQuda is greater than total time   382.651 secs

This doesn't affect anything on our side but it does mess with the QUDA profile.

@Marcogarofalo
Copy link
Contributor Author

here is a comparison of the data before and after the last commit, the speedup can not be seen in such a small test:

debug level 1 rel precision + no strict checks b415eb6

00000000 0.112190826944 8544.821712773226 0.000000e+00 56 182 128 246 209 338 0 1.132254e+00 5.069101e-02
00000001 0.112190826944 10141.057194254070 0.000000e+00 56 184 127 245 206 340 0 2.598577e-01 5.069101e-02
00000002 0.112190826944 8569.440461946775 0.000000e+00 55 182 125 241 206 335 0 2.536940e-01 5.069101e-02
00000003 0.112190826944 7382.444491649191 0.000000e+00 55 181 124 239 204 334 0 2.550959e-01 5.069101e-02

debug level 1 rel precision + no strict checks, e29573f

00000000 0.112190826944 8544.821712773226 0.000000e+00 56 182 128 246 209 338 0 4.350938e+00 5.069101e-02
00000001 0.112190826944 10141.057194254070 0.000000e+00 56 184 127 245 206 340 0 2.661940e-01 5.069101e-02
00000002 0.112190826944 8569.440461946775 0.000000e+00 55 182 125 241 206 335 0 2.616898e-01 5.069101e-02
00000003 0.112190826944 7382.444491649191 0.000000e+00 55 181 124 239 204 334 0 2.592305e-01 5.069101e-02

debug level 4 rel precision + no strict checks b415eb6

00000000 0.112190826944 8544.821712773226 0.000000e+00 56 364 128 492 209 676 0 1.407556e+00 5.069101e-02
00000001 0.112190826944 10141.057194254070 0.000000e+00 56 368 127 490 206 680 0 5.201004e-01 5.069101e-02
00000002 0.112190826944 8569.440461946775 0.000000e+00 55 364 125 482 206 670 0 5.171427e-01 5.069101e-02
00000003 0.112190826944 7382.444491649191 0.000000e+00 55 362 124 478 204 668 0 5.416352e-01 5.069101e-02

debug level 4 rel precision + no strict checks, e29573f

00000000 0.112190826944 8544.821712773226 0.000000e+00 56 364 128 492 209 676 0 1.419756e+00 5.069101e-02
00000001 0.112190826944 10141.057194254070 0.000000e+00 56 368 127 490 206 680 0 4.816360e-01 5.069101e-02
00000002 0.112190826944 8569.440461946775 0.000000e+00 55 364 125 482 206 670 0 4.727600e-01 5.069101e-02
00000003 0.112190826944 7382.444491649191 0.000000e+00 55 362 124 478 204 668 0 4.695284e-01 5.069101e-02

@kostrzewa kostrzewa marked this pull request as ready for review May 8, 2024 12:45
@kostrzewa kostrzewa self-requested a review May 8, 2024 12:46
@kostrzewa kostrzewa merged commit 950a3a1 into master May 8, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants