Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synchronize event in the CUDAProductBase destructor #391

Merged

Conversation

makortel
Copy link

PR description:

This PR is an updated version of #334:

Otherwise there are possibilities for weird races (e.g. combination non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA streams executing work later than expected (= on the next event)).

The difference is that this PR synchronizes wrt. the CUDA event (if there is one) instead of the stream (hence new PR), in this way the destruction of a product A does not have to wait for completion of work queued on the CUDA stream after the product A was made.

PR validation:

Profiling workflow runs, unit test run.

On felk40 with 8 EDM streams and threads I see ~1 % throughput decrease.

Otherwise there are possibilities for weird races (e.g. combination
non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA
streams executing work later than expected (= on the next event)).
@makortel
Copy link
Author

After realizing that in our current "profiling workflow" the last module in the Schedule is a normal EDProducer launching GPU work, I think this fix is necessary.

@fwyzard

This comment has been minimized.

@fwyzard
Copy link

fwyzard commented Oct 24, 2019

The throughput scan extracted from the log seems to indicate that introducing the extra synchronisation does have an impact on the overall throughput, and requires more threads or streams to recover the current performance:

image

threads development testing
4 905.1 ± 4.3 ev/s 826.9 ± 3.7 ev/s
5 932.9 ± 3.7 ev/s 881.6 ± 5.7 ev/s
6 944.6 ± 5.2 ev/s 909.4 ± 8.9 ev/s
7 951.3 ± 3.4 ev/s 930.3 ± 2.3 ev/s
8 953.6 ± 3.4 ev/s 939.1 ± 4.4 ev/s
9 950.4 ± 6.0 ev/s 938.6 ± 2.4 ev/s
10 949.0 ± 4.2 ev/s 948.3 ± 6.5 ev/s
11 921.0 ± 2.8 ev/s 949.8 ± 1.4 ev/s
12 874.3 ± 11.7 ev/s 946.5 ± 4.3 ev/s

@makortel
Copy link
Author

@fwyzard Which variant is this? The one without output on CPU, or the SoA/legacy on CPU?

@fwyzard
Copy link

fwyzard commented Oct 24, 2019

The benchmarks are done with the customizePixelTracksForProfilingGPUOnly customisation, so they should be without transfers and conversions.

@makortel
Copy link
Author

Thanks. Then the performance degradation is expected (although I'm a bit surprised how large the drop is <= 6 streams/threads). The variants with transfers at the end should have close-to-zero effect.

@fwyzard
Copy link

fwyzard commented Oct 24, 2019

Thanks. Then the performance degradation is expected (although I'm a bit surprised how large the drop is <= 6 streams/threads).

No way around it ?

The variants with transfers at the end should have close-to-zero effect.

OK, I can try to benchmark that as well.

@makortel
Copy link
Author

Thanks. Then the performance degradation is expected (although I'm a bit surprised how large the drop is <= 6 streams/threads).

No way around it ?

I'd need something similar to edm::WaitingTaskWithArenaHolder to prevent the framework to proceed from "postEvent" transition before the CUDA stream has finished work (and I constantly forget to talk with Chris about that). On the other hand, the customizePixelTracksForProfilingGPUOnly wouldn't have anything else to do in the mean time anyway when running with same number of EDM streams as threads, so I'm not sure if this approach would actually gain anything.

@fwyzard
Copy link

fwyzard commented Oct 27, 2019

Mhm, I don't want to delay this or ask for an alternative solution based on an unrealistic benchmark.

On the other hand, the customizePixelTracksForProfilingGPUOnly wouldn't have anything else to do in the mean time anyway when running with same number of EDM streams as threads, so I'm not sure if this approach would actually gain anything.

If that's the case, do you have any guess as to what causes the drop in throughput then ?

@makortel
Copy link
Author

If that's the case, do you have any guess as to what causes the drop in throughput then ?

I think what happens now is that the CUDA stream synchronizes wrt. EDM stream in raw-to-cluster (the only ExternalWork module). The program is therefore able to overlap the last kernels of event N with

  • end of event N CPU work (e.g. destruction of event products)
  • beginning of event N+1 CPU work (e.g. source)
  • offlineBeamSpot of event N+1
  • offlineBeamSpotCUDA of event N+1
  • CPU work of siPixelClustersCUDAPreSplitting

So yes, something could be gained wrt. the cudaEventSynchronize() of this PR with a mechanism to prevent the framework to proceed to preEvent transition of the next event on the EDM Stream before the CUDA stream has finished work. But I believe GPU modules whose output gets fully ignored (not consumed by any other GPU module, nor transferred to CPU) are not useful for any "realistic" job.

@fwyzard
Copy link

fwyzard commented Oct 29, 2019

I see - thanks for the explanation.
Let me check what happens if we compare the throughput with copies or conversions...

@fwyzard
Copy link

fwyzard commented Oct 29, 2019

Validation summary

Reference release CMSSW_11_0_0_pre7 at 411b633
Development branch CMSSW_11_0_X_Patatrack at 6bfe94f
Testing PRs:

Validation plots

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

  • tracking validation plots and summary for workflow 10824.5
  • tracking validation plots and summary for workflow 10824.51
  • tracking validation plots and summary for workflow 10824.52

/RelValZMM_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

  • tracking validation plots and summary for workflow 10824.5
  • tracking validation plots and summary for workflow 10824.51
  • tracking validation plots and summary for workflow 10824.52

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_design_v3-v1/GEN-SIM-DIGI-RAW

  • tracking validation plots and summary for workflow 10824.5
  • tracking validation plots and summary for workflow 10824.51
  • tracking validation plots and summary for workflow 10824.52

Throughput plots

/EphemeralHLTPhysics1/Run2018D-v1/RAW run=323775 lumi=53

scan-136.86452.png
zoom-136.86452.png

logs and nvprof/nvvp profiles

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

  • reference release, workflow 10824.5
  • development release, workflow 10824.5
  • development release, workflow 10824.51
  • development release, workflow 10824.52
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • development release, workflow 136.86452
  • testing release, workflow 10824.5
  • testing release, workflow 10824.51
  • testing release, workflow 10824.52
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • testing release, workflow 136.86452

/RelValZMM_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

  • reference release, workflow 10824.5
  • development release, workflow 10824.5
  • development release, workflow 10824.51
  • development release, workflow 10824.52
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • development release, workflow 136.86452
  • testing release, workflow 10824.5
  • testing release, workflow 10824.51
  • testing release, workflow 10824.52
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • testing release, workflow 136.86452

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_design_v3-v1/GEN-SIM-DIGI-RAW

  • reference release, workflow 10824.5
  • development release, workflow 10824.5
  • development release, workflow 10824.51
  • development release, workflow 10824.52
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • development release, workflow 136.86452
  • testing release, workflow 10824.5
  • testing release, workflow 10824.51
  • testing release, workflow 10824.52
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • testing release, workflow 136.86452

Logs

The full log is available at https://patatrack.web.cern.ch/patatrack/validation/pulls/7b16e3f0df4a11cf31250808502d749d57212707/log .

@fwyzard
Copy link

fwyzard commented Oct 29, 2019

OK, here is the comparison with a wider X axis.

GPU only

zoom-gpuonly

with transfers

zoom-transfer

Indeed once we enable transfers the impact is almost negligible.

@fwyzard fwyzard merged commit 8177676 into cms-patatrack:CMSSW_11_0_X_Patatrack Oct 29, 2019
fwyzard pushed a commit that referenced this pull request Oct 8, 2020
Otherwise there are possibilities for weird races, e.g. combination of
non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA
streams executing work later than expected (= on the next event).
fwyzard pushed a commit that referenced this pull request Oct 8, 2020
Otherwise there are possibilities for weird races, e.g. combination of
non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA
streams executing work later than expected (= on the next event).
fwyzard pushed a commit that referenced this pull request Oct 20, 2020
Otherwise there are possibilities for weird races, e.g. combination of
non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA
streams executing work later than expected (= on the next event).
fwyzard pushed a commit that referenced this pull request Oct 20, 2020
Otherwise there are possibilities for weird races, e.g. combination of
non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA
streams executing work later than expected (= on the next event).
fwyzard pushed a commit that referenced this pull request Oct 20, 2020
Otherwise there are possibilities for weird races, e.g. combination of
non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA
streams executing work later than expected (= on the next event).
fwyzard pushed a commit that referenced this pull request Oct 23, 2020
Otherwise there are possibilities for weird races, e.g. combination of
non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA
streams executing work later than expected (= on the next event).
fwyzard pushed a commit that referenced this pull request Oct 23, 2020
Otherwise there are possibilities for weird races, e.g. combination of
non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA
streams executing work later than expected (= on the next event).
fwyzard pushed a commit that referenced this pull request Nov 6, 2020
Otherwise there are possibilities for weird races, e.g. combination of
non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA
streams executing work later than expected (= on the next event).
fwyzard pushed a commit that referenced this pull request Nov 6, 2020
Otherwise there are possibilities for weird races, e.g. combination of
non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA
streams executing work later than expected (= on the next event).
fwyzard pushed a commit that referenced this pull request Nov 6, 2020
Otherwise there are possibilities for weird races, e.g. combination of
non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA
streams executing work later than expected (= on the next event).
fwyzard pushed a commit that referenced this pull request Nov 6, 2020
Otherwise there are possibilities for weird races, e.g. combination of
non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA
streams executing work later than expected (= on the next event).
fwyzard pushed a commit that referenced this pull request Nov 16, 2020
Otherwise there are possibilities for weird races, e.g. combination of
non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA
streams executing work later than expected (= on the next event).
fwyzard pushed a commit that referenced this pull request Nov 16, 2020
Otherwise there are possibilities for weird races, e.g. combination of
non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA
streams executing work later than expected (= on the next event).
fwyzard added a commit that referenced this pull request Nov 27, 2020
Otherwise there are possibilities for weird races, e.g. combination of
non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA
streams executing work later than expected (= on the next event).
fwyzard pushed a commit that referenced this pull request Dec 26, 2020
Otherwise there are possibilities for weird races, e.g. combination of
non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA
streams executing work later than expected (= on the next event).
fwyzard added a commit that referenced this pull request Dec 26, 2020
Otherwise there are possibilities for weird races, e.g. combination of
non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA
streams executing work later than expected (= on the next event).
fwyzard pushed a commit that referenced this pull request Dec 29, 2020
Otherwise there are possibilities for weird races, e.g. combination of
non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA
streams executing work later than expected (= on the next event).
fwyzard pushed a commit that referenced this pull request Jan 13, 2021
Otherwise there are possibilities for weird races, e.g. combination of
non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA
streams executing work later than expected (= on the next event).
fwyzard pushed a commit that referenced this pull request Jan 13, 2021
Otherwise there are possibilities for weird races, e.g. combination of
non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA
streams executing work later than expected (= on the next event).
fwyzard pushed a commit that referenced this pull request Jan 15, 2021
Otherwise there are possibilities for weird races, e.g. combination of
non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA
streams executing work later than expected (= on the next event).
fwyzard pushed a commit that referenced this pull request Jan 15, 2021
Otherwise there are possibilities for weird races, e.g. combination of
non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA
streams executing work later than expected (= on the next event).
fwyzard pushed a commit that referenced this pull request Mar 23, 2021
Otherwise there are possibilities for weird races, e.g. combination of
non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA
streams executing work later than expected (= on the next event).
fwyzard pushed a commit that referenced this pull request Apr 1, 2021
Otherwise there are possibilities for weird races, e.g. combination of
non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA
streams executing work later than expected (= on the next event).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants