Synchronize event in the CUDAProductBase destructor #391

makortel · 2019-09-20T16:20:18Z

PR description:

This PR is an updated version of #334:

Otherwise there are possibilities for weird races (e.g. combination non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA streams executing work later than expected (= on the next event)).

The difference is that this PR synchronizes wrt. the CUDA event (if there is one) instead of the stream (hence new PR), in this way the destruction of a product A does not have to wait for completion of work queued on the CUDA stream after the product A was made.

PR validation:

Profiling workflow runs, unit test run.

On felk40 with 8 EDM streams and threads I see ~1 % throughput decrease.

Otherwise there are possibilities for weird races (e.g. combination non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA streams executing work later than expected (= on the next event)).

makortel · 2019-09-20T16:22:26Z

After realizing that in our current "profiling workflow" the last module in the Schedule is a normal EDProducer launching GPU work, I think this fix is necessary.

fwyzard · 2019-10-24T05:45:18Z

The throughput scan extracted from the log seems to indicate that introducing the extra synchronisation does have an impact on the overall throughput, and requires more threads or streams to recover the current performance:

threads	development	testing
4	905.1 ± 4.3 ev/s	826.9 ± 3.7 ev/s
5	932.9 ± 3.7 ev/s	881.6 ± 5.7 ev/s
6	944.6 ± 5.2 ev/s	909.4 ± 8.9 ev/s
7	951.3 ± 3.4 ev/s	930.3 ± 2.3 ev/s
8	953.6 ± 3.4 ev/s	939.1 ± 4.4 ev/s
9	950.4 ± 6.0 ev/s	938.6 ± 2.4 ev/s
10	949.0 ± 4.2 ev/s	948.3 ± 6.5 ev/s
11	921.0 ± 2.8 ev/s	949.8 ± 1.4 ev/s
12	874.3 ± 11.7 ev/s	946.5 ± 4.3 ev/s

makortel · 2019-10-24T13:21:41Z

@fwyzard Which variant is this? The one without output on CPU, or the SoA/legacy on CPU?

fwyzard · 2019-10-24T13:28:25Z

The benchmarks are done with the customizePixelTracksForProfilingGPUOnly customisation, so they should be without transfers and conversions.

makortel · 2019-10-24T14:00:13Z

Thanks. Then the performance degradation is expected (although I'm a bit surprised how large the drop is <= 6 streams/threads). The variants with transfers at the end should have close-to-zero effect.

fwyzard · 2019-10-24T14:04:20Z

Thanks. Then the performance degradation is expected (although I'm a bit surprised how large the drop is <= 6 streams/threads).

No way around it ?

The variants with transfers at the end should have close-to-zero effect.

OK, I can try to benchmark that as well.

makortel · 2019-10-24T14:18:33Z

Thanks. Then the performance degradation is expected (although I'm a bit surprised how large the drop is <= 6 streams/threads).

No way around it ?

I'd need something similar to edm::WaitingTaskWithArenaHolder to prevent the framework to proceed from "postEvent" transition before the CUDA stream has finished work (and I constantly forget to talk with Chris about that). On the other hand, the customizePixelTracksForProfilingGPUOnly wouldn't have anything else to do in the mean time anyway when running with same number of EDM streams as threads, so I'm not sure if this approach would actually gain anything.

fwyzard · 2019-10-27T23:48:34Z

Mhm, I don't want to delay this or ask for an alternative solution based on an unrealistic benchmark.

On the other hand, the customizePixelTracksForProfilingGPUOnly wouldn't have anything else to do in the mean time anyway when running with same number of EDM streams as threads, so I'm not sure if this approach would actually gain anything.

If that's the case, do you have any guess as to what causes the drop in throughput then ?

makortel · 2019-10-28T14:27:23Z

If that's the case, do you have any guess as to what causes the drop in throughput then ?

I think what happens now is that the CUDA stream synchronizes wrt. EDM stream in raw-to-cluster (the only ExternalWork module). The program is therefore able to overlap the last kernels of event N with

end of event N CPU work (e.g. destruction of event products)
beginning of event N+1 CPU work (e.g. source)
offlineBeamSpot of event N+1
offlineBeamSpotCUDA of event N+1
CPU work of siPixelClustersCUDAPreSplitting

So yes, something could be gained wrt. the cudaEventSynchronize() of this PR with a mechanism to prevent the framework to proceed to preEvent transition of the next event on the EDM Stream before the CUDA stream has finished work. But I believe GPU modules whose output gets fully ignored (not consumed by any other GPU module, nor transferred to CPU) are not useful for any "realistic" job.

fwyzard · 2019-10-29T06:12:00Z

I see - thanks for the explanation.
Let me check what happens if we compare the throughput with copies or conversions...

fwyzard · 2019-10-29T06:13:00Z

Validation summary

Reference release CMSSW_11_0_0_pre7 at 411b633
Development branch CMSSW_11_0_X_Patatrack at 6bfe94f
Testing PRs:

Synchronize event in the CUDAProductBase destructor #391 at 97b6279

Validation plots

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

tracking validation plots and summary for workflow 10824.5
tracking validation plots and summary for workflow 10824.51
tracking validation plots and summary for workflow 10824.52

/RelValZMM_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

tracking validation plots and summary for workflow 10824.5
tracking validation plots and summary for workflow 10824.51
tracking validation plots and summary for workflow 10824.52

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_design_v3-v1/GEN-SIM-DIGI-RAW

tracking validation plots and summary for workflow 10824.5
tracking validation plots and summary for workflow 10824.51
tracking validation plots and summary for workflow 10824.52

Throughput plots

/EphemeralHLTPhysics1/Run2018D-v1/RAW run=323775 lumi=53

logs and `nvprof`/`nvvp` profiles

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

reference release, workflow 10824.5
- ✔️ step3.py: log
development release, workflow 10824.5
- ✔️ step3.py: log
development release, workflow 10824.51
- ✔️ step3.py: log
development release, workflow 10824.52
- ✔️ step3.py: log
- ✔️ profile.py: log
- ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
development release, workflow 136.86452
testing release, workflow 10824.5
- ✔️ step3.py: log
testing release, workflow 10824.51
- ✔️ step3.py: log
testing release, workflow 10824.52
- ✔️ step3.py: log
- ✔️ profile.py: log
- ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
testing release, workflow 136.86452

/RelValZMM_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

reference release, workflow 10824.5
- ✔️ step3.py: log
development release, workflow 10824.5
- ✔️ step3.py: log
development release, workflow 10824.51
- ✔️ step3.py: log
development release, workflow 10824.52
- ✔️ step3.py: log
- ✔️ profile.py: log
- ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
development release, workflow 136.86452
testing release, workflow 10824.5
- ✔️ step3.py: log
testing release, workflow 10824.51
- ✔️ step3.py: log
testing release, workflow 10824.52
- ✔️ step3.py: log
- ✔️ profile.py: log
- ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
testing release, workflow 136.86452

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_design_v3-v1/GEN-SIM-DIGI-RAW

reference release, workflow 10824.5
- ✔️ step3.py: log
development release, workflow 10824.5
- ✔️ step3.py: log
development release, workflow 10824.51
- ✔️ step3.py: log
development release, workflow 10824.52
- ✔️ step3.py: log
- ✔️ profile.py: log
- ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
development release, workflow 136.86452
testing release, workflow 10824.5
- ✔️ step3.py: log
testing release, workflow 10824.51
- ✔️ step3.py: log
testing release, workflow 10824.52
- ✔️ step3.py: log
- ✔️ profile.py: log
- ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
testing release, workflow 136.86452

Logs

The full log is available at https://patatrack.web.cern.ch/patatrack/validation/pulls/7b16e3f0df4a11cf31250808502d749d57212707/log .

fwyzard · 2019-10-29T10:09:24Z

OK, here is the comparison with a wider X axis.

GPU only

with transfers

Indeed once we enable transfers the impact is almost negligible.

Otherwise there are possibilities for weird races, e.g. combination of non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA streams executing work later than expected (= on the next event).

Synchronize event in the CUDAProductBase destructor

97b6279

Otherwise there are possibilities for weird races (e.g. combination non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA streams executing work later than expected (= on the next event)).

makortel mentioned this pull request Sep 20, 2019

Synchronize stream in the CUDAProductBase destructor #334

Closed

This comment has been minimized.

Sign in to view

fwyzard merged commit 8177676 into cms-patatrack:CMSSW_11_0_X_Patatrack Oct 29, 2019

This was referenced Oct 8, 2020

Patatrack integration - Pixel track reconstruction (10/N) cms-sw/cmssw#31722

Merged

Patatrack integration - Pixel vertex reconstruction (11/N) cms-sw/cmssw#31723

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synchronize event in the CUDAProductBase destructor #391

Synchronize event in the CUDAProductBase destructor #391

makortel commented Sep 20, 2019

makortel commented Sep 20, 2019

This comment has been minimized.

fwyzard commented Oct 24, 2019

makortel commented Oct 24, 2019

fwyzard commented Oct 24, 2019

makortel commented Oct 24, 2019

fwyzard commented Oct 24, 2019

makortel commented Oct 24, 2019

fwyzard commented Oct 27, 2019

makortel commented Oct 28, 2019

fwyzard commented Oct 29, 2019

fwyzard commented Oct 29, 2019 •

edited

Loading

fwyzard commented Oct 29, 2019

Synchronize event in the CUDAProductBase destructor #391

Synchronize event in the CUDAProductBase destructor #391

Conversation

makortel commented Sep 20, 2019

PR description:

PR validation:

makortel commented Sep 20, 2019

This comment has been minimized.

fwyzard commented Oct 24, 2019

makortel commented Oct 24, 2019

fwyzard commented Oct 24, 2019

makortel commented Oct 24, 2019

fwyzard commented Oct 24, 2019

makortel commented Oct 24, 2019

fwyzard commented Oct 27, 2019

makortel commented Oct 28, 2019

fwyzard commented Oct 29, 2019

fwyzard commented Oct 29, 2019 • edited Loading

Validation summary

Validation plots

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_design_v3-v1/GEN-SIM-DIGI-RAW

Throughput plots

/EphemeralHLTPhysics1/Run2018D-v1/RAW run=323775 lumi=53

logs and nvprof/nvvp profiles

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_design_v3-v1/GEN-SIM-DIGI-RAW

Logs

fwyzard commented Oct 29, 2019

GPU only

with transfers

fwyzard commented Oct 29, 2019 •

edited

Loading

logs and `nvprof`/`nvvp` profiles