Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synchronize stream in the CUDAProductBase destructor #334

Conversation

makortel
Copy link

@makortel makortel commented Apr 23, 2019

PR description:

Otherwise there are possibilities for weird races (e.g. combination non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA streams executing work later than expected (= on the next event)).

PR validation:

Profiling workflow runs, unit tests run.

Otherwise there are possibilities for weird races (e.g. combination
non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA
streams executing work later than expected (= on the next event)).
@VinInn
Copy link

VinInn commented Apr 24, 2019

adding this to #329 does not help much on V100 (as expected, I suppose)
as any action that slow down execution, crashes became rarer...

@makortel
Copy link
Author

adding this to #329 does not help much on V100 (as expected, I suppose)

Right, I didn't expect it to fix the crashes, since (to my understanding) we already in practice synchronize each CUDA stream as the last action in each event. So this PR is more about correctness in certain corner cases (and it should really be done more efficiently).

fwyzard pushed a commit that referenced this pull request May 15, 2019
@fwyzard
Copy link

fwyzard commented Sep 10, 2019

@makortel , what shall we do with this ?

@makortel
Copy link
Author

makortel commented Sep 10, 2019

I think the problem itself should be addressed. To remind (also myself), weird behavior can happen if

  • non-ExternalWork EDProducer A makes a GPU product B
    • the producer itself does not synchronize with the CUDA stream
    • A has some device memory allocated as member data (per EDM stream), to be reused in the next event
  • EDModule C declares that it consumes B
    • so that the producer A will be run also unscheduled
  • C does not actually read B
    • C does not synchronize with the CUDA stream either

In this case it may happen that

  • Nobody synchronizes the CUDA stream that produces B (ok, this "happens" in any case)
  • The EDM stream may proceed to the next event (2) before the CUDA stream producing B has finished
  • The producer A uses another CUDA stream to produce B' on the next event
  • Now there is a race condition between kernels running on events 1 and 2

I don't like synchronizing the stream, but that's much simpler than any alternative. It would be interesting to check the impact on performance, but for that I should rebase the PR.

@makortel
Copy link
Author

Superseded by #391.

@makortel makortel closed this Sep 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants