Makes UR cuda backend compatible with MPI #2077

JackAKirk · 2024-09-10T16:33:08Z

Fixes intel/llvm#15251
providing that MPI/SYCL codes are updated as in codeplaysoftware/SYCL-samples#33

I will follow up with a corresponding fix to hip that is pretty much copy paste; I can add it to this PR, but that would require #1830 to be merged first.

Background

The origin of the above issue is that since we no longer have a single cuda device per platform, we currently have all devices initialize primary CUcontexts at platform instantiation even when this was not required by the runtime.
The MPI interface can work with cuda/rocm awareness because it is assumed that the user will set only a single hip/cuda device per process prior to each MPI call. The problem with having multiple cuda devices set (i.e their primary CUcontexts are instantiated) (as is the case if you use e.g. the default sycl::context and have multiple devices included), is that MPI calls will operate on each instantiated device, in the case that at least one active process has such a device primary CUcontext set.
This can lead to memory leaks as described in intel/llvm#15251

This PR fixes the issue by removing this platform scope CUcontext instantiation, and promoting sycl::context and ur context in the cuda backend to be responsible for instantiating and releasing the native CUcontexts associated with all devices included in that sycl::context. This works due to the universal usage of sycl::context and ur_context_handle_t_ in SYCL/UR apis. I think that this is the most natural solution for SYCL compatibility, and leads to only negligible one-time overhead in platform instantiation, but has the important benefit of supporting application critical technologies, MPI and *CCL, with SYCL.

Without considering MPI compatibilitysycl::context has been a free parameter (so to speak), that has not been functionally used in cuda/hip backends; it turns out that it is fortunate that it does exist, since otherwise there would not have been a straightforward way to incorporate MPI with SYCL as far as I can see.

Without changing UR/oneapi specs. Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

oneapi-src/unified-runtime#2077 Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

JackAKirk · 2024-09-10T16:37:36Z

tested here:

intel/llvm#15349

JackAKirk · 2024-09-12T14:12:25Z

Here is some more information on why this design choice was made, and comparison with alternatives.

Firstly we start with the requirement that we fix intel/llvm#15251
If we didn't do this then MPI wouldn't work with DPC++/SYCL, and I think this must also then mean it wouldn't be possible to make a oneCCL (which is based on MPI interfaces) nvidia backend for DPC++/SYCL.

Then we identify that the fix is to not allow CUcontext's to remain initialized which are associated with GPUs that are not being used by the programmer in the process that initialized them. This means that we have to change our way of working which currently initializes CUcontexts at platform instantiations, only destroying the CUcontexts in the device destructors that are called at platform destruction.
At this point we encounter the only spec issue regarding timestamps: (from sycl spec revision 9 4.6.6.1. "Event information and profiling descriptors")

Timestamp Requirement

"Each profiling descriptor returns a 64-bit timestamp that represents the number of nanoseconds that have elapsed since some implementation-defined timebase. All events that share the same backend are guaranteed to share the same timebase, therefore the difference between two timestamps from the same backend yields the number of nanoseconds that have elapsed between those events."

This is simply not implementable word for word (unless I am mistaken) in the cuda backend under the constraint that we are MPI compliant (in the following for brevity I will assume this constraint holds), for the following reasons:

CUevent needs an active CUcontext to be initialized
CUevent is invalid if the CUcontext that created it is destroyed
urDeviceGetGlobalTimestamp is not implementable in the cuda backend without a reference CUevent: see the interface of cuEventElapsedTime, that is the only way of returning a event timing difference and requires a reference CUevent.

Therefore we cannot create a "backend global" CUevent timestamp at backend/platform initialization.

However, in practice I don't think this can break any SYCL code with the change in this PR because people can only perform profiling once they have created a sycl::queue, and you cannot create a sycl::queue without a sycl::context. Therefore this change which initializes CUcontext on sycl::context construction, and destroys CUcontext on sycl::context destruction, in practice is really equivalent to the spec requirement. In fact this change somehow seems to actually fix the cuda backend testing of the urDeviceGetGlobalTimestamps UR conformance testing. I'm not actually sure why, but I do note that we previously had separate CUevent timestamp references for each device, in contradiction to the sycl spec.

At this point it is worth considering an alternative: imagine that we instead tie CUcontext instantialization to sycl::queue creation.
If we were to do this we would in practice break the Timestamp Requirement above: say for example a user creates two sycl::queues, and then performs profiling analysis using sycl::events` returned from submissions to these two queues. Since the two queues are using different reference timestamps, the profiling information will be wrong. We would have to give the implementation some means of knowing whether a queue was already created using the requested device, and if so some means of passing the CUevent created already to the new queue. This would certainly be possible, but it would be much more complex and also have some small performance overhead to creating queues. Plus if the second queue was created after the first queue was destroyed (which admittedly would be unusual I think), then the event created initially would be invalid; in which case it would be impossible for the two queues to have the same base timestamp.

In addition to sycl::queue instantiating CUcontext, we would also have to make sycl::kernel_bundle instantiate a CUcontext, since it is not tied to the queue. Now if you did this, at least for the dpc++ implementation I think that all core sycl functionality would be valid since SYCL things require either a sycl::queue or sycl::kernel_bundle to work.
Even in this "core 2020 sycl" world a downside of this approach would be that it would be possible to have programs (which may be quite unusual I guess), whereby queues/kernel_bundles (perhaps kernel_bundle case is more realistic) are continuously created and destroyed. If such a program had the circumstance that the last reference to the CUcontext was released, the CUcontext gets destroyed. There is a ~50ms overhead to either instantiate or destroy a CUcontext, so this would not be desirable. In theory for a program that creates a kernel_bundle, uses it, deletes it, then creates a new one ad ~infinitum, the overhead would be very substantial.

However even ignoring the above issues, the oneapi virtual memory extension does not take a queue argument, but its implementation requires setting CUcontext. This would mean we would have to add a failsafe CUcontext retention call as we have already had to make in this single device query case already: https://github.com/oneapi-src/unified-runtime/pull/2077/files#diff-641b75ae8137280ac68523353cbb6eb8059f8581b35261d7a96d179a478229bcR810. This would mean that we would have to have the similar possibility to that described above, whereby users use the virtual memory extension for a given device when a queue for that same device is not in scope leading to CUcontext initialization/destruction costs.

Summary

So there are 3 main reasons I see to prefer instantiation at context scope rather than queue/kernel_bundle scope:

doesn't break profiling for multiple queues without necessitating a more complex, slower, implementation.
doesn't slow down unusual usages of sycl::queue/ sycl::kernel_bundle.
doesn't slow down virtual_memory, bindless images, and usm_p2p extensions, or require associating them with a sycl::queue via a spec change.

A final general reason is that it would require more refactoring on the ur implementation compared to the suggested fix, beyond what I have already outlined would be required for reasons 1-3.

In practice this doesn't change how users interact with sycl outside of MPI (or if they want to prevent devices being used for another reason). Single process users will be able to continue to use the default sycl::context and ignore sycl::context entirely. If they want to use MPI with only specific devices visible, then they just need to manually create the context as described in codeplaysoftware/SYCL-samples#33
I suppose the only downside would be this means you can't deprecate sycl::context, but this seems like it definitely isn't happening. We need to get application critical things like MPI working fluidly asap.

Semantics wrt CUDA Runtime

Essentially we swap the functionality of cudaSetDevice() with usage of sycl::context(Devices) that is then passed to Queue/Kernel_bundle/virtual mem ext etc Unlike in CUDA Runtime, in sycl Devices can be greater than one, since this information is used in some backends to implement sycl::buffer and allow easier opencl interoperability. For better or for worse sycl chose this implicit device setting idiom. We can note that if it were not for the MPI constraint, we could use sycl::platform for this device setting purpose (as we currently do in the existing impl), since setting all available devices always by default would never encounter problems. MPI requires that we "promote" sycl::context in this way for cuda/hip backends. I imagine that MPI also applies this constraint on L0 backends (but guess this is already satisfied).

This also all fits in with the description of sycl::context in the sycl specification. Effectively we are giving sycl::context the job it has been waiting for.

Taking account that CUevent are no longer valid if CUcontext creating them is destroyed. - Fix format. - Instantiate dummy vars with nullptr. - Pass EvBase by ref. Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

JackAKirk · 2024-09-13T14:08:20Z

I think that the device_num.cpp failure must be unrelated. I see it happening sometimes in other PRs. See:
#2089

hdelan · 2024-09-13T14:16:29Z

source/adapters/cuda/context.hpp

    for (auto &Dev : Devices) {
      urDeviceRetain(Dev);
+      Dev->retainNativeContext();


I would personally prefer to see all of this logic for contexts happen in ur_queue_handle_t_s. This avoids giving sycl::contexts extra semantics for the CUDA backend. Within urQueueCreate you could call something like ur_device_handle_t_::init_device() which would retain the primary ctx and then set the base event, which would then be cached in the device, so if another queue is created for the same device, it doesn't need to do the same base event getting, info querying, etc.

Let's see what @npmiller thinks.

Makes UR cuda backend compatible with MPI

5575dde

Without changing UR/oneapi specs. Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

github-actions bot added the cuda CUDA adapter specific issues label Sep 10, 2024

JackAKirk added a commit to JackAKirk/llvm that referenced this pull request Sep 10, 2024

UR PR test

2ef220e

oneapi-src/unified-runtime#2077 Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

JackAKirk mentioned this pull request Sep 10, 2024

[DO NOT MERGE] UR PR test intel/llvm#15349

Draft

JackAKirk mentioned this pull request Sep 10, 2024

Fix/Document memory leak in MPI. codeplaysoftware/SYCL-samples#33

Open

github-actions bot added the conformance Conformance test suite issues. label Sep 11, 2024

Fix global timestamp.

351edba

Taking account that CUevent are no longer valid if CUcontext creating them is destroyed. - Fix format. - Instantiate dummy vars with nullptr. - Pass EvBase by ref. Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

JackAKirk force-pushed the make-mpi-compatible branch from 332b461 to 351edba Compare September 13, 2024 12:03

Merge branch 'main' into make-mpi-compatible

478983d

JackAKirk marked this pull request as ready for review September 13, 2024 13:19

JackAKirk requested review from a team as code owners September 13, 2024 13:19

JackAKirk requested a review from mmoadeli September 13, 2024 13:19

hdelan reviewed Sep 13, 2024

View reviewed changes

JackAKirk mentioned this pull request Sep 16, 2024

[CUDA][HIP] too many process spawned on multiple GPU systems intel/llvm#15251

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Makes UR cuda backend compatible with MPI #2077

Makes UR cuda backend compatible with MPI #2077

JackAKirk commented Sep 10, 2024 •

edited

Loading

JackAKirk commented Sep 10, 2024

JackAKirk commented Sep 12, 2024 •

edited

Loading

JackAKirk commented Sep 13, 2024 •

edited

Loading

hdelan Sep 13, 2024 •

edited

Loading

Makes UR cuda backend compatible with MPI #2077

Are you sure you want to change the base?

Makes UR cuda backend compatible with MPI #2077

Conversation

JackAKirk commented Sep 10, 2024 • edited Loading

JackAKirk commented Sep 10, 2024

JackAKirk commented Sep 12, 2024 • edited Loading

JackAKirk commented Sep 13, 2024 • edited Loading

hdelan Sep 13, 2024 • edited Loading

Choose a reason for hiding this comment

JackAKirk commented Sep 10, 2024 •

edited

Loading

JackAKirk commented Sep 12, 2024 •

edited

Loading

JackAKirk commented Sep 13, 2024 •

edited

Loading

hdelan Sep 13, 2024 •

edited

Loading