Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[🐛 bug report] Mitsuba 3 Crashes in a Multi-GPU Environment with Device Set to Non-Zero #808

Open
Microno95 opened this issue Jul 17, 2023 · 5 comments

Comments

@Microno95
Copy link
Contributor

Microno95 commented Jul 17, 2023

Summary

Running mitsuba 3 in an environment where I want to have each process use a different GPU on a multi-GPU machine does not work as Dr.JIT gives a CUDA_ERROR_ILLEGAL_ADDRESS error.

System configuration

System Information:

OS: Rocky Linux release 8.7 (Green Obsidian)
CPU: AMD EPYC 7763 64-Core Processor
GPU: NVIDIA A100-SXM4-80GB
     NVIDIA A100-SXM4-80GB
     NVIDIA A100-SXM4-80GB
     NVIDIA A100-SXM4-80GB
Python: 3.11.4 | packaged by conda-forge | (main, Jun 10 2023, 18:08:17) [GCC 12.2.0]
NVidia driver: 525.105.17
CUDA: 11.8.89
LLVM: 14.0.6
Dr.Jit: 0.4.2
Mitsuba: 3.3.0
    Is custom build? True
    Compiled with: GNU 9.3.0
    Variants:
        scalar_rgb
        scalar_spectral
        cuda_ad_rgb
        llvm_ad_rgb   

Description

Setting the device to anything other than device 0 for cuda devices leads to a critical Dr.JIT compiler failure with CUDA API Error 0700 (CUDA_ERROR_ILLEGAL_ADDRESS) in drjit-core/src/util.cpp:203.

I am trying to use mitsuba in a multi-gpu multi-node environment to generate a dataset of renders. In order to do so, I use a pytorch environment where one process per node spawns one process per GPU. To enable using each GPU separately, I use torch.cuda.set_device(rank) where rank is the local rank of a process on a given node, similarly I use mi.util.dr.set_device(rank) to set the GPU for mitsuba.

Doing this before loading the scene leads to the above error and crashes the python instance. If I try to set the Dr.JIT device post-loading of a given scene, this leaves the scene still loaded on the first GPU.

While using the environment variable "CUDA_VISIBLE_DEVICES" works as expected, this prevents using mitsuba 3 in an environment where a single process may want to use multiple GPUs and prevents programmatically managing CUDA processes.

Steps to reproduce

  1. Run import mitsuba as mi;mi.set_variant("cuda_ad_rgb");mi.util.dr.set_device(1)
  2. Load the Cornell Box scene with mi.load_dict(mi.cornell_box())
@njroussel
Copy link
Member

Hi @Microno95

I thought we had two open issues regarding this, but I can only find this one which is now closed: mitsuba-renderer/drjit#119.

I still believe that something is broken in Dr.Jit with regard to changing devices. Unfortunately, we don't have a multi-GPU machine at our disposal to debug this ourselves.
If anyone wants to look into this, here are three good starting points to look at:

@Microno95
Copy link
Contributor Author

Hi @njroussel

I see, I can look into the matter if Dr.Jit supports GTX 1080 GPUs. Hopefully I can provide some insight even if not a solution.

Are there any existing tests in Dr.Jit that test device setting/switching? I'll use that as a starting point to debug the issue.

@njroussel
Copy link
Member

Yeah, that architecture should still be supported.

I don't think there are any tests for this. To be quite honest, I've always wondered how this was initially implemented 😅 I can't even guarantee you that this worked properly at any point in time.

@Microno95
Copy link
Contributor Author

Hey @njroussel,

I got a GTX 1080 installed and started debugging. First, there is a bug in drjit-core specifically on how contexts are used for setting attributes here
https://github.com/mitsuba-renderer/drjit-core/blob/25dd7a5cb96ee58d65cc1499f47de76f6140ff36/src/registry.cpp#L309
and
https://github.com/mitsuba-renderer/drjit-core/blob/25dd7a5cb96ee58d65cc1499f47de76f6140ff36/src/registry.cpp#L340
where the correct statement should be scoped_set_context guard(ts->context);

It's quite a small issue so I didn't necessarily want to create a PR.

The broader problem actually lies in how the device is set on a per-thread basis. The fundamental issue is that calling jit_cuda_set_device only sets the device for the main thread and not any of the other threads. This leads to a CUDA problem where the main thread is loading part of the scene into one device and another thread is loading it onto device 0.

In fact, using dr.set_num_threads(0) alleviates the problem of setting the device and the test suite runs as expected. I'll look more into it, but that's what I've got thus far.

@Microno95
Copy link
Contributor Author

Upon further investigation, it looks like setting the device once on startup of a script can be made to work, but switching devices during runtime will cause major issues because each variable relies on the device (i.e. the cuda stream and the cuda context) to be constant across operations.

I have a patch to get that working whereby the device is set via global state in drjit and then each thread has the device set appropriately either upon construction or whenever jit_cuda_set_device is called but it's not stable in that changing the device during runtime leads to horrific errors (i.e. accessing one GPU's memory while in another GPU's context).

Broadly, the solution is most likely to track the context of each device-side pointer and use that rather than a per-thread context. Not sure how that interacts with different compute streams...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants