OptiX SBT is released on non-main thread #1091

dvicini · 2024-02-28T09:21:57Z

Hi,

I am seeing an odd, sporadic issue I haven't encountered before. I am running the cuda_ad_rgb variant on a relatively unconventional setup. I am rendering a simple scene in a colab notebook (similar to a Jupyter notebook) on a Nvidia v100 datacenter GPU.

Everything runs fine most of the time. However, every few runs I see the following issue: jit_optix_configure_sbt initially gets invoked on the main thread when loading the scene, but the cleanup callback that releases internal structures gets invoked on another thread. I am debugging this by printing std::this_thread::get_id() both on the configure call and within the cleanup call. I tried turning off parallel scene loading, but that didn't seem to make a difference.

Practically, the issue is then that the jit_free call in the cleanup callback internally refers to thread_state_cuda to free up host pinned memory. However, if the cleanup happens on the non-main thread, the thread_state_cuda might not have been initialized, leading to a null pointer dereference in jitc_free when getting thread_state_cuda->stream.

Two questions for this:

Is it expected that the cleanup might happen on a non-main thread? I am using the mi.render function, do we expect any of that custom op mechanism to potentially lead to another thread holding a reference to the Scene object?
Can we just replace the unprotected access to the cuda thread state by thread_state(JitBackend::CUDA) in jit_free?

It is possible that the non-main thread cleanup is related to something in ipython/colab holding an extra reference to the render op or the Mitsuba scene somehow.

This is the code I run in my colab cell, likely not of much help to understand the issue though:

import drjit as dr
import mitsuba as mi
import numpy as np

dr.set_thread_count(1)
mi.set_variant('cuda_ad_rgb')
dr.set_log_level(dr.LogLevel.InfoSym)

def render():
  scene = mi.load_dict(
      {
          'type': 'scene',
          'integrator': {'type': 'direct'},
          'shape': {
              'type': 'cube',
          },
          'emitter': {'type': 'constant'},
          'sensor': {
              'type': 'perspective',
              'to_world': mi.ScalarTransform4f.look_at(
                  [4, 4, 4.5], [0, 0, 0], [0, 1, 0]
              ),
          },
      },
      parallel=False,
  )
  image = np.array(mi.render(scene, spp=256))
  dr.sync_thread()
  del scene
  return image
  
image = render()

The text was updated successfully, but these errors were encountered:

wjakob · 2024-02-28T09:41:07Z

Are you combining this with another array programming framework? For example, PyTorch code involving (our) custom operations causes them to be called from other threads during differentiation.

dvicini · 2024-02-28T09:59:10Z

No, nothing of that sort.

We would not expect Dr.Jit to ever use another thread here right? All the references to the renderop and the scene object should be on the main thread?

If that's the case, I am thinking that this might be something colab specific, e.g., it tries to use threads to provide some additional information to the user for debugging or inspection.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OptiX SBT is released on non-main thread #1091

OptiX SBT is released on non-main thread #1091

dvicini commented Feb 28, 2024 •

edited by merlinND

Loading

wjakob commented Feb 28, 2024

dvicini commented Feb 28, 2024

OptiX SBT is released on non-main thread #1091

OptiX SBT is released on non-main thread #1091

Comments

dvicini commented Feb 28, 2024 • edited by merlinND Loading

wjakob commented Feb 28, 2024

dvicini commented Feb 28, 2024

dvicini commented Feb 28, 2024 •

edited by merlinND

Loading