Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apparent locking issues when running across multiple GPUs #283

Open
gtebbutt opened this issue Dec 6, 2023 · 2 comments
Open

Apparent locking issues when running across multiple GPUs #283

gtebbutt opened this issue Dec 6, 2023 · 2 comments

Comments

@gtebbutt
Copy link

gtebbutt commented Dec 6, 2023

I've noticed an interesting issue when running on multi-GPU machines: although selecting gpu(N) as the decoding context initially works as expected, the overall throughput when running multiple processes drops off very rapidly until there's only one process showing activity on a single GPU, sometimes with occasional very short bursts of processing from others.

This happens even when the processes are totally independent (started separately from different screen sessions, operating on entirely different files, using separate GPUs, for example), which leads me to think there's probably a hardware- or system-level locking mechanism being used globally rather than per-process since it occurs even between separate python instances.

Working theory is that it could be falling through to a global lock of some kind due to setting decoder_info_.vidLock = nullptr;, but so far that hasn't brought us closer to a fix. Would be very helpful to hear if anyone else has (or hasn't!) run into similar issues?

Possibly related to #187 and/or #159?

@johndpope
Copy link

johndpope commented Jun 5, 2024

i've been using claude opus to do ai pyton coding on complex tasks - and I frequently just throw the entire codebase (as one file) as context and ask ai to fix / find problems. are you using gpu context because it's faster?

are you on intel? im looking to do some processing on 35,000 videos - and currently it's taking 1-5 mins per video.
gpu seems obvious choice - but wondering if intel codecs could give good boost.
https://github.com/Intel-Media-SDK/MediaSDK - this is EOL

claude opus may have found your locking problem here
#302

@gtebbutt
Copy link
Author

gtebbutt commented Jun 6, 2024

Oh I've been meaning to do a proper writeup on video decoding for a few months now, and just haven't had the time. Quick notes for now on what we learned from processing video at reasonably large scale (millions of files/billions of frames/few hundred TB of data):

  • We ended up using VALI, and I'd strongly recommend it. It's the successor to NVIDIA's VPF (Video Processing Framework), continued by one of the original devs. Right now it's the fastest option available by a significant margin, but the learning curve is steep. The sample scripts in the old VPF repo are a reasonable starting point, the API is close enough to broadly match up, but that repo is now unmaintained and they do have some bugs. Another thing on my long to-do list is publishing a decord-like wrapper class for VALI, but realistically it may be a while before I get to it.
  • Second best option is torchaudio.StreamReader, and it's much easier to use than VALI. I last benchmarked on torch 2.1, but as far as I know this is still the case: torchvision uses a different configuration for whatever reason, but the torchaudio decoder supports video as well and is much faster. The main issue we ran into is that torch doesn't have a specific CUDA kernel for colour space conversion (whereas VALI does), and NVDEC/CUVID only return NV12 data - that one extra function to convert to RGB was enough to significantly reduce the overall speed and colour accuracy of the process. Even an extra 3ms per frame adds up to a month of extra compute time at the scale we had.
  • Decord is still up there in terms of both speed and colour accuracy, which is very impressive, but we ran into enough reliability issues to take it off the table. Alongside the locking problem, there were several difficult-to-debug situations where it'd occasionally return blank frames - they were never replicable and seemed to happen at random, so our best guess is a race condition somewhere in the pipeline between CUDA, decord, and torch. It ended up being quicker to convert our code to use a different library than it would have been to pin that down.

Throwing the whole repo into an LLM is a technique I hadn't thought of, so it's interesting to see what it came back with! Realistically I'm not likely to get the time to dive into the decord codebase and see how accurate the suggestions are, but if it is possible to get the threading and context locking sorted out that'd be a big win for the library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants