Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The custom CUDA extensions may be uncompatible with PyTorch 1.10.0, possible missing dependency ninja #9

Open
fxmarty opened this issue Jan 25, 2022 · 7 comments

Comments

@fxmarty
Copy link

fxmarty commented Jan 25, 2022

Hello,

I run the training in a google colab instance.

My code is roughly the following (in different cells):

from google.colab import drive
drive.mount('/content/drive')

%cd "drive/My Drive/DirectVoxGO"

pip install -r requirements.txt

import torch
assert(torch.__version__ == '1.10.0+cu111')

pip install torch-scatter -f https://data.pyg.org/whl/torch-1.10.0+cu111.html

!python run.py --config configs/blendedmvs/Jade.py --render_test  # train

First error coming out (in the last line of the above), I need ninja:

Using /root/.cache/torch_extensions/py37_cu111 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py37_cu111/adam_upd_cuda...
Traceback (most recent call last):
  File "run.py", line 13, in <module>
    from lib import utils, dvgo
  File "/content/drive/MyDrive/MAREVA_project/DirectVoxGO/lib/utils.py", line 11, in <module>
    from .masked_adam import MaskedAdam
  File "/content/drive/MyDrive/MAREVA_project/DirectVoxGO/lib/masked_adam.py", line 10, in <module>
    verbose=True)
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/cpp_extension.py", line 1136, in load
    keep_intermediates=keep_intermediates)
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/cpp_extension.py", line 1347, in _jit_compile
    is_standalone=is_standalone)
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/cpp_extension.py", line 1418, in _write_ninja_file_and_build_library
    verify_ninja_availability()
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/cpp_extension.py", line 1474, in verify_ninja_availability
    raise RuntimeError("Ninja is required to load C++ extensions")
RuntimeError: Ninja is required to load C++ extensions

Then, I do

pip install ninja

Rerunning the line !python run.py --config configs/blendedmvs/Jade.py --render_test # train, I now get a very ugly output, that I put on pastebin because it is too long: https://pastebin.com/Um0cMJfg . Notice that despite the warnings, the training starts at the end.

Is this output expected? If these warnings/errors appear due to PyTorch 1.10.0 and are critical, I would suggest to disable those custom CUDA extensions by default so that training can run on several PyTorch versions, not only 1.8.1.

Thanks a lot!

@sunset1995
Copy link
Owner

Thanks for reminding me the dependency :)

Yes, this is expected when the first time you run the code where I turn on the verbose mode of Pytorch jit compilation module for debugging purpose.

@fxmarty
Copy link
Author

fxmarty commented Jan 26, 2022

Great to know - I'll use the version with the custom kernels then. Thanks!

Edit: Maybe an other question @sunset1995 : I saw you increased what seem to be the batch size here https://github.com/sunset1995/DirectVoxGO/blob/main/run.py#L90 from 16 to 65536, is it safe to reduce it or is it actually leveraged by the rendering kernels? In the previous version of the library, even 16 was too high to run on my GPU which has low memory. I did not try yet to run inference with the new version but I will let you know how to goes memory-wise. (Edit again: well reducing to 32768 is fine this time for my memory. Weird.)

Edit again: Just to let you know, but probably you are aware: the old trained models have missing keys with the new version, see e.g. the error during inference

Traceback (most recent call last):
  File "/home/felix/Documents/Mines/3A/Option/Mini-projet/directvoxgo-mareva/DirectVoxGO/run.py", line 518, in <module>
    model = utils.load_model(dvgo.DirectVoxGO, ckpt_path).to(device)
  File "/home/felix/Documents/Mines/3A/Option/Mini-projet/directvoxgo-mareva/DirectVoxGO/lib/utils.py", line 62, in load_model
    model = model_class(**ckpt['model_kwargs'])
  File "/home/felix/Documents/Mines/3A/Option/Mini-projet/directvoxgo-mareva/DirectVoxGO/lib/dvgo.py", line 98, in __init__
    self.mask_cache = MaskCache(
  File "/home/felix/Documents/Mines/3A/Option/Mini-projet/directvoxgo-mareva/DirectVoxGO/lib/dvgo.py", line 395, in __init__
    alpha = 1 - torch.exp(-F.softplus(density + st['model_kwargs']['act_shift']) * st['model_kwargs']['voxel_size_ratio'])
KeyError: 'act_shift'

New trained models are fine (and run faster which is cool!).

@aarrushi
Copy link

aarrushi commented Feb 2, 2022

Hi,
I tried "pip install ninja" but still getting the issue "raise RuntimeError("Ninja is required to load C++ extensions")".
Is there any other dependency to be taken care of?

Also, is there a way to disable the new optimization related changes in the new repo if that is causing issues?

@sunset1995
Copy link
Owner

It seems that window OS have some issues with this.
Maybe can try this zhanghang1989/PyTorch-Encoding#167.

@Learningm
Copy link

I solved this problem "RuntimeError("Ninja is required to load C++ extensions")" by installing

  1. torch 1.8.1, cuda 11.1
  2. pip install ninja
  3. sudo apt-get install ninja-build

@aarrushi
Copy link

@Learningm did you resolve it on windows as well?

@Learningm
Copy link

@aarrushi I resolved it on Linux.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants