Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bugfix] Link against pytorch library by name rather than path (attempting to fix linking issue). #3246

Merged
merged 6 commits into from
Aug 19, 2021

Conversation

nv-dlasalle
Copy link
Collaborator

@nv-dlasalle nv-dlasalle commented Aug 11, 2021

Description

This is meant to fix #3240, which was caused by #3225, which in turn was trying to fix #3220, the issue of having DGL compiled against versions of pytorch with different cuda versions (e.g., the default from pip 1.9.0+cu102 vs. 1.9.0+cu111).

This explicitly links against torch by name, rather than path. This needs more scrutiny and testing, as I can't say I'm an expert on dynamic linking. In my testing below, this appears to achieve the desired result, of allowing pytorch with different CUDA versions to work against the same tensor adapter binary.

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented
  • To the best of my knowledge, examples are either not affected by this change,
    or have been fixed to be compatible with this change
  • Related issue is referred in this PR

Changes

Prior to #3225, when compiling, cmake would output the following:

-- tensoradapter found PyTorch includes: /home/dominique/.local/lib/python3.6/site-packages/torch/include;/home/dominique/.local/lib/python3.6/site-packages/torch/include/torch/csrc/api/include
-- tensoradapter found PyTorch lib: torch;torch_library;/home/dominique/.local/lib/python3.6/site-packages/torch/lib/libc10.so;/usr/local/cuda/lib64/stubs/libcuda.so;/usr/local/cuda/lib64/libnvrtc.so;/usr/local/cuda/lib64/libnvToolsExt.so;/usr/local/cuda/lib64/libcudart.so;/home/dominique/.local/lib/python3.6/site-packages/torch/lib/libc10_cuda.so

After #3225, but without this PR, it would output:

-- tensoradapter found PyTorch includes: /home/dominique/.local/lib/python3.6/site-packages/torch/include;/home/dominique/.local/lib/python3.6/site-packages/torch/include/torch/csrc/api/include
-- tensoradapter found PyTorch lib: /home/dominique/.local/lib/python3.6/site-packages/torch/lib/libtorch.so

With this PR:

-- tensoradapter found PyTorch includes: /home/dominique/.local/lib/python3.6/site-packages/torch/include;/home/dominique/.local/lib/python3.6/site-packages/torch/include/torch/csrc/api/include
-- tensoradapter found PyTorch lib: torch

When compiling DGL with this PR and torch==1.9.0+cu102:

$ python3
Python 3.6.9 (default, Jan 26 2021, 15:33:00)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import dgl
Using backend: pytorch
RDFLib Version: 5.0.0
>>> print(dgl._ffi.base.tensor_adapter_loaded)
True
$ ldd ~/.local/lib/python3.6/site-packages/dgl-0.8-py3.6-linux-x86_64.egg/dgl/tensoradapter/pytorch/libtensoradapter_pytorch_1.9.0.so
        linux-vdso.so.1 (0x00007ffcf5bc4000)
        libtorch.so => /home/dominique/.local/lib/python3.6/site-packages/torch/lib/libtorch.so (0x00007f302fa05000)
        libtorch_cpu.so => /home/dominique/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so (0x00007f301c671000)
        libtorch_cuda.so => /home/dominique/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so (0x00007f2fda16d000)
        libc10.so => /home/dominique/.local/lib/python3.6/site-packages/torch/lib/libc10.so (0x00007f2fd9ec9000)
        libnvToolsExt.so.1 => /usr/local/cuda/lib64/libnvToolsExt.so.1 (0x00007f2fd9cc0000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f2fd98ec000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f2fd96d4000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f2fd92e3000)
        libgomp-a34b3233.so.1 => /home/dominique/.local/lib/python3.6/site-packages/torch/lib/libgomp-a34b3233.so.1 (0x00007f2fd90b9000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f2fd8e9a000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f2fd8c92000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f2fd8a8e000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f2fd86f0000)
        libcudart-80664282.so.10.2 => /home/dominique/.local/lib/python3.6/site-packages/torch/lib/libcudart-80664282.so.10.2 (0x00007f2fd846f000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f302fe46000)
        libc10_cuda.so => /home/dominique/.local/lib/python3.6/site-packages/torch/lib/libc10_cuda.so (0x00007f2fd8242000)
        libnvToolsExt-3965bdd0.so.1 => /home/dominique/.local/lib/python3.6/site-packages/torch/lib/libnvToolsExt-3965bdd0.so.1 (0x00007f2fd8038000)

Then using the same dgl installation, and changing the pytorch version to 1.9.0+cu111.

$ python3
Python 3.6.9 (default, Jan 26 2021, 15:33:00)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import dgl
Using backend: pytorch
RDFLib Version: 5.0.0
>>> print(dgl._ffi.base.tensor_adapter_loaded)
True
$ ldd ~/.local/lib/python3.6/site-packages/dgl-0.8-py3.6-linux-x86_64.egg/dgl/tensoradapter/pytorch/libtensoradapter_pytorch_1.9.0.so
        linux-vdso.so.1 (0x00007fff7e3e3000)
        libtorch.so => /home/dominique/.local/lib/python3.6/site-packages/torch/lib/libtorch.so (0x00007f2eaaf9b000)
        libtorch_cpu.so => /home/dominique/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so (0x00007f2e978d7000)
        libtorch_cuda.so => /home/dominique/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so (0x00007f2e976c3000)
        libc10.so => /home/dominique/.local/lib/python3.6/site-packages/torch/lib/libc10.so (0x00007f2e9741f000)
        libnvToolsExt.so.1 => /usr/local/cuda/lib64/libnvToolsExt.so.1 (0x00007f2e97216000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f2e96e42000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f2e96c2a000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f2e96839000)
        libtorch_cuda_cpp.so => /home/dominique/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda_cpp.so (0x00007f2e065b9000)
        libtorch_cuda_cu.so => /home/dominique/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda_cu.so (0x00007f2dba9bc000)
        libgomp-7c85b1e2.so.1 => /home/dominique/.local/lib/python3.6/site-packages/torch/lib/libgomp-7c85b1e2.so.1 (0x00007f2dba792000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f2dba573000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f2dba36b000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f2dba167000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f2db9dc9000)
        libcudart-6d56b25a.so.11.0 => /home/dominique/.local/lib/python3.6/site-packages/torch/lib/libcudart-6d56b25a.so.11.0 (0x00007f2db9b40000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f2eab3ca000)
        libc10_cuda.so => /home/dominique/.local/lib/python3.6/site-packages/torch/lib/libc10_cuda.so (0x00007f2db9912000)
        libnvToolsExt-24de1d56.so.1 => /home/dominique/.local/lib/python3.6/site-packages/torch/lib/libnvToolsExt-24de1d56.so.1 (0x00007f2db9708000)

@dgl-bot
Copy link
Collaborator

dgl-bot commented Aug 11, 2021

To trigger regression tests:

  • @dgl-bot run [instance-type] [which tests] [compare-with-branch];
    For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

@BarclayII
Copy link
Collaborator

@davidmin7 Could you see if this fix works?

@davidmin7
Copy link
Contributor

Hi @BarclayII,

Yup, this did fix the problem that I was having in #3240. I'll close it as soon as this PR gets merged. Thanks!

@BarclayII BarclayII linked an issue Aug 19, 2021 that may be closed by this pull request
@BarclayII BarclayII merged commit fc6f0b9 into dmlc:master Aug 19, 2021
BarclayII added a commit that referenced this pull request Aug 26, 2021
…pting to fix linking issue). (#3246)

* Use library name

* fix for mac builds from source

Co-authored-by: Quan (Andy) Gan <coin2028@hotmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants