[bugfix] Link against pytorch library by name rather than path (attempting to fix linking issue). #3246

nv-dlasalle · 2021-08-11T21:26:02Z

Description

This is meant to fix #3240, which was caused by #3225, which in turn was trying to fix #3220, the issue of having DGL compiled against versions of pytorch with different cuda versions (e.g., the default from pip 1.9.0+cu102 vs. 1.9.0+cu111).

This explicitly links against torch by name, rather than path. This needs more scrutiny and testing, as I can't say I'm an expert on dynamic linking. In my testing below, this appears to achieve the desired result, of allowing pytorch with different CUDA versions to work against the same tensor adapter binary.

Checklist

Please feel free to remove inapplicable items for your PR.

The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented
To the best of my knowledge, examples are either not affected by this change,
or have been fixed to be compatible with this change
Related issue is referred in this PR

Changes

Prior to #3225, when compiling, cmake would output the following:

-- tensoradapter found PyTorch includes: /home/dominique/.local/lib/python3.6/site-packages/torch/include;/home/dominique/.local/lib/python3.6/site-packages/torch/include/torch/csrc/api/include
-- tensoradapter found PyTorch lib: torch;torch_library;/home/dominique/.local/lib/python3.6/site-packages/torch/lib/libc10.so;/usr/local/cuda/lib64/stubs/libcuda.so;/usr/local/cuda/lib64/libnvrtc.so;/usr/local/cuda/lib64/libnvToolsExt.so;/usr/local/cuda/lib64/libcudart.so;/home/dominique/.local/lib/python3.6/site-packages/torch/lib/libc10_cuda.so

After #3225, but without this PR, it would output:

-- tensoradapter found PyTorch includes: /home/dominique/.local/lib/python3.6/site-packages/torch/include;/home/dominique/.local/lib/python3.6/site-packages/torch/include/torch/csrc/api/include
-- tensoradapter found PyTorch lib: /home/dominique/.local/lib/python3.6/site-packages/torch/lib/libtorch.so

With this PR:

-- tensoradapter found PyTorch includes: /home/dominique/.local/lib/python3.6/site-packages/torch/include;/home/dominique/.local/lib/python3.6/site-packages/torch/include/torch/csrc/api/include
-- tensoradapter found PyTorch lib: torch

When compiling DGL with this PR and torch==1.9.0+cu102:

$ python3
Python 3.6.9 (default, Jan 26 2021, 15:33:00)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import dgl
Using backend: pytorch
RDFLib Version: 5.0.0
>>> print(dgl._ffi.base.tensor_adapter_loaded)
True

$ ldd ~/.local/lib/python3.6/site-packages/dgl-0.8-py3.6-linux-x86_64.egg/dgl/tensoradapter/pytorch/libtensoradapter_pytorch_1.9.0.so
        linux-vdso.so.1 (0x00007ffcf5bc4000)
        libtorch.so => /home/dominique/.local/lib/python3.6/site-packages/torch/lib/libtorch.so (0x00007f302fa05000)
        libtorch_cpu.so => /home/dominique/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so (0x00007f301c671000)
        libtorch_cuda.so => /home/dominique/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so (0x00007f2fda16d000)
        libc10.so => /home/dominique/.local/lib/python3.6/site-packages/torch/lib/libc10.so (0x00007f2fd9ec9000)
        libnvToolsExt.so.1 => /usr/local/cuda/lib64/libnvToolsExt.so.1 (0x00007f2fd9cc0000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f2fd98ec000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f2fd96d4000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f2fd92e3000)
        libgomp-a34b3233.so.1 => /home/dominique/.local/lib/python3.6/site-packages/torch/lib/libgomp-a34b3233.so.1 (0x00007f2fd90b9000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f2fd8e9a000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f2fd8c92000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f2fd8a8e000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f2fd86f0000)
        libcudart-80664282.so.10.2 => /home/dominique/.local/lib/python3.6/site-packages/torch/lib/libcudart-80664282.so.10.2 (0x00007f2fd846f000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f302fe46000)
        libc10_cuda.so => /home/dominique/.local/lib/python3.6/site-packages/torch/lib/libc10_cuda.so (0x00007f2fd8242000)
        libnvToolsExt-3965bdd0.so.1 => /home/dominique/.local/lib/python3.6/site-packages/torch/lib/libnvToolsExt-3965bdd0.so.1 (0x00007f2fd8038000)

Then using the same dgl installation, and changing the pytorch version to 1.9.0+cu111.

$ python3
Python 3.6.9 (default, Jan 26 2021, 15:33:00)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import dgl
Using backend: pytorch
RDFLib Version: 5.0.0
>>> print(dgl._ffi.base.tensor_adapter_loaded)
True

$ ldd ~/.local/lib/python3.6/site-packages/dgl-0.8-py3.6-linux-x86_64.egg/dgl/tensoradapter/pytorch/libtensoradapter_pytorch_1.9.0.so
        linux-vdso.so.1 (0x00007fff7e3e3000)
        libtorch.so => /home/dominique/.local/lib/python3.6/site-packages/torch/lib/libtorch.so (0x00007f2eaaf9b000)
        libtorch_cpu.so => /home/dominique/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so (0x00007f2e978d7000)
        libtorch_cuda.so => /home/dominique/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so (0x00007f2e976c3000)
        libc10.so => /home/dominique/.local/lib/python3.6/site-packages/torch/lib/libc10.so (0x00007f2e9741f000)
        libnvToolsExt.so.1 => /usr/local/cuda/lib64/libnvToolsExt.so.1 (0x00007f2e97216000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f2e96e42000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f2e96c2a000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f2e96839000)
        libtorch_cuda_cpp.so => /home/dominique/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda_cpp.so (0x00007f2e065b9000)
        libtorch_cuda_cu.so => /home/dominique/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda_cu.so (0x00007f2dba9bc000)
        libgomp-7c85b1e2.so.1 => /home/dominique/.local/lib/python3.6/site-packages/torch/lib/libgomp-7c85b1e2.so.1 (0x00007f2dba792000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f2dba573000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f2dba36b000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f2dba167000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f2db9dc9000)
        libcudart-6d56b25a.so.11.0 => /home/dominique/.local/lib/python3.6/site-packages/torch/lib/libcudart-6d56b25a.so.11.0 (0x00007f2db9b40000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f2eab3ca000)
        libc10_cuda.so => /home/dominique/.local/lib/python3.6/site-packages/torch/lib/libc10_cuda.so (0x00007f2db9912000)
        libnvToolsExt-24de1d56.so.1 => /home/dominique/.local/lib/python3.6/site-packages/torch/lib/libnvToolsExt-24de1d56.so.1 (0x00007f2db9708000)

dgl-bot · 2021-08-11T21:27:01Z

To trigger regression tests:

@dgl-bot run [instance-type] [which tests] [compare-with-branch];
For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

BarclayII · 2021-08-12T03:00:38Z

@davidmin7 Could you see if this fix works?

davidmin7 · 2021-08-12T13:03:33Z

Hi @BarclayII,

Yup, this did fix the problem that I was having in #3240. I'll close it as soon as this PR gets merged. Thanks!

…pting to fix linking issue). (#3246) * Use library name * fix for mac builds from source Co-authored-by: Quan (Andy) Gan <coin2028@hotmail.com>

Use library name

da55e20

BarclayII mentioned this pull request Aug 12, 2021

Compilation error with Mac M1 #3235

Closed

BarclayII added 3 commits August 16, 2021 09:23

Merge branch 'master' into dmlc#3240

421469e

Merge branch 'master' into dmlc#3240

1c300a0

Merge branch 'master' into dmlc#3240

bfe0585

BarclayII approved these changes Aug 19, 2021

View reviewed changes

BarclayII added 2 commits August 19, 2021 14:15

Merge branch 'master' into dmlc#3240

1e03dac

fix for mac builds from source

4ca7a6c

BarclayII linked an issue Aug 19, 2021 that may be closed by this pull request

Compilation error with Mac M1 #3235

Closed

BarclayII merged commit fc6f0b9 into dmlc:master Aug 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bugfix] Link against pytorch library by name rather than path (attempting to fix linking issue). #3246

[bugfix] Link against pytorch library by name rather than path (attempting to fix linking issue). #3246

nv-dlasalle commented Aug 11, 2021 •

edited

Loading

dgl-bot commented Aug 11, 2021

BarclayII commented Aug 12, 2021

davidmin7 commented Aug 12, 2021

[bugfix] Link against pytorch library by name rather than path (attempting to fix linking issue). #3246

[bugfix] Link against pytorch library by name rather than path (attempting to fix linking issue). #3246

Conversation

nv-dlasalle commented Aug 11, 2021 • edited Loading

Description

Checklist

Changes

dgl-bot commented Aug 11, 2021

BarclayII commented Aug 12, 2021

davidmin7 commented Aug 12, 2021

nv-dlasalle commented Aug 11, 2021 •

edited

Loading