FS#65202 : [python-pytorch-opt-cuda] incompatible nccl

FS#65202 - [python-pytorch-opt-cuda] incompatible nccl

Attached to Project: Community Packages
Opened by Yuxin Wu (ppwwyyxx) - Sunday, 19 January 2020, 08:41 GMT
Last edited by Konstantin Gizdov (kgizdov) - Monday, 27 January 2020, 19:57 GMT

Task Type	Bug Report
Category	Packages: Testing
Status	Closed
Assigned To	Sven-Hendrik Haase (Svenstaro) Konstantin Gizdov (kgizdov)
Architecture	All
Severity	Low
Priority	Normal
Reported Version
Due in Version	Undecided
Due Date	Undecided
Percent Complete
Votes	15 Cat (lasercat) (2020-01-27) Christos Tzelepis (nullgeppetto) (2020-01-26) Philip Goto (flipflop97) (2020-01-25) Maxime Lewandowski (lywel) (2020-01-25) Quentin Bammey (qbammey) (2020-01-25) Øystein Schønning-Johansen (oysteijo) (2020-01-24) Hans Gaiser (hgaiser) (2020-01-24) Ilango Rajagopal (ilango100) (2020-01-24) Oliver Breitwieser (obreitwi) (2020-01-23) Adria Arrufat (swiftscythe) (2020-01-23) Alessio Elmi (elmuz) (2020-01-22) Noa-Emil Nissinen (4shadoww) (2020-01-22) yann (massendefekt) (2020-01-22) Oliver Weißbarth (oweissbarth) (2020-01-22) Cebtenzzre (cebtenzzre) (2020-01-22)
Private	No

Details

Description:
Cannot import torch.

Steps to reproduce:

Install python-pytorch-opt-cuda 1.4.0-1 from testing.
Run
```
$python -c 'import torch'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python3.8/site-packages/torch/__init__.py", line 81, in <module>
from torch._C import *
ImportError: /usr/lib/python3.8/site-packages/torch/lib/libtorch_python.so: undefined symbol: _ZN5torch4cuda4nccl6detail16throw_nccl_errorE12ncclResult_t
```

This task depends upon

Closed by Konstantin Gizdov (kgizdov)
Monday, 27 January 2020, 19:57 GMT
Reason for closing: Fixed
Additional comments about closing: python-pytorch 1.4.0-4

Comment by Alessio Elmi (elmuz) - Wednesday, 22 January 2020, 08:47 GMT

I can confirm the problem is still present with `python-pytorch-opt-cuda 1.4.0-2`.
```
~ » python
Python 3.8.1 (default, Jan 8 2020, 23:09:20)
[GCC 9.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.8/site-packages/torch/__init__.py", line 81, in <module>
from torch._C import *
ImportError: /usr/lib/python3.8/site-packages/torch/lib/libtorch_python.so: undefined symbol: _ZN5torch4cuda4nccl6detail16throw_nccl_errorE12ncclResult_t

```

Comment by yann (massendefekt) - Wednesday, 22 January 2020, 10:56 GMT

I have the same problem. Any solutions or workarounds?

Comment by Noa-Emil Nissinen (4shadoww) - Wednesday, 22 January 2020, 18:43 GMT

Same problem with python-pytorch-cuda 1.4.0-2. python-pytorch-cuda 1.3.1-7 still works like a charm.

Comment by Ota (otaj) - Thursday, 23 January 2020, 08:37 GMT

It seems like an upstream bug, I tried to compile myself and even after my own compilation I get this error.

Comment by Ota (otaj) - Thursday, 23 January 2020, 09:28 GMT

Just a bit more info, if I compile with bundled NCCL (USE_SYSTEM_NCCL=OFF), the problem still persists.

Comment by Kamran Melikov (kamranm) - Saturday, 25 January 2020, 01:33 GMT

I have the same problem. python-pytorch-opt-cuda 1.3.1-7 works fine.

Comment by Andrew (thelongdivider) - Sunday, 26 January 2020, 23:59 GMT

This problem persists in python-pytorch-opt-cuda-1.4.0-3. It literally won't import, forcing me to stick with 1.3.1-8.

It works fine on the non-cuda version, but this is pretty useless for most of our purposes.

Comment by Konstantin Gizdov (kgizdov) - Monday, 27 January 2020, 00:56 GMT

We are working to fix it as quickly as possible, but it looks like an [upstream bug](https://github.com/pytorch/pytorch/issues/32638). Could you guys try these patches and tell me if they help?

nccl_version.patch (1.9 KiB)

torch_cuda_api.patch (1.2 KiB)

Comment by Andrew (thelongdivider) - Monday, 27 January 2020, 01:26 GMT

I downloaded the PKGBUILD, but these patches don't apply (the file locations are unknown).

Comment by Hans Gaiser (hgaiser) - Monday, 27 January 2020, 06:51 GMT

Not sure if this helps, but for me the pip version (also 1.4.0) does work.

Comment by Ota (otaj) - Monday, 27 January 2020, 09:31 GMT

The patches do help, I just rebuilded for myself and they do seem to work. However, the second patch (torch_cuda_api.patch) failed at first, because the "void throw_nccl_error(ncclResult_t status);" is not prefixed with TORCH_CUDA_API (as seen in here https://github.com/pytorch/pytorch/blob/v1.4.0/torch/csrc/cuda/nccl.h#L22). So I believe the error did not actually happen because it was the function was not inlined, but because it was not exported with TORCH_CUDA_API.

Comment by Konstantin Gizdov (kgizdov) - Monday, 27 January 2020, 09:54 GMT

@thelongdivider, you need to apply the patches correctly using the `-p1` flag as explained in the Wiki.

@hgaiser, the pip version will work, because it is using the built-in nccl and the thus it never needs to export the symbols correctly. This actually proves the bug.

@otaj, good catch - I was creating the patch from the wrong branch. However, this proves 1.4.0 release does indeed have a bug where the symbol is not exported properly. I have updated the patch (attached) and will soon be in the repo.

torch_cuda_api.patch (0.4 KiB)

Comment by Ota (otaj) - Monday, 27 January 2020, 09:57 GMT

Hold on to the pushing to the repo - I am building again to test whether the new patch will work. I also think the patch nccl_version is not adding anything new to the table so I am trying to build without it as well.

Comment by Ota (otaj) - Monday, 27 January 2020, 10:33 GMT

Yeah, it works without nccl_version.patch and smaller torch_cuda_api patch.

Comment by Konstantin Gizdov (kgizdov) - Monday, 27 January 2020, 17:22 GMT

please check out testing/python-pytorch 1.4.0-4 and see if it works

Comment by Andrew (thelongdivider) - Monday, 27 January 2020, 17:32 GMT

Will do. Currently still showing 1.4.0-3 though even on archlinux.org.

Comment by Andrew (thelongdivider) - Monday, 27 January 2020, 18:03 GMT

It imports and seems to be working well with simple cuda based matrix multiply and NN code. Thanks!

	Tasks related to this task (0)

Duplicate tasks of this task (0)

Arch Linux

FS#65202 - [python-pytorch-opt-cuda] incompatible nccl

Details

Loading...