FS#65202 - [python-pytorch-opt-cuda] incompatible nccl

Attached to Project: Community Packages
Opened by Yuxin Wu (ppwwyyxx) - Sunday, 19 January 2020, 08:41 GMT
Last edited by Konstantin Gizdov (kgizdov) - Monday, 27 January 2020, 19:57 GMT
Task Type Bug Report
Category Packages: Testing
Status Closed
Assigned To Sven-Hendrik Haase (Svenstaro)
Konstantin Gizdov (kgizdov)
Architecture All
Severity Low
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 15
Private No

Details

Description:
Cannot import torch.


Steps to reproduce:

Install python-pytorch-opt-cuda 1.4.0-1 from testing.
Run
```
$python -c 'import torch'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python3.8/site-packages/torch/__init__.py", line 81, in <module>
from torch._C import *
ImportError: /usr/lib/python3.8/site-packages/torch/lib/libtorch_python.so: undefined symbol: _ZN5torch4cuda4nccl6detail16throw_nccl_errorE12ncclResult_t
```
This task depends upon

Closed by  Konstantin Gizdov (kgizdov)
Monday, 27 January 2020, 19:57 GMT
Reason for closing:  Fixed
Additional comments about closing:  python-pytorch 1.4.0-4
Comment by Alessio Elmi (elmuz) - Wednesday, 22 January 2020, 08:47 GMT
I can confirm the problem is still present with `python-pytorch-opt-cuda 1.4.0-2`.
```
~ » python
Python 3.8.1 (default, Jan 8 2020, 23:09:20)
[GCC 9.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.8/site-packages/torch/__init__.py", line 81, in <module>
from torch._C import *
ImportError: /usr/lib/python3.8/site-packages/torch/lib/libtorch_python.so: undefined symbol: _ZN5torch4cuda4nccl6detail16throw_nccl_errorE12ncclResult_t

```
Comment by yann (massendefekt) - Wednesday, 22 January 2020, 10:56 GMT
I have the same problem. Any solutions or workarounds?
Comment by Noa-Emil Nissinen (4shadoww) - Wednesday, 22 January 2020, 18:43 GMT
Same problem with python-pytorch-cuda 1.4.0-2. python-pytorch-cuda 1.3.1-7 still works like a charm.
Comment by Ota (otaj) - Thursday, 23 January 2020, 08:37 GMT
It seems like an upstream bug, I tried to compile myself and even after my own compilation I get this error.
Comment by Ota (otaj) - Thursday, 23 January 2020, 09:28 GMT
Just a bit more info, if I compile with bundled NCCL (USE_SYSTEM_NCCL=OFF), the problem still persists.
Comment by Kamran Melikov (kamranm) - Saturday, 25 January 2020, 01:33 GMT
I have the same problem. python-pytorch-opt-cuda 1.3.1-7 works fine.
Comment by Andrew (thelongdivider) - Sunday, 26 January 2020, 23:59 GMT
This problem persists in python-pytorch-opt-cuda-1.4.0-3. It literally won't import, forcing me to stick with 1.3.1-8.

It works fine on the non-cuda version, but this is pretty useless for most of our purposes.
Comment by Konstantin Gizdov (kgizdov) - Monday, 27 January 2020, 00:56 GMT
We are working to fix it as quickly as possible, but it looks like an [upstream bug](https://github.com/pytorch/pytorch/issues/32638). Could you guys try these patches and tell me if they help?
Comment by Andrew (thelongdivider) - Monday, 27 January 2020, 01:26 GMT
I downloaded the PKGBUILD, but these patches don't apply (the file locations are unknown).
Comment by Hans Gaiser (hgaiser) - Monday, 27 January 2020, 06:51 GMT
Not sure if this helps, but for me the pip version (also 1.4.0) does work.
Comment by Ota (otaj) - Monday, 27 January 2020, 09:31 GMT
The patches do help, I just rebuilded for myself and they do seem to work. However, the second patch (torch_cuda_api.patch) failed at first, because the "void throw_nccl_error(ncclResult_t status);" is not prefixed with TORCH_CUDA_API (as seen in here https://github.com/pytorch/pytorch/blob/v1.4.0/torch/csrc/cuda/nccl.h#L22). So I believe the error did not actually happen because it was the function was not inlined, but because it was not exported with TORCH_CUDA_API.
Comment by Konstantin Gizdov (kgizdov) - Monday, 27 January 2020, 09:54 GMT
@thelongdivider, you need to apply the patches correctly using the `-p1` flag as explained in the Wiki.

@hgaiser, the pip version will work, because it is using the built-in nccl and the thus it never needs to export the symbols correctly. This actually proves the bug.

@otaj, good catch - I was creating the patch from the wrong branch. However, this proves 1.4.0 release does indeed have a bug where the symbol is not exported properly. I have updated the patch (attached) and will soon be in the repo.
Comment by Ota (otaj) - Monday, 27 January 2020, 09:57 GMT
Hold on to the pushing to the repo - I am building again to test whether the new patch will work. I also think the patch nccl_version is not adding anything new to the table so I am trying to build without it as well.
Comment by Ota (otaj) - Monday, 27 January 2020, 10:33 GMT
Yeah, it works without nccl_version.patch and smaller torch_cuda_api patch.
Comment by Konstantin Gizdov (kgizdov) - Monday, 27 January 2020, 17:22 GMT
please check out testing/python-pytorch 1.4.0-4 and see if it works
Comment by Andrew (thelongdivider) - Monday, 27 January 2020, 17:32 GMT
Will do. Currently still showing 1.4.0-3 though even on archlinux.org.
Comment by Andrew (thelongdivider) - Monday, 27 January 2020, 18:03 GMT
It imports and seems to be working well with simple cuda based matrix multiply and NN code. Thanks!

Loading...