FS#65176 - [python-pytorch] nn.DataParallel" causes "NCCL Error 4: invalid argument

Attached to Project: Community Packages
Opened by Cat (lasercat) - Thursday, 16 January 2020, 05:49 GMT
Last edited by Sven-Hendrik Haase (Svenstaro) - Wednesday, 22 January 2020, 02:34 GMT
Task Type Bug Report
Category Packages
Status Closed
Assigned To Sven-Hendrik Haase (Svenstaro)
Architecture All
Severity Medium
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

Description:
nn.Dataparallel does not work with the NCCL 2.5.6
This seems to be fixed in 1.4 according to the upstream PR, released 6 hrs ago.

https://github.com/pytorch/pytorch/releases/tag/v1.4.0

Additional info:
* package version(s)
python-pytorch-cuda 1.3.1-7
* config and/or log files etc.
* link to upstream bug report, if any
Also reported in upstream pull request https://github.com/pytorch/pytorch/pull/29014

Steps to reproduce:
This task depends upon

Closed by  Sven-Hendrik Haase (Svenstaro)
Wednesday, 22 January 2020, 02:34 GMT
Reason for closing:  Fixed
Comment by Cat (lasercat) - Thursday, 16 January 2020, 08:22 GMT
Sorry I forgot to fill in the summary...
It should be "nn.DataParallel" causes "NCCL Error 4: invalid argument"
Comment by Sven-Hendrik Haase (Svenstaro) - Thursday, 16 January 2020, 12:08 GMT
Too bad 1.4 doesn't compile at all. :(
Comment by Sven-Hendrik Haase (Svenstaro) - Friday, 17 January 2020, 05:18 GMT
Test 1.4 in [community-testing].

Loading...