FS#62142 - [nvidia] PLEASE ENTER SUMMARY

Attached to Project: Arch Linux
Opened by Alex (aletan) - Tuesday, 26 March 2019, 09:18 GMT
Last edited by Sven-Hendrik Haase (Svenstaro) - Saturday, 30 March 2019, 21:51 GMT
Task Type Bug Report
Category Packages: Extra
Status Closed
Assigned To Sven-Hendrik Haase (Svenstaro)
Architecture All
Severity Low
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

Description:

I'm running two TITAN Xp GPUs with nvidia driver version 418.56-2:

on Arch Linux x86_64, Kernel: 5.0.4-arch1-1-ARCH :

.o+` ----------
`ooo/ OS: Arch Linux x86_64
`+oooo: Kernel: 5.0.4-arch1-1-ARCH
`+oooooo: Uptime: 13 mins
-+oooooo+: Packages: 421 (pacman)
`/:-:++oooo+: Shell: bash 5.0.2
`/++++/+++++++: Theme: Arc [GTK2/3]
`/++++++++++++++: Icons: Adwaita [GTK2/3]
`/+++ooooooooooooo/` Terminal: urxvt
./ooosssso++osssssso+` CPU: Intel i7-7800X (12) @ 4.000GHz
.oossssso-````/ossssss+` GPU: NVIDIA TITAN Xp
-osssssso. :ssssssso. Memory: 1220MiB / 64098MiB


When I execute nvidia-smi I'm getting all the information about both of the GPUs:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56 Driver Version: 418.56 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN Xp Off | 00000000:17:00.0 Off | N/A |
| 24% 42C P8 10W / 250W | 2MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN Xp Off | 00000000:65:00.0 On | N/A |
| 27% 45C P8 17W / 250W | 247MiB / 12192MiB | 0% Default |
+-------------------------------+----------------------+----------------------+


But when I execute P2P bandwidth test sample from cuda package

/opt/cuda/samples/1_Utilities/p2pBandwidthLatencyTest/p2pBandwidthLatencyTest

my system hangs.

There was not the issue in nvidia driver 415 and kernel 4.20.

Steps to reproduce:

1) cd /opt/cuda/samples/1_Utilities/p2pBandwidthLatencyTest/
2) make
3) ./p2pBandwidthLatencyTest


* package version(s)

nvidia:

Repository : extra
Name : nvidia
Version : 418.56-2
Description : NVIDIA drivers for linux
Architecture : x86_64
URL : http://www.nvidia.com/
Licenses : custom
Groups : None
Provides : None
Depends On : linux nvidia-utils=418.56 libglvnd
Optional Deps : None
Conflicts With : None
Replaces : None
Download Size : 11.56 MiB
Installed Size : 11.84 MiB
Packager : Jan Alexander Steffens (heftig) <jan.steffens@gmail.com>
Build Date : Sun 24 Mar 2019 01:22:23 AM MSK
Validated By : MD5 Sum SHA-256 Sum Signature

kernel:

Repository : core
Name : linux
Version : 5.0.4.arch1-1
Description : The Linux kernel and modules
Architecture : x86_64
URL : https://git.archlinux.org/linux.git/log/?h=v5.0.4-arch1
Licenses : GPL2
Groups : base
Provides : None
Depends On : coreutils linux-firmware kmod mkinitcpio
Optional Deps : crda: to set the correct wireless channels of your country
Conflicts With : None
Replaces : None
Download Size : 70.66 MiB
Installed Size : 75.40 MiB
Packager : Jan Alexander Steffens (heftig) <jan.steffens@gmail.com>
Build Date : Sat 23 Mar 2019 11:57:31 PM MSK
Validated By : MD5 Sum SHA-256 Sum Signature

cuda:

Repository : community
Name : cuda
Version : 10.0.130-2
Description : NVIDIA's GPU programming toolkit
Architecture : x86_64
URL : http://www.nvidia.com/object/cuda_home.html
Licenses : custom:NVIDIA
Groups : None
Provides : cuda-toolkit cuda-sdk
Depends On : gcc7-libs opencl-nvidia nvidia-utils gcc7
Optional Deps : gdb: for cuda-gdb
java-runtime: for nsight and nvvp
Conflicts With : None
Replaces : cuda-toolkit cuda-sdk
Download Size : 1316.93 MiB
Installed Size : 3023.75 MiB
Packager : Sven-Hendrik Haase <svenstaro@gmail.com>
Build Date : Mon 24 Sep 2018 09:14:18 AM MSK
Validated By : MD5 Sum SHA-256 Sum Signature
This task depends upon

Closed by  Sven-Hendrik Haase (Svenstaro)
Saturday, 30 March 2019, 21:51 GMT
Reason for closing:  Fixed
Comment by loqs (loqs) - Tuesday, 26 March 2019, 11:30 GMT
 FS#62110  did patching the nvidia driver resolve the issue?
Comment by Alex (aletan) - Tuesday, 26 March 2019, 12:02 GMT
I am going to look into that over time.
Comment by Sven-Hendrik Haase (Svenstaro) - Wednesday, 27 March 2019, 22:15 GMT
Not sure what to do here and there isn't even a description. So there is a patch? Does it do anything for you?
Comment by loqs (loqs) - Wednesday, 27 March 2019, 22:43 GMT
https://bbs.archlinux.org/viewtopic.php?id=244919 includes patches, dmesg backtraces and the possible causal commit.
Hopefully once it was confirmed the patch works someone affected could report the issue to Nvidia.
Edit:
test1.patch is the proposed fix
test2.patch assumes the cause is https://github.com/torvalds/linux/commit/356da6d0cde3323236977fce54c1f9612a742036 so the function can only return NV_FALSE (under linux 5.0).
Comment by Sven-Hendrik Haase (Svenstaro) - Wednesday, 27 March 2019, 23:17 GMT
Well I'm not going to randomly patch packages so I'll need some help here from you guys to test some of this.
Comment by loqs (loqs) - Wednesday, 27 March 2019, 23:37 GMT
I do not possess an SLI system and on a single card system I can not reach nv_dma_map_peer or any of the other callers of nv_dma_is_map_resource_implemented to test.
Comment by Alex (aletan) - Thursday, 28 March 2019, 09:42 GMT
I have applied test1.patch on nvidia 418.56-3 and it works for me!

Thanks a lot!
Comment by Sven-Hendrik Haase (Svenstaro) - Friday, 29 March 2019, 15:49 GMT
Ok, make sure to report this to nvidia. For the time being, I'll fix the package downstream.
Comment by Sven-Hendrik Haase (Svenstaro) - Friday, 29 March 2019, 15:55 GMT
Test 418.56-4 please and report results.

Loading...