FS#69005 : [cuda] 11.2 incompatible with driver 455.45

FS#69005 - [cuda] 11.2 incompatible with driver 455.45

Attached to Project: Community Packages
Opened by Michael (ZeroBeat) - Wednesday, 16 December 2020, 16:42 GMT
Last edited by Sven-Hendrik Haase (Svenstaro) - Thursday, 07 January 2021, 17:24 GMT

Task Type	Bug Report
Category	Packages
Status	Closed
Assigned To	Sven-Hendrik Haase (Svenstaro)
Architecture	x86_64
Severity	High
Priority	Normal
Reported Version
Due in Version	Undecided
Due Date	Undecided
Percent Complete
Votes	5 Yuxin Wu (ppwwyyxx) (2020-12-24) Michał Walenciak (Kicer) (2020-12-22) JumaX9 (jumax9) (2020-12-17) Adria Arrufat (swiftscythe) (2020-12-17) Anton (sci-pirate) (2020-12-16)
Private	No

Details

CUDA 11.2 is incompatible with current driver:

$ pacman -Q | grep cuda
cuda 11.2.0-1

$ pacman -Q | grep nvidia
nvidia 455.45.01-7
nvidia-settings 455.45.01-1
nvidia-utils 455.45.01-1
opencl-nvidia 455.45.01-1

$ hashcat -m 22000 --benchmark
hashcat (v6.1.1-120-g15bf8b730) starting in benchmark mode...

CUDA API (CUDA 11.1)
Device #1: GeForce GTX 970, 3887/4039 MB, 13MCU

OpenCL API (OpenCL 1.2 CUDA 11.1.114) - Platform #1 [NVIDIA Corporation]
Device #2: GeForce GTX 970, skipped

Hashmode: 22000 - WPA-PBKDF2-PMKID+EAPOL (Iterations: 4095)
cuLinkAddData(): the provided PTX was compiled with an unsupported toolchain.

Device #1: Kernel /usr/share/hashcat/OpenCL/shared.cl link failed. Error Log:

ptxas application ptx input, line 9; fatal : Unsupported .version 7.2; current version is '7.1'

Device #1: Kernel /usr/share/hashcat/OpenCL/shared.cl build failed.

Started: Wed Dec 16 16:16:38 2020
Stopped: Wed Dec 16 16:16:40 2020

release notes:
https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html
CUDA 11.2.0 GA >=460.27.04 >=460.89

Solutions:
Inform user not to update to 11.2 untill nvidia 460.27 is released
or
stop update to cuda 11.2
or
provide nvidia 460.27.4 (beta) driver

Stay healthy,
cheers
Mike

This task depends upon

Closed by Sven-Hendrik Haase (Svenstaro)
Thursday, 07 January 2021, 17:24 GMT
Reason for closing: Fixed

Comment by Anton (sci-pirate) - Wednesday, 16 December 2020, 22:09 GMT

All CUDA-enabled programs are in fault state. Another solution in addition to the suggested one can be downgrading cuda to 11.1.1 until the release of nvidia 460.

Comment by Eli Schwartz (eschwartz) - Thursday, 17 December 2020, 02:34 GMT

Field changed: Summary (CUDA 11.2 incompatible with driver 455.45 → [cuda] 11.2 incompatible with driver 455.45)
Field changed: Status (Unconfirmed → Assigned)
Task assigned to Sven-Hendrik Haase (Svenstaro)

This seems like the kind of situation where we should revert with an epoch...

Comment by Sven-Hendrik Haase (Svenstaro) - Thursday, 17 December 2020, 03:28 GMT

Hang on. I specifically test cuda compatibility every time I upgrade cuda when there's only a beta driver out and I could run tensorflow and pytorch just fine. How do I reproduce your issues? You didn't mention anything specific.

Comment by Michael (ZeroBeat) - Thursday, 17 December 2020, 08:35 GMT

It looks like that only some CUDA functions are effected
https://docs.nvidia.com/cuda/parallel-thread-execution/#changes-in-ptx-isa-version-7-2
and the "basic functions" (e.g. quering a device) are still working.

This is an output of a small CUDA code (only basic functions) to query my device:
$ ./dp
CUDA Device Query...
There are 1 CUDA devices.

CUDA Device #0
Major revision number: 5
Minor revision number: 2
Name: GeForce GTX 970
Total global memory: 4236115968
Total shared memory per block: 49152
Total registers per block: 65536
Warp size: 32
Maximum memory pitch: 2147483647
Maximum threads per block: 1024
Maximum dimension 0 of block: 1024
Maximum dimension 1 of block: 1024
Maximum dimension 2 of block: 64
Maximum dimension 0 of grid: 2147483647
Maximum dimension 1 of grid: 65535
Maximum dimension 2 of grid: 65535
Clock rate: 1228000
Total constant memory: 65536
Texture alignment: 512
Concurrent copy and execution: Yes
Number of multiprocessors: 13
Kernel execution timeout: No

I observed the same problem when moving from CUDA 11.0 to 11.1 while running an older driver:
CUDA 11.1 GA >= 455.23 >= 456.38
CUDA 11.0.3 Update 1 >= 450.51.06 >= 451.82
https://github.com/hashcat/hashcat/issues/2626

Comment by Michael (ZeroBeat) - Thursday, 17 December 2020, 08:41 GMT

Soon, we see 5.10 LTS kernel and nvidia 460.27. So the best solution (for me) is to add cuda to "IgnorePkg" until the new versions arrived.

Comment by JumaX9 (jumax9) - Thursday, 17 December 2020, 14:17 GMT

Find the same problems. In my case most of my Tensorflow code ran fine, but some tests failed which I traced back to this problem. The suggested solution (--ignore cuda when updating) works, I don't like it that much but I guess the Linux driver will be out of beta soonish.

Comment by Michael (ZeroBeat) - Thursday, 17 December 2020, 15:36 GMT

I wonder why NVIDIA released CUDA 11.2 before release of driver 460.27.4 (with regard to the API change 7.1 -> 7.2).

Comment by Michael (ZeroBeat) - Thursday, 17 December 2020, 17:15 GMT

BTW: The issue is confirmed on other distros, too:
adds uvm kernel module support for Kernel >= 5.9; which is reenabled now, i.e. things like CUDA are working again with kernels >= 5.9
https://opensuse.pkgs.org/15.2/nvidia-x86_64/nvidia-computeG05-460.27.04-lp152.33.1.x86_64.rpm.html

That let me assume, we (Arch) are not the only one running into that issue.

Stay healthy,
cheers
Mike

Comment by Sven-Hendrik Haase (Svenstaro) - Friday, 18 December 2020, 01:31 GMT

Can you try the beta driver and see whether actually improves things?

Comment by Michael (ZeroBeat) - Friday, 18 December 2020, 08:22 GMT

For me it doesn't work. 460.27.04 boot into a black screen with a flashing cursor on top left. Maybe I missed something while building the 460.27 packages which ended in a "dependency hell". It looks like the driver need some more attention as expected.

Comment by Michael (ZeroBeat) - Friday, 18 December 2020, 08:33 GMT

Maybe my approach (modify your PKGBUILD for nvidia, nvidia-utils and opencl-nvidia) was too simple and must inevitably fail here. Sometimes that had worked, but unfortunately not on 460.27.04.
Tried that on 5.9.14-arch1-1. Maybe 460.27.04 is improved for 5.10.1 - but I'm not sure.

Comment by Sven-Hendrik Haase (Svenstaro) - Friday, 18 December 2020, 10:17 GMT

What you did should work as that's pretty much what I do too. Anyway now I'm not too keen on putting the beta drivers into repos.

Comment by Michael (ZeroBeat) - Friday, 18 December 2020, 10:31 GMT

Yes, for sure, putting a beta driver into the repos is definitely not a good idea.
But I will continue to test the driver in combination with 5.10.1.
Maybe I'm able to figure out, what went wrong.
BTW:
Your PKGBULDs are excellent. Worked before like a charm.

Comment by Michael (ZeroBeat) - Saturday, 19 December 2020, 10:43 GMT

@Svenstro
It looks like that the combination kernel 5.10.1 -> nvidia 460.27.04 is working much better kernel 5.9.14 -> 460.27.04.
We can assume that, if the final driver is released, it will work fine with kernel 5.10 and you shouldn't waste your time, trying to get it work on 5.9.xx

Unfortunately there are still some issues (notebooks) in combination with I have to deal with:
ASUS (TUF gaming) notebook: AMD integrated GPU + NVIDA PCIe card (GTX 1650)
ASUS notebook: Intel integrated GPU + NVIDIA PCIe card (M940)

After turning on the notebook sometimes it takes more than 5 times to reboot the notebook until the NVIDIA card is detected and I'll not run into a black screen.
But I'm not sure if this issue is really related to the beta driver or my xorg config's (attached it - maybe I'm too stupid to generate a correct one and you have a better idea).

20-nvidia.conf.amd_nvidia:
Section "Device"
Identifier "nvidia"
Driver "nvidia"
BusID "PCI:1:0:0"
VendorName "NVIDIA Corporation"
Option "NoLogo" "1"
Option "Interactive" "0"
Option "Coolbits" "12"
Option "AllowEmptyInitialConfiguration"
EndSection

Section "Device"
Identifier "amd"
Driver "amdgpu"
BusID "PCI:5:0:0"
EndSection

Section "Screen"
Identifier "amd"
Device "amd"
EndSection

20-nvidia.conf.intel_nvidia:
Section "Device"
Identifier "nvidia"
Driver "nvidia"
BusID "PCI:1:0:0"
VendorName "NVIDIA Corporation"
Option "NoLogo" "1"
Option "Interactive" "0"
Option "Coolbits" "12"
Option "AllowEmptyInitialConfiguration"
EndSection

Section "Device"
Identifier "intel"
Driver "modesetting"
EndSection

Section "Screen"
Identifier "intel"
Device "intel"
EndSection

Stay healthy
cheers
Mike

Comment by Michael (ZeroBeat) - Sunday, 20 December 2020, 12:41 GMT

@Svenstaro
At least I found the issue. Due to fast SSDs my notebooks booting too fast and I have "slow down" them during boot to prevent systemd attempt to start the display manager before the NVIDIA driver has fully initialized.
After adding an udev rule, the combination of kernel 5.10.1 and nvidia 460.27.04 is working fine.
Now we can wait until final nvidia 460.27 will be released and 5.10 leaves testing.

Comment by Jakub Klinkovský (lahwaacz) - Sunday, 27 December 2020, 14:22 GMT

I was able to reproduce this issue only when I compiled my CUDA program for a GPU architecture that does not match the actual hardware, in which case the CUDA runtime takes the embedded PTX and invokes JIT compiles it. For example, if I compile with "-arch sm_61" instead of "-arch sm_75", I get an error, but compiling directly with "-arch sm_75" works. This explains why svenstaro's tensorflow and pytorch work fine - they are compiled explicitly for many (all?) GPU architectures so JIT was not invoked for svenstaro's hardware.

Comment by Michael (ZeroBeat) - Thursday, 07 January 2021, 17:07 GMT

Fixed by nvidia 460.32.03-1
We can close this report.
Thanks.
Happy new year,
cheers
Mike

Arch Linux

FS#69005 - [cuda] 11.2 incompatible with driver 455.45

Details

Loading...