FS#74738 - [nvidia-dkms] 515.43.04-1 crashes

Attached to Project: Arch Linux
Opened by Frederick Zhang (FrederickZh) - Saturday, 14 May 2022, 10:37 GMT
Last edited by Sven-Hendrik Haase (Svenstaro) - Sunday, 05 June 2022, 23:29 GMT
Task Type Bug Report
Category Packages: Extra
Status Closed
Assigned To Sven-Hendrik Haase (Svenstaro)
Architecture All
Severity Low
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 3
Private No

Details

Description:

After upgrading to 515.43.04-1 from 510.68.02-1, my displays stopped lighting up.

I'm using a GTX980M with an i7 processor.

Additional info:
* kernel: linux 5.17.7.arch1-1 & linux-lts 5.15.39-1

```
May 14 19:28:37 FredArch kernel: divide error: 0000 [#1] SMP PTI
May 14 19:28:37 FredArch kernel: CPU: 4 PID: 190 Comm: nv_queue Tainted: P OE 5.15.39-1-lts #1 eb282472148793153a3a89b82d9c28dbb21a0873
May 14 19:28:37 FredArch kernel: Hardware name: Micro-Star International Co., Ltd. GT72S 6QE/MS-1782, BIOS E1782IMS.122 03/15/2018
May 14 19:28:37 FredArch kernel: RIP: 0010:_nv014304rm+0x4a/0x70 [nvidia]
May 14 19:28:37 FredArch kernel: Code: f7 ff 4c 89 e7 ba 11 00 00 00 44 89 fe 89 c3 e8 cc 4a f7 ff 85 c0 41 89 c4 74 25 89 d8 31 d2 41 f7 f4 31 d2 5b 41 5c 83 e8 01 <41> f7 f6 8d 44 00 02 41 89 45 00 31 c0 41 5d 41 5e 41 5f c3 66 90
May 14 19:28:37 FredArch kernel: RSP: 0018:ffffb8394071bce8 EFLAGS: 00010202
May 14 19:28:37 FredArch kernel: RAX: 0000000000002532 RBX: ffff8bb3c09d0008 RCX: 0000000000000000
May 14 19:28:37 FredArch kernel: RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff8bb3d2bc0008
May 14 19:28:37 FredArch kernel: RBP: ffff8bb3cf0dac90 R08: ffff8bb3d270d428 R09: ffff8bb3cf0dac78
May 14 19:28:37 FredArch kernel: R10: ffff8bb3c09d0008 R11: ffff8bb3cf848aa0 R12: ffff8bb3d2ba0008
May 14 19:28:37 FredArch kernel: R13: ffff8bb3d270d428 R14: 0000000000000000 R15: 0000000000000004
May 14 19:28:37 FredArch kernel: FS: 0000000000000000(0000) GS:ffff8bbb5ed00000(0000) knlGS:0000000000000000
May 14 19:28:37 FredArch kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 14 19:28:37 FredArch kernel: CR2: 000000c002a71000 CR3: 00000001ca210004 CR4: 00000000003706e0
May 14 19:28:37 FredArch kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
May 14 19:28:37 FredArch kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
May 14 19:28:37 FredArch kernel: Call Trace:
May 14 19:28:37 FredArch kernel: <TASK>
May 14 19:28:37 FredArch kernel: ? _nv014451rm+0x1d3/0x390 [nvidia 55525fc83019924e38a7929dab805883a43a02e1]
May 14 19:28:37 FredArch kernel: ? _nv017617rm+0x52c/0x810 [nvidia 55525fc83019924e38a7929dab805883a43a02e1]
May 14 19:28:37 FredArch kernel: audit: type=1130 audit(1652520517.390:67): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=user@1000 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 14 19:28:37 FredArch kernel: ? _nv016892rm+0x3be/0x760 [nvidia 55525fc83019924e38a7929dab805883a43a02e1]
May 14 19:28:37 FredArch kernel: audit: type=1131 audit(1652520517.390:68): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-rfkill comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 14 19:28:37 FredArch kernel: ? _nv017104rm+0x2b/0x80 [nvidia 55525fc83019924e38a7929dab805883a43a02e1]
May 14 19:28:37 FredArch kernel: ? _nv009907rm+0xb5/0x190 [nvidia 55525fc83019924e38a7929dab805883a43a02e1]
May 14 19:28:37 FredArch kernel: ? rm_execute_work_item+0x108/0x120 [nvidia 55525fc83019924e38a7929dab805883a43a02e1]
May 14 19:28:37 FredArch kernel: ? os_execute_work_item+0x45/0x60 [nvidia 55525fc83019924e38a7929dab805883a43a02e1]
May 14 19:28:37 FredArch kernel: ? _main_loop+0x8f/0x150 [nvidia 55525fc83019924e38a7929dab805883a43a02e1]
May 14 19:28:37 FredArch kernel: ? nvidia_modeset_resume+0x20/0x20 [nvidia 55525fc83019924e38a7929dab805883a43a02e1]
May 14 19:28:37 FredArch kernel: ? kthread+0x117/0x140
May 14 19:28:37 FredArch kernel: ? set_kthread_struct+0x40/0x40
May 14 19:28:37 FredArch kernel: ? ret_from_fork+0x22/0x30
May 14 19:28:37 FredArch kernel: </TASK>
May 14 19:28:37 FredArch kernel: Modules linked in: cmac algif_hash algif_skcipher af_alg bnep intel_rapl_msr intel_rapl_common intel_tcc_cooling x86_pkg_temp_thermal intel_powerclamp joydev coretemp nls_iso8859_1 kvm_intel snd_hda_codec_realtek kvm ext4 iwlmvm snd_hda_codec_generic ledtrig_audio irqbypass iTCO_wdt snd_hda_codec_hdmi crc32c_generic rapl intel_pmc_bxt iTCO_vendor_support ee1004 msi_wmi mbcache mac80211 mei_hdcp snd_hda_intel intel_cstate intel_wmi_thunderbolt sparse_keymap wmi_bmof mxm_wmi libarc4 jbd2 btusb snd_intel_dspcfg intel_uncore snd_intel_sdw_acpi btrtl snd_hda_codec btbcm iwlwifi btintel snd_hda_core snd_hwdep pcspkr snd_pcm alx i2c_i801 psmouse thunderbolt cfg80211 i2c_smbus mdio bluetooth snd_timer mei_me ecdh_generic snd rfkill mousedev crc16 soundcore mei intel_pch_thermal wmi video acpi_pad mac_hid vfat fat dm_multipath ipheth sg crypto_user fuse bpf_preload ip_tables x_tables zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE)
May 14 19:28:37 FredArch kernel: hid_logitech_hidpp hid_logitech_dj dm_crypt cbc encrypted_keys dm_mod trusted asn1_encoder tee tpm rng_core hid_gt683r usbhid rtsx_pci_sdmmc serio_raw mmc_core atkbd libps2 crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd sr_mod rtsx_pci xhci_pci cdrom xhci_pci_renesas i8042 serio nvidia_drm(POE) nvidia_uvm(POE) nvidia_modeset(POE) nvidia(POE)
May 14 19:28:37 FredArch kernel: ---[ end trace 1353d3a96393c288 ]---
May 14 19:28:37 FredArch kernel: RIP: 0010:_nv014304rm+0x4a/0x70 [nvidia]
May 14 19:28:37 FredArch kernel: Code: f7 ff 4c 89 e7 ba 11 00 00 00 44 89 fe 89 c3 e8 cc 4a f7 ff 85 c0 41 89 c4 74 25 89 d8 31 d2 41 f7 f4 31 d2 5b 41 5c 83 e8 01 <41> f7 f6 8d 44 00 02 41 89 45 00 31 c0 41 5d 41 5e 41 5f c3 66 90
May 14 19:28:37 FredArch kernel: RSP: 0018:ffffb8394071bce8 EFLAGS: 00010202
May 14 19:28:37 FredArch kernel: RAX: 0000000000002532 RBX: ffff8bb3c09d0008 RCX: 0000000000000000
May 14 19:28:37 FredArch kernel: RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff8bb3d2bc0008
May 14 19:28:37 FredArch kernel: RBP: ffff8bb3cf0dac90 R08: ffff8bb3d270d428 R09: ffff8bb3cf0dac78
May 14 19:28:37 FredArch kernel: R10: ffff8bb3c09d0008 R11: ffff8bb3cf848aa0 R12: ffff8bb3d2ba0008
May 14 19:28:37 FredArch kernel: R13: ffff8bb3d270d428 R14: 0000000000000000 R15: 0000000000000004
May 14 19:28:37 FredArch kernel: FS: 0000000000000000(0000) GS:ffff8bbb5ed00000(0000) knlGS:0000000000000000
May 14 19:28:37 FredArch kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 14 19:28:37 FredArch kernel: CR2: 000000c002a71000 CR3: 00000001ca210004 CR4: 00000000003706e0
May 14 19:28:37 FredArch kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
May 14 19:28:37 FredArch kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
```
This task depends upon

Closed by  Sven-Hendrik Haase (Svenstaro)
Sunday, 05 June 2022, 23:29 GMT
Reason for closing:  Fixed
Additional comments about closing:  New stable nvidia driver is in [extra].
Comment by Lalit Maganti (lalitm) - Saturday, 14 May 2022, 14:24 GMT
I've observed the same behavior with both nvidia and nvidia-dkms packages with a GTX 1050Ti. My RTX 3070Ti system however works correctly.
Comment by Frederick Zhang (FrederickZh) - Saturday, 14 May 2022, 16:11 GMT
Is it possible that they accidentally ported some sort of patch to the proprietary driver and removed support for Maxwell/Pascal architectures lol? (My 980M is Maxwell and 1050Ti is Pascal. The open-source driver doesn't support these architectures https://github.com/NVIDIA/open-gpu-kernel-modules/issues/19.)
Comment by Darrell (denns) - Saturday, 14 May 2022, 18:53 GMT
The 515 driver is the new alpha/beta open source driver that was just announced. I think it's a bad idea for Arch to be immediately switching this to that driver. Nvidia themselves has said "In this open-source release, support for GeForce and Workstation GPUs is alpha-quality." It also only supports Turing and Ampere - anything older (include GeForce 10xx series) is not supported. The driver download page lists this driver as "beta". As it currently stands, this update will break graphics for a lot of users.

I think Arch's nvidia packages should be reverted back to the latest production driver (currently 510.68.02). The new open source 515 driver should be in testing or as a separate package entirely.
Comment by Sven-Hendrik Haase (Svenstaro) - Saturday, 14 May 2022, 20:13 GMT
I usually refuse to package beta drivers if at all possible but in this case, CUDA 11.7 only supports this new driver and nvidia-open also needs the new nvidia-utils to work well. It was a quagmire to be honest. The supported products on this page: https://www.nvidia.com/Download/driverResults.aspx/187826/en-us definitely list many older GPUs and so I don't think the code supporting older GPUs was actually excluded. In fact, I'm running a 1050 Ti in one machine with this driver right now and I don't see issues.
Comment by Darrell (denns) - Saturday, 14 May 2022, 20:53 GMT
I looked into it further, and they actually include both drivers. The open source one is only used if you build kernel-open and (for GeForce GPUs) set NVreg_OpenRmEnableUnsupportedGpus=1 kernel parameter. That explains why the older cards are still functioning. So this is indeed still the closed source driver - just a beta version of it. I have confirmed that it does seem to run OK with an old Quadro K2200 (Maxwell) I have.




From http://us.download.nvidia.com/XFree86/Linux-x86_64/515.43.04/README/kernel_open.html:
Most features of the Linux GPU driver are supported with the open flavor of kernel modules, including CUDA, Vulkan, OpenGL, OptiX, and X11. However, in the current release, some display and graphics features (notably: G-SYNC, Quadro Sync, SLI, Stereo, rotation in X11, and YUV 4:2:0 on Turing), as well as power management, and NVIDIA virtual GPU (vGPU), are not yet supported. These features will be added in upcoming driver releases.

Use of the open kernel modules on GeForce and Workstation GPUs should be considered alpha-quality in this release due to the missing features listed above. To enable use of the open kernel modules on GeForce and Workstation GPUs, set the "NVreg_OpenRmEnableUnsupportedGpus" nvidia.ko kernel module parameter to 1. E.g.,

modprobe nvidia NVreg_OpenRmEnableUnsupportedGpus=1

or, in an /etc/modprobe.d/ configuration file:

options nvidia NVreg_OpenRmEnableUnsupportedGpus=1

The need for this kernel module parameter will be removed in a future release once performance and functionality in the open kernel modules matures and meets or exceeds that of the proprietary kernel modules.

Though the kernel modules in the two flavors are different, they are based on the same underlying source code. The two flavors are mutually exclusive: they cannot be used within the kernel at the same time, and they should not be installed on the filesystem at the same time.
Comment by Frederick Zhang (FrederickZh) - Sunday, 15 May 2022, 12:13 GMT
@Svenstaro [1] says CUDA 11.7.x is compatible with Linux driver >=450.80.02, though CUDA has to run in 'compatibility mode'.

I don't use CUDA so I'm not entirely sure about what the implications here are (missing features? performance drawbacks?), but considering CUDA obviously has a smaller user base, can we prioritise stability for the wider crowd?

If CUDA users would like to experiment with beta drivers, we can have a separate package like nvidia-dkms-beta? Another point is that I guess some CUDA users are running headless machines, so even if something breaks, it won't be as broken as regular desktop users (SSH fix vs LiveUSB chroot).

[1] https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html
Comment by Marcell Meszaros (MarsSeed) - Monday, 16 May 2022, 13:59 GMT
I'm an amateur for these topics but I would also stick with cuda 11.6.2 if needs be for the latest stable NVIDIA driver.

Though FWIW I've updated to cuda 11.7.0 from testing, and built AUR/ffmpeg-cuda for myself on NVIDIA driver 510.68.02 and it works fine with my old laptop GeForce GTX 950M (2015 model).

I've put Nvidia driver packages on the ignore list for now, because I want to avoid the beta driver.
Comment by Marcell Meszaros (MarsSeed) - Monday, 16 May 2022, 15:01 GMT
+1: [nvidia-lts] 515 is also already in [extra], which is something I especially wouldn't have recommended, given that the LTS kernel is preferred for more robust stability and not for bleeding edge.
Comment by Frederick Zhang (FrederickZh) - Tuesday, 17 May 2022, 12:56 GMT
@MarsSeed I think that's a slightly different topic. IIUC what you were asking was that, are *-lts packages meant to be the latest mainline versions that are compatible with LTS kernel, or actually we should use LTS versions of the packages themselves if they exist.

For example I've developed a kernel module that somehow got picked up by a TU and now it's in Arch's official repo. At any time, I have 3 release lines for my kernel module:

- LTS (the one that seldom gets new features, mostly fixes and security patches)
- Mainline (considered stable enough for production, with new features)
- Beta ('bleeding edge', mainly for testing)

AFAIU you were asking in this case, should TU publish the mainline version of my kernel module for both linux and linux-lts, or mainline for linux & LTS for linux-lts instead? (In another word, should we intentionally publish different versions to improve the overall stability of *-lts packages?)

Now back to the NVIDIA issue. IMHO even if nvidia LTS driver existed, we should use Mainline/Production i.e. 510.68.02 atm for both linux and linux-lts, as Arch strives to deliver new features to users asap while maintaining a degree of stability.

Sorry this is a bit OT. I'm happy to continue this discussion in our mailing lists if needed.
Comment by Roy (df3yt) - Thursday, 19 May 2022, 13:27 GMT
I would like to add to this in that using 515 gave me loads of issues. Games not even opening and those that did displayed odd artifacts. As I had just bought a 3080 I had no idea it was driver related as I would assume nvidia-dkms to be stable I had to do a fresh install PopOS and when everything worked I saw it was 510.

Please see this link

"Shouldn't we rename nvidia-dkms, nvidia-settings to beta?"
https://bbs.archlinux.org/viewtopic.php?id=276511

Comment by Frederick Zhang (FrederickZh) - Wednesday, 25 May 2022, 13:56 GMT
Sorry for the noise but I wonder if this is going anywhere?

Are we now just waiting for the next stable release and call it a day? Will we again ship beta drivers in the future?
Comment by Sven-Hendrik Haase (Svenstaro) - Thursday, 26 May 2022, 00:39 GMT
> Are we now just waiting for the next stable release and call it a day?

I waited a bit with this to be able to better gauge the impact range of the problem. At the moment it seems the problems aren't as widespread as previously thought so I think the best course of action is to indeed just wait for a new stable driver. The alternatives are:
1) Epoch the current drivers to an earlier version. This would work but it would cause some user confusion for sure and I'd have to remove the nvidia-open packages which would surely add to the confusion.
2) Add a set of nvidia-beta drivers or else name a new set of drivers "nvidia-mainline". This wouldn't do it I think as it wouldn't force pacman to install those drivers unless I also added a replaces line which is not warranted here.

So in summary, I think the current strategy of "wait it out" is the least disruptive for the user base as a whole.

I'm sorry if you are currently distraught at the state of affairs but I think we need to go the least bad way here.

> Will we again ship beta drivers in the future?

As I have stated in multiple places now, I usually refuse to package beta drivers unless there are good arguments to be made to package it still: When new hardware support is added for desktop-class GPUs which is only available for the beta drivers or when special things such as nvidia-open happen. This is the case quite rarely and from this point forward I'm going to be even more hesitant to package nvidia beta stuff. I've also done it in the past so as to not block on kernel or Xorg updates.

Therefore, I think the answer is "Yes, but only very rarely when there's a good reason."
Comment by Jonathon (jonathon) - Thursday, 26 May 2022, 00:45 GMT
Also for reference, for those who need it 510xx is in the AUR.

Loading...