FS#74738 - [nvidia-dkms] 515.43.04-1 crashes
Attached to Project:
Arch Linux
Opened by Frederick Zhang (FrederickZh) - Saturday, 14 May 2022, 10:37 GMT
Last edited by Sven-Hendrik Haase (Svenstaro) - Sunday, 05 June 2022, 23:29 GMT
Opened by Frederick Zhang (FrederickZh) - Saturday, 14 May 2022, 10:37 GMT
Last edited by Sven-Hendrik Haase (Svenstaro) - Sunday, 05 June 2022, 23:29 GMT
|
Details
Description:
After upgrading to 515.43.04-1 from 510.68.02-1, my displays stopped lighting up. I'm using a GTX980M with an i7 processor. Additional info: * kernel: linux 5.17.7.arch1-1 & linux-lts 5.15.39-1 ``` May 14 19:28:37 FredArch kernel: divide error: 0000 [#1] SMP PTI May 14 19:28:37 FredArch kernel: CPU: 4 PID: 190 Comm: nv_queue Tainted: P OE 5.15.39-1-lts #1 eb282472148793153a3a89b82d9c28dbb21a0873 May 14 19:28:37 FredArch kernel: Hardware name: Micro-Star International Co., Ltd. GT72S 6QE/MS-1782, BIOS E1782IMS.122 03/15/2018 May 14 19:28:37 FredArch kernel: RIP: 0010:_nv014304rm+0x4a/0x70 [nvidia] May 14 19:28:37 FredArch kernel: Code: f7 ff 4c 89 e7 ba 11 00 00 00 44 89 fe 89 c3 e8 cc 4a f7 ff 85 c0 41 89 c4 74 25 89 d8 31 d2 41 f7 f4 31 d2 5b 41 5c 83 e8 01 <41> f7 f6 8d 44 00 02 41 89 45 00 31 c0 41 5d 41 5e 41 5f c3 66 90 May 14 19:28:37 FredArch kernel: RSP: 0018:ffffb8394071bce8 EFLAGS: 00010202 May 14 19:28:37 FredArch kernel: RAX: 0000000000002532 RBX: ffff8bb3c09d0008 RCX: 0000000000000000 May 14 19:28:37 FredArch kernel: RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff8bb3d2bc0008 May 14 19:28:37 FredArch kernel: RBP: ffff8bb3cf0dac90 R08: ffff8bb3d270d428 R09: ffff8bb3cf0dac78 May 14 19:28:37 FredArch kernel: R10: ffff8bb3c09d0008 R11: ffff8bb3cf848aa0 R12: ffff8bb3d2ba0008 May 14 19:28:37 FredArch kernel: R13: ffff8bb3d270d428 R14: 0000000000000000 R15: 0000000000000004 May 14 19:28:37 FredArch kernel: FS: 0000000000000000(0000) GS:ffff8bbb5ed00000(0000) knlGS:0000000000000000 May 14 19:28:37 FredArch kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 May 14 19:28:37 FredArch kernel: CR2: 000000c002a71000 CR3: 00000001ca210004 CR4: 00000000003706e0 May 14 19:28:37 FredArch kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 May 14 19:28:37 FredArch kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 May 14 19:28:37 FredArch kernel: Call Trace: May 14 19:28:37 FredArch kernel: <TASK> May 14 19:28:37 FredArch kernel: ? _nv014451rm+0x1d3/0x390 [nvidia 55525fc83019924e38a7929dab805883a43a02e1] May 14 19:28:37 FredArch kernel: ? _nv017617rm+0x52c/0x810 [nvidia 55525fc83019924e38a7929dab805883a43a02e1] May 14 19:28:37 FredArch kernel: audit: type=1130 audit(1652520517.390:67): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=user@1000 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' May 14 19:28:37 FredArch kernel: ? _nv016892rm+0x3be/0x760 [nvidia 55525fc83019924e38a7929dab805883a43a02e1] May 14 19:28:37 FredArch kernel: audit: type=1131 audit(1652520517.390:68): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-rfkill comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' May 14 19:28:37 FredArch kernel: ? _nv017104rm+0x2b/0x80 [nvidia 55525fc83019924e38a7929dab805883a43a02e1] May 14 19:28:37 FredArch kernel: ? _nv009907rm+0xb5/0x190 [nvidia 55525fc83019924e38a7929dab805883a43a02e1] May 14 19:28:37 FredArch kernel: ? rm_execute_work_item+0x108/0x120 [nvidia 55525fc83019924e38a7929dab805883a43a02e1] May 14 19:28:37 FredArch kernel: ? os_execute_work_item+0x45/0x60 [nvidia 55525fc83019924e38a7929dab805883a43a02e1] May 14 19:28:37 FredArch kernel: ? _main_loop+0x8f/0x150 [nvidia 55525fc83019924e38a7929dab805883a43a02e1] May 14 19:28:37 FredArch kernel: ? nvidia_modeset_resume+0x20/0x20 [nvidia 55525fc83019924e38a7929dab805883a43a02e1] May 14 19:28:37 FredArch kernel: ? kthread+0x117/0x140 May 14 19:28:37 FredArch kernel: ? set_kthread_struct+0x40/0x40 May 14 19:28:37 FredArch kernel: ? ret_from_fork+0x22/0x30 May 14 19:28:37 FredArch kernel: </TASK> May 14 19:28:37 FredArch kernel: Modules linked in: cmac algif_hash algif_skcipher af_alg bnep intel_rapl_msr intel_rapl_common intel_tcc_cooling x86_pkg_temp_thermal intel_powerclamp joydev coretemp nls_iso8859_1 kvm_intel snd_hda_codec_realtek kvm ext4 iwlmvm snd_hda_codec_generic ledtrig_audio irqbypass iTCO_wdt snd_hda_codec_hdmi crc32c_generic rapl intel_pmc_bxt iTCO_vendor_support ee1004 msi_wmi mbcache mac80211 mei_hdcp snd_hda_intel intel_cstate intel_wmi_thunderbolt sparse_keymap wmi_bmof mxm_wmi libarc4 jbd2 btusb snd_intel_dspcfg intel_uncore snd_intel_sdw_acpi btrtl snd_hda_codec btbcm iwlwifi btintel snd_hda_core snd_hwdep pcspkr snd_pcm alx i2c_i801 psmouse thunderbolt cfg80211 i2c_smbus mdio bluetooth snd_timer mei_me ecdh_generic snd rfkill mousedev crc16 soundcore mei intel_pch_thermal wmi video acpi_pad mac_hid vfat fat dm_multipath ipheth sg crypto_user fuse bpf_preload ip_tables x_tables zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) May 14 19:28:37 FredArch kernel: hid_logitech_hidpp hid_logitech_dj dm_crypt cbc encrypted_keys dm_mod trusted asn1_encoder tee tpm rng_core hid_gt683r usbhid rtsx_pci_sdmmc serio_raw mmc_core atkbd libps2 crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd sr_mod rtsx_pci xhci_pci cdrom xhci_pci_renesas i8042 serio nvidia_drm(POE) nvidia_uvm(POE) nvidia_modeset(POE) nvidia(POE) May 14 19:28:37 FredArch kernel: ---[ end trace 1353d3a96393c288 ]--- May 14 19:28:37 FredArch kernel: RIP: 0010:_nv014304rm+0x4a/0x70 [nvidia] May 14 19:28:37 FredArch kernel: Code: f7 ff 4c 89 e7 ba 11 00 00 00 44 89 fe 89 c3 e8 cc 4a f7 ff 85 c0 41 89 c4 74 25 89 d8 31 d2 41 f7 f4 31 d2 5b 41 5c 83 e8 01 <41> f7 f6 8d 44 00 02 41 89 45 00 31 c0 41 5d 41 5e 41 5f c3 66 90 May 14 19:28:37 FredArch kernel: RSP: 0018:ffffb8394071bce8 EFLAGS: 00010202 May 14 19:28:37 FredArch kernel: RAX: 0000000000002532 RBX: ffff8bb3c09d0008 RCX: 0000000000000000 May 14 19:28:37 FredArch kernel: RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff8bb3d2bc0008 May 14 19:28:37 FredArch kernel: RBP: ffff8bb3cf0dac90 R08: ffff8bb3d270d428 R09: ffff8bb3cf0dac78 May 14 19:28:37 FredArch kernel: R10: ffff8bb3c09d0008 R11: ffff8bb3cf848aa0 R12: ffff8bb3d2ba0008 May 14 19:28:37 FredArch kernel: R13: ffff8bb3d270d428 R14: 0000000000000000 R15: 0000000000000004 May 14 19:28:37 FredArch kernel: FS: 0000000000000000(0000) GS:ffff8bbb5ed00000(0000) knlGS:0000000000000000 May 14 19:28:37 FredArch kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 May 14 19:28:37 FredArch kernel: CR2: 000000c002a71000 CR3: 00000001ca210004 CR4: 00000000003706e0 May 14 19:28:37 FredArch kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 May 14 19:28:37 FredArch kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 ``` |
This task depends upon
Closed by Sven-Hendrik Haase (Svenstaro)
Sunday, 05 June 2022, 23:29 GMT
Reason for closing: Fixed
Additional comments about closing: New stable nvidia driver is in [extra].
Sunday, 05 June 2022, 23:29 GMT
Reason for closing: Fixed
Additional comments about closing: New stable nvidia driver is in [extra].
I think Arch's nvidia packages should be reverted back to the latest production driver (currently 510.68.02). The new open source 515 driver should be in testing or as a separate package entirely.
From http://us.download.nvidia.com/XFree86/Linux-x86_64/515.43.04/README/kernel_open.html:
Most features of the Linux GPU driver are supported with the open flavor of kernel modules, including CUDA, Vulkan, OpenGL, OptiX, and X11. However, in the current release, some display and graphics features (notably: G-SYNC, Quadro Sync, SLI, Stereo, rotation in X11, and YUV 4:2:0 on Turing), as well as power management, and NVIDIA virtual GPU (vGPU), are not yet supported. These features will be added in upcoming driver releases.
Use of the open kernel modules on GeForce and Workstation GPUs should be considered alpha-quality in this release due to the missing features listed above. To enable use of the open kernel modules on GeForce and Workstation GPUs, set the "NVreg_OpenRmEnableUnsupportedGpus" nvidia.ko kernel module parameter to 1. E.g.,
modprobe nvidia NVreg_OpenRmEnableUnsupportedGpus=1
or, in an /etc/modprobe.d/ configuration file:
options nvidia NVreg_OpenRmEnableUnsupportedGpus=1
The need for this kernel module parameter will be removed in a future release once performance and functionality in the open kernel modules matures and meets or exceeds that of the proprietary kernel modules.
Though the kernel modules in the two flavors are different, they are based on the same underlying source code. The two flavors are mutually exclusive: they cannot be used within the kernel at the same time, and they should not be installed on the filesystem at the same time.
I don't use CUDA so I'm not entirely sure about what the implications here are (missing features? performance drawbacks?), but considering CUDA obviously has a smaller user base, can we prioritise stability for the wider crowd?
If CUDA users would like to experiment with beta drivers, we can have a separate package like nvidia-dkms-beta? Another point is that I guess some CUDA users are running headless machines, so even if something breaks, it won't be as broken as regular desktop users (SSH fix vs LiveUSB chroot).
[1] https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html
Though FWIW I've updated to cuda 11.7.0 from testing, and built AUR/ffmpeg-cuda for myself on NVIDIA driver 510.68.02 and it works fine with my old laptop GeForce GTX 950M (2015 model).
I've put Nvidia driver packages on the ignore list for now, because I want to avoid the beta driver.
For example I've developed a kernel module that somehow got picked up by a TU and now it's in Arch's official repo. At any time, I have 3 release lines for my kernel module:
- LTS (the one that seldom gets new features, mostly fixes and security patches)
- Mainline (considered stable enough for production, with new features)
- Beta ('bleeding edge', mainly for testing)
AFAIU you were asking in this case, should TU publish the mainline version of my kernel module for both linux and linux-lts, or mainline for linux & LTS for linux-lts instead? (In another word, should we intentionally publish different versions to improve the overall stability of *-lts packages?)
Now back to the NVIDIA issue. IMHO even if nvidia LTS driver existed, we should use Mainline/Production i.e. 510.68.02 atm for both linux and linux-lts, as Arch strives to deliver new features to users asap while maintaining a degree of stability.
Sorry this is a bit OT. I'm happy to continue this discussion in our mailing lists if needed.
Please see this link
"Shouldn't we rename nvidia-dkms, nvidia-settings to beta?"
https://bbs.archlinux.org/viewtopic.php?id=276511
Are we now just waiting for the next stable release and call it a day? Will we again ship beta drivers in the future?
I waited a bit with this to be able to better gauge the impact range of the problem. At the moment it seems the problems aren't as widespread as previously thought so I think the best course of action is to indeed just wait for a new stable driver. The alternatives are:
1) Epoch the current drivers to an earlier version. This would work but it would cause some user confusion for sure and I'd have to remove the nvidia-open packages which would surely add to the confusion.
2) Add a set of nvidia-beta drivers or else name a new set of drivers "nvidia-mainline". This wouldn't do it I think as it wouldn't force pacman to install those drivers unless I also added a replaces line which is not warranted here.
So in summary, I think the current strategy of "wait it out" is the least disruptive for the user base as a whole.
I'm sorry if you are currently distraught at the state of affairs but I think we need to go the least bad way here.
> Will we again ship beta drivers in the future?
As I have stated in multiple places now, I usually refuse to package beta drivers unless there are good arguments to be made to package it still: When new hardware support is added for desktop-class GPUs which is only available for the beta drivers or when special things such as nvidia-open happen. This is the case quite rarely and from this point forward I'm going to be even more hesitant to package nvidia beta stuff. I've also done it in the past so as to not block on kernel or Xorg updates.
Therefore, I think the answer is "Yes, but only very rarely when there's a good reason."