FS#75986 - Nvidia driver freezing since linux 5.19.9

Attached to Project: Arch Linux
Opened by Michele (mikefender) - Friday, 23 September 2022, 11:11 GMT
Last edited by Toolybird (Toolybird) - Monday, 26 September 2022, 21:27 GMT
Task Type Bug Report
Category Kernel
Status Closed
Assigned To No-one
Architecture x86_64
Severity High
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 3
Private No

Details

Description:

Additional info:
* linux-5.19.9, nvidia-515.65.01-14 (and higher)
* reproduced on a PRIME system
* Dell XPS 9570
* NVIDIA GXT 1050Ti
* Intel UDH Graphics 630
* Wayland compositor: Sway

Steps to reproduce: on a PRIME system, turning the dGPU on and then using "prime-run" and run "vkcube" or any other application using the GPU.

The issue is not present in linux-5.19.8 and nvidia-515.65.01-13 and lower versions (linux-lts/nvidia-lts works too)

Attached kernel log showing the nvidia driver exceptions.
This task depends upon

Closed by  Toolybird (Toolybird)
Monday, 26 September 2022, 21:27 GMT
Reason for closing:  Fixed
Additional comments about closing:  linux 5.19.11.arch1-1
Comment by Michele (mikefender) - Friday, 23 September 2022, 20:53 GMT
I've just realized that I totally forgot to add the description, and it doesn't seem to be possible to edit the task, so I'll just post it here.

Basically when running a 3D application using the discrete GPU, the application doesn't start and the process hangs forever and can't be even killed with SIGKILL. When attempting to reboot the system, systemd will wait forever for that process to end, preventing shutdown/reboot.
Comment by Michele (mikefender) - Friday, 23 September 2022, 22:26 GMT
Even running "nvidia-smi" on a virtual terminal (no graphic session) it's enough to trigger the problem. However, the issue is related to the scripts I use to turn the card on/off for power saving. If using the card from a fresh boot without using caveats to turn the card off/on, everything seems to work correctly.

I understand the scripts are not an official method, but so far they've been working fine for years, and the logic behind it is used in popular solutions like nvidia-xrun.
Comment by Michele (mikefender) - Friday, 23 September 2022, 22:26 GMT
Attaching script to turn card off
Comment by Michele (mikefender) - Friday, 23 September 2022, 22:27 GMT
Attaching script to turn card back on
Comment by Toolybird (Toolybird) - Friday, 23 September 2022, 23:17 GMT
This sounds like a kernel and/or Nvidia regression which mean this [1] is applicable. Hopefully you're able to use git bisection and can find the offending commit then report it upstream.

[1] https://wiki.archlinux.org/title/Kernel#Debugging_regressions
Comment by Michele (mikefender) - Saturday, 24 September 2022, 14:14 GMT
Git bisect pointed me to this commit https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=9cd4f1434479f1ac25c440c421fbf52069079914

I'll try to revert it to see if it fixes the issue. As for the reporting, where should I report it? Should it be considered a kernel bug or a nvidia driver bug?
Comment by Michele (mikefender) - Saturday, 24 September 2022, 14:16 GMT Comment by Michele (mikefender) - Saturday, 24 September 2022, 15:27 GMT Comment by loqs (loqs) - Saturday, 24 September 2022, 17:33 GMT
If you can reproduce it with nvidia-open you can report it on [1] if it only affects the closed source drivers instead you can use [2].

[1] https://github.com/NVIDIA/open-gpu-kernel-modules/issues
[2] https://forums.developer.nvidia.com/c/gpu-graphics/linux/148
Comment by Michele (mikefender) - Saturday, 24 September 2022, 18:27 GMT
I don't think my card is supported by nvidia-open (it's a GTX 1050 Ti), but I can try to reproduce using nouveau
Comment by Michele (mikefender) - Saturday, 24 September 2022, 18:34 GMT
I can sort of reproduce with nouveau, it's not that bad (no hangs) but I still get errors in the kernel log:

Sep 24 19:32:45 jason kernel: DMAR: DRHD: handling fault status reg 2
Sep 24 19:32:45 jason kernel: DMAR: [INTR-REMAP] Request device [01:00.0] fault index 0x8000 [fault reason 0x25] Blocked a compatibility format interrupt request
Sep 24 19:32:47 jason kernel: nouveau 0000:01:00.0: sec2: cmdq: timeout waiting for queue ready
Sep 24 19:32:47 jason kernel: nouveau 0000:01:00.0: gr: init failed, -110
Comment by Michele (mikefender) - Saturday, 24 September 2022, 18:45 GMT
the stack trace with nouveau looks similar to the one with nvidia
Comment by loqs (loqs) - Saturday, 24 September 2022, 18:58 GMT
On the kernel.org report can you add to the CC list the author and committer of the bisected commit Lu Baolu <baolu.lu@linux.intel.com> Joerg Roedel <jroedel@suse.de>?
Comment by Michele (mikefender) - Saturday, 24 September 2022, 19:19 GMT
Done!
Comment by loqs (loqs) - Sunday, 25 September 2022, 06:47 GMT
As pointed out by upstream 9cd4f1434479f1ac25c440c421fbf52069079914 has already been reverted in 5.19.11 [1][2]

[1] https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.19.11
[2] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=4d8637f1d67242207410734844ca4b143ac5585e
Comment by Michele (mikefender) - Sunday, 25 September 2022, 12:39 GMT
I've tested the latest mainline kernel and the issue has been resolved (which was expected since they reverted the commit that caused the problem).
Comment by Michele (mikefender) - Monday, 26 September 2022, 17:35 GMT
Kernel from package linux-5.19.11.arch1-1 works fine. This issue can be closed.

Loading...