FS#75995 - [nvidia] Black X11 Screen and partial lockup when upgraded to 515.76 and dual RTX3060
Attached to Project:
Arch Linux
Opened by Christian Pellegrin (chripell) - Saturday, 24 September 2022, 07:22 GMT
Last edited by Sven-Hendrik Haase (Svenstaro) - Thursday, 13 October 2022, 15:01 GMT
Opened by Christian Pellegrin (chripell) - Saturday, 24 September 2022, 07:22 GMT
Last edited by Sven-Hendrik Haase (Svenstaro) - Thursday, 13 October 2022, 15:01 GMT
|
Details
Description:
After upgrading to 515.76 on my system (Amd CPU, Asus Moterboard, 2 X RTX3060, see the nvidia-bug-report.log.gz for detailed configuration) I get a blank screen when I run startx. I can login remotely, I can take a nvidia-bug-report (although it takes a lot to finish) but reboot hangs (with the last message “kernel: nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67d:0:0:1119”) so I suspect a problem at kernel level. Things I tried: Downgrading to 515.65.01 it DOES solve the problem. Disable Amd pstate driver, it does NOT solve the problem. Disable iommu/PCI denylisting for a normal 2xGPU configuration, it does NOT solve the problem. Downgrade to linux LTS 5.15.70, it does NOT solve the problem. Let me know if you need more information, Thanks! Additional info: * package version: nvidia-dkms-515.76-1 * config and/or log files: see attached file * link to upstream bug report: https://forums.developer.nvidia.com/t/bug-report-black-x11-screen-and-partial-lockup-when-upgraded-to-515-76-and-dual-rtx3060/228912 Steps to reproduce: Just start X11 from the console (startx, I configure the WM via .xinitrc) |
This task depends upon
Closed by Sven-Hendrik Haase (Svenstaro)
Thursday, 13 October 2022, 15:01 GMT
Reason for closing: Fixed
Thursday, 13 October 2022, 15:01 GMT
Reason for closing: Fixed
I can confirm. same problem here. Had to downgrade to 515.65 driver.
Last time a couple of years ago when this happened. the linux kernel package was not compiled to support the new version of the nvidia driver.
Usually linux and nvidia packages are released synchronously.
Bye,
Stefan
OP can you try the Nvidia-open driver and see whether it's the same?
Linux eren 5.19.10-arch1-1 #1 SMP PREEMPT_DYNAMIC Tue, 20 Sep 2022 15:17:59 +0000 x86_64 GNU/Linux
Trying LTS was only an additional check.
Thanks for mentioning nvidia-open-dkms, I didn't know about that! Right now I have some computation going on, I will test it this evening and I think it is a good idea move to it.
I also met the same problem on my Arch system (cpu: 12700K, motherboard: msi z690i unify, discrete gpu: RTX3060).
I use DWM as the wm and start it by the command "startx". In the past several months, everything goes well until yesterday I upgrade nvidia-dkms drvier from 515.65 to 515.76.
I get a black screen when I run startx. I cannot even switch to tty2. Ctrl+Alt+Fn (F2~F7) does not work.
I have tested linux kernel, linux-xanmod kernel and linux-lts, and I also tried to use the driver nvidia-open. However, these do not work and I still get a black screen.
After downgrading the driver to 515.65, everything goes well again.
Using nvidia-open-dkms didn't make any difference.
Black screen after bootloader text and no way to switch to TTY.
edit: tried with nvidia-open-dkms but it made no difference.
Downgrading nvidia packages alone did not solve it, had to downgrade linux.
Note that switching to tty doesn't work so you will probably need to use chroot from installation media to install the above packages and setup the boot entry.
* Boot from arch linux installation media
* lsblk to see the root partition
* mount /dev/[the partition] /mnt, where the partition is likely sd[char][number] or nvme[number]n[number]p[number]
* arch-chroot /mnt
* sudo pacman -S linux-zen nvidia-open-dkms
* You may want to run/setup the hook for mkinitcpio
* Add new boot entry to the new kernel
* Restart and use linux-zen entry
EDIT: It was a fluke. The workaround was: starting a hyprland session, exit, start sddm / x11 DE.
My Hyperland.sh file includes this:
```
#!/usr/bin/env bash
export LIBVA_DRIVER_NAME=nvidia
export CLUTTER_BACKEND=wayland
export XDG_SESSION_TYPE=wayland
export QT_WAYLAND_DISABLE_WINDOWDECORATION=1
export MOZ_ENABLE_WAYLAND=1
export __GLX_VENDOR_LIBRARY_NAME=nvidia
export WLR_NO_HARDWARE_CURSORS=1
export GBM_BACKEND=nvidia-drm
export WLR_BACKEND=vulkan
export WLR_RENDERER=gles2
export QT_QPA_PLATFORM=wayland
export GDK_BACKEND=wayland
export XCURSOR_SIZE=24
Hyprland
```
Whatever it does, it helps sddm to start. Perhaps Hyprland is actually not needed but some env variable setup is.
It seems disabling nvidia-drm.modeset could be a temporary workaround. I suppose this is potentially hitting people with 3000 cards with `nvidia-drm.modeset=1` + X11.
after the session is up and running should work.
I can almost confirm. In addition, I had to restart the sddm.service.
So my current workaround:
1. unplug HDMI
2. boot
3. plug in HDMI
4. stop/start sddm.service (logged in remotely from another machine...)
1. I have a system with a RTX3060 connected to a HDMI monitor through a KVM switch (work monitor) and a RTX3060 connected directly to a DP monitor (calibrated for graphics work).
2. I switch the KVM to other system, *not* the one with the RTX3060.
3. I boot my system. Now the POST/linux console is on the DP monitor, usually it is on the HDMI. I login and run `startx`
4. I switch the KVM back to the RTX3060 system and I have my usual dual display / GPU correctly working.
So it looks like there is something in the console initialization code specific to HDMI.
Sadly I have no display port monitor available.
https://forums.developer.nvidia.com/t/515-76-nvidia-drivers/229132/15?u=vcdbvcxfasd
"We were able to duplicate issue locally and are currently debugging it.
Shall keep updated on the same."
More specifically, my screen goes blank after starting either SDDM or LightDM. Haven't tried others yet. I am also unable to switch between TTYs at that point.
I am still trying to find a workaround, will update once I find one.
I was able to resolve this by downgrading BOTH nvidia and linux to 515.65.01-9 and 5.19.4.arch1-1 respectively. Solely downgrading nvidia didn't work. I also downgraded nvidia-utils, but not sure if it was necessary.
I wonder how this sort of issue could be prevented in the future; I know it's difficult when it comes to proprietary packages.
```
DOWNGRADE_FROM_ALA=1 downgrade nvidia-utils linux510-nvidia linux515-nvidia linux518-nvidia lib32-nvidia-utils
loading packages...
warning: downgrading package lib32-nvidia-utils (515.76-1 => 515.65.01-1)
warning: downgrading package linux510-nvidia (515.76-4 => 515.65.01-8)
warning: downgrading package linux515-nvidia (515.76-14 => 515.65.01-8)
warning: downgrading package linux518-nvidia (515.76-1 => 515.65.01-6)
warning: downgrading package nvidia-utils (515.76-1 => 515.65.01-3)
```
same "kernel: nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67e:0:0:1128" in dmesg like reported.
noticed that xrandr for example did not work.
I downgraded all the packages with the impacted version number and added them to the IgnorePkg list:
IgnorePkg = nvidia-dkms nvidia-utils lib32-nvidia-utils libxnvctrl opencl-nvidia
Feeling a bit uncomfortable upgrading a now « unstable » system.
> Fixed a regression in 515.76 that caused blank screens and hangs when starting an X server on RTX 30 series GPUs in some configurations where the boot display is connected via HDMI.