FS#75995 - [nvidia] Black X11 Screen and partial lockup when upgraded to 515.76 and dual RTX3060

Attached to Project: Arch Linux
Opened by Christian Pellegrin (chripell) - Saturday, 24 September 2022, 07:22 GMT
Last edited by Sven-Hendrik Haase (Svenstaro) - Thursday, 13 October 2022, 15:01 GMT
Task Type Bug Report
Category Packages: Extra
Status Closed
Assigned To Sven-Hendrik Haase (Svenstaro)
Felix Yan (felixonmars)
Architecture All
Severity Low
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 13
Private No

Details

Description:

After upgrading to 515.76 on my system (Amd CPU, Asus Moterboard, 2 X RTX3060, see the nvidia-bug-report.log.gz for detailed configuration) I get a blank screen when I run startx. I can login remotely, I can take a nvidia-bug-report (although it takes a lot to finish) but reboot hangs (with the last message “kernel: nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67d:0:0:1119”) so I suspect a problem at kernel level.

Things I tried:

Downgrading to 515.65.01 it DOES solve the problem.
Disable Amd pstate driver, it does NOT solve the problem.
Disable iommu/PCI denylisting for a normal 2xGPU configuration, it does NOT solve the problem.
Downgrade to linux LTS 5.15.70, it does NOT solve the problem.
Let me know if you need more information,

Thanks!

Additional info:
* package version: nvidia-dkms-515.76-1
* config and/or log files: see attached file
* link to upstream bug report: https://forums.developer.nvidia.com/t/bug-report-black-x11-screen-and-partial-lockup-when-upgraded-to-515-76-and-dual-rtx3060/228912

Steps to reproduce: Just start X11 from the console (startx, I configure the WM via .xinitrc)
This task depends upon

Closed by  Sven-Hendrik Haase (Svenstaro)
Thursday, 13 October 2022, 15:01 GMT
Reason for closing:  Fixed
Comment by Stefan Kain (stkain) - Saturday, 24 September 2022, 15:43 GMT
Hello,

I can confirm. same problem here. Had to downgrade to 515.65 driver.
Last time a couple of years ago when this happened. the linux kernel package was not compiled to support the new version of the nvidia driver.
Usually linux and nvidia packages are released synchronously.

Bye,
Stefan
Comment by Sven-Hendrik Haase (Svenstaro) - Sunday, 25 September 2022, 07:15 GMT
Stefan, the OP is using a dkms package and as such the driver is compiled just in time for each kernel version.

OP can you try the Nvidia-open driver and see whether it's the same?
Comment by Sven-Hendrik Haase (Svenstaro) - Sunday, 25 September 2022, 07:50 GMT
OP, could you perhaps also test the current non-lts kernel?
Comment by Christian Pellegrin (chripell) - Sunday, 25 September 2022, 08:08 GMT
Sorry, I was unclear in my initial post. I use latest non-TLS kernel usually:

Linux eren 5.19.10-arch1-1 #1 SMP PREEMPT_DYNAMIC Tue, 20 Sep 2022 15:17:59 +0000 x86_64 GNU/Linux

Trying LTS was only an additional check.

Thanks for mentioning nvidia-open-dkms, I didn't know about that! Right now I have some computation going on, I will test it this evening and I think it is a good idea move to it.
Comment by Chang Wei (Weich) - Sunday, 25 September 2022, 10:22 GMT
hi guys,

I also met the same problem on my Arch system (cpu: 12700K, motherboard: msi z690i unify, discrete gpu: RTX3060).

I use DWM as the wm and start it by the command "startx". In the past several months, everything goes well until yesterday I upgrade nvidia-dkms drvier from 515.65 to 515.76.

I get a black screen when I run startx. I cannot even switch to tty2. Ctrl+Alt+Fn (F2~F7) does not work.

I have tested linux kernel, linux-xanmod kernel and linux-lts, and I also tried to use the driver nvidia-open. However, these do not work and I still get a black screen.

After downgrading the driver to 515.65, everything goes well again.
Comment by Sven-Hendrik Haase (Svenstaro) - Sunday, 25 September 2022, 10:29 GMT
I don't have this problem on my 2080ti but I'd like someone to test the open drivers and report back.
Comment by Brian Gomes Bascoy (pera) - Sunday, 25 September 2022, 16:09 GMT
Same problem here with an RTX3060; my system gets completely unresponsive (e.g. caps lock led not turning on) so I had to downgrade.

Using nvidia-open-dkms didn't make any difference.
Comment by Tnuk E (fgosdisaj) - Sunday, 25 September 2022, 18:59 GMT
Same issue on a 3080ti with 515.76-1. Been like this since sept 23.
Black screen after bootloader text and no way to switch to TTY.
edit: tried with nvidia-open-dkms but it made no difference.
Comment by Hustin (turbochamp) - Sunday, 25 September 2022, 20:36 GMT
Same issues on a 3070 Ti. Using WM (XMonad) after logging in and running startx with xinit the screen is completely unresponsive. Cannot switch TTY or kill xserver. Downgrading linux, linux-headers, nvidia-dkms, nvidia-utils and lib32-nvidia-utils solved the issue.

Downgrading nvidia packages alone did not solve it, had to downgrade linux.
Comment by Christian Pellegrin (chripell) - Sunday, 25 September 2022, 20:46 GMT
I tried latest nvidia-open-dkms and it does NOT work, showing the same symptoms as described in the first bug report.
Comment by Sven-Hendrik Haase (Svenstaro) - Sunday, 25 September 2022, 22:55 GMT
Well that really sucks and it also sucks we can't do anything here. nvidia obviously didn't test thoroughly on many devices. I hope the thread quickly is seen and resolved by nvidia.
Comment by Quang Luong (quang) - Monday, 26 September 2022, 01:02 GMT
Workaround that worked for me to recover the working system (RTX 3080): Use linux-zen kernel with nvidia-open-dkms.

Note that switching to tty doesn't work so you will probably need to use chroot from installation media to install the above packages and setup the boot entry.

* Boot from arch linux installation media
* lsblk to see the root partition
* mount /dev/[the partition] /mnt, where the partition is likely sd[char][number] or nvme[number]n[number]p[number]
* arch-chroot /mnt
* sudo pacman -S linux-zen nvidia-open-dkms
* You may want to run/setup the hook for mkinitcpio
* Add new boot entry to the new kernel
* Restart and use linux-zen entry

EDIT: It was a fluke. The workaround was: starting a hyprland session, exit, start sddm / x11 DE.

My Hyperland.sh file includes this:
```
#!/usr/bin/env bash

export LIBVA_DRIVER_NAME=nvidia
export CLUTTER_BACKEND=wayland
export XDG_SESSION_TYPE=wayland
export QT_WAYLAND_DISABLE_WINDOWDECORATION=1
export MOZ_ENABLE_WAYLAND=1
export __GLX_VENDOR_LIBRARY_NAME=nvidia
export WLR_NO_HARDWARE_CURSORS=1
export GBM_BACKEND=nvidia-drm
export WLR_BACKEND=vulkan
export WLR_RENDERER=gles2
export QT_QPA_PLATFORM=wayland
export GDK_BACKEND=wayland
export XCURSOR_SIZE=24

Hyprland
```
Whatever it does, it helps sddm to start. Perhaps Hyprland is actually not needed but some env variable setup is.
Comment by q rty (q234rty) - Monday, 26 September 2022, 10:23 GMT
Since nvidia-open-dkms also doesn't work, it would make sense to report this in https://github.com/NVIDIA/open-gpu-kernel-modules/issues as well.
Comment by Sven-Hendrik Haase (Svenstaro) - Monday, 26 September 2022, 23:20 GMT
I'm really uncertain about this. I could epoch this to the previous version but then again this driver is confirmed to fix the excessive power draw issue that many people had and so that sucks either way. It also doesn't seem to hit every 3000 series card either as we got reports from many people saying it works fine for them. I really hope nvidia is quick to acknowledge the issue and release a fix for this.
Comment by Christian Pellegrin (chripell) - Tuesday, 27 September 2022, 06:03 GMT Comment by Quang Luong (quang) - Tuesday, 27 September 2022, 09:21 GMT
> It also doesn't seem to hit every 3000 series card either as we got reports from many people saying it works fine for them.

It seems disabling nvidia-drm.modeset could be a temporary workaround. I suppose this is potentially hitting people with 3000 cards with `nvidia-drm.modeset=1` + X11.
Comment by Tnuk E (fgosdisaj) - Tuesday, 27 September 2022, 14:47 GMT
Have this issue with or without the DRM modeset activated.
Comment by Jonathon (jonathon) - Tuesday, 27 September 2022, 20:24 GMT
To confirm impact isn't across all 3000-series cards, I have an RTX 3070 (Max-Q) and did not see this issue while the driver was in testing. Extra data point, linux-lqx and linux-mainline were fine.
Comment by moonsheep (moonsheep) - Friday, 30 September 2022, 00:15 GMT
I am experiencing the same behavior after upgrading with an RTX 3060Ti. Doing a full downgrade did the trick.
Comment by Stefan Kain (stkain) - Friday, 30 September 2022, 19:31 GMT
Someone in the NVIDIA-github-report figured out that starting with HDMI unplugged during boot and then plugging the monitor in
after the session is up and running should work.

I can almost confirm. In addition, I had to restart the sddm.service.
So my current workaround:
1. unplug HDMI
2. boot
3. plug in HDMI
4. stop/start sddm.service (logged in remotely from another machine...)


Comment by Luciano Lorenti (lucianolorenti) - Friday, 30 September 2022, 22:07 GMT
The workaround described by Stefan worked for me.
Comment by Joshua Patterson (Gippies) - Saturday, 01 October 2022, 00:33 GMT
Playing off of Stefan's workaround, I was using an HDMI cable so I tried switching to a DisplayPort cable and that also worked for me (without having to unplug and plug it back in or restart sddm).
Comment by Christian Pellegrin (chripell) - Saturday, 01 October 2022, 06:27 GMT
Thanks for the suggestion! This actually works for me on 515.76:

1. I have a system with a RTX3060 connected to a HDMI monitor through a KVM switch (work monitor) and a RTX3060 connected directly to a DP monitor (calibrated for graphics work).
2. I switch the KVM to other system, *not* the one with the RTX3060.
3. I boot my system. Now the POST/linux console is on the DP monitor, usually it is on the HDMI. I login and run `startx`
4. I switch the KVM back to the RTX3060 system and I have my usual dual display / GPU correctly working.

So it looks like there is something in the console initialization code specific to HDMI.
Comment by Guillaume BINET (gbin) - Sunday, 02 October 2022, 01:30 GMT
For completeness the 3090 is also affected.
Comment by Chang Wei (Weich) - Sunday, 02 October 2022, 02:58 GMT
The workaround described by Stefan also worked for me. After I use a DisplayPort cable, the driver 515.76 works well. There is not much doubt that the issue is only related to HDMI.


Comment by Igor Moura (igormp) - Monday, 03 October 2022, 19:00 GMT
Can report this also happens to my 3090. Blindly logging and doing a startx, then plugging the HDMI did the trick for me too.

Sadly I have no display port monitor available.
Comment by Tnuk E (fgosdisaj) - Tuesday, 04 October 2022, 14:38 GMT
Official response:
https://forums.developer.nvidia.com/t/515-76-nvidia-drivers/229132/15?u=vcdbvcxfasd

"We were able to duplicate issue locally and are currently debugging it.
Shall keep updated on the same."
Comment by Tnuk E (fgosdisaj) - Tuesday, 04 October 2022, 14:43 GMT
**double post
Comment by Shashank Rajesh (SerpentEagle) - Friday, 07 October 2022, 04:34 GMT
I also have the same issue with my EVGA RTX 3050 XC using HDMI 2.1 (I use a TV as a monitor and thus cannot use DisplayPort)

More specifically, my screen goes blank after starting either SDDM or LightDM. Haven't tried others yet. I am also unable to switch between TTYs at that point.

I am still trying to find a workaround, will update once I find one.
Comment by Shashank Rajesh (SerpentEagle) - Friday, 07 October 2022, 04:44 GMT
Update:

I was able to resolve this by downgrading BOTH nvidia and linux to 515.65.01-9 and 5.19.4.arch1-1 respectively. Solely downgrading nvidia didn't work. I also downgraded nvidia-utils, but not sure if it was necessary.

I wonder how this sort of issue could be prevented in the future; I know it's difficult when it comes to proprietary packages.
Comment by Simon Brännström (Sensu) - Saturday, 08 October 2022, 12:50 GMT
@SerpentEagle Personally, I decided to install nvidia-470xx-dkms from the AUR and it's dependencies instead using an AUR helper (yay in my case), which was quite convenient. Fortunately, I didn't have to downgrade the kernel as well to work around this issue.
Comment by Markus (Links2004) - Monday, 10 October 2022, 16:21 GMT
did run in too the same problem, "fixed" it via downgrade to 515.65.01.

```
DOWNGRADE_FROM_ALA=1 downgrade nvidia-utils linux510-nvidia linux515-nvidia linux518-nvidia lib32-nvidia-utils
loading packages...
warning: downgrading package lib32-nvidia-utils (515.76-1 => 515.65.01-1)
warning: downgrading package linux510-nvidia (515.76-4 => 515.65.01-8)
warning: downgrading package linux515-nvidia (515.76-14 => 515.65.01-8)
warning: downgrading package linux518-nvidia (515.76-1 => 515.65.01-6)
warning: downgrading package nvidia-utils (515.76-1 => 515.65.01-3)
```

same "kernel: nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67e:0:0:1128" in dmesg like reported.
noticed that xrandr for example did not work.

Comment by LESNOFF Dimitri (dlesnoff) - Monday, 10 October 2022, 19:39 GMT
I had the same problem with a RTX3070 (so the bug seems to target only HDMI Ampere cards). I already did use dkms drivers. and the two workarounds (HDMI and downgrade) worked for me.
I downgraded all the packages with the impacted version number and added them to the IgnorePkg list:
IgnorePkg = nvidia-dkms nvidia-utils lib32-nvidia-utils libxnvctrl opencl-nvidia

Feeling a bit uncomfortable upgrading a now « unstable » system.
Comment by Alexandros (fumantsu) - Wednesday, 12 October 2022, 07:22 GMT
I can confirm that the issue is not only in 3000 series. I have a RTX2070 and getting the same after a reboot from the latest upgrade in Manjaro. I need to test the workaround with the HDMI because I use DP.
Comment by Igor Moura (igormp) - Thursday, 13 October 2022, 01:32 GMT
Looks like this was fixed in the latest driver: https://www.nvidia.com/Download/driverResults.aspx/193764/en-us/

> Fixed a regression in 515.76 that caused blank screens and hangs when starting an X server on RTX 30 series GPUs in some configurations where the boot display is connected via HDMI.

Loading...