FS#73880 - [xorg-server] Crash when starting SteamVR (nvidia)

Attached to Project: Arch Linux
Opened by James Hogan (jhogan) - Saturday, 19 February 2022, 09:50 GMT
Last edited by Andreas Radke (AndyRTR) - Monday, 21 February 2022, 09:16 GMT
Task Type Bug Report
Category Packages: Extra
Status Closed
Assigned To Andreas Radke (AndyRTR)
Sven-Hendrik Haase (Svenstaro)
Laurent Carlier (lordheavy)
Architecture x86_64
Severity High
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

Description:

Since an update yesterday, xorg crashes when starting SteamVR, with nvidia drivers (and an HTC vive)

package version(s)
* xorg-server 21.1.3-3
* nvidia-dkms 510.54-1
* linux 5.16.10.arch1-1
* linux-hardened 5.15.21.hardened1-3

config and/or log files etc.
* see attached Xorg.0.log.old

recent upgrades (all multiple days ago, so surprised if it'd make a difference since I think i've restarted multiple times in that time):
* [2022-02-09T14:31:24+0000] [ALPM] upgraded xorg-server (21.1.3-1 -> 21.1.3-2) (ages ago)
* [2022-02-15T16:58:22+0000] [ALPM] upgraded linux-hardened (5.15.21.hardened1-1 -> 5.15.21.hardened1-3)
* [2022-02-15T16:58:27+0000] [ALPM] upgraded nvidia-dkms (510.47.03-3 -> 510.54-1)

These were after it started happening, didn't help:
* [2022-02-19T07:45:35+0000] [ALPM] upgraded xorg-server (21.1.3-2 -> 21.1.3-3)

Steps to reproduce:
* update system
* start SteamVR with HTC vive plugged in
* xorg crashes back to login screen

Debugging attempts
* I attempted to attach gdb to Xorg from a different TTY, but it seemed to just hang the system (couldn't ssh in at that point either)
* I tried downgrading linux-hardened to 5.15.21.hardened1-1 (it was upgraded a few days back so its a long shot), but it wouldn't boot, nvidia-dkms wouldn't rebuild due to different gcc plugin versions.
* I tried switching to linux-hardened to linux, and it still happened.

Happy to try debugging with guidance
This task depends upon

Closed by  Andreas Radke (AndyRTR)
Monday, 21 February 2022, 09:16 GMT
Reason for closing:  Fixed
Comment by James Hogan (jhogan) - Saturday, 19 February 2022, 09:51 GMT
I'll also that it started happening yesterday, after which it did work for a short time, but today I can't get it to work at all.
Comment by James Hogan (jhogan) - Saturday, 19 February 2022, 10:22 GMT
I found coredumpctl, see attached log, gdb backtrace doesn't look particularly helpful, unsure whether the null pointer is dereferenced from Xorg or nvidia code.
Comment by James Hogan (jhogan) - Saturday, 19 February 2022, 10:32 GMT
If I cat /proc/$(pidof Xorg)/maps before the crash:

55ae9d812000-55ae9d841000 r--p 00000000 fe:01 22372125 /usr/lib/Xorg
55ae9d841000-55ae9d9f8000 r-xp 0002f000 fe:01 22372125 /usr/lib/Xorg
55ae9d9f8000-55ae9da6b000 r--p 001e6000 fe:01 22372125 /usr/lib/Xorg
55ae9da6b000-55ae9da6f000 r--p 00258000 fe:01 22372125 /usr/lib/Xorg
55ae9da6f000-55ae9da75000 rw-p 0025c000 fe:01 22372125 /usr/lib/Xorg
55ae9da75000-55ae9dab3000 rw-p 00000000 00:00 0
55ae9f8c8000-55aea182f000 rw-p 00000000 00:00 0 [heap]

Thread 1 (Thread 0x7fd70c43e940 (LWP 48005)):
#0 0x00007fd70cd3134c in __pthread_kill_implementation () at /usr/lib/libc.so.6
#1 0x00007fd70cce44b8 in raise () at /usr/lib/libc.so.6
#2 0x00007fd70ccce534 in abort () at /usr/lib/libc.so.6
#3 0x000055ae9d9659f0 in System ()
#4 0x000055ae9d967535 in FatalError ()
#5 0x000055ae9d96cf0a in ()
#6 0x00007fd70cce4560 in <signal handler called> () at /usr/lib/libc.so.6
#7 0x0000000000000000 in ()
#8 0x000055ae9d8f09fe in ()
#9 0x000055ae9d8f3b52 in ()
#10 0x000055ae9d8f2cd6 in ()
#11 0x000055ae9d84f34a in ()
#12 0x00007fd70cccf310 in __libc_start_call_main () at /usr/lib/libc.so.6
#13 0x00007fd70cccf3c1 in __libc_start_main_impl () at /usr/lib/libc.so.6
#14 0x000055ae9d84f795 in _start ()

so it appears frame 8 (0x000055ae9d8f09fe) which calls the null pointer is in the 2nd map, i.e. Xorg code:

(gdb) frame 8
#8 0x000055ae9d8f09fe in ?? ()
(gdb) x/16i $rip-0x10
0x55ae9d8f09ee: test %edx,%eax
0x55ae9d8f09f0: add (%rax),%eax
0x55ae9d8f09f2: add %cl,-0x7b(%rax)
0x55ae9d8f09f5: shlb $0x89,0x48(%rax,%rcx,2)
0x55ae9d8f09fa: (bad)
0x55ae9d8f09fb: call *0x78(%rax)
=> 0x55ae9d8f09fe: test %rax,%rax
0x55ae9d8f0a01: je 0x55ae9d8f0a40
0x55ae9d8f0a03: mov 0x1bbd9b(%rip),%esi # 0x55ae9daac7a4
0x55ae9d8f0a09: mov 0x8(%rax),%rdx
0x55ae9d8f0a0d: mov 0x1bbd95(%rip),%ecx # 0x55ae9daac7a8
0x55ae9d8f0a13: test %esi,%esi
0x55ae9d8f0a15: je 0x55ae9d8f0a70
0x55ae9d8f0a17: test %ecx,%ecx
0x55ae9d8f0a19: je 0x55ae9d8f0acd
0x55ae9d8f0a1f: movslq 0x1bbd7a(%rip),%rcx # 0x55ae9daac7a0

Don't know if that helps identify where its coming from.
Comment by Andreas Radke (AndyRTR) - Saturday, 19 February 2022, 11:32 GMT
Please check if this is a duplicate of  FS#73875  - if it keeps crashing with xorg-server 21.1.3-5 please report it upstream.
Comment by James Hogan (jhogan) - Saturday, 19 February 2022, 12:05 GMT
I build based on the PKGBUILD (i think its the one in testing), and got a proper backtrace. I've reported upstream here:
https://gitlab.freedesktop.org/xorg/xserver/-/issues/1315

I think its linked to being after a full power cycle (Storm Eunice meant power cuts yesterday and this morning).
Comment by James Hogan (jhogan) - Saturday, 19 February 2022, 12:33 GMT
I can confirm it is still reproducible on 21.1.3-5
Comment by James Hogan (jhogan) - Monday, 21 February 2022, 08:32 GMT
This is fixed in xorg-server 21.1.3-6 (well, steamvr can't use direct mode, but at least it doesn't crash the x server and i can power cycle the htc vive link box after boot to get it working again).

Thanks
James

Loading...