FS#46064 - [nvidia-libgl] segfault when using TSX (__lll_unlock_elision)
Attached to Project:
Arch Linux
Opened by Neal Oakey (neal) - Saturday, 22 August 2015, 08:02 GMT
Last edited by Sven-Hendrik Haase (Svenstaro) - Wednesday, 13 January 2016, 17:53 GMT
Opened by Neal Oakey (neal) - Saturday, 22 August 2015, 08:02 GMT
Last edited by Sven-Hendrik Haase (Svenstaro) - Wednesday, 13 January 2016, 17:53 GMT
|
Details
Many programs (like: i3lock, pavucontrol, zathura,
gcr-prompter, wireshark, awesome, chromium) segfault on
exit. The crash seems to happen when libEGL is finalized and
attempts to unlock an elided lock.
Hardware is a Thinkpad T550 (i7-5600U) with NVIDIA graphics using the proprietary drivers. IIRC crashes started appearing with the latest glibc update. Building glibc without lock elision circumvents the problem. Sample trace (identical for all programs): [Thread debugging using libthread_db enabled] Using host libthread_db library "/usr/lib/libthread_db.so.1". Core was generated by `pavucontrol'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00007f42c76687e0 in __lll_unlock_elision () from /usr/lib/libpthread.so.0 (gdb) bt #0 0x00007f42c76687e0 in __lll_unlock_elision () from /usr/lib/libpthread.so.0 #1 0x00007f42c2c3a26c in ?? () from /usr/lib/libEGL.so.1 #2 0x00007f42c2bcaa22 in ?? () from /usr/lib/libEGL.so.1 #3 0x00007ffddf6677c0 in ?? () #4 0x00007f42c2c4eea1 in ?? () from /usr/lib/libEGL.so.1 #5 0x00007ffddf6677c0 in ?? () #6 0x00007f42ccd20885 in _dl_fini () from /lib64/ld-linux-x86-64.so.2 Backtrace stopped: previous frame inner to this frame (corrupt stack?) glibc 2.22-1 nvidia-libgl 352.30-1 |
This task depends upon
Closed by Sven-Hendrik Haase (Svenstaro)
Wednesday, 13 January 2016, 17:53 GMT
Reason for closing: Fixed
Wednesday, 13 January 2016, 17:53 GMT
Reason for closing: Fixed
FS#45295was closed with "Not a glibc issue - file bugs against the segfaulting packages."FS#45295is against glibc. This is against nvidia-libgl.https://wiki.archlinux.org/index.php/Microcode
yes it is, I had to recompile it again
Also in my case there are no microcode updates as well (skylake cpu).
Installing a custom version of glibc with lock elision disabled makes the problems go away.
Even with lock elision disabled, running helgrind with say vim (from gvim) shows that libEGL is unlocking a lock that wasn't locked: http://pastie.org/pastes/10443914/text
I'm sure helgrind will show the same issue on systems that don't have TSX enabled.
test.c:
int main() { return 0; }
compile with: gcc test.c -o test -lEGL
test with helgrind: valgrind --tool=helgrind ./test
==3673== Invalid read of size 8
==3673== at 0xAFA9BE1: __eglTeardownVendor (in /usr/lib/nvidia/libEGL.so.355.11)
==3673== by 0x400F884: _dl_fini (in /usr/lib/ld-2.22.so)
==3673== by 0x63F8F87: __run_exit_handlers (in /usr/lib/libc-2.22.so)
==3673== by 0x63F8FD4: exit (in /usr/lib/libc-2.22.so)
==3673== by 0x63E3616: (below main) (in /usr/lib/libc-2.22.so)
==3673== Address 0x8 is not stack'd, malloc'd or (recently) free'd
==3673==
==3673==
==3673== Process terminating with default action of signal 11 (SIGSEGV)
==3673== Access not within mapped region at address 0x8
==3673== at 0xAFA9BE1: __eglTeardownVendor (in /usr/lib/nvidia/libEGL.so.355.11)
==3673== by 0x400F884: _dl_fini (in /usr/lib/ld-2.22.so)
==3673== by 0x63F8F87: __run_exit_handlers (in /usr/lib/libc-2.22.so)
==3673== by 0x63F8FD4: exit (in /usr/lib/libc-2.22.so)
==3673== by 0x63E3616: (below main) (in /usr/lib/libc-2.22.so)
==3673== If you believe this happened as a result of a stack
==3673== overflow in your program's main thread (unlikely but
==3673== possible), you can try to increase the size of the
==3673== main thread stack using the --main-stacksize= flag.
==3673== The main thread stack size used in this run was 8388608.
"exo-open" with no arguments is a nice example of this behavior.
Still get lock elision crashes when that's built in.
driver 358.09 (beta) https://devtalk.nvidia.com/default/topic/884727
http://git.uplinklabs.net/snoonan/projects/archlinux/ec2/ec2-packages.git/tree/glibc/glibc-2.22-lock-elision-crash-nvidia.patch
I hadn't heard Aaron Plattner's name until now. I haven't received many updates on the NVIDIA bug report other than simple status changes (no comments yet). I believe it's currently marked "in progress" (was "pending review" a couple weeks ago).
Aaron Plattner said: NOTE: I'm on paternity leave until early 2016.
I've solved it by patching the glibc with the tsx blacklist code from debian (http://sources.debian.net/data/main/g/glibc/2.22-0experimental1/debian/patches/amd64/local-blacklist-for-Intel-TSX.diff), just added model 94 with stepping <= 3 aswell.
Maybe this would be an approach to solve this issue? I've tried to use the latest microcode from intel, as intel-ucode is outdated, but even with that it doesn't work properly.
Of course this could also be a flawed CPU, but I don't really have a way to figure that one out.
That's not a viable approach, as it would require backlisting every single CPU with TSX support, including future CPUs.
It's exactly the same as compiling glibc with lock elison disabled.
The problem is not a CPU issue, the TSX feature is finally working correctly, that's why there is no microcode
update for the newer CPUs. It's an issue with the Nvidia package, which is acting outside the spec. Only an update
to the Nvidia package will fix it. Everything else (including microcode updates which disable TSX) is a workaround.
In libEGL_nvidia.so.0 (md5sum 36a6edacefcd2893b3bc5c6c282943b6), replace bytes "e8 64 d3 f8 ff" @ offset 0x95987 with "90 90 90 90 90".
It replaces an unnecessary call to pthread_mutex_unlock just before pthread_mutex_destroy with a sequence of NOPs. This problem is silently ignored in glibc built without lock elision or on CPUs with no TSX.
In other words: this issue SHOULD be fixed in the beta.
From the description I see that the "phtread_mutex_unlock" issue is mentioned again, same as for 361.16:
"Fixed a bug in the EGL driver where a mutex was unlocked more than once. This triggers undefined behavior, and in particular, if lock elision is enabled in glibc, may result in a segmentation fault."
Apart from that, I don't find any new fixes, esp. nothing what might be different as compared to 361.16
Does it make sense to test 361.18, too, with respect to the "plasmashell segfault" issues?
the "plasmashell segfault issues" should be fixed in 361.18.
I'll give it a try now...
Edit: Can confirm both the sddm/plasmashell segfault and the original "glibc lock elision segfaults" are gone with 361.18.