FS#46064 - [nvidia-libgl] segfault when using TSX (__lll_unlock_elision)

Attached to Project: Arch Linux
Opened by Neal Oakey (neal) - Saturday, 22 August 2015, 08:02 GMT
Last edited by Sven-Hendrik Haase (Svenstaro) - Wednesday, 13 January 2016, 17:53 GMT
Task Type Bug Report
Category Packages: Extra
Status Closed
Assigned To Ionut Biru (wonder)
Sven-Hendrik Haase (Svenstaro)
Felix Yan (felixonmars)
Architecture All
Severity High
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 16
Private No

Details

Many programs (like: i3lock, pavucontrol, zathura, gcr-prompter, wireshark, awesome, chromium) segfault on exit. The crash seems to happen when libEGL is finalized and attempts to unlock an elided lock.

Hardware is a Thinkpad T550 (i7-5600U) with NVIDIA graphics using the proprietary drivers. IIRC crashes started appearing with the latest glibc update. Building glibc without lock elision circumvents the problem.

Sample trace (identical for all programs):

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
Core was generated by `pavucontrol'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f42c76687e0 in __lll_unlock_elision () from /usr/lib/libpthread.so.0
(gdb) bt
#0 0x00007f42c76687e0 in __lll_unlock_elision () from /usr/lib/libpthread.so.0
#1 0x00007f42c2c3a26c in ?? () from /usr/lib/libEGL.so.1
#2 0x00007f42c2bcaa22 in ?? () from /usr/lib/libEGL.so.1
#3 0x00007ffddf6677c0 in ?? ()
#4 0x00007f42c2c4eea1 in ?? () from /usr/lib/libEGL.so.1
#5 0x00007ffddf6677c0 in ?? ()
#6 0x00007f42ccd20885 in _dl_fini () from /lib64/ld-linux-x86-64.so.2
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

glibc 2.22-1
nvidia-libgl 352.30-1
This task depends upon

Closed by  Sven-Hendrik Haase (Svenstaro)
Wednesday, 13 January 2016, 17:53 GMT
Reason for closing:  Fixed
Comment by Jan Alexander Steffens (heftig) - Saturday, 22 August 2015, 15:10 GMT
Reopening.  FS#45295  was closed with "Not a glibc issue - file bugs against the segfaulting packages."
Comment by Doug Newgard (Scimmia) - Saturday, 22 August 2015, 15:12 GMT
Right, which is not what this is. This is a complete duplicate unless I'm reading it completely wrong?
Comment by Jan Alexander Steffens (heftig) - Saturday, 22 August 2015, 15:14 GMT
 FS#45295  is against glibc. This is against nvidia-libgl.
Comment by Alad Wenter (Alad) - Saturday, 29 August 2015, 08:30 GMT
Try updating the microcode if you haven't already:

https://wiki.archlinux.org/index.php/Microcode
Comment by Jan Alexander Steffens (heftig) - Saturday, 29 August 2015, 08:36 GMT
Since there are no microcode updates available for this processor, I assume it has a non-buggy implementation.
Comment by Sven-Hendrik Haase (Svenstaro) - Saturday, 29 August 2015, 15:58 GMT
I don't think there is any way I can do anything here?
Comment by Jan Alexander Steffens (heftig) - Saturday, 29 August 2015, 15:59 GMT
I fear nVidia will just point back to glibc, and Allan is adamant about keeping lock elision enabled.
Comment by Sven-Hendrik Haase (Svenstaro) - Saturday, 29 August 2015, 16:19 GMT
I'm pulling in Allan and see.
Comment by Allan McRae (Allan) - Saturday, 29 August 2015, 19:31 GMT
This is a nvidia issue. They are unlocking something that does not exist.
Comment by Sven-Hendrik Haase (Svenstaro) - Wednesday, 09 September 2015, 07:14 GMT
Is this still an issue in the current version?
Comment by Neal Oakey (neal) - Wednesday, 09 September 2015, 10:02 GMT
you mean the current glibc version? (I think there was no nvidia update)
yes it is, I had to recompile it again
Comment by OldShatterhand (OldShatterhand) - Friday, 25 September 2015, 17:14 GMT
Hi, I can confirm the bug. Same behaviour on my system (just other programms crashing: vlc, sddm, kde-lockscreen). As soon as I recompile glibc with lock elison disabled or link libEGL.so.1 to the mesa lib instead of the nvidia lib manually (ln -s /usr/lib/mesa/libEGL.so.1 /usr/lib) everything appears to work correctly.
Also in my case there are no microcode updates as well (skylake cpu).
Comment by Scott Mansell (phire) - Friday, 25 September 2015, 19:45 GMT
Can confirm. I'm also on skylake and running into the same problems, as soon as I install nvidia-libgl gnome-shell doesn't launch and vim (from the gvim package) segfaults on quit.
Installing a custom version of glibc with lock elision disabled makes the problems go away.

Even with lock elision disabled, running helgrind with say vim (from gvim) shows that libEGL is unlocking a lock that wasn't locked: http://pastie.org/pastes/10443914/text

I'm sure helgrind will show the same issue on systems that don't have TSX enabled.
Comment by Scott Mansell (phire) - Saturday, 26 September 2015, 15:09 GMT
Minimal test case.

test.c:
int main() { return 0; }

compile with: gcc test.c -o test -lEGL

test with helgrind: valgrind --tool=helgrind ./test
Comment by Steven Noonan (neunon) - Tuesday, 29 September 2015, 09:22 GMT
I have a glibc built without lock elision (literally everything was crashing on my Skylake box, so I killed the lock elision feature). I see a different crash-on-exit with anything using libEGL.so:

==3673== Invalid read of size 8
==3673== at 0xAFA9BE1: __eglTeardownVendor (in /usr/lib/nvidia/libEGL.so.355.11)
==3673== by 0x400F884: _dl_fini (in /usr/lib/ld-2.22.so)
==3673== by 0x63F8F87: __run_exit_handlers (in /usr/lib/libc-2.22.so)
==3673== by 0x63F8FD4: exit (in /usr/lib/libc-2.22.so)
==3673== by 0x63E3616: (below main) (in /usr/lib/libc-2.22.so)
==3673== Address 0x8 is not stack'd, malloc'd or (recently) free'd
==3673==
==3673==
==3673== Process terminating with default action of signal 11 (SIGSEGV)
==3673== Access not within mapped region at address 0x8
==3673== at 0xAFA9BE1: __eglTeardownVendor (in /usr/lib/nvidia/libEGL.so.355.11)
==3673== by 0x400F884: _dl_fini (in /usr/lib/ld-2.22.so)
==3673== by 0x63F8F87: __run_exit_handlers (in /usr/lib/libc-2.22.so)
==3673== by 0x63F8FD4: exit (in /usr/lib/libc-2.22.so)
==3673== by 0x63E3616: (below main) (in /usr/lib/libc-2.22.so)
==3673== If you believe this happened as a result of a stack
==3673== overflow in your program's main thread (unlikely but
==3673== possible), you can try to increase the size of the
==3673== main thread stack using the --main-stacksize= flag.
==3673== The main thread stack size used in this run was 8388608.

"exo-open" with no arguments is a nice example of this behavior.
Comment by Steven Noonan (neunon) - Tuesday, 29 September 2015, 10:01 GMT
The __eglTeardownVendor crash doesn't happen if I downgrade the NVIDIA packages from 355.11 to 352.41.

Still get lock elision crashes when that's built in.
Comment by Jan Alexander Steffens (heftig) - Tuesday, 13 October 2015, 08:58 GMT
Attaching phire's helgrind output in case it vanishes from pastie.
   hel.txt (3.4 KiB)
Comment by Darek (blablo) - Wednesday, 14 October 2015, 15:30 GMT
Maybe it will help?
driver 358.09 (beta) https://devtalk.nvidia.com/default/topic/884727
Comment by Gerhard Bogner (slashME) - Friday, 16 October 2015, 13:39 GMT
With a Skylake CPU (sig=0x506e3, pf=0x2, revision=0x39) this still happens using the 358.09 beta driver if lock elision in glibc is enabled. (It doesn't with nvidia < 352 regardless of lock elision.)
Comment by Mister Ypsilon (mrypsilon) - Tuesday, 03 November 2015, 16:46 GMT
Does NVIDIA already know about this? They should really fix it, having to compile glibc every new release could get pretty annoying...
Comment by Steven Noonan (neunon) - Tuesday, 03 November 2015, 19:42 GMT
I've reported the bug using the NVIDIA partners site. That should get it the appropriate engineering attention.
Comment by Pieter Lexis (lieter) - Monday, 09 November 2015, 20:01 GMT
Confirmed this issue still exists with the same skylake as Gerhard (sig=0x506e3, pf=0x2, revision=0x39) and the latest NVIDIA driver (358.09). Also confirmed that disabling lock elision fixes this issue.
Comment by Jörg Stettner (jost5367) - Monday, 09 November 2015, 20:10 GMT
I can confirm this issue still exists on an Intel Core i5-6500 Skylake system (sig=0x506e3, pf=0x2, revision=0x49) and the NVIDIA driver 352.55. Again, using glibc without --enable-lock-elision fixes it.
Comment by Gerhard Bogner (slashME) - Sunday, 22 November 2015, 21:22 GMT
Still exists with nvidia driver 358.16-1.
Comment by Sven-Hendrik Haase (Svenstaro) - Tuesday, 24 November 2015, 07:53 GMT
You guys with reproducible problems, please post a bug report on the nvidia dev forums and link it here. Nvidia appears to be unaware of it.
Comment by Steven Noonan (neunon) - Tuesday, 24 November 2015, 08:00 GMT
As I said before, I reported it through the NVIDIA partners site (bug #1701106 if you're able to request access somehow). They're aware of it.
Comment by Sven-Hendrik Haase (Svenstaro) - Tuesday, 24 November 2015, 08:23 GMT
Alright, so basically there's nothing to do here for us since glibc is going to be kept the way it is in Arch? Is Aaron Plattner aware of this? He's an Arch user himself IIRC and works at nvidia.
Comment by Steven Noonan (neunon) - Tuesday, 24 November 2015, 08:35 GMT
Until it's fixed by NVIDIA, there is an awful patch that could be applied to glibc (I don't like the patch *at all*, but it has made my systems usable in the meantime):

http://git.uplinklabs.net/snoonan/projects/archlinux/ec2/ec2-packages.git/tree/glibc/glibc-2.22-lock-elision-crash-nvidia.patch

I hadn't heard Aaron Plattner's name until now. I haven't received many updates on the NVIDIA bug report other than simple status changes (no comments yet). I believe it's currently marked "in progress" (was "pending review" a couple weeks ago).
Comment by Darek (blablo) - Tuesday, 24 November 2015, 08:57 GMT
@Svenstaro
Aaron Plattner said: NOTE: I'm on paternity leave until early 2016.
Comment by Allan McRae (Allan) - Tuesday, 24 November 2015, 11:13 GMT
That glibc patch is awful and will not be applied. Someone could just create a "glibc-for-stupid-nvidia" package that disables lock elision and provides/replaces glibc.
Comment by Steven Noonan (neunon) - Wednesday, 02 December 2015, 02:48 GMT
Good news. NVIDIA closed the bug today and has marked it as "fixed". An upcoming driver release should have the fix (352.67 or later, 361.10 or later).
Comment by Matthew (Freaksta) - Monday, 04 January 2016, 22:54 GMT
Any idea on the release of 352.67 or later?
Comment by Michael Schäfer (Tarr3128) - Tuesday, 05 January 2016, 09:46 GMT
I do have a similar issue with Intel(R) Xeon(R) CPU E3-1535M v5 (sig=0x506e3, pf=0x20, revision=0x49), nvidia driver 358.16, lots of stuff segfaulting all over the place.

I've solved it by patching the glibc with the tsx blacklist code from debian (http://sources.debian.net/data/main/g/glibc/2.22-0experimental1/debian/patches/amd64/local-blacklist-for-Intel-TSX.diff), just added model 94 with stepping <= 3 aswell.

Maybe this would be an approach to solve this issue? I've tried to use the latest microcode from intel, as intel-ucode is outdated, but even with that it doesn't work properly.

Of course this could also be a flawed CPU, but I don't really have a way to figure that one out.
Comment by Darek (blablo) - Tuesday, 05 January 2016, 20:21 GMT Comment by Jörg Stettner (jost5367) - Tuesday, 05 January 2016, 20:29 GMT
Great news! I'll wait for the corresponding Arch Package, and test it with the original glibc libraries - should be working without core dumps, according to the last bullet point mentioned in Aaron Plattner's dev note.
Comment by Matthew (Freaksta) - Tuesday, 05 January 2016, 23:17 GMT
Will the corresponding arch package wait until the BETA is complete, or will there be a BETA package made available?
Comment by Scott Mansell (phire) - Wednesday, 06 January 2016, 01:55 GMT
@Tarr3128

That's not a viable approach, as it would require backlisting every single CPU with TSX support, including future CPUs.
It's exactly the same as compiling glibc with lock elison disabled.

The problem is not a CPU issue, the TSX feature is finally working correctly, that's why there is no microcode
update for the newer CPUs. It's an issue with the Nvidia package, which is acting outside the spec. Only an update
to the Nvidia package will fix it. Everything else (including microcode updates which disable TSX) is a workaround.
Comment by Michael Schäfer (Tarr3128) - Wednesday, 06 January 2016, 08:41 GMT
I see, thank you @phire. The reason I posted that solution is that there is a difference in between disabling lock elision and the blacklisting patch, I still had segfaults, even though not as many, might be some code issue with glibc. Once the driver is available I'll do some testing.
Comment by Riku Salminen (rikusalminen) - Wednesday, 06 January 2016, 13:05 GMT
I have a workaround for this issue with a 5 byte binary patch:

In libEGL_nvidia.so.0 (md5sum 36a6edacefcd2893b3bc5c6c282943b6), replace bytes "e8 64 d3 f8 ff" @ offset 0x95987 with "90 90 90 90 90".

It replaces an unnecessary call to pthread_mutex_unlock just before pthread_mutex_destroy with a sequence of NOPs. This problem is silently ignored in glibc built without lock elision or on CPUs with no TSX.
Comment by Sven-Hendrik Haase (Svenstaro) - Thursday, 07 January 2016, 10:20 GMT
Please test the new nvidia beta driver. You can use the AUR package.
Comment by Riku Salminen (rikusalminen) - Thursday, 07 January 2016, 11:00 GMT
I have not tested the beta driver but I did extract the package and analyze libEGL_nvidia.so disassembly with objdump and the offending call to pthread_mutex_unlock is gone.

In other words: this issue SHOULD be fixed in the beta.
Comment by Sven-Hendrik Haase (Svenstaro) - Thursday, 07 January 2016, 13:11 GMT
I stuck the new driver into [testing]. Please test.
Comment by Michael Schäfer (Tarr3128) - Thursday, 07 January 2016, 13:38 GMT
There is a thread over at nvidia (https://devtalk.nvidia.com/default/topic/908506/several-essential-kde-applications-sddm-krunner-plasmashell-segfault-on-startup-with-361-16/) that suggests that the new 361.16 driver has a rather serious bug.
Comment by Gerhard Bogner (slashME) - Thursday, 07 January 2016, 23:29 GMT
The original crash (in __lll_unlock_elision) seemts to be fixed in 361-16 from testing. (Gnome works, I didn't try KDE.)
Comment by Jörg Stettner (jost5367) - Wednesday, 13 January 2016, 15:58 GMT
Today Nvidia released another beta driver 361.18
From the description I see that the "phtread_mutex_unlock" issue is mentioned again, same as for 361.16:
"Fixed a bug in the EGL driver where a mutex was unlocked more than once. This triggers undefined behavior, and in particular, if lock elision is enabled in glibc, may result in a segmentation fault."
Apart from that, I don't find any new fixes, esp. nothing what might be different as compared to 361.16
Does it make sense to test 361.18, too, with respect to the "plasmashell segfault" issues?
Comment by Mister Ypsilon (mrypsilon) - Wednesday, 13 January 2016, 16:14 GMT
According to this: https://devtalk.nvidia.com/default/topic/908506/several-essential-kde-applications-sddm-krunner-plasmashell-segfault-on-startup-with-361-16/#4776281
the "plasmashell segfault issues" should be fixed in 361.18.
I'll give it a try now...

Edit: Can confirm both the sddm/plasmashell segfault and the original "glibc lock elision segfaults" are gone with 361.18.

Loading...