FS#39631 - [glibc] --enable-lock-elision breaks applications on Haswell

Attached to Project: Arch Linux
Opened by Thomas Bächler (brain0) - Wednesday, 26 March 2014, 15:10 GMT
Last edited by Allan McRae (Allan) - Friday, 05 September 2014, 11:25 GMT
Task Type Bug Report
Category Packages: Core
Status Closed
Assigned To Allan McRae (Allan)
Architecture x86_64
Severity High
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 1
Private No

Details

With current glibc 2.19-3 on a Haswell i7-4600U, some applications show subtle, hard to reproduce failure.

In particular, running Maple 17 and performing certain computations causes Maple to abort with
GC Thread signalAbort 0x7fb49679f700 Execution stopped: Stack limit reached.
after some time. Due to lack of sources and debug symbols, the backtrace is rather useless.

I built glibc with --enable-lock-elision=no, put libpthread.so.0 into its own directory and started the application with LD_LIBRARY_PATH set accordingly. This fixes the problem.
This task depends upon

Closed by  Allan McRae (Allan)
Friday, 05 September 2014, 11:25 GMT
Reason for closing:  Not a bug
Additional comments about closing:  Not a glibc issue.
Comment by Dave Reisner (falconindy) - Wednesday, 26 March 2014, 15:17 GMT Comment by Thomas Bächler (brain0) - Wednesday, 26 March 2014, 15:21 GMT
Backtrace looks different here. But as I said, it is a rather useless backtrace.

Core was generated by `/opt/maple17/bin.X86_64_LINUX/mserver -kport 51580 -O C --env-setup'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007eff57382389 in raise () from /usr/lib/libc.so.6
(gdb) bt
#0 0x00007eff57382389 in raise () from /usr/lib/libc.so.6
#1 0x00007eff57383788 in abort () from /usr/lib/libc.so.6
#2 0x00007eff57a23333 in ?? () from /opt/maple17/bin.X86_64_LINUX/libmaple.so
#3 0x00007eff57a2201f in ?? () from /opt/maple17/bin.X86_64_LINUX/libmaple.so
#4 0x00007eff56c2f0a2 in start_thread () from /usr/lib/libpthread.so.0
#5 0x00007eff57432d1d in clone () from /usr/lib/libc.so.6
Comment by Allan McRae (Allan) - Wednesday, 26 March 2014, 22:50 GMT
Every report about issues with lock-elision has been fixed in the upstream software. Not sure how you are going to get that done for maple...
Comment by Thomas Bächler (brain0) - Wednesday, 26 March 2014, 23:35 GMT
I won't. Every report I make will simply be answered with a link to the list of "supported operating systems" (this always includes terribly outdated redhat and ubuntu versions). Last time I had a problem however (different one), it eventually got fixed in glibc.

I'd at least like to determine who behaves incorrectly here.
Comment by Allan McRae (Allan) - Thursday, 27 March 2014, 01:48 GMT
Looking at that backtrace, there is no way to tell. The abort call comes from somewhere withing libmaple.so. Without information on what causes the abort, there is nothing we can do.
Comment by Allan McRae (Allan) - Thursday, 01 May 2014, 08:15 GMT
I'm going to close this as I am not disabling lock-elision without actual evidence it is at fault (and then I would fix it).

I suggest getting an old glibc package or rebuild the current and LD_PRELOAD it.
Comment by Cedric BAIL (bluebugs) - Thursday, 04 September 2014, 10:26 GMT
  • Field changed: Percent Complete (100% → 0%)
Intel is turning off lock elision support in Haswell hardware due to some serious bug. You can read the following link :
http://anandtech.com/show/8376/intel-disables-tsx-instructions-erratum-found-in-haswell-haswelleep-broadwell
http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e3-1200v3-spec-update.pdf

Quite a super rare case to find bug in HW and I am also impressed that someone in arch linux bug tracker was able to find it month before the news got public.
Comment by Allan McRae (Allan) - Thursday, 04 September 2014, 10:35 GMT
Closing as a hardware bug not a software one. Update your microcode if affected.
Comment by Cedric BAIL (bluebugs) - Thursday, 04 September 2014, 18:06 GMT
  • Field changed: Percent Complete (100% → 0%)
Most people wont be able to dig down that the crash and instability of their platform is due to an hardware bug. I still would advice to turn that feature off, to improve the stability of every program for every one.
As for myself, I don't really care, I can work around.
Comment by Jan de Groot (JGC) - Thursday, 04 September 2014, 18:09 GMT
Reopening:

The only hardware that supports this extension at this moment, is hardware which is bugged. Updating microcode (through BIOS or through microcode interface in linux) will just disable the extension, meaning that no CPU provides support for this feature.

I would suggest turning this off: why would you use an extension that isn't available on any hardware? Also note that it isn't enabled by default, but specifically enabled with a --enable-flag, so we should turn it off.
Comment by Jan Alexander Steffens (heftig) - Friday, 05 September 2014, 08:00 GMT
The current microcode we have does not disable the feature, at least on celestia (i7-4770 stepping 3 microcode 0x1a).
My haswells (i7-4770R stepping 1 microcode 0xe and i7-4750HQ stepping 1 microcode 0x10) both have it disabled.

I'm also for rebuilding glibc without lock elision.
Comment by Jan Alexander Steffens (heftig) - Friday, 05 September 2014, 08:11 GMT
Oh, the 4770R didn't have the intel-ucode update, which also brings it to 0x10 (still without hle and rtm).
Comment by Daniel Micay (thestinger) - Friday, 05 September 2014, 08:12 GMT
At some point we're going to want this enabled again, so perhaps putting intel-ucode in base after the next upgrade would be a good idea. For now, disabling the feature would be fine too.
Comment by Allan McRae (Allan) - Friday, 05 September 2014, 08:34 GMT
intel-ucode should not be in base.
Comment by Allan McRae (Allan) - Friday, 05 September 2014, 08:48 GMT
I am finding many references to this being disabled where needed. E.g.
http://techreport.com/news/26911/errata-prompts-intel-to-disable-tsx-in-haswell-early-broadwell-cpus

An Intel spokesperson has provided TR with a brief statement on the TSX erratum, confirming that Intel has "addressed the issue" and "disabled the TSX feature on affected products."
Comment by Jan Alexander Steffens (heftig) - Friday, 05 September 2014, 09:29 GMT
It's still a problematic feature that causes issues not only in non-free applications but also causes libgc to segfault (in its own testsuite as well as applications using it).
Comment by Allan McRae (Allan) - Friday, 05 September 2014, 10:15 GMT
Every replicable software problem has been demonstrated to be an issue in the software and not glibc. They need fixed at the source.
Comment by Cedric BAIL (bluebugs) - Friday, 05 September 2014, 10:52 GMT
So Intel does turn off the feature for every one getting a new Haswell, but previous hardware with faulty microcode where in fact fine. Maybe we should tell them to use the older microcode if every replicable software problem has been demonstrated to be an issue in the software...

Seriously there is no hardware with a working microcode for that instruction as far as I can tell. It is going to create issue randomly that people wont be able to debug and when they will finally point it down (if they have the time to go that deep down), it will be to figure out that Arch Linux is providing a faulty glibc.

I guess I better go to glibc and try to figure a patch there.
Comment by Daniel Micay (thestinger) - Friday, 05 September 2014, 10:54 GMT
I don't think there's anything for glibc to do differently, lock elision was intentionally designed to be a no-op on older processors to avoid needing runtime detection.
Comment by Cedric BAIL (bluebugs) - Friday, 05 September 2014, 11:02 GMT
And when it is not a no-op, but a faulty instruction ?
Comment by Daniel Micay (thestinger) - Friday, 05 September 2014, 11:13 GMT
It's never a faulty instruction if the microcode is up-to-date and it's easy for a distribution to take care of that. If Arch isn't going to ship intel-ucode in base like it does with linux-firmware via the linux dependency, then I don't think enabling lock elision makes sense. There are lots of already shipped Haswell CPUs where it's now known to be broken.
Comment by Cedric BAIL (bluebugs) - Friday, 05 September 2014, 11:17 GMT
I am not a fan updating microcode, but it would be a solution indeed to the problem.
Comment by Allan McRae (Allan) - Friday, 05 September 2014, 11:25 GMT
@thestinger: linux-firmware is not in base.

Microcode updates appear to have been released by Intel.

Lock elision is enabled by Arch, Debian (jessie), Fedora, openSUSE, ...

I see no reason to disable it. glibc is not the issue.

Loading...