FS#43010 - [glibc] Enabling lock elision in glibc causes illegal instruction crashes on non-Haswell Intel CPUs

Attached to Project: Arch Linux
Opened by David Anderson (danderson) - Thursday, 04 December 2014, 21:40 GMT
Last edited by Doug Newgard (Scimmia) - Friday, 05 December 2014, 00:36 GMT
Task Type Bug Report
Category Packages: Core
Status Closed
Assigned To No-one
Architecture x86_64
Severity High
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 1
Private No

Details

Description:

When compiled with --enable-lock-elision, glibc 2.20 unconditionally issues the 'xend' instruction in pthread_mutex_unlock. This causes programs to crash with SIGILL on non-Haswell Intel CPUs, because they don't implement the TSX instruction set extension that defines 'xend'.

Obviously, the fix for glibc itself should be done upstream (I don't see any relevant bugs in their tracker, so I'm going to go file one after this). In the meantime, Arch could remove --enable-lock-elision from the glibc PKGBUILD to work around the issue, at the cost of degraded performance on Haswell CPUs.

Fedora is also tracking this bug in their tracker, though they don't seem to be working on an upstream fix - they just disabled lock elision. See https://bugzilla.redhat.com/show_bug.cgi?id=1146967 and https://bugzilla.redhat.com/show_bug.cgi?id=1144794

Steps to reproduce:

The annoying reproduction I have involves building Ceph using my PKGBUILD here: https://github.com/danderson/packages-archlinux/tree/master/aur/ceph , then running `ceph -s`. I'm working on a short&sweet C reproduction, I'll post it when I have it.
This task depends upon

Closed by  Doug Newgard (Scimmia)
Friday, 05 December 2014, 00:36 GMT
Reason for closing:  Not a bug
Additional comments about closing:  User requested: Invalid: Ceph is invoking undefined pthread behavior which glibc devs have decided to not make less crashy.
Comment by David Anderson (danderson) - Thursday, 04 December 2014, 21:46 GMT
Additionally, Intel published an errata recommending that TSX (aka hardware lock elision) be disabled due to unpredictable behavior on certain Haswell platforms: http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e3-1200v3-spec-update.pdf . That sounds like Arch should just omit --enable-lock-elision should just be removed from Arch's glibc build alltogether for the time being.
Comment by David Anderson (danderson) - Thursday, 04 December 2014, 21:54 GMT
Bah, and found the upstream bug, closed as invalid. It seems that Ceph is invoking undefined behavior by unlocking an already unlocked rwlock.

And indeed, I'm unable to reproduce the SIGILL with a minimal reproduction case (attached) which correctly sequences the unlock. Incorrect sequencing (also attached) does trigger the SIGILL, but as explained in https://sourceware.org/bugzilla/show_bug.cgi?id=17561 , glibc devs decided it was undefined behavior not worth correcting.

And the Intel errata for TSX was "fixed" by a microcode update that turned off TSX at the source, so it's probably fine to keep lock elision enabled for glibc, unless you feel that it's unfair to require users to keep their microcode up to date to not be crashy.

Loading...