FS#42689 - [glibc] SIGILL after microcode update

Attached to Project: Arch Linux
Opened by Marius Kleber (MK13) - Wednesday, 05 November 2014, 14:30 GMT
Last edited by Thomas Bächler (brain0) - Sunday, 07 December 2014, 09:16 GMT
Task Type Bug Report
Category Packages: Core
Status Closed
Assigned To Thomas Bächler (brain0)
Allan McRae (Allan)
Architecture All
Severity High
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 6
Private No

Details

Description:
After installing the intel-ucode package processes are killed with SIGILL. This seems to be caused by libpthread using TLX instructions that have been disabled by the intel-ucode update. Please refer to https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=762195 for a detailed explanation and a patch that have been confirmed to fix the issue.

Steps to reproduce:
1. Install and activate the intel-ucode package
2. Suspend
3. Resume
4. Open a program (e.g. terminal emulator)
Steps 2. and 3. may not be necessary, but according to the forum users it only occurs after a suspend/resume cycle

Additional info:
* package version(s)
glibc 2.20-2
linux 3.17.2-1

* config and/or log files etc.
Oct 24 18:59:52 brutebox kernel: traps: VirtualBox[31137] trap invalid opcode ip:7faaaecbcf02 sp:7ffffac2d7c8 error:0 in libpthread-2.20.so[7faaaecb1000+17000]
Oct 24 18:59:57 brutebox kernel: traps: VirtualBox[31181] trap invalid opcode ip:7ff9461c3aea sp:7fffd673de58 error:0 in libpthread-2.20.so[7ff9461b8000+17000]
Oct 24 19:00:02 brutebox kernel: traps: VirtualBox[31225] trap invalid opcode ip:7f92ba2f0f02 sp:7fff1f1e0118 error:0 in libpthread-2.20.so[7f92ba2e5000+17000]
Oct 24 19:00:26 brutebox kernel: traps: chromium[31392] trap invalid opcode ip:7f006aea5b63 sp:7f005ed2bca8 error:0 in libpthread-2.20.so[7f006ae94000+17000]
Oct 24 19:00:29 brutebox systemd-coredump[31393]: Process 31390 (chromium) of user 1000 dumped core.
Oct 24 19:01:01 brutebox kernel: traps: steam[31784] trap invalid opcode ip:f71ff18b sp:ffadc080 error:0 in libpthread-2.20.so[f71ee000+18000]
Oct 24 19:01:01 brutebox systemd-coredump[31785]: Process 31784 (steam) of user 1000 dumped core.
Oct 24 19:04:27 brutebox kernel: traps: coredumpctl[29427] trap invalid opcode ip:7fad4105bb63 sp:7fff96409408 error:0 in libpthread-2.20.so[7fad4104a000+17000]
Oct 24 19:04:28 brutebox systemd-coredump[885]: Process 29427 (coredumpctl) of user 1000 dumped core.
Oct 24 19:16:45 brutebox kernel: traps: pacman[5914] trap invalid opcode ip:7fab73665aea sp:7fff5a5f2d18 error:0 in libpthread-2.20.so[7fab7365a000+17000]
Oct 24 19:16:45 brutebox systemd-coredump[5915]: Process 5914 (pacman) of user 1000 dumped core.

* arch forum thread regarding this issue
https://bbs.archlinux.org/viewtopic.php?pid=1472766

This task depends upon

Closed by  Thomas Bächler (brain0)
Sunday, 07 December 2014, 09:16 GMT
Reason for closing:  Fixed
Comment by t-ask (tAsk) - Wednesday, 05 November 2014, 15:02 GMT
I just did the test (with microcode update):

1. I rebootet my system.
2a. opened terminal and executed "gnome-control-center" at least 20 times -> No, problems
2b. opened applications -> No problems.
3. I suspended my System and waited 20 seconds - and wake it up again
4a. I opened terminal and executed "gnome-control-center" once -> Error "Illegal instruction (core dumped)"
4b. I tried to start several applications by using the mouse -> applications don't start randomly.
4c. GDM and Gnome background images are totally broken. Possibly, because the related applications crashed after suspend, too.

Other issues:
5a. The printer only works randomly
5b. Keymap settings reset to default

This bug should be fixed soon, because of it's randomness and because it's effecting many programs which use glibc. Not, that we get too many bug reports on "random effects" which are related to this issue, too.

[edit]
I just removed the microcode from syslinux.cfg. The problems 4a and 4b are gone, while 4c is still there.
Comment by Allan McRae (Allan) - Wednesday, 05 November 2014, 20:26 GMT
@Thomas: what is causing this? Seems the ucode update is being dropped or something...
Comment by Jan de Groot (JGC) - Wednesday, 05 November 2014, 21:02 GMT
Looks like kernel re-applies the ucode update on resume. If some application using libpthread starts in the meantime it will crash and burn as soon as TSX is disabled by microcode.
Comment by Thomas Bächler (brain0) - Wednesday, 05 November 2014, 21:35 GMT
I am not sure, but from the log it seems that the update is not applied on CPU0 on resume. I'll have to test on my laptop.
Comment by Thomas Bächler (brain0) - Wednesday, 05 November 2014, 21:52 GMT
Okay, there IS a problem, but you are lucky if you have hyperthreading.

My laptop updates CPU0 and CPU2 on boot. Somehow, CPU0 is never updated on resume. Instead, my laptop updates CPU1 and CPU2 on boot - but since CPU1 is just the second thread of CPU0, everything works. If CPU1 is a separate core, then CPU0 will run on the old microcode and the others on the updates one.
Comment by Thomas Bächler (brain0) - Wednesday, 05 November 2014, 22:22 GMT
Okay, I cannot reproduce the problem on my non-HT machine either:

# dmesg | grep microcode
[ 0.000000] CPU0 microcode updated early to revision 0x1c, date = 2014-07-03
[ 0.093040] CPU1 microcode updated early to revision 0x1c, date = 2014-07-03
[ 0.113650] CPU2 microcode updated early to revision 0x1c, date = 2014-07-03
[ 0.134293] CPU3 microcode updated early to revision 0x1c, date = 2014-07-03
[ 0.345610] microcode: CPU0 sig=0x306c3, pf=0x2, revision=0x1c
[ 0.345615] microcode: CPU1 sig=0x306c3, pf=0x2, revision=0x1c
[ 0.345620] microcode: CPU2 sig=0x306c3, pf=0x2, revision=0x1c
[ 0.345623] microcode: CPU3 sig=0x306c3, pf=0x2, revision=0x1c
[ 0.345654] microcode: Microcode Update Driver: v2.00 <tigran@aivazian.fsnet.co.uk>, Peter Oruba
<SUSPEND>
[ 79.131245] CPU1 microcode updated early to revision 0x1c, date = 2014-07-03
[ 79.145367] CPU2 microcode updated early to revision 0x1c, date = 2014-07-03
[ 79.159411] CPU3 microcode updated early to revision 0x1c, date = 2014-07-03

But:

# cat /sys/devices/system/cpu/cpu*/microcode/version
0x1c
0x1c
0x1c
0x1c
Comment by Garry (InvalidInterrupt) - Thursday, 06 November 2014, 04:40 GMT
The document at https://www.kernel.org/doc/Documentation/power/suspend-and-cpuhotplug.txt makes it seem like re-applying microcode updates after resuming from RAM isn't strictly necessary.


I'd like to provide more details from my experience with this issue, hoping that they are helpful. Processes begun before the first suspend/resume after a reboot remain stable even after repeated suspends. After the first suspend, new proceses I start will sometimes die (always within 10 seconds of starting). The probability that a program will fail when I call it seems to increase with multiple suspends (I'd need to do more testing before being sure). I left some logs in the forum thread MK13 linked. If there's anything more I can do to help, I'd be happy to try.
Comment by Alexander Schnaidt (Namarrgon) - Thursday, 06 November 2014, 10:27 GMT
Even though, as Thomas shows, /sys/devices/system/cpu/cpu*/microcode/version reports the expected version, the microcode for cpu0 is either reverted or not reapplied after suspend.

Unfortuantely the kernel and /proc/cpuinfo aren't reliable here and don't report that HLE/RTM is available again after suspend and so if a process happens to be started on cpu0 and hits the codepath in libpthread SIGILL will be triggered because glibc doesn't rely on the kernel for cpu feature-detection.

Running a recent-ish vesion of cpuid before and after suspend:
~% grep -i hle before-suspend.cpuid
HLE hardware lock elision = false
HLE hardware lock elision = false
HLE hardware lock elision = false
HLE hardware lock elision = false
~% grep -i hle after-suspend.cpuid
HLE hardware lock elision = true
HLE hardware lock elision = false
HLE hardware lock elision = false
HLE hardware lock elision = false

The same for RTM:
--- before-suspend.cpuid 2014-11-06 10:48:11.538239973 +0100
+++ after-suspend.cpuid 2014-11-06 10:52:47.646053883 +0100
@@ -172,3 +172,3 @@
BMI instruction = true
- HLE hardware lock elision = false
+ HLE hardware lock elision = true
AVX2: advanced vector extensions 2 = true
@@ -178,3 +178,3 @@
INVPCID instruction = true
- RTM: restricted transactional memory = false
+ RTM: restricted transactional memory = true
QM: quality of service monitoring = false


I haven't looked at how debian handles it but fedore patched glibc and disabled these on their x86 platforms so they should be safe even if the microcode update isn't working as advertised:

http://pkgs.fedoraproject.org/cgit/glibc.git/commit/?h=f21&id=f4db47775a019670fb3fcfa7b949d4ea190dacca
http://pkgs.fedoraproject.org/cgit/glibc.git/commit/?h=f21&id=f60aa2d62981e1e19947d491189b4a2377e347a1
Comment by Marius Kleber (MK13) - Thursday, 06 November 2014, 11:42 GMT
As I understand we have actually two issues:
1) The kernel does not handle microcode updates properly after a suspend
2) glibc's cpu feature-detection is broken

And the workaround done by debian is to disable TLX in glibc by default.

Is this somehow correct?
Comment by Allan McRae (Allan) - Thursday, 06 November 2014, 12:13 GMT
So... the fact that crashes are happening so quickly indicates it is not the TLX issue that resulted in these instructions being disabled by the microcode update that is the issue as those crashes were very, very, very rare. Here, the CPU recognizes the illegal instruction.

This tells me that glibc is detecting TLX is available (based on the once CPU?) and using it, but that will fail on 3/4 of the CPU. This is sort of a bug in glibc as it should disable instructions if any CPU does not support them. I'll look into this upstream.

But the main issues is still the resume from suspend losing the microcode update partially.

Comment by Thomas Bächler (brain0) - Thursday, 06 November 2014, 17:49 GMT
Alexander, how do you get this cpuid output? I'd like to test this on my machine as well.

It seems to me that it may be a firmware-related problem: Your firmware could incorrectly reset the microcode to its built-in version on resume and Linux wouldn't expect it to. I guess the appropriate kernel developers should be informed of this, maybe a microcode update should be forced after resume?
Comment by Alexander Schnaidt (Namarrgon) - Thursday, 06 November 2014, 18:08 GMT
@Thomas
It's from the aur/cpuid package. Here's a PKGBUILD for an updated version that supports these extensions:

https://gist.github.com/namarrgon/49d30b0f4ed65c6212c5

As for reloading the microcode update after resume, I'm not sure how that would work. /sys/devices/system/cpu/microcode/reload didn't seem to affect anything and even if it would work we'd probably run into the same issue as with loading the microcode late in the boot-process, e.g. after the root-switch, which resulted in a kernel-oops iirc.
Comment by Allan McRae (Allan) - Thursday, 06 November 2014, 23:15 GMT
I queried upstream glibc. /proc/cpuinfo has cached information, so is reporting the wrong information after suspend. glibc/cpuid is reporting the correct state.
Comment by Allan McRae (Allan) - Sunday, 09 November 2014, 21:59 GMT
Can someone get this information?

https://sourceware.org/ml/libc-alpha/2014-11/msg00165.html

"That's quite unusual, the kernel goes to some lengths to ensure that
microcode is reprogrammed on resume from suspend (and also from
hibernate... see syscore_resume() in kernel/power/*.c)

Dropping the default loglevel to debug should get some messages printed
about which cpus got reprogrammed on resume, which would be useful
(also, to find out if it's unique to Intel or AMD, since they're
separate codepaths.)"
Comment by Alexander Schnaidt (Namarrgon) - Monday, 10 November 2014, 14:29 GMT
@Allan
I attached the dmesg output for a full boot/suspend/resume cycle with loglevel=8.

Cpu0, being the boot-cpu, is special-cased during suspend/resume and the microcode update reset without the kernel noticing. This becomes obvious when manually offline-ing cpu0. Our kernel does not set 'CONFIG_BOOTPARAM_HOTPLUG_CPU0' but booting with 'cpu0_hotplug' still allows it to be hotplugged in certain cases.

a) going through the usual suspend/resume that results in cpu0 with TSX enabled and remaining cpus with it disabled.
b) bringing cpu0 off- and back online:
'echo 0 > /sys/devices/system/cpu/cpu0/online'
'echo 1 > /sys/devices/system/cpu/cpu0/online'
c) This results in dmseg:

[ 1337.545570] intel_pstate CPU 0 exiting
[ 1337.586204] Broke affinity for irq 24
[ 1337.587231] kvm: disabling virtualization on CPU0
[ 1337.689184] smpboot: CPU 0 is now offline
[ 1341.550302] smpboot: Booting Node 0 Processor 0 APIC 0x0
[ 1341.551039] CPU0 microcode updated early to revision 0x1c, date = 2014-07-03
[ 1341.552361] kvm: enabling virtualization on CPU0
[ 1341.579435] NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.

d) cpuid reports HLE/RTM disabled on all cpus and none of the SIGILL issues occur after that.

However cpu0 is handled, it doesn't go through the hotplug-routine as the others and the microcode update is never applied on resume.
Comment by Thomas Bächler (brain0) - Monday, 10 November 2014, 18:41 GMT
Namarrgon, can you please bring this up on the kernel bugzilla? You already have lots of very useful information so we can probably get a fix quickly.
Comment by Alexander Schnaidt (Namarrgon) - Monday, 10 November 2014, 20:45 GMT Comment by Alexander Schnaidt (Namarrgon) - Monday, 10 November 2014, 21:03 GMT
Sorry, the browser crashed and double-posted...
Comment by t-ask (tAsk) - Wednesday, 12 November 2014, 10:57 GMT
Testing microcode with another Haswell system (Intel 3337U CPU) with Hyper Threading everything works well after suspend. On the other system without Hyper Threading (HT) I'm experiencing the described (missing CPU0) bug as written in my first comment.
Comment by Thomas Bächler (brain0) - Friday, 14 November 2014, 21:34 GMT
So, we have a patch for the issue here attached to the upstream report.
Comment by Thomas Bächler (brain0) - Friday, 14 November 2014, 22:26 GMT
Please test with linux 3.17.3-1 from testing. Thanks.
Comment by Marius Kleber (MK13) - Saturday, 15 November 2014, 00:50 GMT
# sudo dmesg | grep microcode
[ 0.000000] CPU0 microcode updated early to revision 0x1c, date = 2014-07-03
[ 0.091790] CPU1 microcode updated early to revision 0x1c, date = 2014-07-03
[ 0.112423] CPU2 microcode updated early to revision 0x1c, date = 2014-07-03
[ 0.133075] CPU3 microcode updated early to revision 0x1c, date = 2014-07-03
[ 0.310637] microcode: CPU0 sig=0x306c3, pf=0x2, revision=0x1c
[ 0.310642] microcode: CPU1 sig=0x306c3, pf=0x2, revision=0x1c
[ 0.310647] microcode: CPU2 sig=0x306c3, pf=0x2, revision=0x1c
[ 0.310650] microcode: CPU3 sig=0x306c3, pf=0x2, revision=0x1c
[ 0.310679] microcode: Microcode Update Driver: v2.00 <tigran@aivazian.fsnet.co.uk>, Peter Oruba
<SUSPEND>
[ 36.279044] CPU0 microcode updated early to revision 0x1c, date = 2014-07-03
[ 36.290196] CPU1 microcode updated early to revision 0x1c, date = 2014-07-03
[ 36.304266] CPU2 microcode updated early to revision 0x1c, date = 2014-07-03
[ 36.318325] CPU3 microcode updated early to revision 0x1c, date = 2014-07-03

Works for me
Comment by t-ask (tAsk) - Sunday, 16 November 2014, 17:37 GMT
# cpuid | egrep '(HLE|RTM)'
HLE hardware lock elision = false
RTM: restricted transactional memory = false
HLE hardware lock elision = false
RTM: restricted transactional memory = false
HLE hardware lock elision = false
RTM: restricted transactional memory = false
HLE hardware lock elision = false
RTM: restricted transactional memory = false

# sudo dmesg | grep microcode
[ 0.000000] CPU0 microcode updated early to revision 0x1c, date = 2014-07-03
[ 0.090279] CPU1 microcode updated early to revision 0x1c, date = 2014-07-03
[ 0.110847] CPU2 microcode updated early to revision 0x1c, date = 2014-07-03
[ 0.131408] CPU3 microcode updated early to revision 0x1c, date = 2014-07-03
[ 0.405957] microcode: CPU0 sig=0x306c3, pf=0x2, revision=0x1c
[ 0.405961] microcode: CPU1 sig=0x306c3, pf=0x2, revision=0x1c
[ 0.405966] microcode: CPU2 sig=0x306c3, pf=0x2, revision=0x1c
[ 0.405970] microcode: CPU3 sig=0x306c3, pf=0x2, revision=0x1c
[ 0.405996] microcode: Microcode Update Driver: v2.00 <tigran@aivazian.fsnet.co.uk>, Peter Oruba
<SUSPEND>
# dmesg | grep microcode
[ 0.000000] CPU0 microcode updated early to revision 0x1c, date = 2014-07-03
[ 0.090279] CPU1 microcode updated early to revision 0x1c, date = 2014-07-03
[ 0.110847] CPU2 microcode updated early to revision 0x1c, date = 2014-07-03
[ 0.131408] CPU3 microcode updated early to revision 0x1c, date = 2014-07-03
[ 0.405957] microcode: CPU0 sig=0x306c3, pf=0x2, revision=0x1c
[ 0.405961] microcode: CPU1 sig=0x306c3, pf=0x2, revision=0x1c
[ 0.405966] microcode: CPU2 sig=0x306c3, pf=0x2, revision=0x1c
[ 0.405970] microcode: CPU3 sig=0x306c3, pf=0x2, revision=0x1c
[ 0.405996] microcode: Microcode Update Driver: v2.00 <tigran@aivazian.fsnet.co.uk>, Peter Oruba
[ 193.434172] CPU0 microcode updated early to revision 0x1c, date = 2014-07-03
[ 193.445316] CPU1 microcode updated early to revision 0x1c, date = 2014-07-03
[ 193.459395] CPU2 microcode updated early to revision 0x1c, date = 2014-07-03
[ 193.473468] CPU3 microcode updated early to revision 0x1c, date = 2014-07-03

Works for me, too :)
Comment by Natrio (natrio) - Friday, 05 December 2014, 23:26 GMT
  • Field changed: Percent Complete (100% → 0%)
The patch
https://projects.archlinux.org/svntogit/packages.git/tree/trunk/fix_CPU0_microcode_on_resume.patch?h=packages/linux
is a dirty test version, it brakes 32-bit kernel, it isn't in kernel mainline, and must be replaced by this:
https://bugzilla.kernel.org/show_bug.cgi?id=88391#c20
for 3.17 kernels
Comment by Natrio (natrio) - Sunday, 07 December 2014, 06:07 GMT
  • Field changed: Percent Complete (100% → 0%)
This patch
https://projects.archlinux.org/svntogit/packages.git/tree/trunk/fix_CPU0_microcode_on_resume.patch?h=packages/linux
is outdated (18 Nov 2014) and wrong again, because it still NOT contain
+#ifdef CONFIG_X86_64
and still brakes 32-bit kernels.
See https://bugzilla.kernel.org/show_bug.cgi?id=88391#c20 since 2014-11-26 from author of all versions of this patch.
Comment by Natrio (natrio) - Sunday, 07 December 2014, 07:08 GMT
See also
https://bugzilla.kernel.org/show_bug.cgi?id=88391#c25
https://bugs.archlinux.org/task/42820#comment130307
>> Borislav Petkov 2014-11-27 12:09:20 UTC
...
> Just to clarify for the arch-linux kernel: the patch in there needs the
> #ifdef CONFIG_X86_64 around it as in comment #20...
Comment by Thomas Bächler (brain0) - Sunday, 07 December 2014, 09:16 GMT
The upstream fix that disables this on 32 bit is 02ecc41abcea4ff9291d548f6f846b29b354ddd2 [1].

[1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=02ecc41abcea4ff9291d548f6f846b29b354ddd2

Loading...