FS#42689 - [glibc] SIGILL after microcode update
Attached to Project:
Arch Linux
Opened by Marius Kleber (MK13) - Wednesday, 05 November 2014, 14:30 GMT
Last edited by Thomas Bächler (brain0) - Sunday, 07 December 2014, 09:16 GMT
Opened by Marius Kleber (MK13) - Wednesday, 05 November 2014, 14:30 GMT
Last edited by Thomas Bächler (brain0) - Sunday, 07 December 2014, 09:16 GMT
|
Details
Description:
After installing the intel-ucode package processes are killed with SIGILL. This seems to be caused by libpthread using TLX instructions that have been disabled by the intel-ucode update. Please refer to https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=762195 for a detailed explanation and a patch that have been confirmed to fix the issue. Steps to reproduce: 1. Install and activate the intel-ucode package 2. Suspend 3. Resume 4. Open a program (e.g. terminal emulator) Steps 2. and 3. may not be necessary, but according to the forum users it only occurs after a suspend/resume cycle Additional info: * package version(s) glibc 2.20-2 linux 3.17.2-1 * config and/or log files etc. Oct 24 18:59:52 brutebox kernel: traps: VirtualBox[31137] trap invalid opcode ip:7faaaecbcf02 sp:7ffffac2d7c8 error:0 in libpthread-2.20.so[7faaaecb1000+17000] Oct 24 18:59:57 brutebox kernel: traps: VirtualBox[31181] trap invalid opcode ip:7ff9461c3aea sp:7fffd673de58 error:0 in libpthread-2.20.so[7ff9461b8000+17000] Oct 24 19:00:02 brutebox kernel: traps: VirtualBox[31225] trap invalid opcode ip:7f92ba2f0f02 sp:7fff1f1e0118 error:0 in libpthread-2.20.so[7f92ba2e5000+17000] Oct 24 19:00:26 brutebox kernel: traps: chromium[31392] trap invalid opcode ip:7f006aea5b63 sp:7f005ed2bca8 error:0 in libpthread-2.20.so[7f006ae94000+17000] Oct 24 19:00:29 brutebox systemd-coredump[31393]: Process 31390 (chromium) of user 1000 dumped core. Oct 24 19:01:01 brutebox kernel: traps: steam[31784] trap invalid opcode ip:f71ff18b sp:ffadc080 error:0 in libpthread-2.20.so[f71ee000+18000] Oct 24 19:01:01 brutebox systemd-coredump[31785]: Process 31784 (steam) of user 1000 dumped core. Oct 24 19:04:27 brutebox kernel: traps: coredumpctl[29427] trap invalid opcode ip:7fad4105bb63 sp:7fff96409408 error:0 in libpthread-2.20.so[7fad4104a000+17000] Oct 24 19:04:28 brutebox systemd-coredump[885]: Process 29427 (coredumpctl) of user 1000 dumped core. Oct 24 19:16:45 brutebox kernel: traps: pacman[5914] trap invalid opcode ip:7fab73665aea sp:7fff5a5f2d18 error:0 in libpthread-2.20.so[7fab7365a000+17000] Oct 24 19:16:45 brutebox systemd-coredump[5915]: Process 5914 (pacman) of user 1000 dumped core. * arch forum thread regarding this issue https://bbs.archlinux.org/viewtopic.php?pid=1472766 |
This task depends upon
1. I rebootet my system.
2a. opened terminal and executed "gnome-control-center" at least 20 times -> No, problems
2b. opened applications -> No problems.
3. I suspended my System and waited 20 seconds - and wake it up again
4a. I opened terminal and executed "gnome-control-center" once -> Error "Illegal instruction (core dumped)"
4b. I tried to start several applications by using the mouse -> applications don't start randomly.
4c. GDM and Gnome background images are totally broken. Possibly, because the related applications crashed after suspend, too.
Other issues:
5a. The printer only works randomly
5b. Keymap settings reset to default
This bug should be fixed soon, because of it's randomness and because it's effecting many programs which use glibc. Not, that we get too many bug reports on "random effects" which are related to this issue, too.
[edit]
I just removed the microcode from syslinux.cfg. The problems 4a and 4b are gone, while 4c is still there.
My laptop updates CPU0 and CPU2 on boot. Somehow, CPU0 is never updated on resume. Instead, my laptop updates CPU1 and CPU2 on boot - but since CPU1 is just the second thread of CPU0, everything works. If CPU1 is a separate core, then CPU0 will run on the old microcode and the others on the updates one.
# dmesg | grep microcode
[ 0.000000] CPU0 microcode updated early to revision 0x1c, date = 2014-07-03
[ 0.093040] CPU1 microcode updated early to revision 0x1c, date = 2014-07-03
[ 0.113650] CPU2 microcode updated early to revision 0x1c, date = 2014-07-03
[ 0.134293] CPU3 microcode updated early to revision 0x1c, date = 2014-07-03
[ 0.345610] microcode: CPU0 sig=0x306c3, pf=0x2, revision=0x1c
[ 0.345615] microcode: CPU1 sig=0x306c3, pf=0x2, revision=0x1c
[ 0.345620] microcode: CPU2 sig=0x306c3, pf=0x2, revision=0x1c
[ 0.345623] microcode: CPU3 sig=0x306c3, pf=0x2, revision=0x1c
[ 0.345654] microcode: Microcode Update Driver: v2.00 <tigran@aivazian.fsnet.co.uk>, Peter Oruba
<SUSPEND>
[ 79.131245] CPU1 microcode updated early to revision 0x1c, date = 2014-07-03
[ 79.145367] CPU2 microcode updated early to revision 0x1c, date = 2014-07-03
[ 79.159411] CPU3 microcode updated early to revision 0x1c, date = 2014-07-03
But:
# cat /sys/devices/system/cpu/cpu*/microcode/version
0x1c
0x1c
0x1c
0x1c
I'd like to provide more details from my experience with this issue, hoping that they are helpful. Processes begun before the first suspend/resume after a reboot remain stable even after repeated suspends. After the first suspend, new proceses I start will sometimes die (always within 10 seconds of starting). The probability that a program will fail when I call it seems to increase with multiple suspends (I'd need to do more testing before being sure). I left some logs in the forum thread MK13 linked. If there's anything more I can do to help, I'd be happy to try.
Unfortuantely the kernel and /proc/cpuinfo aren't reliable here and don't report that HLE/RTM is available again after suspend and so if a process happens to be started on cpu0 and hits the codepath in libpthread SIGILL will be triggered because glibc doesn't rely on the kernel for cpu feature-detection.
Running a recent-ish vesion of cpuid before and after suspend:
~% grep -i hle before-suspend.cpuid
HLE hardware lock elision = false
HLE hardware lock elision = false
HLE hardware lock elision = false
HLE hardware lock elision = false
~% grep -i hle after-suspend.cpuid
HLE hardware lock elision = true
HLE hardware lock elision = false
HLE hardware lock elision = false
HLE hardware lock elision = false
The same for RTM:
--- before-suspend.cpuid 2014-11-06 10:48:11.538239973 +0100
+++ after-suspend.cpuid 2014-11-06 10:52:47.646053883 +0100
@@ -172,3 +172,3 @@
BMI instruction = true
- HLE hardware lock elision = false
+ HLE hardware lock elision = true
AVX2: advanced vector extensions 2 = true
@@ -178,3 +178,3 @@
INVPCID instruction = true
- RTM: restricted transactional memory = false
+ RTM: restricted transactional memory = true
QM: quality of service monitoring = false
I haven't looked at how debian handles it but fedore patched glibc and disabled these on their x86 platforms so they should be safe even if the microcode update isn't working as advertised:
http://pkgs.fedoraproject.org/cgit/glibc.git/commit/?h=f21&id=f4db47775a019670fb3fcfa7b949d4ea190dacca
http://pkgs.fedoraproject.org/cgit/glibc.git/commit/?h=f21&id=f60aa2d62981e1e19947d491189b4a2377e347a1
1) The kernel does not handle microcode updates properly after a suspend
2) glibc's cpu feature-detection is broken
And the workaround done by debian is to disable TLX in glibc by default.
Is this somehow correct?
This tells me that glibc is detecting TLX is available (based on the once CPU?) and using it, but that will fail on 3/4 of the CPU. This is sort of a bug in glibc as it should disable instructions if any CPU does not support them. I'll look into this upstream.
But the main issues is still the resume from suspend losing the microcode update partially.
It seems to me that it may be a firmware-related problem: Your firmware could incorrectly reset the microcode to its built-in version on resume and Linux wouldn't expect it to. I guess the appropriate kernel developers should be informed of this, maybe a microcode update should be forced after resume?
It's from the aur/cpuid package. Here's a PKGBUILD for an updated version that supports these extensions:
https://gist.github.com/namarrgon/49d30b0f4ed65c6212c5
As for reloading the microcode update after resume, I'm not sure how that would work. /sys/devices/system/cpu/microcode/reload didn't seem to affect anything and even if it would work we'd probably run into the same issue as with loading the microcode late in the boot-process, e.g. after the root-switch, which resulted in a kernel-oops iirc.
https://sourceware.org/ml/libc-alpha/2014-11/msg00165.html
"That's quite unusual, the kernel goes to some lengths to ensure that
microcode is reprogrammed on resume from suspend (and also from
hibernate... see syscore_resume() in kernel/power/*.c)
Dropping the default loglevel to debug should get some messages printed
about which cpus got reprogrammed on resume, which would be useful
(also, to find out if it's unique to Intel or AMD, since they're
separate codepaths.)"
I attached the dmesg output for a full boot/suspend/resume cycle with loglevel=8.
Cpu0, being the boot-cpu, is special-cased during suspend/resume and the microcode update reset without the kernel noticing. This becomes obvious when manually offline-ing cpu0. Our kernel does not set 'CONFIG_BOOTPARAM_HOTPLUG_CPU0' but booting with 'cpu0_hotplug' still allows it to be hotplugged in certain cases.
a) going through the usual suspend/resume that results in cpu0 with TSX enabled and remaining cpus with it disabled.
b) bringing cpu0 off- and back online:
'echo 0 > /sys/devices/system/cpu/cpu0/online'
'echo 1 > /sys/devices/system/cpu/cpu0/online'
c) This results in dmseg:
[ 1337.545570] intel_pstate CPU 0 exiting
[ 1337.586204] Broke affinity for irq 24
[ 1337.587231] kvm: disabling virtualization on CPU0
[ 1337.689184] smpboot: CPU 0 is now offline
[ 1341.550302] smpboot: Booting Node 0 Processor 0 APIC 0x0
[ 1341.551039] CPU0 microcode updated early to revision 0x1c, date = 2014-07-03
[ 1341.552361] kvm: enabling virtualization on CPU0
[ 1341.579435] NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
d) cpuid reports HLE/RTM disabled on all cpus and none of the SIGILL issues occur after that.
However cpu0 is handled, it doesn't go through the hotplug-routine as the others and the microcode update is never applied on resume.
[ 0.000000] CPU0 microcode updated early to revision 0x1c, date = 2014-07-03
[ 0.091790] CPU1 microcode updated early to revision 0x1c, date = 2014-07-03
[ 0.112423] CPU2 microcode updated early to revision 0x1c, date = 2014-07-03
[ 0.133075] CPU3 microcode updated early to revision 0x1c, date = 2014-07-03
[ 0.310637] microcode: CPU0 sig=0x306c3, pf=0x2, revision=0x1c
[ 0.310642] microcode: CPU1 sig=0x306c3, pf=0x2, revision=0x1c
[ 0.310647] microcode: CPU2 sig=0x306c3, pf=0x2, revision=0x1c
[ 0.310650] microcode: CPU3 sig=0x306c3, pf=0x2, revision=0x1c
[ 0.310679] microcode: Microcode Update Driver: v2.00 <tigran@aivazian.fsnet.co.uk>, Peter Oruba
<SUSPEND>
[ 36.279044] CPU0 microcode updated early to revision 0x1c, date = 2014-07-03
[ 36.290196] CPU1 microcode updated early to revision 0x1c, date = 2014-07-03
[ 36.304266] CPU2 microcode updated early to revision 0x1c, date = 2014-07-03
[ 36.318325] CPU3 microcode updated early to revision 0x1c, date = 2014-07-03
Works for me
HLE hardware lock elision = false
RTM: restricted transactional memory = false
HLE hardware lock elision = false
RTM: restricted transactional memory = false
HLE hardware lock elision = false
RTM: restricted transactional memory = false
HLE hardware lock elision = false
RTM: restricted transactional memory = false
# sudo dmesg | grep microcode
[ 0.000000] CPU0 microcode updated early to revision 0x1c, date = 2014-07-03
[ 0.090279] CPU1 microcode updated early to revision 0x1c, date = 2014-07-03
[ 0.110847] CPU2 microcode updated early to revision 0x1c, date = 2014-07-03
[ 0.131408] CPU3 microcode updated early to revision 0x1c, date = 2014-07-03
[ 0.405957] microcode: CPU0 sig=0x306c3, pf=0x2, revision=0x1c
[ 0.405961] microcode: CPU1 sig=0x306c3, pf=0x2, revision=0x1c
[ 0.405966] microcode: CPU2 sig=0x306c3, pf=0x2, revision=0x1c
[ 0.405970] microcode: CPU3 sig=0x306c3, pf=0x2, revision=0x1c
[ 0.405996] microcode: Microcode Update Driver: v2.00 <tigran@aivazian.fsnet.co.uk>, Peter Oruba
<SUSPEND>
# dmesg | grep microcode
[ 0.000000] CPU0 microcode updated early to revision 0x1c, date = 2014-07-03
[ 0.090279] CPU1 microcode updated early to revision 0x1c, date = 2014-07-03
[ 0.110847] CPU2 microcode updated early to revision 0x1c, date = 2014-07-03
[ 0.131408] CPU3 microcode updated early to revision 0x1c, date = 2014-07-03
[ 0.405957] microcode: CPU0 sig=0x306c3, pf=0x2, revision=0x1c
[ 0.405961] microcode: CPU1 sig=0x306c3, pf=0x2, revision=0x1c
[ 0.405966] microcode: CPU2 sig=0x306c3, pf=0x2, revision=0x1c
[ 0.405970] microcode: CPU3 sig=0x306c3, pf=0x2, revision=0x1c
[ 0.405996] microcode: Microcode Update Driver: v2.00 <tigran@aivazian.fsnet.co.uk>, Peter Oruba
[ 193.434172] CPU0 microcode updated early to revision 0x1c, date = 2014-07-03
[ 193.445316] CPU1 microcode updated early to revision 0x1c, date = 2014-07-03
[ 193.459395] CPU2 microcode updated early to revision 0x1c, date = 2014-07-03
[ 193.473468] CPU3 microcode updated early to revision 0x1c, date = 2014-07-03
Works for me, too :)
https://projects.archlinux.org/svntogit/packages.git/tree/trunk/fix_CPU0_microcode_on_resume.patch?h=packages/linux
is a dirty test version, it brakes 32-bit kernel, it isn't in kernel mainline, and must be replaced by this:
https://bugzilla.kernel.org/show_bug.cgi?id=88391#c20
for 3.17 kernels
https://projects.archlinux.org/svntogit/packages.git/tree/trunk/fix_CPU0_microcode_on_resume.patch?h=packages/linux
is outdated (18 Nov 2014) and wrong again, because it still NOT contain
+#ifdef CONFIG_X86_64
and still brakes 32-bit kernels.
See https://bugzilla.kernel.org/show_bug.cgi?id=88391#c20 since 2014-11-26 from author of all versions of this patch.
https://bugzilla.kernel.org/show_bug.cgi?id=88391#c25
https://bugs.archlinux.org/task/42820#comment130307
>> Borislav Petkov 2014-11-27 12:09:20 UTC
...
> Just to clarify for the arch-linux kernel: the patch in there needs the
> #ifdef CONFIG_X86_64 around it as in comment #20...
[1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=02ecc41abcea4ff9291d548f6f846b29b354ddd2