FS#27828 - kernel panic after upgrading from glibc 2.14.1-4 to 2.15-3

Attached to Project: Arch Linux
Opened by Si Feng (danielfeng) - Wednesday, 04 January 2012, 20:10 GMT
Last edited by Allan McRae (Allan) - Thursday, 08 March 2012, 22:00 GMT
Task Type Bug Report
Category Packages: Core
Status Closed
Assigned To Allan McRae (Allan)
Architecture All
Severity Critical
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 11
Private No

Details

Description:

Kernel panic after upgrading from glibc 2.14.1-4 to 2.15-3 on x86_64 XenServer PV guest.
Tested multiple times on both kernel26-lts 2.6.32.51-1 and linux 3.1.6-1.
Didn't observed such issue on i686.

Log:

:: Starting full system upgrade...
resolving dependencies...
looking for inter-conflicts...

Targets (1): glibc-2.15-3

Total Download Size: 7.36 MB
Total Installed Size: 36.47 MB

Proceed with installation? [Y/n]
:: Retrieving packages from core...
downloading glibc-2.15-3-x86_64.pkg.tar.xz...
warning: /etc/locale.gen installed as /etc/locale.gen.pacnew
[ 46.703507] ldconfig[487] trap invalid opcode ip:42c775 sp:7fff188ffbf8 error:0 in ldconfig[400000+de000]
/tmp/alpm_nzzWYK/.INSTALL: line 4: 487 Illegal instruction sbin/ldconfig -r .
[ 46.706486] Not activating Mandatory Access Control now since /sbin/tomoyo-init doesn't exist.
INIT: version 2.88 reloading
Generating locales...
en_US.UTF-8... done
en_US.ISO-8859-1... done
Generation complete.
[ 50.326026] ldconfig[588] trap invalid opcode ip:42c775 sp:7fff8e362278 error:0 in ldconfig[400000+de000]

And when rebooting:

[ 0.097463] blkfront: xvda: barriers enabled
[ 0.097681] xvda: xvda1
[ 0.188474] Initialising Xen virtual ethernet driver.
:: Running Hook [udev]
:: Triggering uevents...done.
[ 0.636870] EXT4-fs (xvda1): mounted filesystem with ordered data mode
[ 0.927621] Not activating Mandatory Access Control now since /sbin/tomoyo-init doesn't exist.
INIT: version 2.88 booting
[ 1.001250] init[1] trap invalid opcode ip:7f72a3e9ba3f sp:7fffd6a70578 error:0 in libc-2.15.so[7f72a3d7b000+199000]
[ 1.001553] Kernel panic - not syncing: Attempted to kill init!
[ 1.001569] Pid: 1, comm: init Not tainted 2.6.32.51-1-lts #1
[ 1.001579] Call Trace:
[ 1.001595] [<ffffffff8138ed98>] panic+0x78/0x131
[ 1.001610] [<ffffffff81063fdb>] do_exit+0x71b/0x840
[ 1.001624] [<ffffffff81064465>] do_group_exit+0x45/0xb0
[ 1.001639] [<ffffffff81076d9f>] get_signal_to_deliver+0x1bf/0x390
[ 1.001654] [<ffffffff8100f26f>] ? xen_restore_fl_direct_end+0x0/0x1
[ 1.001669] [<ffffffff8101121f>] do_signal+0x6f/0x7c0
[ 1.001682] [<ffffffff81013885>] ? do_invalid_op+0x95/0xb0
[ 1.001696] [<ffffffff810119e5>] do_notify_resume+0x55/0x70
[ 1.001708] [<ffffffff81012adc>] retint_signal+0x48/0x8c
This task depends upon

Closed by  Allan McRae (Allan)
Thursday, 08 March 2012, 22:00 GMT
Reason for closing:  Fixed
Comment by Si Feng (danielfeng) - Wednesday, 04 January 2012, 22:07 GMT
Update: tested downgrading to 2.14.1-4, back to normal again. Not sure if it happens on regular installation.
Comment by Si Feng (danielfeng) - Thursday, 05 January 2012, 06:10 GMT
Tested on Arch x86_64 KVM guest, no problem observed.
Comment by Iso (Iso) - Thursday, 05 January 2012, 08:41 GMT
Happened on my VM as well.
Comment by einar (esjurso) - Thursday, 05 January 2012, 11:16 GMT
Does it work with xsave=1?
Comment by Jason William Walton (jasonww) - Thursday, 05 January 2012, 11:51 GMT
Not a bug in glibc as far as I'm concerned, I could be wrong though.

Are these AVX enabled CPUs? Try a newer Xen (4.1.0+) and play around with the settings, or build your own multiarch disabled glibc if you can't upgrade.
Comment by Michael Werner (Xaseron) - Thursday, 05 January 2012, 14:52 GMT
I get a kernel panic, when i try to boot my dom0.
Comment by Si Feng (danielfeng) - Thursday, 05 January 2012, 17:04 GMT
E3-1230/E3-1240 CPU
XenServer 6.0 (Xen 4.1.1?)
It happens on x86_64 DomU. i686 is fine.
Comment by Gavin (Glinx) - Thursday, 05 January 2012, 20:54 GMT
...may be relevant to glibc-2.15-3 – it stopped the UAE (Amiga) emulator from running. Downgrading to glibc 2.14.1-4 solved the problem.
Comment by brad barden (iamb) - Monday, 09 January 2012, 23:59 GMT
I am running in to the same issue on a couple boxes. It trashes both dom0s and domUs. After installing 2.15-3, lots of stuff (including ldconfig, as in the output above and executed by the package on installation) dies with Illegal instruction. Pacman fails to run too, so it's a bit of a pain to downgrade again, but downgrading to 2.14.1-4 fixes it.

It's not so bad now and not updating glibc is an option, but as packages are being built against it now it means not updating those packages as well. There's already a bug against openssl because the latest package requires glibc 2.15 (and depends doesn't say so).

I'm willing to help if I'm able.
Comment by Allan McRae (Allan) - Tuesday, 10 January 2012, 04:11 GMT
It would be really helpful if someone can post a gdb backtrace of the ldconfig call.
Comment by Si Feng (danielfeng) - Tuesday, 10 January 2012, 04:53 GMT
@brad I am using 2.14.1 and also rebuilt openssl with it. But now I realized that almost all new packages are being built using 2.15, thus I cannot upgrade any of them as well. It's not a real solution to rebuild all new packages in the future.

@Allan How to get that gdb backtrace?
Comment by Bartłomiej Piotrowski (Barthalion) - Tuesday, 10 January 2012, 16:58 GMT
Same here (x86_64 domU).

Additionaly, when I try to chroot into rootfs, I've got "Illegal instruction" error.
   cpuinfo (2.7 KiB)
Comment by Arvid Hällen (Arvid) - Tuesday, 10 January 2012, 16:58 GMT
I get warnings about pacnew files and broken system, but the system boots (i686).



Comment by Si Feng (danielfeng) - Tuesday, 10 January 2012, 17:23 GMT
@Arvid it's not related to this bug. The warnings are just for new config files installed, and the "WARNING: /boot appears to be a separate partition but is not mounted. You probably just broke your system. Congratulations." means you might not mount your boot partition under /boot before upgrading kernel. In this case your old kernel is still on boot partition and the new one has been installed under /boot on root partition, which will not be used when boot. I noticed the sentence "You probably just broke your system. Congratulations." has just been removed from source. You may have put "noauto" in /etc/fstab, but for Arch it's not recommended as Arch does not mount /boot automatically when upgrading kernel.
Comment by brad barden (iamb) - Wednesday, 11 January 2012, 03:20 GMT
Allan, no can do I'm afraid:

# gdb
Illegal instruction
Comment by Pierre Bourdon (delroth) - Wednesday, 11 January 2012, 12:35 GMT
@Allan:

Starting program: /usr/sbin/chroot /mnt/install
Executing new program: /mnt/install/bin/bash

Program received signal SIGILL, Illegal instruction.
0x00007ffff74b2a3f in ?? ()
(gdb) bt
#0 0x00007ffff74b2a3f in ?? ()
#1 0x00007ffff7bb7085 in ?? ()
#2 0x0000000000000073 in ?? ()
#3 0x706e692f6374652f in ?? ()
#4 0x00000000006fe3b6 in ?? ()
#5 0x00000000006fe370 in ?? ()
#6 0x0000000000000010 in ?? ()
#7 0x00000000006fe3a6 in ?? ()
#8 0x00007ffff7dda4b0 in ?? ()
#9 0x0000000000000000 in ?? ()
(gdb) x/i $rip
0x7ffff74b2a3f: vmovdqa 0x46979(%rip),%xmm4 # 0x7ffff74f93c0

This happens on Sandy Bridge CPUs (or anything that has AVX support actually) using Xen (PV, not HVM, dom0 or domU). I only tried using Xen 4.0, not 4.1.

From the cpuinfo:

flags : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good nonstop_tsc aperfmperf pni pclmulqdq est ssse3 cx16 sse4_1 sse4_2 x2apic popcnt aes avx hypervisor lahf_lm ida arat

avx is supported but not xsave, so according to the Intel manuals that means that AVX should not be used. The glibc code from the Git seems correct too, but it has changed a lot recently so maybe this release contains buggy code. If you look at the last commits on sysdeps/x86_64/dl-trampoline.S the number of times the AVX detection has changed because it was invalid is kind of scary...

commit 08a300c956feeca7ccb9081f88701229da8e25c5
Author: H.J. Lu <hongjiu.lu@intel.com>
Date: Wed Sep 7 21:38:23 2011 -0400

Simplify AVX check

commit 0276a718c0fa58916a6e7c54bad22b4e58bb39b4
Author: Ulrich Drepper <drepper@gmail.com>
Date: Sat Aug 20 08:58:44 2011 -0400

Fix minor CFI problem in regular x86-64 trampoline

commit c88f17668b67d22fe470933ab81119de587ee175
Author: Ulrich Drepper <drepper@gmail.com>
Date: Sat Aug 20 08:56:30 2011 -0400

Fix CFI info in x86-64 trampolines for non-AVX code

commit bba33c289b1b24e1bb3075b7fce5b56c9d01ce2f
Author: Ulrich Drepper <drepper@gmail.com>
Date: Sat Jul 23 15:18:13 2011 -0400

One more typo in AVX test

commit 1aae088a8aa2a4e4211bfe6c0e18100d85f106ae
Author: Ulrich Drepper <drepper@gmail.com>
Date: Fri Jul 22 23:33:22 2011 -0400

One more change to XSAVE patch

commit 1d002f25399c0a0ed2cc276d4ee18db869152384
Author: Andreas Schwab <schwab@redhat.com>
Date: Fri Jul 22 14:33:47 2011 -0400

Fix AVX check

I haven't checked this exact release code to see if it is correct. The current Git version seems fine to me though, maybe backporting this file would fix the problem?
Comment by Allan McRae (Allan) - Wednesday, 11 January 2012, 12:48 GMT
The 2.15 release is newer than all those changes. So if the git version is fine, the fix is limited to one of ~40 commits... Could you do the bisect?
Comment by Pierre Bourdon (delroth) - Wednesday, 11 January 2012, 14:14 GMT
I found the bug and made a bug report to the glibc maintainers: http://sourceware.org/bugzilla/show_bug.cgi?id=13583
Comment by Pierre Bourdon (delroth) - Thursday, 12 January 2012, 10:08 GMT
In the meantime why not compile Archlinux glibc without AVX support? This *might* also remove the need of one of the patch that is currently applied (the math64 stuff).

This bug is really critical for a lot of people: upgrade your dom0/domU and your system can't be used anymore, and the upgrade can't be skipped because new packages are compiled for this new glibc version.
Comment by Allan McRae (Allan) - Thursday, 12 January 2012, 10:24 GMT
This should work:
http://dev.archlinux.org/~allan/glibc-2.15-3.1-x86_64.pkg.tar.xz

It is built with "--disable-multi-arch".
Comment by Pierre Bourdon (delroth) - Thursday, 12 January 2012, 10:26 GMT
It does fix the problem, thanks.
Comment by Si Feng (danielfeng) - Thursday, 12 January 2012, 15:16 GMT
It works for now. Thanks.
Comment by Allan McRae (Allan) - Tuesday, 24 January 2012, 05:42 GMT
Can people please test:
http://dev.archlinux.org/~allan/glibc-2.15-3.1-x86_64.pkg.tar.xz

It contains a much more minimal workaround to the issue which would be suitable to put in the repos once I have confirmation it works.
Comment by brad barden (iamb) - Wednesday, 25 January 2012, 16:18 GMT
I was using my own with --disable-multi-arch, that worked fine, tested this package and it is working as well. Thanks!
Comment by Bartłomiej Piotrowski (Barthalion) - Tuesday, 31 January 2012, 16:05 GMT
glibc-2.15-4 broke Arch again.

I've got "Illegal instruction" error after trying to run any command, and the same kernel panic after restart.
Comment by Bartłomiej Piotrowski (Barthalion) - Saturday, 04 February 2012, 07:31 GMT
And the same with glibc 2.15-5.
Comment by Allan McRae (Allan) - Saturday, 04 February 2012, 08:16 GMT
I am going to need a backtrace to follow this up. Upstream thinks all AVX issues are fixed...
Comment by Bartłomiej Piotrowski (Barthalion) - Saturday, 04 February 2012, 08:47 GMT
[ 0.511089] i8042: No controller found
[ 0.551798] drivers/rtc/hctosys.c: unable to open rtc device (rtc0)
:: Starting udevd...
done.
:: Running Hook [udev]
:: Triggering uevents...done.
INIT: version 2.88 booting
[ 4.317916] Kernel panic - not syncing: Attempted to kill init!
[ 4.317925] Pid: 1, comm: init Not tainted 3.0.17-1-lts #1
[ 4.317930] Call Trace:
[ 4.317939] [<ffffffff813ed9e7>] panic+0xa0/0x1ad
[ 4.317947] [<ffffffff81007cf9>] ? xen_irq_enable_direct_reloc+0x4/0x4
[ 4.317953] [<ffffffff81060eb3>] do_exit+0x8e3/0x8f0
[ 4.317958] [<ffffffff81061214>] do_group_exit+0x44/0xa0
[ 4.317964] [<ffffffff81072080>] get_signal_to_deliver+0x340/0x510
[ 4.317970] [<ffffffff8100b1af>] do_signal+0x6f/0x780
[ 4.317975] [<ffffffff8100c035>] ? do_invalid_op+0x95/0xb0
[ 4.317980] [<ffffffff8100b945>] do_notify_resume+0x65/0x80
[ 4.317986] [<ffffffff813f6f5c>] retint_signal+0x48/0x8c

And identical output of upgrading as in bug report.
Comment by Allan McRae (Allan) - Saturday, 04 February 2012, 09:25 GMT
I need a gdb backtrace. To do that:
1) create a chroot with the latest glibc
2) sudo gdb chroot
3) run /path/to/chroot (should crash in bash)
4) bt full
5) disassemble
...

Might need to rebuild glibc without the strip commands at the end of the PKGBUILD for this to be useful.
Comment by Si Feng (danielfeng) - Tuesday, 07 February 2012, 18:26 GMT
glibc-2.15-4 works well on XenServer 6.0 Arch PV guests but glibc-2.15-5 causes kernel panic again.
Comment by Allan McRae (Allan) - Tuesday, 07 February 2012, 19:23 GMT
Please follow the instructions here im my above comment: https://bugs.archlinux.org/task/27828#comment88876
Comment by Pierre Bourdon (delroth) - Monday, 13 February 2012, 14:44 GMT
Unfortunately I can confirm that the bug still happens in 2.15-5 :(

Core was generated by `/bin/bash -i'.
Program terminated with signal 4, Illegal instruction.
#0 0x00007f1d125de0ff in __strcasecmp_l_avx () from /mnt/install/lib/libc.so.6
(gdb) bt
#0 0x00007f1d125de0ff in __strcasecmp_l_avx () from /mnt/install/lib/libc.so.6
#1 0x00007f1d12ce2085 in rl_parse_and_bind () from /mnt/install/lib/libreadline.so.6
#2 0x00007f1d12ce2950 in _rl_read_init_file () from /mnt/install/lib/libreadline.so.6
#3 0x00007f1d12cd767a in rl_initialize () from /mnt/install/lib/libreadline.so.6
#4 0x000000000045d305 in initialize_readline ()
#5 0x000000000041957d in ?? ()
#6 0x000000000041b409 in ?? ()
#7 0x000000000041dda6 in ?? ()
#8 0x00000000004207d0 in yyparse ()
#9 0x0000000000418d8a in parse_command ()
#10 0x0000000000418e56 in read_command ()
#11 0x000000000041908f in reader_loop ()
#12 0x00000000004178fb in main ()
(gdb) x/i $rip
0x7f1d125de0ff <__strcasecmp_l_avx+31>: vmovdqa 0x46979(%rip),%xmm4 # 0x7f1d12624a80

I mailed you (Allan) root access to a Xen domU so you can test if the bug is fixed before releasing next glibc version and experiment with it if you have time to do so.
Comment by Allan McRae (Allan) - Tuesday, 14 February 2012, 03:59 GMT
So there is still an issue with strcasecmp despite the closed upstream bug report...

Can someone please start an instance of xen with xsave=1 on the command line so I can have additional evidence on this issue? (Warning, this probably prevents migration...)
Comment by Allan McRae (Allan) - Tuesday, 14 February 2012, 06:56 GMT
Should be fixed with glibc-2.15-6... at least bash no longer crashes.

I would still appreciate someone testing if adding xsave=1 fixes the issue too.
Comment by Pierre Bourdon (delroth) - Tuesday, 14 February 2012, 07:03 GMT
If you mean adding "xsave=1" to the guest config file, it does not fix the problem (still SIGILL).
Comment by Allan McRae (Allan) - Tuesday, 14 February 2012, 07:22 GMT
Hmmm... not entirely sure what I mean as I do not have much to do with Xen... :P

I think I mean either on the "xm" command or the adding it to the kernel line in grub.conf.
Comment by Michael Werner (Xaseron) - Tuesday, 14 February 2012, 17:04 GMT
when i boot my dom0 with glibc 2.15-5 and xsave=1 i get a kernel panic.

when i boot my dom0 with glibc 2.15-6 i get on some applications "illegal hardware instruction" e.g. xend

Comment by Allan McRae (Allan) - Tuesday, 14 February 2012, 20:48 GMT
Can you give me an example of "some applications" so I can track down what is giving the illegal instruction?
Comment by Allan McRae (Allan) - Tuesday, 21 February 2012, 12:52 GMT
@Xaseron: any chance of an example package that is still broken?

Anyone else still having issues?
Comment by Si Feng (danielfeng) - Thursday, 23 February 2012, 10:38 GMT
2.15-6 works well on domU.
Comment by Michael Werner (Xaseron) - Friday, 24 February 2012, 16:54 GMT
sorry for the late reply. i was on vacation.
Most graphical applications crashes:
[1] 5500 illegal hardware instruction (core dumped) firefox
same for chromium, wesnoth
after a reboot X crashes when i try to start it.

vim crashes with:
Vim: Caught deadly signal ILL
Vim: Finished.

some starts without problem:
bash, zsh, xterm, roxterm, skype, pidgin,
Comment by Allan McRae (Allan) - Saturday, 25 February 2012, 01:24 GMT
@Xaseron: I'm going to need a gdb backtrace to look into this. Here are instructions:

1) Extract current glibc to /some/path
2) run "gdb /some/path/lib/ld-linux-x86-64.so.2 --library-path /some/path/lib firefox"
3) type the following commands:
run
bt full
disassemble
Comment by Michael Werner (Xaseron) - Saturday, 25 February 2012, 22:48 GMT
gdb /tmp/debug/lib/ld-linux-x86-64.so.2
GNU gdb (GDB) 7.4
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /tmp/debug/lib/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
(gdb) run --library-path /tmp/debug/lib
lib/ lib64/
(gdb) run --library-path /tmp/debug/lib /usr/bin/firefox
Starting program: /tmp/debug/lib/ld-linux-x86-64.so.2 --library-path /tmp/debug/lib /usr/bin/firefox
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/libthread_db.so.1".

Program received signal SIGILL, Illegal instruction.
0x00007ffff7261e82 in ?? () from /tmp/debug/lib/libm.so.6
(gdb) bt full
#0 0x00007ffff7261e82 in ?? () from /tmp/debug/lib/libm.so.6
No symbol table info available.
#1 0x00007ffff5e06d43 in ?? () from /usr/lib/firefox/libxul.so
No symbol table info available.
#2 0x00007ffff5e08d90 in ?? () from /usr/lib/firefox/libxul.so
No symbol table info available.
#3 0x00007ffff5bd6739 in ?? () from /usr/lib/firefox/libxul.so
No symbol table info available.
#4 0x00007ffff5bd682a in ?? () from /usr/lib/firefox/libxul.so
No symbol table info available.
#5 0x00007ffff5bd7281 in ?? () from /usr/lib/firefox/libxul.so
No symbol table info available.
#6 0x0000000000401cda in ?? ()
No symbol table info available.
#7 0x0000000000000000 in ?? ()
No symbol table info available.
(gdb) disassemble
No function contains program counter for selected frame.

I hope it will help.
Comment by Michael Werner (Xaseron) - Thursday, 01 March 2012, 13:14 GMT
I tested it again with glibc-2.15-7 and it works with xsave=1.
I'm able to start my dom0. And firefox etc works.
But i am unable to start a Virtual Machine.
It always questions if xend is running and breaks.
After that i have an unaccessible unamed machine in xm list.
i also tried to add xsave=1 inside my domU config ... no success.

When i boot dom0 without xsave=1 i get again "illegal hardware instruction"
Comment by Michael Werner (Xaseron) - Thursday, 08 March 2012, 15:05 GMT
after applaying the 3 patches mentioned by aaronfitz (https://aur.archlinux.org/packages.php?ID=14640)
and using xsave=1 everything is working fine :-)

Loading...