FS#27828 - kernel panic after upgrading from glibc 2.14.1-4 to 2.15-3
Attached to Project:
Arch Linux
Opened by Si Feng (danielfeng) - Wednesday, 04 January 2012, 20:10 GMT
Last edited by Allan McRae (Allan) - Thursday, 08 March 2012, 22:00 GMT
Opened by Si Feng (danielfeng) - Wednesday, 04 January 2012, 20:10 GMT
Last edited by Allan McRae (Allan) - Thursday, 08 March 2012, 22:00 GMT
|
Details
Description:
Kernel panic after upgrading from glibc 2.14.1-4 to 2.15-3 on x86_64 XenServer PV guest. Tested multiple times on both kernel26-lts 2.6.32.51-1 and linux 3.1.6-1. Didn't observed such issue on i686. Log: :: Starting full system upgrade... resolving dependencies... looking for inter-conflicts... Targets (1): glibc-2.15-3 Total Download Size: 7.36 MB Total Installed Size: 36.47 MB Proceed with installation? [Y/n] :: Retrieving packages from core... downloading glibc-2.15-3-x86_64.pkg.tar.xz... warning: /etc/locale.gen installed as /etc/locale.gen.pacnew [ 46.703507] ldconfig[487] trap invalid opcode ip:42c775 sp:7fff188ffbf8 error:0 in ldconfig[400000+de000] /tmp/alpm_nzzWYK/.INSTALL: line 4: 487 Illegal instruction sbin/ldconfig -r . [ 46.706486] Not activating Mandatory Access Control now since /sbin/tomoyo-init doesn't exist. INIT: version 2.88 reloading Generating locales... en_US.UTF-8... done en_US.ISO-8859-1... done Generation complete. [ 50.326026] ldconfig[588] trap invalid opcode ip:42c775 sp:7fff8e362278 error:0 in ldconfig[400000+de000] And when rebooting: [ 0.097463] blkfront: xvda: barriers enabled [ 0.097681] xvda: xvda1 [ 0.188474] Initialising Xen virtual ethernet driver. :: Running Hook [udev] :: Triggering uevents...done. [ 0.636870] EXT4-fs (xvda1): mounted filesystem with ordered data mode [ 0.927621] Not activating Mandatory Access Control now since /sbin/tomoyo-init doesn't exist. INIT: version 2.88 booting [ 1.001250] init[1] trap invalid opcode ip:7f72a3e9ba3f sp:7fffd6a70578 error:0 in libc-2.15.so[7f72a3d7b000+199000] [ 1.001553] Kernel panic - not syncing: Attempted to kill init! [ 1.001569] Pid: 1, comm: init Not tainted 2.6.32.51-1-lts #1 [ 1.001579] Call Trace: [ 1.001595] [<ffffffff8138ed98>] panic+0x78/0x131 [ 1.001610] [<ffffffff81063fdb>] do_exit+0x71b/0x840 [ 1.001624] [<ffffffff81064465>] do_group_exit+0x45/0xb0 [ 1.001639] [<ffffffff81076d9f>] get_signal_to_deliver+0x1bf/0x390 [ 1.001654] [<ffffffff8100f26f>] ? xen_restore_fl_direct_end+0x0/0x1 [ 1.001669] [<ffffffff8101121f>] do_signal+0x6f/0x7c0 [ 1.001682] [<ffffffff81013885>] ? do_invalid_op+0x95/0xb0 [ 1.001696] [<ffffffff810119e5>] do_notify_resume+0x55/0x70 [ 1.001708] [<ffffffff81012adc>] retint_signal+0x48/0x8c |
This task depends upon
Are these AVX enabled CPUs? Try a newer Xen (4.1.0+) and play around with the settings, or build your own multiarch disabled glibc if you can't upgrade.
XenServer 6.0 (Xen 4.1.1?)
It happens on x86_64 DomU. i686 is fine.
It's not so bad now and not updating glibc is an option, but as packages are being built against it now it means not updating those packages as well. There's already a bug against openssl because the latest package requires glibc 2.15 (and depends doesn't say so).
I'm willing to help if I'm able.
@Allan How to get that gdb backtrace?
Additionaly, when I try to chroot into rootfs, I've got "Illegal instruction" error.
# gdb
Illegal instruction
Starting program: /usr/sbin/chroot /mnt/install
Executing new program: /mnt/install/bin/bash
Program received signal SIGILL, Illegal instruction.
0x00007ffff74b2a3f in ?? ()
(gdb) bt
#0 0x00007ffff74b2a3f in ?? ()
#1 0x00007ffff7bb7085 in ?? ()
#2 0x0000000000000073 in ?? ()
#3 0x706e692f6374652f in ?? ()
#4 0x00000000006fe3b6 in ?? ()
#5 0x00000000006fe370 in ?? ()
#6 0x0000000000000010 in ?? ()
#7 0x00000000006fe3a6 in ?? ()
#8 0x00007ffff7dda4b0 in ?? ()
#9 0x0000000000000000 in ?? ()
(gdb) x/i $rip
0x7ffff74b2a3f: vmovdqa 0x46979(%rip),%xmm4 # 0x7ffff74f93c0
This happens on Sandy Bridge CPUs (or anything that has AVX support actually) using Xen (PV, not HVM, dom0 or domU). I only tried using Xen 4.0, not 4.1.
From the cpuinfo:
flags : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good nonstop_tsc aperfmperf pni pclmulqdq est ssse3 cx16 sse4_1 sse4_2 x2apic popcnt aes avx hypervisor lahf_lm ida arat
avx is supported but not xsave, so according to the Intel manuals that means that AVX should not be used. The glibc code from the Git seems correct too, but it has changed a lot recently so maybe this release contains buggy code. If you look at the last commits on sysdeps/x86_64/dl-trampoline.S the number of times the AVX detection has changed because it was invalid is kind of scary...
commit 08a300c956feeca7ccb9081f88701229da8e25c5
Author: H.J. Lu <hongjiu.lu@intel.com>
Date: Wed Sep 7 21:38:23 2011 -0400
Simplify AVX check
commit 0276a718c0fa58916a6e7c54bad22b4e58bb39b4
Author: Ulrich Drepper <drepper@gmail.com>
Date: Sat Aug 20 08:58:44 2011 -0400
Fix minor CFI problem in regular x86-64 trampoline
commit c88f17668b67d22fe470933ab81119de587ee175
Author: Ulrich Drepper <drepper@gmail.com>
Date: Sat Aug 20 08:56:30 2011 -0400
Fix CFI info in x86-64 trampolines for non-AVX code
commit bba33c289b1b24e1bb3075b7fce5b56c9d01ce2f
Author: Ulrich Drepper <drepper@gmail.com>
Date: Sat Jul 23 15:18:13 2011 -0400
One more typo in AVX test
commit 1aae088a8aa2a4e4211bfe6c0e18100d85f106ae
Author: Ulrich Drepper <drepper@gmail.com>
Date: Fri Jul 22 23:33:22 2011 -0400
One more change to XSAVE patch
commit 1d002f25399c0a0ed2cc276d4ee18db869152384
Author: Andreas Schwab <schwab@redhat.com>
Date: Fri Jul 22 14:33:47 2011 -0400
Fix AVX check
I haven't checked this exact release code to see if it is correct. The current Git version seems fine to me though, maybe backporting this file would fix the problem?
This bug is really critical for a lot of people: upgrade your dom0/domU and your system can't be used anymore, and the upgrade can't be skipped because new packages are compiled for this new glibc version.
http://dev.archlinux.org/~allan/glibc-2.15-3.1-x86_64.pkg.tar.xz
It is built with "--disable-multi-arch".
http://dev.archlinux.org/~allan/glibc-2.15-3.1-x86_64.pkg.tar.xz
It contains a much more minimal workaround to the issue which would be suitable to put in the repos once I have confirmation it works.
I've got "Illegal instruction" error after trying to run any command, and the same kernel panic after restart.
[ 0.551798] drivers/rtc/hctosys.c: unable to open rtc device (rtc0)
:: Starting udevd...
done.
:: Running Hook [udev]
:: Triggering uevents...done.
INIT: version 2.88 booting
[ 4.317916] Kernel panic - not syncing: Attempted to kill init!
[ 4.317925] Pid: 1, comm: init Not tainted 3.0.17-1-lts #1
[ 4.317930] Call Trace:
[ 4.317939] [<ffffffff813ed9e7>] panic+0xa0/0x1ad
[ 4.317947] [<ffffffff81007cf9>] ? xen_irq_enable_direct_reloc+0x4/0x4
[ 4.317953] [<ffffffff81060eb3>] do_exit+0x8e3/0x8f0
[ 4.317958] [<ffffffff81061214>] do_group_exit+0x44/0xa0
[ 4.317964] [<ffffffff81072080>] get_signal_to_deliver+0x340/0x510
[ 4.317970] [<ffffffff8100b1af>] do_signal+0x6f/0x780
[ 4.317975] [<ffffffff8100c035>] ? do_invalid_op+0x95/0xb0
[ 4.317980] [<ffffffff8100b945>] do_notify_resume+0x65/0x80
[ 4.317986] [<ffffffff813f6f5c>] retint_signal+0x48/0x8c
And identical output of upgrading as in bug report.
1) create a chroot with the latest glibc
2) sudo gdb chroot
3) run /path/to/chroot (should crash in bash)
4) bt full
5) disassemble
...
Might need to rebuild glibc without the strip commands at the end of the PKGBUILD for this to be useful.
Core was generated by `/bin/bash -i'.
Program terminated with signal 4, Illegal instruction.
#0 0x00007f1d125de0ff in __strcasecmp_l_avx () from /mnt/install/lib/libc.so.6
(gdb) bt
#0 0x00007f1d125de0ff in __strcasecmp_l_avx () from /mnt/install/lib/libc.so.6
#1 0x00007f1d12ce2085 in rl_parse_and_bind () from /mnt/install/lib/libreadline.so.6
#2 0x00007f1d12ce2950 in _rl_read_init_file () from /mnt/install/lib/libreadline.so.6
#3 0x00007f1d12cd767a in rl_initialize () from /mnt/install/lib/libreadline.so.6
#4 0x000000000045d305 in initialize_readline ()
#5 0x000000000041957d in ?? ()
#6 0x000000000041b409 in ?? ()
#7 0x000000000041dda6 in ?? ()
#8 0x00000000004207d0 in yyparse ()
#9 0x0000000000418d8a in parse_command ()
#10 0x0000000000418e56 in read_command ()
#11 0x000000000041908f in reader_loop ()
#12 0x00000000004178fb in main ()
(gdb) x/i $rip
0x7f1d125de0ff <__strcasecmp_l_avx+31>: vmovdqa 0x46979(%rip),%xmm4 # 0x7f1d12624a80
I mailed you (Allan) root access to a Xen domU so you can test if the bug is fixed before releasing next glibc version and experiment with it if you have time to do so.
Can someone please start an instance of xen with xsave=1 on the command line so I can have additional evidence on this issue? (Warning, this probably prevents migration...)
I would still appreciate someone testing if adding xsave=1 fixes the issue too.
I think I mean either on the "xm" command or the adding it to the kernel line in grub.conf.
when i boot my dom0 with glibc 2.15-6 i get on some applications "illegal hardware instruction" e.g. xend
Anyone else still having issues?
Most graphical applications crashes:
[1] 5500 illegal hardware instruction (core dumped) firefox
same for chromium, wesnoth
after a reboot X crashes when i try to start it.
vim crashes with:
Vim: Caught deadly signal ILL
Vim: Finished.
some starts without problem:
bash, zsh, xterm, roxterm, skype, pidgin,
1) Extract current glibc to /some/path
2) run "gdb /some/path/lib/ld-linux-x86-64.so.2 --library-path /some/path/lib firefox"
3) type the following commands:
run
bt full
disassemble
GNU gdb (GDB) 7.4
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /tmp/debug/lib/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
(gdb) run --library-path /tmp/debug/lib
lib/ lib64/
(gdb) run --library-path /tmp/debug/lib /usr/bin/firefox
Starting program: /tmp/debug/lib/ld-linux-x86-64.so.2 --library-path /tmp/debug/lib /usr/bin/firefox
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/libthread_db.so.1".
Program received signal SIGILL, Illegal instruction.
0x00007ffff7261e82 in ?? () from /tmp/debug/lib/libm.so.6
(gdb) bt full
#0 0x00007ffff7261e82 in ?? () from /tmp/debug/lib/libm.so.6
No symbol table info available.
#1 0x00007ffff5e06d43 in ?? () from /usr/lib/firefox/libxul.so
No symbol table info available.
#2 0x00007ffff5e08d90 in ?? () from /usr/lib/firefox/libxul.so
No symbol table info available.
#3 0x00007ffff5bd6739 in ?? () from /usr/lib/firefox/libxul.so
No symbol table info available.
#4 0x00007ffff5bd682a in ?? () from /usr/lib/firefox/libxul.so
No symbol table info available.
#5 0x00007ffff5bd7281 in ?? () from /usr/lib/firefox/libxul.so
No symbol table info available.
#6 0x0000000000401cda in ?? ()
No symbol table info available.
#7 0x0000000000000000 in ?? ()
No symbol table info available.
(gdb) disassemble
No function contains program counter for selected frame.
I hope it will help.
I'm able to start my dom0. And firefox etc works.
But i am unable to start a Virtual Machine.
It always questions if xend is running and breaks.
After that i have an unaccessible unamed machine in xm list.
i also tried to add xsave=1 inside my domU config ... no success.
When i boot dom0 without xsave=1 i get again "illegal hardware instruction"
and using xsave=1 everything is working fine :-)