FS#22343 - Random kernel panics after udev upgrade
Attached to Project:
Arch Linux
Opened by H.pferd (stosch) - Thursday, 06 January 2011, 20:32 GMT
Last edited by Tobias Powalowski (tpowa) - Monday, 24 January 2011, 18:46 GMT
Opened by H.pferd (stosch) - Thursday, 06 January 2011, 20:32 GMT
Last edited by Tobias Powalowski (tpowa) - Monday, 24 January 2011, 18:46 GMT
|
Details
Description: Since a few days (probably after the udev
upgrade to 165) I discover random kernel panics on several
machines (Laptop and Desktop). Those occur in two
situations:
-When I insert a DVD (no automount) -rarely when I boot up. There might be a connection to this forum thread: https://bbs.archlinux.org/viewtopic.php?id=111197 even though there is a slightly different display output (see attached file). The problem seems to be solved when downgrading to older udev164-3. Additional info: udev 165-1 picture with display output Steps to reproduce: non found - random but pretty often |
This task depends upon
Closed by Tobias Powalowski (tpowa)
Monday, 24 January 2011, 18:46 GMT
Reason for closing: Fixed
Additional comments about closing: patch added to all kernels
Monday, 24 January 2011, 18:46 GMT
Reason for closing: Fixed
Additional comments about closing: patch added to all kernels
udevadm info --convert-db
and rebuild your initramfs by running mkinitcpio -p kernel26
udevadm info --convert-db and updated the initramfs in both kernels26 2.6.36 and 2.6.37 from testing ,
login to my desktop, put a dvd on the drive and panic again,sometimes panic on boot.
Downgrading to udev 164-3 is the only workaround that helps for now.
I tried with the stock kernel. The problem continues.
same error here, ramdomly kernel panic at udev autodetect process, fallback image works normal
According to Tobias, here is what I've done to solve the problem :
- upgrade to udev 165-1
- udevadm info --convert-db
- boot many times. No kernel panic
- rebuild initramfs
- boot many times. No kernel panic
So for me the problem is gone. I do not use my Arch to play DVD, so I haven't check this case.
@Emmanuel : are you 100% sure that the problem is gone ? It can be sporadic.
Basically, it doesn't get stuck when I use udev 164-3 with stock ARCH kernel, and still stucks while trying to boot with a bfs-patched kernel.
@stosch: Using bfs by any chance?
Update: The system definitely doesn't get stuck when trying to boot kernel without bfs.
So the problem still occurs.
I´ve cleaned my pkg cache. Someone know where can I download the udev 164 package? Thanks!
i686: http://schlunix.org/archlinux/core/os/i686/udev-164-3-i686.pkg.tar.xz
x86_64: http://schlunix.org/archlinux/core/os/x86_64/udev-164-3-x86_64.pkg.tar.xz
But I think it would be better if Arch devs move back 165-1 to [testing] until we don't know what caused this critical bug for many users.
Right after I updated this evening, I rebooted, and came up to a string of code, all of it being greek to me, and a panic right there. I rebooted, tried the fallback image, and it is working fine so far. I am afraid to do anything further with my system, as I'm afraid I'll totally bork it - I'm going to keep my system up solidly for the next couple of days, and see if there is a kernel fix that comes in - if not, then I'll downgrade udev and the kernel both to be on the safe side.
Thanks for hearing me out.
My system:
lsmod
Module Size Used by
usb_storage 34044 1
fuse 56816 3
lm85 13885 0
hwmon_vid 2008 1 lm85
ext2 55656 1
mbcache 4298 1 ext2
nvidia 9234898 28
snd_seq_dummy 1079 0
snd_seq_oss 25040 0
snd_seq_midi_event 4528 1 snd_seq_oss
snd_seq 41688 5 snd_seq_dummy,snd_seq_oss,snd_seq_midi_event
snd_pcm_oss 33694 0
snd_emu10k1 124455 1
snd_intel8x0 22230 2
snd_rawmidi 15288 1 snd_emu10k1
snd_mixer_oss 14654 1 snd_pcm_oss
snd_ac97_codec 87943 2 snd_emu10k1,snd_intel8x0
ac97_bus 762 1 snd_ac97_codec
snd_pcm 59136 4 snd_pcm_oss,snd_emu10k1,snd_intel8x0,snd_ac97_codec
firewire_ohci 23548 0
snd_seq_device 4369 5 snd_seq_dummy,snd_seq_oss,snd_seq,snd_emu10k1,snd_rawmidi
e100 27295 0
mii 3198 1 e100
firewire_core 42849 1 firewire_ohci
snd_timer 15583 3 snd_seq,snd_emu10k1,snd_pcm
snd_util_mem 1820 1 snd_emu10k1
agpgart 22816 1 nvidia
crc_itu_t 1053 1 firewire_core
snd_hwdep 4764 1 snd_emu10k1
uhci_hcd 19091 0
snd 43219 18 snd_seq_oss,snd_seq,snd_pcm_oss,snd_emu10k1,snd_intel8x0,snd_rawmidi,snd_mixer_oss,snd_ac97_codec,snd_pcm,snd_seq_device,snd_timer,snd_hwdep
soundcore 4929 1 snd
iTCO_wdt 8677 0
i2c_i801 6946 0
snd_page_alloc 5981 3 snd_emu10k1,snd_intel8x0,snd_pcm
ehci_hcd 32908 0
ppdev 4862 0
parport_pc 27832 1
shpchp 23037 0
psmouse 49765 0
lp 6652 0
iTCO_vendor_support 1433 1 iTCO_wdt
pci_hotplug 21523 1 shpchp
usbcore 115866 4 usb_storage,uhci_hcd,ehci_hcd
i2c_core 15762 3 lm85,nvidia,i2c_i801
parport 25499 3 ppdev,parport_pc,lp
pcspkr 1359 0
serio_raw 3566 0
thermal 9690 0
evdev 6692 7
processor 22776 0
button 3746 0
reiserfs 225719 3
sg 21028 0
sd_mod 24384 7
sr_mod 13217 0
cdrom 31378 1 sr_mod
pata_acpi 2308 0
ata_generic 2215 0
ata_piix 17935 4
libata 140308 3 pata_acpi,ata_generic,ata_piix
scsi_mod 106955 5 usb_storage,sg,sr_mod,sd_mod,libata
uname -a
Linux dedanna.rocks.net 2.6.36-ARCH #1 SMP PREEMPT Sat Jan 8 13:16:43 UTC 2011 i686 Intel(R) Pentium(R) 4 CPU 2.40GHz GenuineIntel GNU/Linux
Oh, btw, meant to mention also that my FreeAgent external usb drive was NOT hotplugged when I updated this evening. Normally I do leave it hotplugged, and it does fine through kernel updates, etc., but I had received a heads up from our site admin at our forum, so I did unmount it this time before I updated - start reading here: http://bjoernvold.com/forum/viewtopic.php?p=6826#p6826
Thanks.
This needs a news post. I'm glad I'm working from home today or I'd be totally hosed. People shouldn't have to check the bug tracker to know if they will have a usable system after upgrading.
Downgrading to udev 164 resolves the issue
Well, apparently the issue is illusive... On my 2 i686 and 2 x86_64 systems, I haven't experienced any crashes since Jan 5 (and two of them work 24/7). No problems with DVD under KDE x64 either. Apparently I wasn't lucky ehough regarding my hardware :)
And I experienced nothing after the udev update (which I got on Dec. 16th) until the kernel update last night. I am still suspicious that it's actually the kernel's fault, that it's not able to work with current udev. But however you want to look at that I guess.
Also, considering, as in the link I left earlier in this thread, it is obviously NOT Arch's fault. This is upstream. Considering Arch's track record, I think a comment like that is totally unwarranted. It is very rare when there are bugs in Arch, at least in my own experience, so am patient when things arise.
I suspect in Kim's case, udev was updated on Dec. 16th but the initcpio image still had the previous version of udev.
After she updated the kernel, the initcpio image was rebuilt and udev in the initcpio image was updated to the latest version (which crashes on boot).
https://bbs.archlinux.org/viewtopic.php?id=111197
"Keep in mind that this is not a udev bug. It's a kernel bug that udev165 tickles." https://bbs.archlinux.org/viewtopic.php?pid=878331#p878331 - just like I thought.
"I just realized that running the kernel with 'edd=off' seems to allow me to boot now. Only tried it a few times but I was getting 90% boot failure prior to this and I haven't not been able to boot yet. Might help with some?!?!?!" - https://bbs.archlinux.org/viewtopic.php?pid=878334#p878334 - this last suggestion here does NOT work for me.
Thanks.
I have this problem occurring on at least three archlinux systems. And adding edd=off on the kernel line does not help.
The problem occurs at boot time; the fallback image does not make any difference. The messages you see differ from system to system and sometimes from boot to boot. My Samsung NC10 wouldn't even boot after 5 trials. I could not reproduce the problem on an x64 system, thusfar it only occurred on i686 systems. It may also occur on a system booted from an external usb hdd.
Files added:
img_9972.jpg: samsung NC10 (only system using grub 2, others use grub 1, always exactly the same messages)
img_9974.jpg: P4P800-X first boot after upgrade to kernel 2.6.36.3-1 (but it happened before)
img_9975.jpg: P4P800-X second boot (third time it booted correctly)
img_9976.jpg: PX845PE
img_9977.jpg: PX845PE with edd=off
img_9979.jpg: PX845PE using fallback image
img_9980.jpg: PX845PE booted from external usb hdd
IMG_9974.JPG (112.6 KiB)
IMG_9975.JPG (85.3 KiB)
IMG_9976.JPG (107.3 KiB)
IMG_9977.JPG (105.7 KiB)
IMG_9979.JPG (105.9 KiB)
IMG_9980.JPG (93.4 KiB)
Jan 15 07:38:33 localhost kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Jan 15 07:38:33 localhost kernel: ata3.00: failed command: IDENTIFY PACKET DEVICE
Jan 15 07:38:33 localhost kernel: ata3.00: cmd a1/00:01:00:00:00/00:00:00:00:00/00 tag 0 pio 512 in
Jan 15 07:38:33 localhost kernel: res 40/00:03:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
Jan 15 07:38:33 localhost kernel: ata3.00: status: { DRDY }
Jan 15 07:38:33 localhost kernel: ata3: hard resetting link
Jan 15 07:38:33 localhost kernel: ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Jan 15 07:38:33 localhost kernel: ata3.00: configured for UDMA/100
Jan 15 07:38:33 localhost kernel: ata3: EH complete
First workaround that worked for me was the panic.patch posted here by DaNiMoTh. I've now gone the Slackware -current route and rebuilt udev with the commit that's causing all the problems reverted.
a picture of the kernel panic http://imgur.com/26tFP
http://bjoernvold.com/forum/viewtopic.php?p=6938#p6938
I think I'm going to downgrade udev and the kernel both. The problem with that being, that I'll probably screw it up, not knowing much about mkinitcpio, and always afraid to try it even when I get a walkthrough.
On my two amd64, running 64 bits, this does not happen.
udevadm info --convert-db
and rebuild your initramfs by running mkinitcpio -p kernel26"
I've noted from the Arch forum, that patches, etc. have been messing up Fallback as well as the main kernel for a good fair few. I'm afraid to do anything at this point, as Fallback is the only way I can get into my system. How safe would this be to run?
Edit: Just tried it anyway, and guess what? It worked. This is without my external drives hotplugged, and with no cd/dvd in my dvd drive. I also have several more reboots to go to confirm it's a keeper.
I should say, this is exactly what I did. It's not like I just whipped out a root terminal while in the desktop, and did it. I didn't. There was a bit more to it. https://bbs.archlinux.org/viewtopic.php?pid=880526#p880526
You are a god. Thank you.
umount all external usb drives, make sure no other hard drives are mounted, make sure there is no cd or dvd in the cd/dvd drives, then:
Log out
Hit Ctl+Alt+F1
Log into root
Then, do:
init 3
udevadm info --convert-db
mkinitcpio -p kernel26
Then, reboot.
It's simple to do, and worked for me where nothing else did.
Would it be fixed by installing the last kernel tree ? I wonder if there is any bug report on it ? Also, I cannot find the last commit of Tejun Heo in the kernel tree, so I'm not sure the bug is corrected.
http://comments.gmane.org/gmane.linux.hotplug.devel/16438
http://www.linuxquestions.org/questions/slackware-14/current-randomly-timed-kernel-oops-on-bootup-of-two-test-boxen-852843/
I gather from Tejun's comments that he doesn't actually agree that he broke it. Doesn't seem to me they've identified the issue yet so I wouldn't be optimistic about seeing a fix yet. I'm hobbling along with a simple patch to udev to remove the offending call but I'm hoping for a real fix soon as it's time to update the system again, haven't done it in over a week now (eek).
I as well haven't updated since I got the bug.
uname -r
2.6.36-ARCH
How do I do the other part of this, "give the new kernel a different name, and keep the current Arch kernel and initrd in /boot and Grub" - when a new kernel's installed, the old one uninstalls in regular updates. I guess I don't have the hang of keeping the old kernel in Arch yet.
Thanks so much.
There's lots of info on the net on building kernels. O'Reilly has an online book by Greg Kroah Hartman (Linux Kernel in a Nutshell?) that has a lot of good info in it.
It's still happening sometimes on boot, although rarely. Even with udev 164-3. I'm using latest LTS kernel. haven't tried yet with the normal kernel.
In-case it adds anything, my setup includes...
1xSATA HDD
2xPATA DVD drives (Liteon & Samsung)
http://permalink.gmane.org/gmane.linux.hotplug.devel/16492
From: Tejun Heo <htejun@xxxxxxxx>
Subject:[PATCH #upstream-fixes] libata: set queue DMA alignment to sector size for ATAPI too
Date: Thu, 20 Jan 2011 13:59:06 +0100 (01/20/2011 05:59:06 AM)
ata_pio_sectors() expects buffer for each sector to be contained in a
single page; otherwise, it ends up overrunning the first page. This
is achieved by setting queue DMA alignment. If sector_size is smaller
than PAGE_SIZE and all buffers are sector_size aligned, buffer for
each sector is always contained in a single page.
This wasn't applied to ATAPI devices but IDENTIFY_PACKET is executed
as ATA_PROT_PIO and thus uses ata_pio_sectors(). Newer versions of
udev issue IDENTIFY_PACKET with unaligned buffer triggering the
problem and causing oops.
This patch fixes the problem by setting sdev->sector_size to
ATA_SECT_SIZE on ATATPI devices and always setting DMA alignment to
sector_size. While at it, add a warning for the unlikely but still
possible scenario where sector_size is larger than PAGE_SIZE, in which
case the alignment wouldn't be enough.
Name : kernel26-lts
Version : 2.6.32.28-1
Name : udev
Version : 165-1
I'm not a developer nor packager myself, nor am I running Arch testing, and really do not want to attempt the above. I'm only someone who helps with bugs, and does what the devels tell her to do in order to see what works, and what doesn't. I'd much rather get what I can from updates. I took a simple route to get working again for now, and it worked - another user has reported the same. Can we move on with a fix for the repos? If you'll notice, right now, I'm only on this kernel:
pacman -Q kernel26
kernel26 2.6.36.3-1
I am nowhere near ready in all honesty to try something this deep in the unknown. I've been running Arch solidly yes, but for just under a year, and am still in the learning stages, I think, with Arch. I have general knowledge of Linux, yes, ran Cooker for Mandriva for a couple of years, and contributed to many bugs for that. That's the extent of what knowledge I have, having run Linux in general for over 10 years now. I have installed and set up Arch for many many people, including myself, and to include businesses, but after that, with others, I hand them the wiki and let them go for it.
Thanks.
It should also apply cleanly to Linux 2.6.36.
Does not compile yet.
libata-alignment-2.6.32.patch (1.6 KiB)
Note: I have modified the patch so that it does not require commit http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.37.y.git;a=commitdiff;h=295124dce4ddfd40b1f12d3ffd2779673e87c701 (which adds support for >512 byte sector sizes in Linux 2.6.37 and later).
I've also attached the original patch for Linux 2.6.37.
libata-alignment.patch (2.7 KiB)
The kernel patch fixes the issue. Would you be able to add them to the Arch Linux kernel packages?
libata-alignment-2.6.32.patch for core/kernel26 2.6.36.3, core/kernel26-lts 2.6.32.28
libata-alignment.patch for testing/kernel26 2.6.37