FS#25474 - [linux] Loading aacraid takes more than 60 seconds

Attached to Project: Arch Linux
Opened by Joker-jar (Joker-jar) - Tuesday, 09 August 2011, 08:27 GMT
Last edited by Tom Gundersen (tomegun) - Friday, 18 November 2011, 13:44 GMT
Task Type Bug Report
Category Kernel
Status Closed
Assigned To Tobias Powalowski (tpowa)
Thomas Bächler (brain0)
Dave Reisner (falconindy)
Tom Gundersen (tomegun)
Architecture x86_64
Severity Medium
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 2
Private No

Details

Ramdisk hangs during the boot. After ~1 minute displays message: "udevd[108]: timeout: killing: ..." (see attached image). This problem appears after upgrade to 3.0 kernel. Fallback image has this problem too.
This task depends upon

Closed by  Tom Gundersen (tomegun)
Friday, 18 November 2011, 13:44 GMT
Reason for closing:  None
Additional comments about closing:  Work's for user, patch exists.
Comment by Jelle van der Waa (jelly) - Wednesday, 10 August 2011, 09:44 GMT
Not sure, which package could cause this issue, assigned to tpowa -> kernel, tom -> udev, initscripts.

Btw is there anything special in your MODULES or modprobe.conf
Comment by Tom Gundersen (tomegun) - Wednesday, 10 August 2011, 10:00 GMT
The problem appears to happen before initscripts sets in, so assigned to mkinitcpio instead of initscripts and added Thomas. Also added the linux as this might be a broken module.

The module in question has alias pci:v00009005d00000285sv00009005sd000002D5bc01sc04i00 and the module name is aacraid.

It appears the system can boot without this module being loaded. Maybe it would be worth adding --exit-if-exists=<whatever node we need to continue> to udevadm settle, so we don't wait for unnecessary stuff?

@Joker-jar:
Please post your /etc/mkinitcpio.conf. Also: are you up to date with the latest packages from [core], or do you have some custom packages (in particular custom kernel)?
Comment by Thomas Bächler (brain0) - Wednesday, 10 August 2011, 11:08 GMT
Tom, the problem is that we don't know what exactly we need to continue booting.

Anyway, this is likely a kernel bug, dmesg will tell.
Comment by Joker-jar (Joker-jar) - Wednesday, 10 August 2011, 13:17 GMT
Kernel is default (from core). System is freshly. Manually installed packages are only userspace. My mkinitcpio.conf (there is not any wrong, imho): http://pastebin.com/aeaytS30 . Also, i have hardware raid controller Adaptec 2405:

$ lspci | grep -i adaptec
01:00.0 RAID bus controller: Adaptec AAC-RAID (rev 09)

P.S. Sorry for my english :)
Comment by Tom Gundersen (tomegun) - Wednesday, 10 August 2011, 13:19 GMT
As Thomas said, dmseg would probably tell us what is up. Could you please attach it?
Comment by Joker-jar (Joker-jar) - Wednesday, 10 August 2011, 13:26 GMT Comment by Tom Gundersen (tomegun) - Wednesday, 10 August 2011, 13:43 GMT
I don't see anything odd (except for the obvious delay). Are you able to rmmod aacraid and run "modprobe -bv pci:v00009005d00000285sv00009005sd000002D5bc01sc04i00" manually? Does this take a long time, or finish immediately?

(In order to rmmod you probably have to mount from a livecd, or get mount into sh in your initramfs, so that your hdd will not be mounted).
Comment by Tom Gundersen (tomegun) - Wednesday, 10 August 2011, 13:44 GMT
(For what it's worth: When I "modprobe -bv pci:v00009005d00000285sv00009005sd000002D5bc01sc04i00" the module loads successfully immediately. However, I don't have the hardware in question, so it is not surprising that the behavior is different.)
Comment by Dave Reisner (falconindy) - Wednesday, 10 August 2011, 13:50 GMT
/lib/initcpio/busybox provides rmmod. It may not be symlinked, but its an included applet.

$ /lib/initcpio/busybox rmmod
BusyBox v1.18.4 (2011-05-14 12:38:46 CEST) multi-call binary.

Usage: rmmod [-wfa] [MODULE]..
Comment by Joker-jar (Joker-jar) - Wednesday, 10 August 2011, 14:08 GMT
I have no way to use live system now (may be, tomorrow). But i think that live CD has 2.6 Kernel, and i had no problems with it.

in addition:

$ lsmod | grep aacraid
aacraid 79005 4
scsi_mod 131482 7 usb_storage,uas,sg,sd_mod,sr_mod,libata,aacraid
Comment by Thomas Bächler (brain0) - Wednesday, 10 August 2011, 14:37 GMT
Can't see anything wrong either, despite the RAID controller taking >60s to initialize. This is most definitely a bug in the aacraid module.
Comment by Joker-jar (Joker-jar) - Wednesday, 10 August 2011, 14:46 GMT
I see RAID controller initialization ~60 sec before starting Grub :) What does module so long? I knew nothing about it before 3.0 (whether it was used up to 3.0?)
Comment by Tom Gundersen (tomegun) - Wednesday, 10 August 2011, 15:21 GMT
Could you verify that 2.6.39 does not cause this issue? There were no real changes to the module between 2.6.39 and 3.0.

I guess livecd is no good. So booting into the initrd is probably the best option. Not sure what is the best way, but I'd do something like this: add rdinit=/bin/sh to your kernel command line. When booted mount /proc and /sys and then call modprobe to see what happens.
Comment by Thomas Bächler (brain0) - Wednesday, 10 August 2011, 15:47 GMT
@Tom: 'rdinit=/bin/busybox sh' to the kernel command line, followed by 'busybox --install -s' once the shell appears.
Comment by Joker-jar (Joker-jar) - Wednesday, 10 August 2011, 20:38 GMT
With rdinit=/bin/sh kernel is panic:

Kernel panic - not syncing: VFS: unable to mount root fs on unknown_block(1,0)
Comment by Dave Reisner (falconindy) - Wednesday, 10 August 2011, 20:43 GMT
Yes. That's expected... which is why thomas suggested using:

rdinit=/bin/busybox sh

there is no /bin/sh on the initramfs prior to running /bin/busybox --install -s (which is the first thing that's called in /init).
Comment by Tom Gundersen (tomegun) - Wednesday, 10 August 2011, 20:43 GMT
Sorry, I was wrong. Do as Thomas described.
Comment by Joker-jar (Joker-jar) - Wednesday, 10 August 2011, 21:09 GMT
hm, usb keyboard doesn't work in this shell, although i added usbinput hook and recreated ramdisk.

Probleam appears after this update (it is 100%): http://pastebin.com/tKwTadHv
Comment by Joker-jar (Joker-jar) - Thursday, 11 August 2011, 08:31 GMT
I just try to boot with archlive 2010.05 (2.6.xx kernel). I coudln't rmmod aacraid ("module aacraid is in use"), but there is no any delays during the boot. Live session dmesg: http://pastebin.com/YVSyn3eY
Comment by Tom Gundersen (tomegun) - Thursday, 11 August 2011, 08:54 GMT
@Joker-jar: you might have to rmmod other modules before aacraid can be rmmod'ed. You'll find them if you do "lsmod | grep aacraid".

To reproduce, maybe the best thing is to boot with the most recent arch iso, as it should have linux 3.0: http://releng.archlinux.org/isos/2011.08.11/archlinux-2011.08.11-netinstall-x86_64.iso .
Comment by Joker-jar (Joker-jar) - Thursday, 11 August 2011, 10:36 GMT
This image has same problem (hang with message on archboot.jpg). dmesg: http://pastebin.com/PsuQn7TE

Than i tried to boot with param "modprobe.blacklist=aacraid". System loads quickly but raid partitions, of course, were not available.

# modprobe -bv pci:v00009005d00000285sv00009005sd000002D5bc01sc04i00
Module pci:v00009005d00000285sv00009005sd000002D5bc01sc04i00 not found.

# time modprobe aacraid
real 1m1.673s
user 0m0.100s
sys 0m0.003s
Comment by Thomas Bächler (brain0) - Thursday, 11 August 2011, 10:58 GMT
You should report this to the kernel.org bugzilla, we definitely can't fix this ourselves.
Comment by Joker-jar (Joker-jar) - Thursday, 11 August 2011, 11:56 GMT
I think so too. Thank you very much for your cooperation. So, i have only one trouble: how explain it all :) because my english is really bad.
Comment by Thomas Bächler (brain0) - Thursday, 11 August 2011, 13:04 GMT
We were able to understand you pretty well, so you shouldn't have trouble. You can also reference this bug report. Your last post where you used 'time' to measure how long it takes to modprobe the module says a lot already - I don't think such a delay should ever occur.
Comment by Tom Gundersen (tomegun) - Thursday, 11 August 2011, 13:50 GMT
I agree with Thomas, "# time modprobe aacraid" basically says it all. I'd also suggest posting the result of "# time modprobe aacraid" with the previosu (working) kernel.

Your timing is now a bit more than one second over the timeout (60 sec), so it would be good to know how big the difference was (if it was 1 sec or 59 secs before the upgrade).
Comment by Joker-jar (Joker-jar) - Thursday, 11 August 2011, 14:26 GMT
By the way, system says that killed problem module (see screenshot), but in fact this module successfully loaded, despite the delay. Otherwise the system would not boot up without RAID controller driver.
Comment by Joker-jar (Joker-jar) - Thursday, 11 August 2011, 14:29 GMT
Just in case i have updated motherboard and raid controller bios today
Comment by Tom Gundersen (tomegun) - Thursday, 11 August 2011, 14:36 GMT
I don't know exactly what modprobe does, but I assume it has already inserted the module into the kernel, just waiting for it to initialize before it returns (so when modprobe is killed, it actually does not have anything important left to do, and 1.6 seconds later the kernel is anyway finished with successfully loading the module, so it is there when needed later).

Would be nice if you could reproduce the timing numbers (with good and bad kernel) now that your BIOS is updated, and post to the kernel bugzilla.
Comment by Joker-jar (Joker-jar) - Sunday, 14 August 2011, 08:01 GMT
I wrote to bugzilla.kernel.org three days ago, but there is no any answer yet. Is there anybody know any live system with Kernel 3.0 to test this problem?
Comment by Tom Gundersen (tomegun) - Monday, 15 August 2011, 23:22 GMT
@Joker-jar: <http://releng.archlinux.org/isos/2011.08.15/>

Could you post a link to your bug report?
Comment by Joker-jar (Joker-jar) - Tuesday, 16 August 2011, 06:34 GMT
@tomegun: I mean other than Archlinux.

https://bugzilla.kernel.org/show_bug.cgi?id=40932
Comment by Joker-jar (Joker-jar) - Sunday, 28 August 2011, 20:21 GMT
How to decrease this udev timeout?
Comment by Tom Gundersen (tomegun) - Sunday, 28 August 2011, 21:37 GMT
You want to change the event timeout, which should be set in the rule that modprobes the aacraid module.

I think replacing
DRIVER!="?*", ENV{MODALIAS}=="?*", RUN+="/sbin/modprobe -bv $env{MODALIAS}"
in /lib/udev/rules.d/80-drivers.rules, with
DRIVER!="?*", ENV{MODALIAS}=="?*", RUN+="/sbin/modprobe -bv $env{MODALIAS}", OPTIONS+="event_timeout=60"
(where 60 is replaced with the number of seconds you want).

I have not tested this so maybe it is not exactly correct, have a look in "man udev" for more info.

Out of curiosity, I don't think I have seen the time measurement on a working kernel. If it is not too much effort, could you paste the output of "# time modprobe aacraid" after booting a working kernel with "modprobe.blacklist=aacraid"?
Comment by Joker-jar (Joker-jar) - Sunday, 28 August 2011, 22:50 GMT
Thanks, i'll try it. "working kernel" you mean 2.6.xx?
Comment by Tom Gundersen (tomegun) - Sunday, 28 August 2011, 23:09 GMT
@Joker-jar: yeah, whatever was the last version you tried that didn't have this problem.
Comment by Joker-jar (Joker-jar) - Friday, 16 September 2011, 12:45 GMT
Hello, i just back from vacation. Today i tried to boot Kubuntu beta 11.10 (live). It has 3.0 kernel too. It seems it has similar problem: http://pastebin.com/uE1Pa5nz

I'll try "# time modprobe aacraid" with working kernel as soon as i'll buy blank CD :)
Comment by Tom Gundersen (tomegun) - Friday, 16 September 2011, 14:52 GMT
Thanks for reporting back.
Comment by Joker-jar (Joker-jar) - Monday, 19 September 2011, 15:23 GMT
Hm, some info:

i run kexec daemon and try to reboot. System hangs with this messages:

Triggering uevents...
irq 19: nobody cared (...)

handlers:

usb_hcd_ing
usb_hcd_ing

disabling IRQ 19

And after that (~1 min) message from screenshot (timeout killing modprobe). IRQ 19? May be this is conflict?

Comment by Tom Gundersen (tomegun) - Friday, 18 November 2011, 05:12 GMT
Just to note that this has been fixed in mainline, and probably will be in the next stable release:

commit cf16123c9c8e346ed1dd171295a678d77648d7f8
Author: Vasily Averin <vvs@parallels.com>
Date: Fri Nov 11 13:42:16 2011 +0400

[SCSI] aacraid: controller hangs if kernel uses non-default ASPM policy

Aacraid controller can hang on some nodes if kernel uses non-default
(powersave) ASPM policy. Controller hangs shortly after successful load and
hardware detection. Scsi error handler detects this hang and tries to restart
hardware but it does not help.

Initially it was noticed on RHEL6-based openVZ kernel after backporting
aacraid driver from mainline (RHEL6 kernel with original driver works well)
http://bugzilla.openvz.org/show_bug.cgi?id=2043

This issue happens because default ASPM policy was changed in Red Hat
kernels. Therefore guys from Red Hat have noticed this problem long time ago:
on Fedora 12
https://bugzilla.redhat.com/show_bug.cgi?id=540478
on Fedora 14
https://bugzilla.redhat.com/show_bug.cgi?id=679385

In RHEL6 kernel this issue was fixed, ASPM was disabled in aacraid driver. In
kernel changelog I've found that seems it was done by Matthew Garrett: -
[scsi] aacraid: Disable ASPM by default (Matthew Garrett) [599735]

However seems this patch was not submitted to mainline. I've reproduced this
issue on vanilla 3.1.0 kernel booted with "pcie_aspm.policy=powersave" option,
So I believe it makes sense to do it now.

Signed-off-by: Vasily Averin <vvs@sw.ru>
[mjg: Checking the Windows drivers indicates that they disable ASPM under all
circumstances, so:]
Acked-by: Matthew Garrett <mjg@redhat.com>
Acked-by: Achim Leubner <Achim_Leubner@pmc-sierra.com>
Cc: stable@kernel.org
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Comment by Joker-jar (Joker-jar) - Friday, 18 November 2011, 07:02 GMT
Yes, i noticed that problem is disappeared in 3.1 kernel on my ArchLinux

Loading...