FS#37961 - Race between mdadm(_udev) and lvm2 hook

Attached to Project: Arch Linux
Opened by Arno (ihad) - Saturday, 30 November 2013, 18:30 GMT
Last edited by Gerardo Exequiel Pozzi (djgera) - Sunday, 01 December 2013, 06:39 GMT
Task Type Bug Report
Category Packages: Core
Status Closed
Assigned To No-one
Architecture All
Severity Low
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

Description:

Today I rebooted my system after some time and got dropped to the rescue shell, because the mdadm, or mdadm_udev hook didn't assmble the RAID arrays in time for the lmv2 hook.

1. The setup:
~~
$ cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md1 : active raid5 sde[0] sdh[4] sdg[2] sdf[1]
8790405120 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4][UUUU]

md127 : active raid5 sdb2[0] sdd2[3] sda2[2] sdc2[1]
1464862656 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

unused devices: <none>
~~

md127 contains the rootfs:
~~
$ mount
...
/dev/mapper/raid-root on / type xfs (rw,relatime,attr2,inode64,noquota)
...
~~

2. The Problem:
Both hooks, mdadm and mdadm_udev, don't wait until the kernel has assembled the raids, so lvm vgscan + lvm vgchange -ay fail. Starting lvmetad doesn't help, either. In my case assembling the raids takes about 20 seconds, give or take:
~~
[snip]
[ 5.296915] ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 5.297120] ata6.15: Port Multiplier 1.2, 0x197b:0x0325 r0, 15 ports, feat 0x5/0xf
[ 5.299579] ata6.00: hard resetting link
[ 5.783729] ata6.00: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[ 5.783755] ata6.01: hard resetting link
[ 6.270389] ata6.01: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[ 6.270414] ata6.02: hard resetting link
[ 6.757050] ata6.02: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[ 6.757075] ata6.03: hard resetting link
[ 7.243709] ata6.03: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[ 7.243734] ata6.04: hard resetting link
[ 8.266888] ata6.04: failed to resume link (SControl 0)
[ 8.267090] ata6.04: SATA link down (SStatus 0 SControl 0)
[ 8.267134] ata6.05: hard resetting link
[ 9.290205] ata6.05: failed to resume link (SControl 0)
[ 9.290406] ata6.05: SATA link down (SStatus 0 SControl 0)
[ 9.290450] ata6.06: hard resetting link
[ 10.313527] ata6.06: failed to resume link (SControl 0)
[ 10.313727] ata6.06: SATA link down (SStatus 0 SControl 0)
[ 10.313771] ata6.07: hard resetting link
[ 11.336849] ata6.07: failed to resume link (SControl 0)
[ 11.337049] ata6.07: SATA link down (SStatus 0 SControl 0)
[ 11.337094] ata6.08: hard resetting link
[ 12.360171] ata6.08: failed to resume link (SControl 0)
[ 12.360373] ata6.08: SATA link down (SStatus 0 SControl 0)
[ 12.360418] ata6.09: hard resetting link
[ 13.383480] ata6.09: failed to resume link (SControl 0)
[ 13.383679] ata6.09: SATA link down (SStatus 0 SControl 0)
[ 13.383724] ata6.10: hard resetting link
[ 14.406818] ata6.10: failed to resume link (SControl 0)
[ 14.407017] ata6.10: SATA link down (SStatus 0 SControl 0)
[ 14.407062] ata6.11: hard resetting link
[ 15.430135] ata6.11: failed to resume link (SControl 0)
[ 15.430335] ata6.11: SATA link down (SStatus 0 SControl 0)
[ 15.430380] ata6.12: hard resetting link
[ 16.453456] ata6.12: failed to resume link (SControl 0)
[ 16.453655] ata6.12: SATA link down (SStatus 0 SControl 0)
[ 16.453700] ata6.13: hard resetting link
[ 17.476777] ata6.13: failed to resume link (SControl 0)
[ 17.476977] ata6.13: SATA link down (SStatus 0 SControl 0)
[ 17.477022] ata6.14: hard resetting link
[ 18.500092] ata6.14: failed to resume link (SControl 0)
[ 18.500291] ata6.14: SATA link down (SStatus 0 SControl 0)
[ 18.500888] ata6.00: ATA-8: ST3000DM001-9YN166, CC4H, max UDMA/133
[ 18.500890] ata6.00: 5860533168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA
[ 18.501413] ata6.00: configured for UDMA/133
[ 18.501972] ata6.01: ATA-8: ST3000DM001-9YN166, CC4H, max UDMA/133
[ 18.501973] ata6.01: 5860533168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA
[ 18.502496] ata6.01: configured for UDMA/133
[ 18.503050] ata6.02: ATA-8: ST3000DM001-9YN166, CC4H, max UDMA/133
[ 18.503051] ata6.02: 5860533168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA
[ 18.503587] ata6.02: configured for UDMA/133
[ 18.504151] ata6.03: ATA-8: ST3000DM001-9YN166, CC4H, max UDMA/133
[ 18.504153] ata6.03: 5860533168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA
[ 18.504680] ata6.03: configured for UDMA/133
[ 18.504725] ata6: EH complete
[ 18.504855] scsi 5:0:0:0: Direct-Access ATA ST3000DM001-9YN1 CC4H PQ: 0 ANSI: 5
[ 18.504977] sd 5:0:0:0: [sde] 5860533168 512-byte logical blocks: (3.00 TB/2.72 TiB)
[ 18.504980] sd 5:0:0:0: [sde] 4096-byte physical blocks
[ 18.505064] sd 5:0:0:0: [sde] Write Protect is off
[ 18.505067] sd 5:0:0:0: [sde] Mode Sense: 00 3a 00 00
[ 18.505077] scsi 5:1:0:0: Direct-Access ATA ST3000DM001-9YN1 CC4H PQ: 0 ANSI: 5
[ 18.505095] sd 5:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 18.505265] sd 5:1:0:0: [sdf] 5860533168 512-byte logical blocks: (3.00 TB/2.72 TiB)
[ 18.505267] sd 5:1:0:0: [sdf] 4096-byte physical blocks
[ 18.505334] sd 5:1:0:0: [sdf] Write Protect is off
[ 18.505337] sd 5:1:0:0: [sdf] Mode Sense: 00 3a 00 00
[ 18.505357] sd 5:1:0:0: [sdf] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 18.505445] scsi 5:2:0:0: Direct-Access ATA ST3000DM001-9YN1 CC4H PQ: 0 ANSI: 5
[ 18.505789] sd 5:2:0:0: [sdg] 5860533168 512-byte logical blocks: (3.00 TB/2.72 TiB)
[ 18.505791] sd 5:2:0:0: [sdg] 4096-byte physical blocks
[ 18.738022] sde: unknown partition table
[ 18.738055] sd 5:2:0:0: [sdg] Write Protect is off
[ 18.738059] sd 5:2:0:0: [sdg] Mode Sense: 00 3a 00 00
[ 18.738066] scsi 5:3:0:0: Direct-Access ATA ST3000DM001-9YN1 CC4H PQ: 0 ANSI: 5
[ 18.738115] sd 5:2:0:0: [sdg] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 18.738170] sd 5:3:0:0: [sdh] 5860533168 512-byte logical blocks: (3.00 TB/2.72 TiB)
[ 18.738172] sd 5:3:0:0: [sdh] 4096-byte physical blocks
[ 18.738222] sd 5:0:0:0: [sde] Attached SCSI disk
[ 18.738225] sd 5:3:0:0: [sdh] Write Protect is off
[ 18.738229] sd 5:3:0:0: [sdh] Mode Sense: 00 3a 00 00
[ 18.738252] sd 5:3:0:0: [sdh] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 18.753173] sdf: unknown partition table
[ 18.769158] sdg: unknown partition table
[ 18.784408] sdh: unknown partition table
[ 18.793227] sd 5:1:0:0: [sdf] Attached SCSI disk
[ 18.793259] sd 5:2:0:0: [sdg] Attached SCSI disk
[ 18.892777] sd 5:3:0:0: [sdh] Attached SCSI disk
[ 32.050091] md: md127 stopped.
[ 32.051031] md: bind<sdc2>
[ 32.051204] md: bind<sda2>
[ 32.051428] md: bind<sdd2>
[ 32.051567] md: bind<sdb2>
[ 32.106573] raid6: sse2x1 4654 MB/s
[ 32.163234] raid6: sse2x2 7013 MB/s
[ 32.219897] raid6: sse2x4 8279 MB/s
[ 32.219900] raid6: using algorithm sse2x4 (8279 MB/s)
[ 32.219901] raid6: using intx1 recovery algorithm
[ 32.220039] async_tx: api initialized (async)
[ 32.220125] xor: measuring software checksum speed
[ 32.253210] prefetch64-sse: 14538.000 MB/sec
[ 32.286554] generic_sse: 13742.400 MB/sec
[ 32.286556] xor: using function: prefetch64-sse (14538.000 MB/sec)
[ 32.287385] md: raid6 personality registered for level 6
[ 32.287387] md: raid5 personality registered for level 5
[ 32.287389] md: raid4 personality registered for level 4
[ 32.287563] md/raid:md127: device sdb2 operational as raid disk 0
[ 32.287566] md/raid:md127: device sdd2 operational as raid disk 3
[ 32.287567] md/raid:md127: device sda2 operational as raid disk 2
[ 32.287568] md/raid:md127: device sdc2 operational as raid disk 1
[ 32.287817] md/raid:md127: allocated 4332kB
[ 32.287839] md/raid:md127: raid level 5 active with 4 out of 4 devices, algorithm 2
[ 32.287840] RAID conf printout:
[ 32.287841] --- level:5 rd:4 wd:4
[ 32.287842] disk 0, o:1, dev:sdb2
[ 32.287843] disk 1, o:1, dev:sdc2
[ 32.287844] disk 2, o:1, dev:sda2
[ 32.287845] disk 3, o:1, dev:sdd2
[ 32.287861] md127: detected capacity change from 0 to 1500019359744
[ 32.288412] md127: unknown partition table
[ 36.903361] md: md1 stopped.
[ 36.904109] md: bind<sdf>
[ 36.904234] md: bind<sdg>
[ 36.904397] md: bind<sdh>
[ 36.904566] md: bind<sde>
[ 36.905817] md/raid:md1: device sde operational as raid disk 0
[ 36.905820] md/raid:md1: device sdh operational as raid disk 3
[ 36.905821] md/raid:md1: device sdg operational as raid disk 2
[ 36.905822] md/raid:md1: device sdf operational as raid disk 1
[ 36.906038] md/raid:md1: allocated 4332kB
[ 36.906058] md/raid:md1: raid level 5 active with 4 out of 4 devices, algorithm 2
[ 36.906059] RAID conf printout:
[ 36.906060] --- level:5 rd:4 wd:4
[ 36.906061] disk 0, o:1, dev:sde
[ 36.906062] disk 1, o:1, dev:sdf
[ 36.906063] disk 2, o:1, dev:sdg
[ 36.906063] disk 3, o:1, dev:sdh
[ 36.906082] md1: detected capacity change from 0 to 9001374842880
[ 36.931384] md1: unknown partition table
[snip]
~~

After much trial and error I came up with the following solution:
~~
$ /usr/lib/initcpio/hooks # diff -u mdadm mdadm_custom
--- mdadm 2013-11-30 18:49:12.662707310 +0100
+++ mdadm_custom 2013-11-30 18:49:48.755836776 +0100
@@ -5,6 +5,8 @@

# for partitionable raid, we need to load md_mod first!
modprobe md_mod 2>/dev/null
+ echo "Waiting ${rootdelay} seconds for RAID to assemble!"
+ sleep ${rootdelay}

if [ -n "$md" ]; then
echo 'DEVICE partitions' >"$mdconfig"
~~

I'm pretty sure that there is a better way to wait for RAID assembly, but I came up empty, unfortunately.

Additional info:
* package version(s)
linux 3.12.1-3
mkinitcpio 0.15.0-1

* config and/or log files etc.
~~
[root@hadante] /etc # grep -v '^#' mkinitcpio.conf | grep -v '^$'
MODULES=""
BINARIES=""
FILES=""
HOOKS="base udev autodetect modconf block mdadm_custom lvm2_manual filesystems keyboard fsck"
~~
[root@hadante] /lib/initcpio/hooks # cat lvm2_manual
#!/usr/bin/ash

run_hook() {
udevadm settle
lvm vgscan
lvm vgchange -ay
}

# vim: set ft=sh ts=4 sw=4 et:
~~

Steps to reproduce:
[root@hadante] # systemctl reboot

(with 8 SATA-HD + 1 DVD-RW-Writer)

Full dmesg attached.
   dmesg (61.8 KiB)
This task depends upon

Closed by  Gerardo Exequiel Pozzi (djgera)
Sunday, 01 December 2013, 06:39 GMT
Reason for closing:  Not a bug
Comment by Dave Reisner (falconindy) - Saturday, 30 November 2013, 19:56 GMT
Your setup is entirely broken. You appear to be using udev rules *and* mdassemble in your initramfs, which leads to the behavior you see. Notice this line here:

[ 32.050091] md: md127 stopped.

This is after udev had already assembled the device and mdassemble runs. Presumably, since lvm assembly is based on lvmetad and udev events, lvm assembly tries to take place in parallel with mdassemble breaking your already assembled array.

If you can reproduce this with the *actual* mdadm_udev and lvm2 hooks from the Arch Linux packages, then I'm interested.
Comment by Arno (ihad) - Saturday, 30 November 2013, 21:58 GMT
Hmm, on the first reboot my hooks were (no customization):
~~
HOOKS="base udev autodetect modconf block mdadm_udev lvm2 filesystems keyboard fsck"
~~
with an explicit entry for md127 in /etc/mdadm.conf. No entry for md1, but md1 is not needed for booting, just hosting Virtual machines.
I got a rescue shell after ${rootdelay} seconds.
After
~~
$ lvm vgscan
$ lvm vgchange -ay
$ mount /dev/mapper/raid-root /new_root
$ exit
(some error about job control and that booting would most likely fail)
$ exit
~~
the system booted.

mdadm is a run_hook, lvm2 is a run_earlyhook, which starts lvmetad. If lvmetad is supposed to "vgscan" everything, does it wait until the RAIDS are assembled? Please note that it takes 20+ seconds...
If not, it's the wrong way around (in my case): First assemble the RAIDS, then lvm vgscan.

Long story short:
On the first reboot I got a rescue shell with the actual, original archlinux hooks (mdadm_udev before lvm2) with an explicit /etc/mdadm.conf generated by mdadm --detail --scan for _only_ md127.
Comment by Dave Reisner (falconindy) - Sunday, 01 December 2013, 05:20 GMT
> HOOKS="base udev autodetect modconf block mdadm_udev lvm2 filesystems keyboard fsck"
But this doesn't line up with what I point out to be problematic... Please attach your whole /etc/mkinitcpio.conf and the dmesg output from a failed boot with it.
Comment by Arno (ihad) - Sunday, 01 December 2013, 06:19 GMT
Hmm, I guess you were right saying that my setup was broken.
I reinstalled mkinitcpio, regenerated the initrd and now it just works with this:
~~
HOOKS="base udev autodetect modconf block mdadm_udev lvm2 filesystems keyboard fsck"
~~

No idea what went wrong in the first place. Anyhow, thanks for your help, and sorry for bothering you!

Loading...