FS#41833 - [lvm2] lvmetad in initrd hangs, blocking further lvm calls
Attached to Project:
Arch Linux
Opened by Bastien Durel (bastiendurel) - Friday, 05 September 2014, 10:08 GMT
Last edited by Christian Hesse (eworm) - Wednesday, 10 February 2021, 09:34 GMT
Opened by Bastien Durel (bastiendurel) - Friday, 05 September 2014, 10:08 GMT
Last edited by Christian Hesse (eworm) - Wednesday, 10 February 2021, 09:34 GMT
|
Details
Description: lvmetad in initrd hangs, preventing
lvm2-lvmetad.service from starting, therefore making most
lvm related tasks (such as pvscan on boot) hanging forever
(pvscan at boot is backgrounded after 90s).
I described the problem first here : https://unix.stackexchange.com/questions/153705/lvmetad-hangs-on-startup Shorter: lvmetad isn't kill by the run_cleanuphook (kill $(cat /run/lvmetad.pid)), and then it prevents systemd to run it If I later kill (-9, or this won't work) this early process, all lvm2-lvmetad.service starts, pvscan's finished, etc. Additional info: * package version(s): 2.02.111-1 mkinitcpio.conf : HOOKS="base udev autodetect modconf block lvm2 filesystems keyboard fsck" Here is the gdb trace : http://pastebin.com/4mpNBL8t fd 6 is the socket : lvmetad 60 root 6u unix 0xffff880212251f80 0t0 8269 /run/lvm/lvmetad.socket Steps to reproduce: I can't say, I have this problem only on one machine that have a SSD ans a HD disk ; other works but they don't have the same disk layout. |
This task depends upon
Closed by Christian Hesse (eworm)
Wednesday, 10 February 2021, 09:34 GMT
Reason for closing: None
Additional comments about closing: lvm2 2.03.11-3
Wednesday, 10 February 2021, 09:34 GMT
Reason for closing: None
Additional comments about closing: lvm2 2.03.11-3
Out of curiosity, why do you run lvmetad in the initrd?
Phrased differently, where does this pvscan call come from? It isn't called from any of the hooks you mention in your config.
The pvscan I talk about are run by systemd after pivot. (actually, I don't know for the pvscan with pid 117)
the pvscan that hangs on boot are launched by lvm2-pvscan@.service
lvmetad in intrd is launched by /usr/lib/initcpio/hooks/lvm2
initramfs: lvmetad starts
initramfs: udevadm trigger, hardware setup, root is mounted, etc
initramfs: lvmetad is told to quit, and it doesn't do so immediately
--- switch_root ---
systemd: lvm-ish stuff happens (all according to the upstream units)
If this is correct, then it's upstream which is enforcing usage of lvmetad *and* pvscan at the same time. A udev rule (69-dm-lvm-metad.rules) generates a SYSTEMD_WANTS clause for devices, and these run in later userspace (nothing to do with early userspace since systemd isn't present in this user's configuration). This rule looks pretty wrong in several ways...
root 117 0.0 0.0 25924 5064 ? Ss 09:26 0:00 (lvm2) pvscan --background --cache --activate ay --major 8 --minor 21
have been launched in initrd
- the lesser pid I know is systemd-modules-load, 255.
- lsof on process 117 shows all opened libraries as deleted: http://pastebin.com/uNCuHnzY
- process starts at 09:26:16, 2 seconds before systemd
@Thomas: why did we add this? Are we just trying to replicate the upstream systemd-run-based equivalent?
To clear this up, this is how things happen:
* lvm2 early hook starts lvmetad.
* udev hook triggers uevents.
* udev launches all those pvscan commands (compare /usr/lib/initcpio/udev/69-dm-lvm-metad.rules to /usr/lib/udev/rules.d/69-dm-lvm-metad.rules).
* /init finds the root device and mounts it.
* lvm2 late hook sends SIGTERM to lvmetad (lvmetad does not exit according to this bug report).
The thing is, this is all solved perfectly with systemd, since everything is guaranteed to be killed properly before it switches root - only the old style sh-based initramfs shows these symptoms. The only possible solution I see is making the lvm2 late hook go on a killing spree and nuke all pvscan and lvmetad processes with SIGKILL in a loop until none remains. This is overly complex compared to the clean systemd-based solution.
following is total crap, but it seems to me lvm2 currently doesn't have a latehook, it has cleanuphook.
What if we do something like this:
run_latehook() {
kill $(cat /run/lvmetad.pid)
}
run_cleanuphook() {
sleep 0.2
if [ -e "/run/lvmetad.pid" ]; then
kill -9 $(cat /run/lvmetad.pid)
fi
}
I looked in lvmetad source.. the parent process there is waiting in a select call with a timeout-value of 0.25 seconds, so while above is ugly
and would add 0.2 sec to everyones boot-time, it should work most of the time i think..
PS: Sorry, seems that refreshing a flyspray page happily re-submits old comments, what the hell?
Today I had to wait 1m30 during boot as early lvmetad wasn't shutdown before switching root. At last it was killed by systemd, and the service was launched, so lvm system works in session. But the swap partition that lies on the lvm was not mounted (timeout)
sept. 09 09:09:58 localhost systemd[1]: Starting LVM2 metadata daemon socket.
sept. 09 09:09:58 localhost systemd[1]: Listening on LVM2 metadata daemon socket.
sept. 09 09:09:58 localhost systemd[1]: Starting LVM2 metadata daemon...
sept. 09 09:09:58 localhost systemd[1]: Started LVM2 metadata daemon.
[...]
sept. 09 09:09:58 localhost kernel: sd 0:0:0:0: [sda] Attached SCSI disk
sept. 09 09:09:58 localhost kernel: sdb: sdb1 sdb2 < sdb5 >
sept. 09 09:09:58 localhost kernel: sd 1:0:0:0: [sdb] Attached SCSI disk
sept. 09 09:09:58 localhost systemd[1]: Starting system-lvm2\x2dpvscan.slice.
sept. 09 09:09:58 localhost systemd[1]: Created slice system-lvm2\x2dpvscan.slice.
sept. 09 09:09:58 localhost systemd[1]: Starting LVM2 PV scan on device 8:5...
sept. 09 09:09:59 localhost systemd[1]: Starting LVM2 PV scan on device 8:21...
[...]
sept. 09 09:09:59 localhost lvm[112]: 2 logical volume(s) in volume group "ubuntu" now active
sept. 09 09:09:59 localhost systemd[1]: Found device /dev/mapper/ubuntu-archroot.
sept. 09 09:09:59 localhost systemd[1]: Started LVM2 PV scan on device 8:5.
sept. 09 09:09:59 localhost systemd[1]: Starting File System Check on /dev/mapper/ubuntu-archroot...
sept. 09 09:09:59 localhost systemd-fsck[131]: /dev/mapper/ubuntu-archroot: clean, 511127/2949120 files, 6309850/11796480 blocks
sept. 09 09:09:59 localhost lvm[116]: 20 logical volume(s) in volume group "space" now active
sept. 09 09:09:59 localhost systemd[1]: Started File System Check on /dev/mapper/ubuntu-archroot.
sept. 09 09:09:59 localhost systemd[1]: Mounting /sysroot...
sept. 09 09:09:59 localhost systemd[1]: Mounted /sysroot.
sept. 09 09:09:59 localhost systemd[1]: Starting Initrd Root File System.
sept. 09 09:09:59 localhost kernel: EXT4-fs (dm-0): mounting ext3 file system using the ext4 subsystem
sept. 09 09:09:59 localhost systemd[1]: Reached target Initrd Root File System.
[...]
sept. 09 09:09:59 localhost systemd[1]: Stopping LVM2 PV scan on device 8:21...
sept. 09 09:09:59 localhost systemd[1]: Stopping LVM2 metadata daemon...
sept. 09 09:09:59 localhost systemd[1]: Stopping LVM2 PV scan on device 8:5...
[...]
sept. 09 09:10:02 localhost systemd[1]: Starting Switch Root...
sept. 09 09:10:02 localhost systemd[1]: Switching root.
[...]
sept. 09 09:11:32 data-dev4 systemd[1]: lvm2-lvmetad.service stop-sigterm timed out. Killing.
sept. 09 09:11:32 data-dev4 systemd[1]: lvm2-pvscan@8:21.service stop-sigterm timed out. Killing.
sept. 09 09:11:32 data-dev4 systemd[1]: lvm2-lvmetad.service: main process exited, code=killed, status=9/KILL
sept. 09 09:11:32 data-dev4 systemd[1]: Stopped LVM2 metadata daemon.
sept. 09 09:11:32 data-dev4 systemd[1]: Unit lvm2-lvmetad.service entered failed state.
sept. 09 09:11:32 data-dev4 systemd[1]: lvm2-pvscan@8:21.service: main process exited, code=killed, status=9/KILL
sept. 09 09:11:32 data-dev4 systemd[1]: Stopped LVM2 PV scan on device 8:21.
sept. 09 09:11:32 data-dev4 systemd[1]: Unit lvm2-pvscan@8:21.service entered failed state.
sept. 09 09:11:32 data-dev4 systemd[1]: Job dev-disk-by\x2duuid-162f9ac5\x2d33c6\x2d4d2d\x2d8e76\x2d1e2873d6a75a.device/start timed out.
sept. 09 09:11:32 data-dev4 systemd[1]: Timed out waiting for device dev-disk-by\x2duuid-162f9ac5\x2d33c6\x2d4d2d\x2d8e76\x2d1e2873d6a75a.device.
sept. 09 09:11:32 data-dev4 systemd[1]: Dependency failed for /dev/disk/by-uuid/162f9ac5-33c6-4d2d-8e76-1e2873d6a75a.
sept. 09 09:11:32 data-dev4 systemd[1]: Dependency failed for Swap.
[...]
sept. 09 09:11:32 data-dev4 systemd[1]: Started Login Service.
sept. 09 09:11:32 data-dev4 systemd[1]: Starting LVM2 metadata daemon...
sept. 09 09:11:32 data-dev4 systemd[1]: Started LVM2 metadata daemon.
sdb is a 1 Tb hard disk
sept. 10 08:58:05 localhost kernel: scsi 1:0:0:0: Direct-Access ATA WDC WD10EZRX-00A 1A01 PQ: 0 ANSI: 5
sept. 10 08:58:05 localhost kernel: sd 1:0:0:0: [sdb] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)
sept. 10 08:58:05 localhost kernel: sd 1:0:0:0: [sdb] 4096-byte physical blocks
sept. 10 08:58:05 localhost kernel: sd 1:0:0:0: [sdb] Write Protect is off
sept. 10 08:58:05 localhost kernel: sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
sept. 10 08:58:05 localhost kernel: sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sept. 10 08:58:05 localhost kernel: sdb: sdb1 sdb2 < sdb5 >
sept. 10 08:58:05 localhost kernel: sd 1:0:0:0: [sdb] Attached SCSI disk
$ cat /proc/cmdline
BOOT_IMAGE=/arch/vmlinuz-linux root=/dev/mapper/ubuntu-archroot rw break=premount
Is there another way ?
The only way the "break" param is honored is without systemd.
If you want to apply specific settings to LVM volumes, you must use something like this:
SUBSYSTEM=="block", ACTION=="add|change", ENV{DM_UDEV_RULES_VSN}=="?*", ENV{DM_UDEV_DISABLE_OTHER_RULES_FLAG}!="1", ENV{DM_NAME}=="space-vm*", GROUP="bastien"
The match on DM_UDEV_RULES_VSN and DM_UDEV_DISABLE_OTHER_RULES_FLAG is important.
Try the following: Revert back to base+udev+lvm2 initramfs, but add BINARIES="strace". Then use break=postmount and look for any "pvscan" processes that did not exit. Attach to them using strace -p $PID and see if you get something useful.
when breaking postmount, I did not find any pvscan process in ps list (I tried twice)
editing my rules like brain0 said leads to another hang
After a lot of poking around I'm pretty sure I can see what is going on. lvmetad is started, then udev is started and udev trigger is run. This kicks off the discovery of the disks, the MD arrays, and ultimately the LVM VGs. This is done by the following command from /lib/initcpio/udev/69-dm-lvm-metad.rules:
/usr/bin/lvm pvscan --background --cache --activate ay --major $major --minor $minor
Note the "--background" switch. This means they are started and udev goes on with its life. Udev trigger returns, and udev settle is called. If this happens to be called and finish before one of the pvscans gets to actually registering lvs, the udev hook can actually cause udev to exit WHILE it is processing lvs.
This is bad enough, however the way pvscan works, it creates and holds a semaphore across the LV activation. This semaphore is normally freed by 95-dm-notify.rules however udev is gone, this rule never fires to pvscan hangs waiting for the semaphore to be released.
Then lvmetad hangs because its waiting for pvscan to finish. The other aggravating thing about this bug is when this happens, half my LVs (I have a lot of them, the machine is a VM server) are missing.
Possible fix is just to remove "--background" from this pvscan invocation, I'm not really sure why its there, I'd think we WANT to wait for pvscan to finish.
Removing --background is certainly not what the lvm guys want, they even delegate pvscan to a systemd service and return instantly when systemd is running. In the systemd case, we could likely solve this problem by delaying the root mount until all pvscan@.service have finished. For the shell case, we can wait for all pvscan processes to finish in a late hook.
Systemd won't have the issue because it kills everything before switching to the real root including the hung pvscans and the hung lvmetad. It will still have the issues of missing lvs though (one of which even theoretically could be the root).
And yes, waiting for the pvscan processes (or systemd units) to finish before stopping udev wold also solve the issue. However isn't that more or less the same as removing the background flag? Six of one, half dozen of another.
It trys to stop everything, but as pvscan hangs, it waits 90s to kill it
while [ "$(pgrep -f pvscan)" ]; do
sleep 0.1
done
This seems to fix the issue as it waits for all the LVM processes to finish before going on to stop udev. I had to stick it in the udev hook instead of the lvm hook because initcpio doesn't seem to have a hook phase for post udev trigger but before root mount. Late hooks and cleanup hooks are too late and the main hook is possibly too early since I'm not sure how to ensure it runs after the udev hook.
I don't know what the "official" fix for this would look like (since sticking lvm stuff in the udev hook is definitely ugly), but this shows the general idea and proves that it solves the issue. For the shell script case anyway, solving the issue for the systemd case is another matter.
The late hook does seem fine for the udev interactions, yes.
"Note that when two units with an ordering dependency between them are shut down, the inverse of the start-up order
is applied. i.e. if a unit is configured with After= on another unit, the former is stopped before the latter if both are shut down."
EDIT: However, it can also be needed to use Requires=systemd-udevd.service or BindsTo=systemd-udevd.service, so that
the pvscan instance is actually made to stop in the same transaction as udevd.
today's upgrade removed the workaround I put in /usr/lib/initcpio/hooks/udev, and I ran into emergency shell on reboot
Having the same problem, my config is / on SSD and /home on HD disk.
I put the fix in /usr/lib/initcpio/hooks/udev and it seems to have solved the problem.
Thanks for this (temporary) solution.
I did my first install of Arch last night and it went ok.
With my second install on different hardware, I bumped into this problem.
I see that there's a workaround available, but it's for the non-systemd case.
I can't seem to find how to switch init systems since systemd is the only officially support init system. Can someone please tell me which init system to use to employ the workaround and how to activate it/disable systemd?
Thank you.
In /etc/mkinitcpio.conf, you must change your HOOKS list:
HOOKS="base udev autodetect modconf block lvm2 filesystems keyboard fsck" is a hook list for script
HOOKS="base systemd autodetect modconf block sd-lvm2 filesystems keyboard fsck" is a hook list for systemd
see https://wiki.archlinux.org/index.php/mkinitcpio#HOOKS
I don't know why, but lately, I've had this timeout issue even with the workaround...Sometimes, it won't even work after a cold boot. It really is pretty random.
Is there any other workaround to try ?....even if it slows down the entire boot process, it will be better than waiting 90 sec.
Edit: removing the --background flag seems to work here too, but looks to be sub-optimal, as a system update will break this.
RUN+="/usr/bin/lvm pvscan --cache --background --activate ay --major $major --minor $minor", ENV{LVM_SCANNED}="1"
to
RUN+="/usr/bin/lvm pvscan --cache --activate ay --major $major --minor $minor", ENV{LVM_SCANNED}="1"
I finally found the right combination of keywords to search which led me to this. I have an SSD with two partitions one for the boot v32 and the other one is an LVM partition with /root and /home I also have an HDD with one big 3TB partition on LVM with /var and /data It only hangs on the HDD. At first I thought it was the size of the HDD but then I saw this.
I'm on 3.19 as well and I can confirm that the workaround mentioned above does not work either.
I can confirm that the fix you provided works. Yes, hacky, but it fixes the core issue. Thank you for that.
I'm going back to @kressb patch, which worked before.
Maybe I'm being too literal, but did you replace "base udev" with "systemd" or did you mean "base udev" with "base systemd"?
I replaced "base udev" with "systemd", as the comment suggested.
Did you also change the "lvm2" hook to " sd-lvm2" when using "systemd"?
Personally I modified the udev hook with
while [ "$(pgrep -f pvscan)" ]; do
sleep 0.1
done
And its been working great.
No, I haven't tried that. I rebuilt it with the udev hook patch (without systemd) and it's working.
In /etc/mkinitcpio.conf
replace in HOOKS=...
udev -> systemd
lvm2 -> sd-lvm2
so that for example
HOOKS="base udev autodetect modconf block lvm2 filesystems keyboard fsck"
becomes
HOOKS="base systemd autodetect modconf block sd-lvm2 filesystems keyboard fsck"
Both have to be replaced!
my packages:
linux 3.19.3-3
systemd 219-5
core/linux 4.0.1-1
core/systemd 219-6
Fortunately switching to systemd/sd-lvm2 hooks seems to have solved it.
On the LVM partition declared in your /etc/fstab, you can add nofail,x-systemd.device-timeout=XX
The boot will no longer wait for the nofail partition to be mounted but when it fails, you will end up with an unmounted LVM partition that you can then activate with vgchange -ay
So in my eyes there are two issues in this case, the first one is the "wait for pvscan" which i solved as sugested with an late hook in lvm2 (not in udev hook):
run_latehook() {
while [ "$(pgrep -f pvscan)" ]; do
sleep 0.2
done
}
The second one is an in my eyes unexpected behavior of the lvmetad.
I look into this and if the lvmetad isn't responding on the socket, all reads
(lvs, pvs, etc.) wait indefinitely.
I look at this with strace and could see an connect to the socket, an write "hello" and an read on the socket...it never returns.
I think there should be an timeout on the read from the socket...so maybe i should open an bug upstream?
@Snowman / @brain0
As this ticket is open such a long time i would ask if the suggested wait loop here is an solution and could this implemented in the lvm2 late hook?
Do these packages work for you? Both are signed, so you can just install with `pacman -U`.
http://pkgbuild.com/~eworm/lvm2/device-mapper-2.02.160-1.1-x86_64.pkg.tar.xz
http://pkgbuild.com/~eworm/lvm2/lvm2-2.02.160-1.1-x86_64.pkg.tar.xz
Does anybody have objections adding something like this to the hook?
Maybe the timeout could be reduced from the actual 30s (150) to 10s (50)? (analog to the encrypt hook drive detect timeout?)
Updated to lvm2 2.02.160-2 in [testing], which includes the late hook with loop.
A "fix" could be setting the wait count in mkinitcpio.conf or something like that, but I'll fully admit to this case looking really edgy.