FS#32403 - [lvm2] lvm-monitoring.service seems to start too early (sometimes fails, group not ready)

Attached to Project: Arch Linux
Opened by Andrzej Giniewicz (Giniu) - Friday, 02 November 2012, 18:05 GMT
Last edited by Gerardo Exequiel Pozzi (djgera) - Wednesday, 12 February 2014, 01:59 GMT
Task Type Bug Report
Category Packages: Core
Status Closed
Assigned To Thomas Bächler (brain0)
Architecture All
Severity Medium
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

Description:

It seems that (in lvm2 from testing) the lvm-monitoring.service sometimes starts too early:

http://pastebin.com/9mgp67eX (cut parts I believe were interesting, I can submit full log if needed),

but if I load it manually after boot, it works:

Nov 02 18:57:49 gauss systemd[1]: Starting Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling...
Nov 02 18:57:49 gauss lvm[672]: 4 logical volume(s) in volume group "lvm" monitored
Nov 02 18:57:49 gauss systemd[1]: Started Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling.
This task depends upon

Closed by  Gerardo Exequiel Pozzi (djgera)
Wednesday, 12 February 2014, 01:59 GMT
Reason for closing:  No response
Additional comments about closing:  No response in more than one year.
Comment by Andrzej Giniewicz (Giniu) - Friday, 02 November 2012, 19:07 GMT
Btw, for now it happened like once (machine had 6 reboots for testing), so for now it belongs to "random features" group :)
Comment by Thomas Bächler (brain0) - Friday, 02 November 2012, 19:10 GMT
I am unsure what exactly to wait for, need to compare with the broken upstream unit files again.
Comment by Andrzej Giniewicz (Giniu) - Friday, 02 November 2012, 19:25 GMT
I still don't quite get it too. Right now I got it to boot cleanly (as I said, random) and in log I see that now it started even earlier but worked. Maybe the reason was different back then? I will upload whole log of one working and one failed case a little later (I hope I will get to it today), maybe it will help.
Comment by Andrzej Giniewicz (Giniu) - Thursday, 08 November 2012, 10:21 GMT
Ok, I think I have someting. I noticed, that not always all volumes are mounted. My setup is as follows: I have /dev/md1 mirroring raid, inside lvm group and set of volumes - 4 starting with dom0-* are for host, others starting with <name>-* are for virtual machines. I noticed, that I had two boots where one of volumes for virtual machine was not ready. I is what started to appear in log after recent update:

Nov 08 11:10:25 gauss systemd-udevd[257]: timeout: killing '/usr/sbin/lvm pvscan --cache --activate ay --major 9 --minor 1' [380]
Nov 08 11:10:26 gauss systemd-udevd[192]: worker [257] /devices/virtual/block/md1 timeout; kill it
Nov 08 11:10:26 gauss systemd-udevd[192]: seq 2026 '/devices/virtual/block/md1' killed
Nov 08 11:10:26 gauss systemd-udevd[192]: worker [257] terminated by signal 9 (Killed)

now, I had boots whit 4 dom0-* devices, also had boots with 5 devices dom0-* + one other, or 6 devices (all). It might be, that it is not lvm-monitoring that was at fault. Any idea how to proceed with this one? I cannot find exact moment when this message started to show, but it wasn't long ago, maybe few days.
Comment by Thomas Bächler (brain0) - Thursday, 08 November 2012, 12:12 GMT
The udev timeout is very bad. It means that either pvscan or lvmetad lock up. For clarity, can I see your 'lsblk' output? Can you also run 'journalctl --no-pager -a --since="2012-11-08" _SYSTEMD_UNIT=lvmetad.service + _SYSTEMD_UNIT=lvmetad.socket + _SYSTEMD_UNIT=systemd-udevd.service'? I hope there will be some output.
Comment by Andrzej Giniewicz (Giniu) - Thursday, 08 November 2012, 13:05 GMT
So, this is journalctl output, but with larger dates, not today but since jan (whole history): http://pastebin.com/VddeTwJG
And here goes lsblk: http://pastebin.com/VuHbNpkw

those extra, not dom0-* lvms were made by libvirt. This output is from after I made vgchange -ay from console.

When I boot and the detection hangs, lsblk looks like this: http://pastebin.com/u07Q4327
Comment by Thomas Bächler (brain0) - Thursday, 08 November 2012, 13:20 GMT
Okay, there is stuff going on that I don't understand.

First of all, the journal output suggests two different types of failures.

Second, are there any other udev rules in play besides the one from the LVM package? Are these simply LVM snapshots, or is there more going on? (The lsblk output is really confusing to me). Maybe there is a problem with monitoring after all: Can you disable lvm-monitoring.service and start it manually once the system booted, and see if the problem disappears?
Comment by Andrzej Giniewicz (Giniu) - Thursday, 08 November 2012, 14:01 GMT
Disabling monitoring did not helped. Maybe output of lvdisplay will help: http://pastebin.com/u83xRU5y seems it is snapshopt:

LV snapshot status active destination for sage-fs_vorigin

and it has enabled COW table:

LV Size 100,00 GiB
Current LE 25600
COW-table size 20,00 GiB
COW-table LE 5120

it looks like this probably because virtual disc in libvirt were created with option to preallocate 20GB but reserve up to 100GB.
Comment by Thomas Bächler (brain0) - Thursday, 08 November 2012, 15:31 GMT
So, this is all set up by LVM internally, which means it should work fine.

Can you determine what triggers this failure? You say this works most of the time. Also, these volumes should be activated in initramfs already (otherwise you couldn't boot from /), so why do these failures appear so late in the boot process?

Another idea: Can you get more debug output? Copy /usr/lib/systemd/systemd/lvmetad.service to /etc/systemd/system/ and add '-l all' to the arguments of lvmetad? Also, copy /usr/lib/udev/rules.d/69-dm-lvm-metad.rules to /etc/udev/rules.d/ and add the '-d' option to the pvscan command. This should generate much more output.
Comment by Andrzej Giniewicz (Giniu) - Thursday, 08 November 2012, 20:07 GMT
Added -d and -l all but no change to information there. Also enabled activation.checks and activation.verify_udev_operations - but not luck. When I set auto_activation_volume_list to "lvm/dom0-fs" "lvm/dom0-root" "lvm/dom0-swap" "lvm/dom0-home", the message disappears, but side effect is that virtual machines no longer see the discs.

Doing vgchange -ay results in
6 logical volume(s) in volume group "lvm" now active

but takes quite long time:

time vgchange -ay
6 logical volume(s) in volume group "lvm" now active

real 0m26.103s
user 0m0.017s
sys 0m0.297s

This time does not look good, seems to be related.
Comment by Thomas Bächler (brain0) - Thursday, 08 November 2012, 22:13 GMT
Alright, it seems related to activating those volumes. I think it may be related to a problem with dmeventd. What do systemctl status dmeventd.socket and systemctl status dmeventd.service say? What does journalctl say about dmeventd.service and dmeventd.socket?
Comment by Andrzej Giniewicz (Giniu) - Friday, 09 November 2012, 07:58 GMT
Output of status commands: http://pastebin.com/XHKyjegR
Those informations about monitoring/not being able to close are from when I disabled lvm-monitoring.service and rebooted. Anyway, when checking dmeventd info for single boot:

journalctl -b --no-pager _SYSTEMD_UNIT=dmeventd.service + _SYSTEMD_UNIT=dmeventd.socket

gives absolutely no output (other than "logs begin at ...")
Comment by Andrzej Giniewicz (Giniu) - Thursday, 29 November 2012, 09:33 GMT
I decided to try another thing. I.e. I got rid of the volumes created by libvirt (and created one usual instead, to use disk images inside - I have to bring this machine up and had no idea what else I could try). Anyway, I also hoped to see some other errors that show up then - and hoped that identifying them (if any) can help in original issue. I no longer have troubles starting the service, but I see (fresh start):

[root@gauss ~]# journalctl -b --no-pager _SYSTEMD_UNIT=systemd-udevd.service
-- Logs begin at czw, 2012-09-20 11:03:36 CEST, end at czw, 2012-11-29 10:18:15 CET. --
lis 29 10:17:55 gauss systemd-udevd[245]: conflicting device node '/dev/mapper/lvm-dom0--fs' found, link to '/dev/dm-0' will not be created
lis 29 10:17:55 gauss systemd-udevd[256]: conflicting device node '/dev/mapper/lvm-dom0--root' found, link to '/dev/dm-2' will not be created
lis 29 10:17:55 gauss systemd-udevd[257]: conflicting device node '/dev/mapper/lvm-dom0--swap' found, link to '/dev/dm-3' will not be created
lis 29 10:17:55 gauss systemd-udevd[258]: conflicting device node '/dev/mapper/lvm-dom0--virtuals' found, link to '/dev/dm-4' will not be created
lis 29 10:17:56 gauss systemd-udevd[254]: conflicting device node '/dev/mapper/lvm-dom0--home' found, link to '/dev/dm-1' will not be created

I noticed, that those /dev/dm-* are in system, looks like they are created earlier:

[root@gauss ~]# journalctl -b --no-pager | grep dm-0
lis 29 10:17:54 gauss kernel: EXT4-fs (dm-0): mounting ext3 file system using the ext4 subsystem
lis 29 10:17:54 gauss kernel: EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: (null)
lis 29 10:17:54 gauss kernel: EXT4-fs (dm-0): re-mounted. Opts: data=ordered
lis 29 10:17:55 gauss systemd-udevd[245]: conflicting device node '/dev/mapper/lvm-dom0--fs' found, link to '/dev/dm-0' will not be created

Anyway, what do you think, could this somehow cause issues when instead of only mount device, lvm would also have to setup snapshot and origin like in devices created by libvirt?
Comment by Thomas Bächler (brain0) - Thursday, 29 November 2012, 09:43 GMT
Sorry that I haven't been able to investigate this further recently.

The errors you are seeing should not happen, and I haven't seen them with non-broken LVM builds for years. Other than a broken or very outdated initramfs (especially outdated lvm/device-mapper tools), I have no idea where these errors would come from.

Loading...