Please read this before reporting a bug:
https://wiki.archlinux.org/title/Bug_reporting_guidelines
Do NOT report bugs when a package is just outdated, or it is in the AUR. Use the 'flag out of date' link on the package page, or the Mailing List.
REPEAT: Do NOT report bugs for outdated packages!
https://wiki.archlinux.org/title/Bug_reporting_guidelines
Do NOT report bugs when a package is just outdated, or it is in the AUR. Use the 'flag out of date' link on the package page, or the Mailing List.
REPEAT: Do NOT report bugs for outdated packages!
FS#56438 - [linux] LVM fails on reboot, works fine on cold start
Attached to Project:
Arch Linux
Opened by Martin Dratva (raqua) - Friday, 24 November 2017, 22:13 GMT
Last edited by freswa (frederik) - Thursday, 10 September 2020, 13:03 GMT
Opened by Martin Dratva (raqua) - Friday, 24 November 2017, 22:13 GMT
Last edited by freswa (frederik) - Thursday, 10 September 2020, 13:03 GMT
|
DetailsDescription:
Recently my machine started to have strange issue. All works fine on cold boot. But on reboot, I get this issue: A start job is running for dev-disk-by ... A stop job is running for LVM2 PV scan on device .... See the pictures: https://ibb.co/huh386 https://ibb.co/cYyX1R Those two messages are alternating and after timeout (I have set it to 30 sec. instead of default 1:30) I get this screen: https://ibb.co/fNxQMR Again, if I log in and do "reboot" I will get back to this situation again. If I do "poweroff" and then start PC again, no issue at all, boots perfectly fine. I was searching for the messages, but others seem to have this case consistent disregarding whether it is cold start or reboot. my fstab: # <file system> <dir> <type> <options> <dump> <pass> shm /dev/shm tmpfs nodev,nosuid 0 0 tmpfs /tmp tmpfs nodev,nosuid,size=4G 0 0 /dev/sr0 /media/cd auto ro,user,noauto,unhide 0 0 /dev/dvd /media/dvd auto ro,user,noauto,unhide 0 0 #/dev/fd0 /media/fl auto user,noauto 0 0 UUID=d36ad497-1614-4cc5-8cc9-da2d5e5a417c / ext4 rw,noatime 0 1 #UUID=c5b000c2-c921-41ca-99a8-329ae3b61366 swap swap defaults 0 0 UUID=02b484bb-b331-43ce-a9c2-8a9658f0c8c1 /media/arch_legacy ext4 rw,noatime,noauto 0 0 UUID=9b8e3c44-b667-4bc3-ba2b-a67acaec81d0 /media/arch_exp ext4 rw,noatime,,noauto 0 0 #UUID=d36ad497-1614-4cc5-8cc9-da2d5e5a417c /media/arch ext4 rw,noatime,noauto 0 0 UUID=568618cd-749c-4c31-8e3d-aee8018a4d0d /mnt/data ext4 rw,noatime,data=ordered 0 2 I have 2 other installations on the same machine - Linux Mint 18 and few years old, non-updated Arch (kernel 4.0.x). Both of those do not have this issue. It started couple of months ago. Not sure exactly when. |
This task depends upon
4.14 kernel -> yes
linux-4.13-1 and linux-4.12-2 to see if the issue was introduced by the 4.13.y stable series or was introduced during the 4.13 merge window.
This should hopefully reduce the further steps you need to take to find the source.
4.13-1 -> works NOT
See https://bbs.archlinux.org/viewtopic.php?pid=1747625#p1747625 bisect between 4.12 and 4.13 and hopefully find the commit that caused it so it can be reported upstream.
I have reproduced this issue also in:
4.8.0
4.9.0
4.10.0
4.12.0
However, it seems that the older the kernel, the harder to reproduce - it takes more reboots for issue to manifest itself.
4.8.0 - about 4-5 successful reboots before getting this error again
4.9.0 - about 4-5 successful reboots before getting this error again
4.10.0 - about 2-4 successful reboots before getting this error again
4.12.0 - about 1-3 successful reboots before getting this error again
4.13.x - about 0-1 successful reboots before getting this error again
Interestingly, I was not able to reproduce it in 4.9.66 LTS kernel, gave up after 10 successful reboots.
I guess I should report this directly to kernel.
Passing it upstream to the kernel bug tracker without being able to pinpoint a subsystem or version does not give upstream much to work with.
This would suggest possible failing HW to me, but something in the new kernels seems to be triggering it that is not in LTS kernel.
https://bugzilla.redhat.com/show_bug.cgi?id=1520659
https://bugzilla.kernel.org/show_bug.cgi?id=198065
LVM team responded that it looks like it is not a bug in LVM:
"Well your bug surely isn't lvm2 related - I'd expect some configuration issue with systemd (or possibly using some old version of systemd).
There are observable number of failing rules for systemd in your journalctl.
You kernel is possibly missing certain mandatory features for proper usage of systemd.
I'm closing bug for lvm2 - your problem is not lvm2 related."
I have done a separate installation of Arch from scratch on the same machine and I can not reproduce the bug there. It is a rather simple installation and I do not have everything set up as on a standard install, because that is complex and takes a lot of time, but I would suspect that additional services (nothing low level like LVM) should not interfere with disk availability. I can not be 100% sure of that, of course, but for now I am trying to work with that theory.
As LVM team suggested, there might be issues with systemd installation/configuration. If I compare boot messages of the fresh install and my normal install, thye surely differ. Especially at the beginning of boot, even though I am using the same systemd version, there are messages like "Systemd 235 starting" and something regarding Udev. It is there only for a split second, so I do not know exactly, but mu standard installation does not have those messages. So something is different. Could there be something messed up due to rolling releases/age of my installation (5 years+). Something that was not correctly upgraded? I attempt to maintain my system, merge all pacnews, follow arch website, but I might have missed something.
I have no idea how to find and solve "failing systemd rules" mentioned by LVM team. There is only one systemd unit failing and the is the one seen on a screenshots in this report.
What would be the best way to find this issue ? Please advise.
Second thing I noticed is that when I disable mounting of the /mnt/data that is failing, my system boots just fine and I can mount it manually. However, there are missing symlinks:
/dev/mapper/volgroup -> ../../dm-0
/dev/disk/by-uuid/<uuid of my disk> -> ../../dm-0
I can mount dm-0 manually and it works fine, but symlinks are not there. When I changed my mount to /dev/dm-0 instead of UUID, it is still failing. Probably at mount time this dm-0 is simply not ready. Which component created the symlinks? Would there be a way to defer this a bit, so if it is slow, it might succeed later? Can this be an issue with my systemd as mentioned above that I attempt to mount it too soon?