Arch Linux

Please read this before reporting a bug:
https://wiki.archlinux.org/title/Bug_reporting_guidelines

Do NOT report bugs when a package is just outdated, or it is in the AUR. Use the 'flag out of date' link on the package page, or the Mailing List.

REPEAT: Do NOT report bugs for outdated packages!
Tasklist

FS#56438 - [linux] LVM fails on reboot, works fine on cold start

Attached to Project: Arch Linux
Opened by Martin Dratva (raqua) - Friday, 24 November 2017, 22:13 GMT
Last edited by freswa (frederik) - Thursday, 10 September 2020, 13:03 GMT
Task Type Bug Report
Category Kernel
Status Closed
Assigned To Tobias Powalowski (tpowa)
Jan Alexander Steffens (heftig)
Architecture x86_64
Severity Medium
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 1
Private No

Details

Description:

Recently my machine started to have strange issue. All works fine on cold boot. But on reboot, I get this issue:
A start job is running for dev-disk-by ...
A stop job is running for LVM2 PV scan on device .... See the pictures:
https://ibb.co/huh386
https://ibb.co/cYyX1R

Those two messages are alternating and after timeout (I have set it to 30 sec. instead of default 1:30) I get this screen:
https://ibb.co/fNxQMR

Again, if I log in and do "reboot" I will get back to this situation again. If I do "poweroff" and then start PC again, no issue at all, boots perfectly fine. I was searching for the messages, but others seem to have this case consistent disregarding whether it is cold start or reboot.
my fstab:

# <file system> <dir> <type> <options> <dump> <pass>
shm /dev/shm tmpfs nodev,nosuid 0 0
tmpfs /tmp tmpfs nodev,nosuid,size=4G 0 0

/dev/sr0 /media/cd auto ro,user,noauto,unhide 0 0
/dev/dvd /media/dvd auto ro,user,noauto,unhide 0 0
#/dev/fd0 /media/fl auto user,noauto 0 0

UUID=d36ad497-1614-4cc5-8cc9-da2d5e5a417c / ext4 rw,noatime 0 1
#UUID=c5b000c2-c921-41ca-99a8-329ae3b61366 swap swap defaults 0 0
UUID=02b484bb-b331-43ce-a9c2-8a9658f0c8c1 /media/arch_legacy ext4 rw,noatime,noauto 0 0
UUID=9b8e3c44-b667-4bc3-ba2b-a67acaec81d0 /media/arch_exp ext4 rw,noatime,,noauto 0 0
#UUID=d36ad497-1614-4cc5-8cc9-da2d5e5a417c /media/arch ext4 rw,noatime,noauto 0 0
UUID=568618cd-749c-4c31-8e3d-aee8018a4d0d /mnt/data ext4 rw,noatime,data=ordered 0 2


I have 2 other installations on the same machine - Linux Mint 18 and few years old, non-updated Arch (kernel 4.0.x). Both of those do not have this issue. It started couple of months ago. Not sure exactly when.
This task depends upon

Closed by  freswa (frederik)
Thursday, 10 September 2020, 13:03 GMT
Reason for closing:  Fixed
Comment by Andrew Crerar (andrewSC) - Friday, 01 December 2017, 19:07 GMT
What's the version of your kernel?
Comment by Martin Dratva (raqua) - Friday, 01 December 2017, 19:18 GMT
4.13.12-1
Comment by loqs (loqs) - Friday, 01 December 2017, 20:48 GMT
Does the same issue occur under linux-lts or linux 4.14.3 from testing?
Comment by Martin Dratva (raqua) - Friday, 01 December 2017, 22:32 GMT
LTS kernel -> no
4.14 kernel -> yes
Comment by loqs (loqs) - Friday, 01 December 2017, 22:47 GMT
Can you try from https://archive.archlinux.org/packages/l/linux/ as a sort of quick bisection
linux-4.13-1 and linux-4.12-2 to see if the issue was introduced by the 4.13.y stable series or was introduced during the 4.13 merge window.
This should hopefully reduce the further steps you need to take to find the source.
Comment by Martin Dratva (raqua) - Friday, 01 December 2017, 23:03 GMT
4.12-2 -> works
4.13-1 -> works NOT
Comment by loqs (loqs) - Friday, 01 December 2017, 23:08 GMT
So the issue appears to be caused a kernel change during the 4.13 merge window.
See https://bbs.archlinux.org/viewtopic.php?pid=1747625#p1747625 bisect between 4.12 and 4.13 and hopefully find the commit that caused it so it can be reported upstream.
Comment by Martin Dratva (raqua) - Saturday, 02 December 2017, 13:18 GMT
Ok, I have tested a bit more and it is a bit more complicated.

I have reproduced this issue also in:
4.8.0
4.9.0
4.10.0
4.12.0

However, it seems that the older the kernel, the harder to reproduce - it takes more reboots for issue to manifest itself.
4.8.0 - about 4-5 successful reboots before getting this error again
4.9.0 - about 4-5 successful reboots before getting this error again
4.10.0 - about 2-4 successful reboots before getting this error again
4.12.0 - about 1-3 successful reboots before getting this error again
4.13.x - about 0-1 successful reboots before getting this error again

Interestingly, I was not able to reproduce it in 4.9.66 LTS kernel, gave up after 10 successful reboots.

I guess I should report this directly to kernel.
Comment by Andrew Crerar (andrewSC) - Saturday, 02 December 2017, 13:47 GMT
At this point I think that makes the most sense ;) Definitely doesn't seem like an arch-specific issue at this point
Comment by loqs (loqs) - Saturday, 02 December 2017, 19:36 GMT
Might also try lvm2 upstream see if they could provide you with more diagnostics to try or which kernel subsystems to look at.
Passing it upstream to the kernel bug tracker without being able to pinpoint a subsystem or version does not give upstream much to work with.
Comment by Martin Dratva (raqua) - Monday, 04 December 2017, 21:44 GMT
It seems that the issue is getting worse. I can not cold boot anymore with default kernel 4.13.something. Fails also on cold boot. But with 4.9.66 works without issue.
This would suggest possible failing HW to me, but something in the new kernels seems to be triggering it that is not in LTS kernel.
Comment by Martin Dratva (raqua) - Monday, 04 December 2017, 21:45 GMT Comment by Martin Dratva (raqua) - Sunday, 17 December 2017, 10:54 GMT
Kernel guys did not respond to the report yet.

LVM team responded that it looks like it is not a bug in LVM:
"Well your bug surely isn't lvm2 related - I'd expect some configuration issue with systemd (or possibly using some old version of systemd).
There are observable number of failing rules for systemd in your journalctl.
You kernel is possibly missing certain mandatory features for proper usage of systemd.
I'm closing bug for lvm2 - your problem is not lvm2 related."

I have done a separate installation of Arch from scratch on the same machine and I can not reproduce the bug there. It is a rather simple installation and I do not have everything set up as on a standard install, because that is complex and takes a lot of time, but I would suspect that additional services (nothing low level like LVM) should not interfere with disk availability. I can not be 100% sure of that, of course, but for now I am trying to work with that theory.
As LVM team suggested, there might be issues with systemd installation/configuration. If I compare boot messages of the fresh install and my normal install, thye surely differ. Especially at the beginning of boot, even though I am using the same systemd version, there are messages like "Systemd 235 starting" and something regarding Udev. It is there only for a split second, so I do not know exactly, but mu standard installation does not have those messages. So something is different. Could there be something messed up due to rolling releases/age of my installation (5 years+). Something that was not correctly upgraded? I attempt to maintain my system, merge all pacnews, follow arch website, but I might have missed something.
I have no idea how to find and solve "failing systemd rules" mentioned by LVM team. There is only one systemd unit failing and the is the one seen on a screenshots in this report.
What would be the best way to find this issue ? Please advise.


Second thing I noticed is that when I disable mounting of the /mnt/data that is failing, my system boots just fine and I can mount it manually. However, there are missing symlinks:
/dev/mapper/volgroup -> ../../dm-0
/dev/disk/by-uuid/<uuid of my disk> -> ../../dm-0

I can mount dm-0 manually and it works fine, but symlinks are not there. When I changed my mount to /dev/dm-0 instead of UUID, it is still failing. Probably at mount time this dm-0 is simply not ready. Which component created the symlinks? Would there be a way to defer this a bit, so if it is slow, it might succeed later? Can this be an issue with my systemd as mentioned above that I attempt to mount it too soon?


Loading...