FS#56438 : [linux] LVM fails on reboot, works fine on cold start

FS#56438 - [linux] LVM fails on reboot, works fine on cold start

Attached to Project: Arch Linux
Opened by Martin Dratva (raqua) - Friday, 24 November 2017, 22:13 GMT
Last edited by freswa (frederik) - Thursday, 10 September 2020, 13:03 GMT

Task Type	Bug Report
Category	Kernel
Status	Closed
Assigned To	Tobias Powalowski (tpowa) Jan Alexander Steffens (heftig)
Architecture	x86_64
Severity	Medium
Priority	Normal
Reported Version
Due in Version	Undecided
Due Date	Undecided
Percent Complete
Votes	1 Martin Dratva (raqua) (2017-11-24)
Private	No

Details

Description:

Recently my machine started to have strange issue. All works fine on cold boot. But on reboot, I get this issue:
A start job is running for dev-disk-by ...
A stop job is running for LVM2 PV scan on device .... See the pictures:
https://ibb.co/huh386
https://ibb.co/cYyX1R

Those two messages are alternating and after timeout (I have set it to 30 sec. instead of default 1:30) I get this screen:
https://ibb.co/fNxQMR

Again, if I log in and do "reboot" I will get back to this situation again. If I do "poweroff" and then start PC again, no issue at all, boots perfectly fine. I was searching for the messages, but others seem to have this case consistent disregarding whether it is cold start or reboot.
my fstab:

# <file system> <dir> <type> <options> <dump> <pass>
shm /dev/shm tmpfs nodev,nosuid 0 0
tmpfs /tmp tmpfs nodev,nosuid,size=4G 0 0

/dev/sr0 /media/cd auto ro,user,noauto,unhide 0 0
/dev/dvd /media/dvd auto ro,user,noauto,unhide 0 0
#/dev/fd0 /media/fl auto user,noauto 0 0

UUID=d36ad497-1614-4cc5-8cc9-da2d5e5a417c / ext4 rw,noatime 0 1
#UUID=c5b000c2-c921-41ca-99a8-329ae3b61366 swap swap defaults 0 0
UUID=02b484bb-b331-43ce-a9c2-8a9658f0c8c1 /media/arch_legacy ext4 rw,noatime,noauto 0 0
UUID=9b8e3c44-b667-4bc3-ba2b-a67acaec81d0 /media/arch_exp ext4 rw,noatime,,noauto 0 0
#UUID=d36ad497-1614-4cc5-8cc9-da2d5e5a417c /media/arch ext4 rw,noatime,noauto 0 0
UUID=568618cd-749c-4c31-8e3d-aee8018a4d0d /mnt/data ext4 rw,noatime,data=ordered 0 2

I have 2 other installations on the same machine - Linux Mint 18 and few years old, non-updated Arch (kernel 4.0.x). Both of those do not have this issue. It started couple of months ago. Not sure exactly when.

This task depends upon

Closed by freswa (frederik)
Thursday, 10 September 2020, 13:03 GMT
Reason for closing: Fixed

Comment by Andrew Crerar (andrewSC) - Friday, 01 December 2017, 19:07 GMT

What's the version of your kernel?

Comment by Martin Dratva (raqua) - Friday, 01 December 2017, 19:18 GMT

4.13.12-1

Comment by loqs (loqs) - Friday, 01 December 2017, 20:48 GMT

Does the same issue occur under linux-lts or linux 4.14.3 from testing?

Comment by Martin Dratva (raqua) - Friday, 01 December 2017, 22:32 GMT

LTS kernel -> no
4.14 kernel -> yes

Comment by loqs (loqs) - Friday, 01 December 2017, 22:47 GMT

Can you try from https://archive.archlinux.org/packages/l/linux/ as a sort of quick bisection
linux-4.13-1 and linux-4.12-2 to see if the issue was introduced by the 4.13.y stable series or was introduced during the 4.13 merge window.
This should hopefully reduce the further steps you need to take to find the source.

Comment by Martin Dratva (raqua) - Friday, 01 December 2017, 23:03 GMT

4.12-2 -> works
4.13-1 -> works NOT

Comment by loqs (loqs) - Friday, 01 December 2017, 23:08 GMT

So the issue appears to be caused a kernel change during the 4.13 merge window.
See https://bbs.archlinux.org/viewtopic.php?pid=1747625#p1747625 bisect between 4.12 and 4.13 and hopefully find the commit that caused it so it can be reported upstream.

Comment by Martin Dratva (raqua) - Saturday, 02 December 2017, 13:18 GMT

Ok, I have tested a bit more and it is a bit more complicated.

I have reproduced this issue also in:
4.8.0
4.9.0
4.10.0
4.12.0

However, it seems that the older the kernel, the harder to reproduce - it takes more reboots for issue to manifest itself.
4.8.0 - about 4-5 successful reboots before getting this error again
4.9.0 - about 4-5 successful reboots before getting this error again
4.10.0 - about 2-4 successful reboots before getting this error again
4.12.0 - about 1-3 successful reboots before getting this error again
4.13.x - about 0-1 successful reboots before getting this error again

Interestingly, I was not able to reproduce it in 4.9.66 LTS kernel, gave up after 10 successful reboots.

I guess I should report this directly to kernel.

Comment by Andrew Crerar (andrewSC) - Saturday, 02 December 2017, 13:47 GMT

At this point I think that makes the most sense ;) Definitely doesn't seem like an arch-specific issue at this point

Comment by loqs (loqs) - Saturday, 02 December 2017, 19:36 GMT

Might also try lvm2 upstream see if they could provide you with more diagnostics to try or which kernel subsystems to look at.
Passing it upstream to the kernel bug tracker without being able to pinpoint a subsystem or version does not give upstream much to work with.

Comment by Martin Dratva (raqua) - Monday, 04 December 2017, 21:44 GMT

It seems that the issue is getting worse. I can not cold boot anymore with default kernel 4.13.something. Fails also on cold boot. But with 4.9.66 works without issue.
This would suggest possible failing HW to me, but something in the new kernels seems to be triggering it that is not in LTS kernel.

Comment by Martin Dratva (raqua) - Monday, 04 December 2017, 21:45 GMT

For the reference - upstream bug reports:
https://bugzilla.redhat.com/show_bug.cgi?id=1520659
https://bugzilla.kernel.org/show_bug.cgi?id=198065

Comment by Martin Dratva (raqua) - Sunday, 17 December 2017, 10:54 GMT

Kernel guys did not respond to the report yet.

LVM team responded that it looks like it is not a bug in LVM:
"Well your bug surely isn't lvm2 related - I'd expect some configuration issue with systemd (or possibly using some old version of systemd).
There are observable number of failing rules for systemd in your journalctl.
You kernel is possibly missing certain mandatory features for proper usage of systemd.
I'm closing bug for lvm2 - your problem is not lvm2 related."

I have done a separate installation of Arch from scratch on the same machine and I can not reproduce the bug there. It is a rather simple installation and I do not have everything set up as on a standard install, because that is complex and takes a lot of time, but I would suspect that additional services (nothing low level like LVM) should not interfere with disk availability. I can not be 100% sure of that, of course, but for now I am trying to work with that theory.
As LVM team suggested, there might be issues with systemd installation/configuration. If I compare boot messages of the fresh install and my normal install, thye surely differ. Especially at the beginning of boot, even though I am using the same systemd version, there are messages like "Systemd 235 starting" and something regarding Udev. It is there only for a split second, so I do not know exactly, but mu standard installation does not have those messages. So something is different. Could there be something messed up due to rolling releases/age of my installation (5 years+). Something that was not correctly upgraded? I attempt to maintain my system, merge all pacnews, follow arch website, but I might have missed something.
I have no idea how to find and solve "failing systemd rules" mentioned by LVM team. There is only one systemd unit failing and the is the one seen on a screenshots in this report.
What would be the best way to find this issue ? Please advise.

Second thing I noticed is that when I disable mounting of the /mnt/data that is failing, my system boots just fine and I can mount it manually. However, there are missing symlinks:
/dev/mapper/volgroup -> ../../dm-0
/dev/disk/by-uuid/<uuid of my disk> -> ../../dm-0

I can mount dm-0 manually and it works fine, but symlinks are not there. When I changed my mount to /dev/dm-0 instead of UUID, it is still failing. Probably at mount time this dm-0 is simply not ready. Which component created the symlinks? Would there be a way to defer this a bit, so if it is slow, it might succeed later? Can this be an issue with my systemd as mentioned above that I attempt to mount it too soon?

	Tasks related to this task (0)

Duplicate tasks of this task (0)

Arch Linux

FS#56438 - [linux] LVM fails on reboot, works fine on cold start

Details

Loading...