FS#63984 - [linux] 5.3 kernel not loading scsi devices in bus order

Attached to Project: Arch Linux
Opened by L. Bradley LaBoon (lb.laboon) - Tuesday, 01 October 2019, 23:22 GMT
Last edited by freswa (frederik) - Friday, 21 February 2020, 21:54 GMT
Task Type Bug Report
Category Packages: Core
Status Closed
Assigned To No-one
Architecture All
Severity High
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 2
Private No

Details

Description:

Starting with kernel 5.3, SCSI devices are no longer loaded/processed in bus order, and instead appear to be processed randomly.
In my case, I have an Archlinux QEMU guest with two block SCSI devices (one ext4 main device, and one swap device) and I explicitly set the bus ordering via QEMU command line arguments so that within the guest I can refer to them by their device names (sda, sdb, etc).

Prior to 5.3, the device names would always be related to their bus order. For example, if my SCSI bus IDs are 0:0:0:0 and 0:0:1:2, then 0:0:0:0 would always be sda and 0:0:1:2 would always be sdb. Starting with 5.3, this is no longer the case and the result is different with each reboot. ~50% of the time the guest boots into an emergency shell due to the devices being in the unexpected order.

I have attached a snippet of the kernel boot log during an unsuccessful boot, where it can be seen that device ID 0:0:1:2 gets processed/named before 0:0:0:0.
   boot.log (1.9 KiB)
This task depends upon

Closed by  freswa (frederik)
Friday, 21 February 2020, 21:54 GMT
Reason for closing:  None
Additional comments about closing:  This seems pretty stalled to me. If it's still an issue, please fill a re-open request. Thank you :)
Comment by loqs (loqs) - Wednesday, 02 October 2019, 00:18 GMT
Do the upstream kernel developers agree this is a bug / regression rather than undefined behavior?
[1] Is recommended to avoid exactly issues such as this.

[1] https://wiki.archlinux.org/index.php/Persistent_block_device_naming
Comment by L. Bradley LaBoon (lb.laboon) - Wednesday, 02 October 2019, 17:53 GMT
I will check upstream.

Unfortunately, using UUID (et al) as identifiers isn't an option in our case since multiple VMs are deployed from a common base image, and we don't know what the partition UUIDs are going to be at the time the image is made. This is why we explicitly order devices on the bus so we can rely on the device names being consistent.
Comment by loqs (loqs) - Wednesday, 02 October 2019, 18:16 GMT
You could try reverting 82a54da641f3cacfa31db36fc58a5e903f804c22 and f049cf1a7b6737c75884247c3f6383ef104d255a.
Comment by L. Bradley LaBoon (lb.laboon) - Thursday, 03 October 2019, 16:03 GMT
Thanks for the pointer! Looks like it's a combination of f049cf1a7b6737c75884247c3f6383ef104d255a and 82a54da641f3cacfa31db36fc58a5e903f804c22.

I just built a kernel with those two commits reverted and it seems to be working perfectly.
So it sounds like the new asynchronous device probing is what's biting us, and from the way the commits are worded it sounds like that's the intended functionality.

I'll see if I can get a thread started in the LKML. This change really hurts those of us who deploy VMs from base images, because as I mentioned above we can't bake UUIDs into the image because we don't know what they're going to be ahead of time.
Comment by L. Bradley LaBoon (lb.laboon) - Thursday, 03 October 2019, 22:20 GMT
LKML thread for reference: https://lkml.org/lkml/2019/10/3/2108
Comment by loqs (loqs) - Thursday, 03 October 2019, 22:39 GMT
As it breaks your workflow [1] if the subsystem developers can not offer a solution you could consider CC'ing linus asking for a revert for 5.4.

https://lore.kernel.org/lkml/CAHk-=wiP4K8DRJWsCo=20hn_6054xBamGKF2kPgUzpB5aMaofA%40mail.gmail.com/

Loading...