FS#79439 - [linux] 6.4.11 rtsx driver bug prevents booting in some cases

Attached to Project: Arch Linux
Opened by Gene (GeneC) - Tuesday, 22 August 2023, 12:03 GMT
Last edited by Jan Alexander Steffens (heftig) - Thursday, 21 September 2023, 19:52 GMT
Task Type Bug Report
Category Kernel
Status Closed
Assigned To Jan Alexander Steffens (heftig)
Levente Polyak (anthraxx)
Architecture x86_64
Severity Low
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 11
Private No

Details

linux kernel 6.4.11

There is a bug with the rtsx driver in 6.4.11 that can cause boot to fail on machines with some hardware that need the rtsx driver.
In my case it presented as NVME failure and thus prevented machine from booting.

6.4.10 is fine.

The hardware that triggers this is :
03:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS525A PCI Express Card Reader (rev 01)

Work around:
blacklist the driver (its only a card reader).
i.e. Add
blacklist rtsx_pci
blacklist rtsx_pci_sdmmc
to /etc/modprobe.d/blacklist_rtsx.conf and rebuild initramfs.

More details are available on lkml including the git bisect:
https://lkml.org/lkml/2023/8/16/1183

As of now there is no upstream fix or revert that I am aware of.

Should we revert commit 69304c8d285b77c9a56d68f5ddb2558f27abf406
until this is fixed upstream?

This task depends upon

Closed by  Jan Alexander Steffens (heftig)
Thursday, 21 September 2023, 19:52 GMT
Reason for closing:  Fixed
Additional comments about closing:  linux 6.5.4.arch2-1
Comment by Al Audet (alaudet) - Tuesday, 22 August 2023, 19:39 GMT
I was also affected with linux-lts-6.1.46-1. Downgrading to 6.1.45 for now.

Downgrading both kernels clears it for now for me.
Comment by Toolybird (Toolybird) - Tuesday, 22 August 2023, 20:26 GMT
Forum thread [1]. It's also noted on the kernel regression tracker [2]

[1] https://bbs.archlinux.org/viewtopic.php?id=288095
[2] https://linux-regtracking.leemhuis.info/regzbot/mainline/
Comment by Ronan Pigott (Brocellous) - Tuesday, 22 August 2023, 21:19 GMT
I can't reproduce? Same card:

$ uname -r
6.4.11-arch2-1
$ lspci -kd::ff00
03:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS525A PCI Express Card Reader (rev 01)
Subsystem: Dell RTS525A PCI Express Card Reader
Kernel driver in use: rtsx_pci
Kernel modules: rtsx_pci
$ journalctl -b --no-hostname -g rtsx
Aug 22 14:07:51 kernel: rtsx_pci 0000:03:00.0: enabling device (0000 -> 0002)
Comment by Gene (GeneC) - Tuesday, 22 August 2023, 21:35 GMT
Ronan thats interesting - perhaps it is makes a difference with other drivers somehow - in my case it led to nvme problem - perhaps your case is different.
Does the machine use nvme for root?
Comment by Ronan Pigott (Brocellous) - Tuesday, 22 August 2023, 23:44 GMT
Yes it does.

$ findmnt -rvno source /
/dev/nvme0n1p5
$ lspci -kd::108
04:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
Subsystem: Samsung Electronics Co Ltd SSD 970 EVO
Kernel driver in use: nvme
Kernel modules: nvme
Comment by Gene (GeneC) - Wednesday, 23 August 2023, 09:41 GMT
Perhaps the chipset plays a role. The laptop with the problem is intel based (Intel(R) Core(TM) i7-7820HQ).

Glad you're not affected :)
Comment by Ronan Pigott (Brocellous) - Wednesday, 23 August 2023, 18:57 GMT
My laptop is also KBL...

$ grep model.name /proc/cpuinfo| uniq
model name : Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz

It might have to do with the nvme model.
Comment by Gene (GeneC) - Wednesday, 23 August 2023, 20:18 GMT
The nvme card in this machine is:

:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961/SM963
Comment by Chris T (christarazi) - Wednesday, 23 August 2023, 23:35 GMT
Hit this on

Lenovo ThinkPad T14 Gen 3, model 21CF000KUS

---

$ findmnt -rvno source /
/dev/nvme0n1p3


$ lspci -kd::108
03:00.0 Non-Volatile memory controller: SK hynix Platinum P41/PC801 NVMe Solid State Drive
Subsystem: SK hynix Platinum P41/PC801 NVMe Solid State Drive
Kernel driver in use: nvme
Kernel modules: nvme
Comment by loqs (loqs) - Thursday, 24 August 2023, 11:08 GMT Comment by Gene (GeneC) - Thursday, 24 August 2023, 11:32 GMT
Thank you @loqs I've shared this with info lkml as well.
Comment by Al Audet (alaudet) - Thursday, 24 August 2023, 13:14 GMT
Thanks I have followed the thread on lkml, interesting stuff. I saw that a commit revert fixed the issue. Is this revert something we might see in a future kernel as I saw some chatter about it being fairly low priority. Reason asking is that I do use the card reader quite a bit so I am just kind of staying on 6.4.10 for now. But thinking I may have to expect that I will just have to blacklist the drivers if I want new kernels on this system.

Appreciate all the work on this.
Comment by Ronan Pigott (Brocellous) - Thursday, 24 August 2023, 18:39 GMT
Man, several of those reports for the 2017 Dell XPS 15 9560, but that is indeed my laptop. Why is mine not affected? Firmware differences?

$ grep -H $ /sys/devices/virtual/dmi/id/{product_name,board_{name,version},bios_{date,version}}
/sys/devices/virtual/dmi/id/product_name:XPS 15 9560
/sys/devices/virtual/dmi/id/board_name:05FFDN
/sys/devices/virtual/dmi/id/board_version:A00
/sys/devices/virtual/dmi/id/bios_date:11/10/2022
/sys/devices/virtual/dmi/id/bios_version:1.31.0
$ grep -H $ /sys/class/block/nvme0n1/device/{model,firmware_rev}
/sys/class/block/nvme0n1/device/model:Samsung SSD 970 EVO 1TB
/sys/class/block/nvme0n1/device/firmware_rev:2B2QEXE7
Comment by Ronan Pigott (Brocellous) - Thursday, 24 August 2023, 19:25 GMT
[accidental double post]
Comment by Ronan Pigott (Brocellous) - Thursday, 24 August 2023, 19:27 GMT
Now that I think about it, I did upgrade the SSD at some point. Good chance that might be why.
Comment by Al Audet (alaudet) - Thursday, 24 August 2023, 21:00 GMT
For info, I have the stock drive from my XPS15 9560 and was affected

Model PC401 NVMe SK hynix 512GB
512.11 GB
FW Rev - 80002E00
Comment by loqs (loqs) - Sunday, 27 August 2023, 22:08 GMT
Can everyone affected who has not done so already try adding the kernel parameters below [1][2]:

nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

[1] https://lore.kernel.org/all/fa82d9dcbe83403abc644c20922b47f9%40realtek.com/
[2] https://bugzilla.kernel.org/show_bug.cgi?id=217802#c4
Comment by Demurgos (demurgos) - Sunday, 27 August 2023, 23:12 GMT
I had this bug with my XPS 9560. loqs, I tried to set these params as they were mentioned in dmseg but it did not help. I first thought I may have a dying SSD so I bought a new one, but it errored too. What helped was the workaround at the top with the blacklisted card reader driver.
Comment by Al Audet (alaudet) - Monday, 28 August 2023, 02:12 GMT
I tried the kernel parameters on lts 6.1.47 and it did not work.
Hangs at running hooks [udev] and then does not recognize storage devices. After a few seconds drops me to an emergency shell.
thanks
Comment by Augusto Zanellato (auguzanellato) - Monday, 28 August 2023, 14:36 GMT
I can confirm the issue also affects my XPS 15 9560 with the following NVMe drives:
- stock Toshiba (Dell OEM) XG4 1TB
- Crucial P3 1TB
- Sabrent Rocket 4 (SB-ROCKET-NVMe4) 1TB
pcie_aspm=off doesn't seem to do anything ftr.
Comment by Ronan Pigott (Brocellous) - Tuesday, 29 August 2023, 17:37 GMT
If you have an effected XPS 9560, what is your bios version? Can you reproduce with the latest bios [1][2]?

Also, fwiw, my (functional) xps 9560 is using the systemd hooks instead of busybox:
$ grep ^HOOKS /etc/mkinitcpio.conf
HOOKS=(base systemd autodetect modconf kms keyboard block filesystems resume fsck)

[1] https://lore.kernel.org/lkml/fa82d9dcbe83403abc644c20922b47f9%40realtek.com/
[2] https://fwupd.org/lvfs/devices/com.dell.uefi34578c72.firmware
Comment by Augusto Zanellato (auguzanellato) - Tuesday, 29 August 2023, 18:12 GMT
> If you have an effected XPS 9560, what is your bios version? Can you reproduce with the latest bios [1]?
I initially experienced the issue with bios 1.29.0, but I upgraded to 1.31.0 and the issue is still there.

For reference I tested 6.4.12.zen1-1 which should be the latest linux-zen version; the known good version I use is 6.4.10.zen2-1.

Regarding systemd hooks: I'm probably also using those, but I'm using booster instead of mkinitcpio.
Comment by Ronan Pigott (Brocellous) - Tuesday, 29 August 2023, 18:32 GMT
Alright well thanks for testing it.

booster provides an alternative init for the initrd stage, so it doesn't run systemd. However I really don't expect the initramfs to make any difference here so I wouldn't worry too much about it. Just thought I'd record that info in case, since my 9560 is apparently the only working one.

I only noticed this bug while checking the tracker because I was affected by https://bugs.archlinux.org/task/79366 also in 6.4.11, but not on this laptop.
Comment by Toolybird (Toolybird) - Saturday, 09 September 2023, 01:09 GMT
Dupe  FS#79427 
Comment by famar (famar) - Sunday, 10 September 2023, 13:01 GMT
Not a duplicate of  FS#79427 . The issue persists with linux 6.5.2.arch1-1. Consider increasing severity and/or priority.
Comment by loqs (loqs) - Sunday, 10 September 2023, 13:24 GMT
@famar the severity and/or priority are not really relevant to upstream issues.
If you want to hasten the process please pursue it upstream. There has been no response to this request for a status update [1]. You could also submit the revert upstream yourself as although the commit may be technically correct it breaks existing behavior.

In regards to  FS#79427  the boot failure was caused by this issue. There was a separate tpm issue that was visible on the console which has been resolved.

[1] https://lore.kernel.org/lkml/5d38cf11-114a-4997-a0fc-4627402468f8%40sapience.com/
Comment by Gene (GeneC) - Monday, 11 September 2023, 10:31 GMT
Kernel regression tracker is requesting confirmation of bug in 6.5.x and later. Please add any helpful info to

https://bugzilla.kernel.org/show_bug.cgi?id=217802

or to lkml mailing list as per [1] of @loqs previoius comment.

Thanks.
Comment by loqs (loqs) - Monday, 11 September 2023, 11:13 GMT
6-6-rc1 is available in linux-mainline from AUR available prebuilt from miffe [1]. I do not expect it to fix the issue but upstream would like confirmation.

[1] https://wiki.archlinux.org/title/Unofficial_user_repositories#miffe
Comment by Al Audet (alaudet) - Tuesday, 12 September 2023, 12:03 GMT Comment by loqs (loqs) - Wednesday, 20 September 2023, 09:52 GMT

Loading...