Arch Linux

Please read this before reporting a bug:
https://wiki.archlinux.org/title/Bug_reporting_guidelines

Do NOT report bugs when a package is just outdated, or it is in the AUR. Use the 'flag out of date' link on the package page, or the Mailing List.

REPEAT: Do NOT report bugs for outdated packages!
Tasklist

FS#74556 - Kernel Oops in 5.17.4 with AMD GPU Reset workaround

Attached to Project: Arch Linux
Opened by James King (Randomized) - Monday, 25 April 2022, 15:40 GMT
Last edited by Toolybird (Toolybird) - Sunday, 16 October 2022, 20:58 GMT
Task Type Bug Report
Category Kernel
Status Closed
Assigned To Tobias Powalowski (tpowa)
Jan Alexander Steffens (heftig)
David Runge (dvzrv)
Architecture All
Severity Low
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 1
Private No

Details

Description:
I pass-through an AMD graphics card that requires a workaround to reset it (IOMMU Group 38 45:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64] [1002:687f] (rev c1))

After upgrading from the last 5.16 kernel to the latest 5.17 kernel (5.17.4) my script that I use to initiate to work-around by putting the machine to sleep generated a kernel oops. I've subsequently installed linux-lts and have successfully been able to run this script so it appears to be something in the 5.17.x series of kernels.

The script is:

echo "1" | sudo tee -a /sys/bus/pci/devices/0000\:45\:00.0/remove
echo "1" | sudo tee -a /sys/bus/pci/devices/0000\:45\:00.1/remove

systemctl suspend

The oops follows:

Apr 25 08:53:19 tux-master kernel: BUG: kernel NULL pointer dereference, address: 000000000000006c
Apr 25 08:53:19 tux-master kernel: #PF: supervisor read access in kernel mode
Apr 25 08:53:19 tux-master kernel: #PF: error_code(0x0000) - not-present page
Apr 25 08:53:19 tux-master kernel: PGD 0 P4D 0
Apr 25 08:53:19 tux-master kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Apr 25 08:53:19 tux-master kernel: CPU: 13 PID: 992 Comm: tee Tainted: G W 5.17.4-arch1-1 #1 bba05afeab01638bf5119bbe9f3f1f1452c88ff1
Apr 25 08:53:19 tux-master kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X399 Taichi, BIOS P3.30 08/14/2018
Apr 25 08:53:19 tux-master kernel: RIP: 0010:pcie_capability_read_dword+0x1c/0xb0
Apr 25 08:53:19 tux-master kernel: Code: eb a9 41 be 86 00 00 00 eb e3 0f 1f 40 00 0f 1f 44 00 00 41 56 41 89 f6 41 55 41 54 55 53 c7 02 00 00 00 00 41 83 e6 03 75 3e <44> 0f b6 6f 6c 48 89 fd 45 84 ed 74 25 89 f3 49 89 d4 e8 5d fe ff
Apr 25 08:53:19 tux-master kernel: RSP: 0018:ffffbfde92dc3c10 EFLAGS: 00010246
Apr 25 08:53:19 tux-master kernel: RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000000064
Apr 25 08:53:19 tux-master kernel: RDX: ffffbfde92dc3c4c RSI: 000000000000000c RDI: 0000000000000000
Apr 25 08:53:19 tux-master kernel: RBP: ffffa02044335d80 R08: 0000000000000004 R09: ffffbfde92dc3bf4
Apr 25 08:53:19 tux-master kernel: R10: 0000000000000000 R11: 0000000000000044 R12: 0000000000000000
Apr 25 08:53:19 tux-master kernel: R13: 0000000000000040 R14: 0000000000000000 R15: 0000000000000000
Apr 25 08:53:19 tux-master kernel: FS: 00007f9b55145740(0000) GS:ffffa02ffef40000(0000) knlGS:0000000000000000
Apr 25 08:53:19 tux-master kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 25 08:53:19 tux-master kernel: CR2: 000000000000006c CR3: 0000002072658000 CR4: 00000000003506e0
Apr 25 08:53:19 tux-master kernel: Call Trace:
Apr 25 08:53:19 tux-master kernel: <TASK>
Apr 25 08:53:19 tux-master kernel: pcie_aspm_check_latency.isra.0+0x104/0x210
Apr 25 08:53:19 tux-master kernel: pcie_update_aspm_capable+0xb0/0xe0
Apr 25 08:53:19 tux-master kernel: pcie_aspm_pm_state_change+0x3d/0xa0
Apr 25 08:53:19 tux-master kernel: pci_raw_set_power_state+0x169/0x210
Apr 25 08:53:19 tux-master kernel: pci_set_power_state+0xf8/0x1a0
Apr 25 08:53:19 tux-master kernel: vfio_pci_remove+0x15/0x30 [vfio_pci 4504ca667961aa5b56c0d2e5ce76a10c76fa6bc6]
Apr 25 08:53:19 tux-master kernel: pci_device_remove+0x36/0xa0
Apr 25 08:53:19 tux-master kernel: __device_release_driver+0x17a/0x250
Apr 25 08:53:19 tux-master kernel: device_release_driver+0x24/0x30
Apr 25 08:53:19 tux-master kernel: pci_stop_bus_device+0x68/0x90
Apr 25 08:53:19 tux-master kernel: pci_stop_and_remove_bus_device_locked+0x16/0x30
Apr 25 08:53:19 tux-master kernel: remove_store+0x7d/0x90
Apr 25 08:53:19 tux-master kernel: kernfs_fop_write_iter+0x11c/0x1b0
Apr 25 08:53:19 tux-master kernel: new_sync_write+0x15c/0x1f0
Apr 25 08:53:19 tux-master kernel: vfs_write+0x1eb/0x280
Apr 25 08:53:19 tux-master kernel: ksys_write+0x67/0xe0
Apr 25 08:53:19 tux-master kernel: do_syscall_64+0x5c/0x80
Apr 25 08:53:19 tux-master kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
Apr 25 08:53:19 tux-master kernel: RIP: 0033:0x7f9b5524a257
Apr 25 08:53:19 tux-master kernel: Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
Apr 25 08:53:19 tux-master kernel: RSP: 002b:00007ffd0ca5a778 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
Apr 25 08:53:19 tux-master kernel: RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f9b5524a257
Apr 25 08:53:19 tux-master kernel: RDX: 0000000000000002 RSI: 00007ffd0ca5a8d0 RDI: 0000000000000003
Apr 25 08:53:19 tux-master kernel: RBP: 00007ffd0ca5a8d0 R08: 0000000000001004 R09: 0000000000000001
Apr 25 08:53:19 tux-master kernel: R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000002
Apr 25 08:53:19 tux-master kernel: R13: 00005570dc0684a0 R14: 0000000000000002 R15: 00007f9b553437a0
Apr 25 08:53:19 tux-master kernel: </TASK>
Apr 25 08:53:19 tux-master kernel: Modules linked in: hid_microsoft ff_memless mousedev joydev dm_mod nct6775 hwmon_vid iwlmvm snd_usb_audio snd_usbmidi_lib snd_rawmidi mac80211 snd_seq_device mc intel_rapl_msr mxm_wmi wmi_bmof snd_hda_codec_realtek snd>
Apr 25 08:53:19 tux-master kernel: xhci_pci_renesas vfio_pci vfio_pci_core irqbypass vfio_virqfd vfio_iommu_type1 vfio
Apr 25 08:53:19 tux-master kernel: CR2: 000000000000006c
Apr 25 08:53:19 tux-master kernel: ---[ end trace 0000000000000000 ]---
Apr 25 08:53:19 tux-master kernel: RIP: 0010:pcie_capability_read_dword+0x1c/0xb0
Apr 25 08:53:19 tux-master kernel: Code: eb a9 41 be 86 00 00 00 eb e3 0f 1f 40 00 0f 1f 44 00 00 41 56 41 89 f6 41 55 41 54 55 53 c7 02 00 00 00 00 41 83 e6 03 75 3e <44> 0f b6 6f 6c 48 89 fd 45 84 ed 74 25 89 f3 49 89 d4 e8 5d fe ff
Apr 25 08:53:19 tux-master kernel: RSP: 0018:ffffbfde92dc3c10 EFLAGS: 00010246
Apr 25 08:53:19 tux-master kernel: RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000000064
Apr 25 08:53:19 tux-master kernel: RDX: ffffbfde92dc3c4c RSI: 000000000000000c RDI: 0000000000000000
Apr 25 08:53:19 tux-master kernel: RBP: ffffa02044335d80 R08: 0000000000000004 R09: ffffbfde92dc3bf4
Apr 25 08:53:19 tux-master kernel: R10: 0000000000000000 R11: 0000000000000044 R12: 0000000000000000
Apr 25 08:53:19 tux-master kernel: R13: 0000000000000040 R14: 0000000000000000 R15: 0000000000000000
Apr 25 08:53:19 tux-master kernel: FS: 00007f9b55145740(0000) GS:ffffa02ffef40000(0000) knlGS:0000000000000000
Apr 25 08:53:19 tux-master kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 25 08:53:19 tux-master kernel: CR2: 000000000000006c CR3: 0000002072658000 CR4: 00000000003506e0
This task depends upon

Closed by  Toolybird (Toolybird)
Sunday, 16 October 2022, 20:58 GMT
Reason for closing:  Fixed
Additional comments about closing:  linux 6.0.2.arch1-1
Comment by James King (Randomized) - Friday, 06 May 2022, 21:45 GMT
Tested this with 5.17.5-arch1-1 and issue persists
Comment by Alexey Ryzhov (xue) - Wednesday, 11 May 2022, 12:51 GMT
Experiencing a very similar issue with an AMD GPU without any reset workarounds on kernel 5.17.5-arch1-1. Kernel 5.15.38-1-lts from the linux-lts package works fine.
Device: 0a:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 [Radeon RX 6800/6800 XT / 6900 XT] (rev c0)

Oops: 0000 [#1] PREEMPT SMP NOPTI
Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS PRO/X570 AORUS PRO, BIOS F36a 02/16/2022
RIP: 0010:vfio_pci_core_unregister_device+0xd/0xa0 [vfio_pci_core]
Comment by James King (Randomized) - Monday, 13 June 2022, 14:53 GMT
Tested this with 5.18.3-arch1-1 but now the stack trace has changed:

BUG: kernel NULL pointer dereference, address: 000000000000006c
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 21 PID: 957 Comm: tee Tainted: G W 5.18.3-arch1-1 #1 2090c6f1d9d20f39bd14c0acb6fa89ddb994d43f
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X399 Taichi, BIOS P3.30 08/14/2018
RIP: 0010:pcie_capability_reg_implemented+0x7/0xd0
Code: 03 00 00 00 48 c7 c7 70 8b d4 b1 5b e9 22 2e b1 ff 0f 0b eb d3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 0f 1f 44 00 00 31 d2 <80> 7f 6c 00 89 f1 74 3e>
RSP: 0018:ffffbb3d926dfbe8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 000000000000000c RCX: 0000000000000064
RDX: 0000000000000000 RSI: 000000000000000c RDI: 0000000000000000
RBP: 0000000000000000 R08: 0000000000000004 R09: ffffbb3d926dfbd4
R10: 0000000000000044 R11: ffffffffb0b35990 R12: ffffbb3d926dfc24
R13: 0000000000000000 R14: 0000000000001388 R15: 0000000000000000
FS: 00007f7c0b225740(0000) GS:ffff9f8dfe540000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000000006c CR3: 00000010795ba000 CR4: 00000000003506e0
Call Trace:
<TASK>
pcie_capability_read_dword+0x2b/0xb0
pcie_aspm_check_latency.isra.0+0x10b/0x210
pcie_update_aspm_capable+0xb1/0xe0
pcie_aspm_pm_state_change+0x41/0xa0
pci_raw_set_power_state+0x137/0x210
vfio_pci_remove+0x19/0x30 [vfio_pci 71a74ce0c543b84b41207595c4fc0aba2b32864c]
pcie_aspm_check_latency.isra.0+0x10b/0x210
pcie_update_aspm_capable+0xb1/0xe0
pcie_aspm_pm_state_change+0x41/0xa0
pci_raw_set_power_state+0x137/0x210
vfio_pci_remove+0x19/0x30 [vfio_pci 71a74ce0c543b84b41207595c4fc0aba2b32864c]
pci_device_remove+0x3a/0xa0
device_release_driver_internal+0x1b3/0x210
pci_stop_bus_device+0x69/0x90
pci_stop_and_remove_bus_device_locked+0x1a/0x30
remove_store+0x82/0xa0
kernfs_fop_write_iter+0x11f/0x1f0
new_sync_write+0x13d/0x1c0
vfs_write+0x1ec/0x270
ksys_write+0x6f/0xf0
pcie_aspm_check_latency.isra.0+0x10b/0x210
pcie_update_aspm_capable+0xb1/0xe0
pcie_aspm_pm_state_change+0x41/0xa0
pci_raw_set_power_state+0x137/0x210
vfio_pci_remove+0x19/0x30 [vfio_pci 71a74ce0c543b84b41207595c4fc0aba2b32864c]
pci_device_remove+0x3a/0xa0
device_release_driver_internal+0x1b3/0x210
pci_stop_bus_device+0x69/0x90
pci_stop_and_remove_bus_device_locked+0x1a/0x30
remove_store+0x82/0xa0
kernfs_fop_write_iter+0x11f/0x1f0
new_sync_write+0x13d/0x1c0
vfs_write+0x1ec/0x270
ksys_write+0x6f/0xf0
do_syscall_64+0x5f/0x90
? syscall_exit_to_user_mode+0x26/0x50
? do_syscall_64+0x6b/0x90
? do_syscall_64+0x6b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f7c0b101c27
Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51>
RSP: 002b:00007ffdbd3677b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f7c0b101c27
RDX: 0000000000000002 RSI: 00007ffdbd367910 RDI: 0000000000000003
RBP: 00007ffdbd367910 R08: 0000000000001004 R09: 0000000000000001
R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000002
R13: 000055eb94cd64a0 R14: 0000000000000002 R15: 00007f7c0b1f9940
</TASK>
Modules linked in: snd_usb_audio snd_usbmidi_lib snd_rawmidi hid_microsoft snd_seq_device mc ff_memless mousedev joydev intel_rapl_msr intel_rapl_common amd6>
nvme_core aacraid xhci_pci_renesas vfio_pci vfio_pci_core irqbypass vfio_virqfd vfio_iommu_type1 vfio
CR2: 000000000000006c
---[ end trace 0000000000000000 ]---
RIP: 0010:pcie_capability_reg_implemented+0x7/0xd0
Code: 03 00 00 00 48 c7 c7 70 8b d4 b1 5b e9 22 2e b1 ff 0f 0b eb d3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 0f 1f 44 00 00 31 d2 <80> 7f 6c 00 89 f1 74 3e>
RSP: 0018:ffffbb3d926dfbe8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 000000000000000c RCX: 0000000000000064
RDX: 0000000000000000 RSI: 000000000000000c RDI: 0000000000000000
RBP: 0000000000000000 R08: 0000000000000004 R09: ffffbb3d926dfbd4
R10: 0000000000000044 R11: ffffffffb0b35990 R12: ffffbb3d926dfc24
R13: 0000000000000000 R14: 0000000000001388 R15: 0000000000000000
FS: 00007f7c0b225740(0000) GS:ffff9f8dfe540000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000000006c CR3: 00000010795ba000 CR4: 00000000003506e0
Comment by James King (Randomized) - Friday, 29 July 2022, 19:05 GMT
Updating to note that this same behavior occurs still with 5.18.14-arch1-1
Comment by James King (Randomized) - Sunday, 16 October 2022, 18:11 GMT
Tested this again with 6.0.1-arch2-1 and it seem there has been a fix as this now works as expected again.
Comment by Toolybird (Toolybird) - Sunday, 16 October 2022, 20:58 GMT
Thanks for letting us know.

Loading...