FS#74346 - [linux] 5.17.1 NULL pointer dereference in amdgpu

Attached to Project: Arch Linux
Opened by Lars Beckers (extmind) - Monday, 04 April 2022, 13:55 GMT
Last edited by Jelle van der Waa (jelly) - Thursday, 14 September 2023, 17:55 GMT
Task Type Bug Report
Category Kernel
Status Closed
Assigned To Tobias Powalowski (tpowa)
Jan Alexander Steffens (heftig)
David Runge (dvzrv)
Architecture x86_64
Severity High
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 2
Private No


Successfully resumed after suspend, changed display configuration, and changes did apply. But shortly after the system stopped responding to anything. Log shows a kernel trace, stating a NULL pointer dereference.

Additional info:
* linux 5.17.1-arch1-1
* Hardware: Thinkpad T14s with "AMD Ryzen 7 PRO 4750U with Radeon Graphics" (iGPU)
* attached kernel log, retrieved after a forced reboot

Steps to reproduce:
Did not happen previously when changing displays during the same boot.
This task depends upon

Closed by  Jelle van der Waa (jelly)
Thursday, 14 September 2023, 17:55 GMT
Reason for closing:  Deferred
Additional comments about closing:  Old kernel, please retry with the latest
Comment by Lahfa Samy (AkechiShiro) - Friday, 08 April 2022, 10:13 GMT
Hey I'm running Arch under AMD Ryzen 7 3700U (a Thinkpad T495) but I haven't encountered this bug yet on linux 5.17.1-arch1-1 however I'm encountering another bug about AMD IOMMU, that seems to do a kernel oops after resuming from suspend, if a startx server is launched the only thing I can do is a forced reboot.
(The logs I've joined are made with initcall_debug, no_console_suspend, ignore_loglevel for a decent amount of debugging output possible) but I guess my issue is a complete other bug and not related to yours ?
Comment by Lars Beckers (extmind) - Friday, 08 April 2022, 12:37 GMT
I'm no expert on this, but as the call trace looks completely different and my system does not exhibit the problem during resumption itself, I think this bug is unrelated to yours.
Comment by Lahfa Samy (AkechiShiro) - Friday, 08 April 2022, 15:07 GMT
Thanks for your advice, I've thus opened another bug report : https://bugs.archlinux.org/task/74405?project=1&opened=27395
Comment by David C. Rankin (drankinatty) - Monday, 27 June 2022, 04:11 GMT
I have this same issue using default kernel amdgpu drive and an old [AMD/ATI] RV370 [Radeon X300] card (in a server). This started in 5.18 (maybe last 5.17 or two as all updates are done remotely). lspci for the video card shows:

01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] RV370 [Radeon X300] (prog-if 00 [VGA controller])

The dmesg output for the NULL pointer dereference is:

[ 9.660937] [drm] amdgpu kernel modesetting enabled.
[ 9.661025] amdgpu: CRAT table not found
[ 9.661028] amdgpu: Virtual CRAT table created for CPU
[ 9.661040] amdgpu: Topology: Add CPU node
[ 9.661296] [drm] initializing kernel modesetting (IP DISCOVERY 0x1002:0x5B70 0x1002:0x0F03 0x00).
[ 9.661302] amdgpu 0000:01:00.1: amdgpu: Trusted Memory Zone (TMZ) feature disabled as experimental (default)
[ 9.661305] amdgpu 0000:01:00.1: amdgpu: Fatal error during GPU init
[ 9.661318] amdgpu: probe of 0000:01:00.1 failed with error -12
[ 9.661338] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 9.661384] #PF: supervisor write access in kernel mode
[ 9.661411] #PF: error_code(0x0002) - not-present page
[ 9.661440] PGD 0 P4D 0
[ 9.661454] Oops: 0002 [#1] PREEMPT SMP NOPTI

(full snippet with backtrace included as attachment)

This has progressively gotten worse. Let me know what else to send and I'm happy to do it.
Comment by David C. Rankin (drankinatty) - Monday, 27 June 2022, 04:40 GMT
Not sure if it is needed, but here is and attachment with hardware and system details related to the comment above. The attachment was produced with:

# inxi -c0 -C --gpu --memory --machine --sensors --system
Comment by David C. Rankin (drankinatty) - Monday, 27 June 2022, 14:58 GMT
This is definitely a kernel BUG. I just happened to have two servers with this same video card in them (low-power, fanless, fine for server). Checking the second server, the exact same NULL pointer issue is present with the amdgpu module. The dmesg output from the second server is attached. Exact same [AMD/ATI] RV370 [Radeon X300] card. Let me know if you need further hardware details on the second box.
Comment by loqs (loqs) - Monday, 27 June 2022, 15:38 GMT
Is the issue still present in linux 5.19-rc4? Is there an upstream bug report on https://gitlab.freedesktop.org/drm/amd/-/issues for the issue?
Comment by David C. Rankin (drankinatty) - Tuesday, 28 June 2022, 23:36 GMT
https://gitlab.freedesktop.org/drm/amd/-/issues/2070 Appears to be the issue. Link added there referring here.
Comment by David C. Rankin (drankinatty) - Sunday, 03 July 2022, 07:20 GMT
The priority of this Bug needs to be raised. The kernel NULL pointer on reboot causes my system to hard-lock on shutdown necessitating a hard-restart. (thnkfully it doesn't hardlock until after drives are unmounted and in safe state. When your system can no longer be shutdown cleanly due to this amdgpu bug, that needs to get elevated priority.
Comment by David C. Rankin (drankinatty) - Monday, 03 October 2022, 00:56 GMT
Thank God this looks like it is finally going to be fixed. This was out Friday:

Fixes: cfbb6b004744 ("drm/amdgpu: Rework reset domain to be refcounted.")

Signed-off-by: Zhang Boyang <zhangboyang.id@gmail.com>
Link:a8bce489-8ccc-aa95-3de6-f854e03ad557@suddenlinkmail.com/"> https://lore.kernel.org/lkml/a8bce489-8ccc-aa95-3de6-f854e03ad557@suddenlinkmail.com/
Link:AT9WHR.3Z1T3VI9A2AQ3@att.net/"> https://lore.kernel.org/lkml/AT9WHR.3Z1T3VI9A2AQ3@att.net/
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

Not sure when we will get it, but it couldn't happen quick enough. Every kernel update locks the box on reboot.
Comment by loqs (loqs) - Monday, 03 October 2022, 02:25 GMT
If you built 6.0 with the suggested fix applied you could confirm if it resolves the issue and then report that back to upstream.
6.0 with change from [1] applied:
https://drive.google.com/file/d/1ZKZVSs4tlwVlpNQuq7cwcawAbKTURZQI/view?usp=sharing linux-6.0-1-x86_64.pkg.tar.zst
https://drive.google.com/file/d/1O8Blidk_8tCMigf759aoB2-olI_XTfNx/view?usp=sharing linux-headers-6.0-1-x86_64.pkg.tar.zst

[1] https://lore.kernel.org/all/20220930214110.1074226-2-zhangboyang.id%40gmail.com/