FS#75237 - [linux][linux-hardened] Unrecoverable crash with amdgpu after DPMS on with >= 5.18

Attached to Project: Arch Linux
Opened by David Runge (dvzrv) - Monday, 04 July 2022, 12:39 GMT
Last edited by Toolybird (Toolybird) - Sunday, 26 March 2023, 00:09 GMT
Task Type Bug Report
Category Packages: Core
Status Closed
Assigned To Jan Alexander Steffens (heftig)
Levente Polyak (anthraxx)
Architecture All
Severity Critical
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 1
Private No

Details

Description:

When my machine wakes up from DPMS, I get a short wake up of the screen, then the screen goes back to sleep again. It will eventually be woken up after more than a minute and I can log back into my session.
If the screen is turned off (manually) in this time frame, the machine will crash after logging in (after about 1 or 2 minutes) and will need a hardware reset.

```
Jul 03 10:47:18 hmbx kernel: ACPI: bus type drm_connector registered
Jul 03 10:47:18 hmbx kernel: [drm] amdgpu kernel modesetting enabled.
Jul 03 10:47:18 hmbx kernel: amdgpu: Ignoring ACPI CRAT on non-APU system
Jul 03 10:47:18 hmbx kernel: amdgpu: Virtual CRAT table created for CPU
Jul 03 10:47:18 hmbx kernel: amdgpu: Topology: Add CPU node
Jul 03 10:47:18 hmbx kernel: fb0: switching to amdgpu from EFI VGA
Jul 03 10:47:18 hmbx kernel: amdgpu 0000:0d:00.0: vgaarb: deactivate vga console
Jul 03 10:47:18 hmbx kernel: amdgpu 0000:0d:00.0: enabling device (0006 -> 0007)
Jul 03 10:47:18 hmbx kernel: [drm] initializing kernel modesetting (NAVI10 0x1002:0x731F 0x1043:0x0583 0xC4).
Jul 03 10:47:18 hmbx kernel: amdgpu 0000:0d:00.0: amdgpu: Trusted Memory Zone (TMZ) feature disabled as experimental (default)
Jul 03 10:47:18 hmbx kernel: [drm] register mmio base: 0xFCD00000
Jul 03 10:47:18 hmbx kernel: [drm] register mmio size: 524288
Jul 03 10:47:18 hmbx kernel: [drm] add ip block number 0 <nv_common>
Jul 03 10:47:18 hmbx kernel: [drm] add ip block number 1 <gmc_v10_0>
Jul 03 10:47:18 hmbx kernel: [drm] add ip block number 2 <navi10_ih>
Jul 03 10:47:18 hmbx kernel: [drm] add ip block number 3 <psp>
Jul 03 10:47:18 hmbx kernel: [drm] add ip block number 4 <smu>
Jul 03 10:47:18 hmbx kernel: [drm] add ip block number 5 <dm>
Jul 03 10:47:18 hmbx kernel: [drm] add ip block number 6 <gfx_v10_0>
Jul 03 10:47:18 hmbx kernel: [drm] add ip block number 7 <sdma_v5_0>
Jul 03 10:47:18 hmbx kernel: [drm] add ip block number 8 <vcn_v2_0>
Jul 03 10:47:18 hmbx kernel: [drm] add ip block number 9 <jpeg_v2_0>
Jul 03 10:47:18 hmbx kernel: amdgpu 0000:0d:00.0: amdgpu: Fetched VBIOS from VFCT
Jul 03 10:47:18 hmbx kernel: amdgpu: ATOM BIOS: 115-D199PI0-101
Jul 03 10:47:18 hmbx kernel: [drm] VCN decode is enabled in VM mode
Jul 03 10:47:18 hmbx kernel: [drm] VCN encode is enabled in VM mode
Jul 03 10:47:18 hmbx kernel: [drm] JPEG decode is enabled in VM mode
Jul 03 10:47:18 hmbx kernel: [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
Jul 03 10:47:18 hmbx kernel: amdgpu 0000:0d:00.0: amdgpu: VRAM: 8176M 0x0000008000000000 - 0x00000081FEFFFFFF (8176M used)
Jul 03 10:47:18 hmbx kernel: amdgpu 0000:0d:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
Jul 03 10:47:18 hmbx kernel: amdgpu 0000:0d:00.0: amdgpu: AGP: 267894784M 0x0000008400000000 - 0x0000FFFFFFFFFFFF
Jul 03 10:47:18 hmbx kernel: [drm] Detected VRAM RAM=8176M, BAR=256M
Jul 03 10:47:18 hmbx kernel: [drm] RAM width 256bits GDDR6
Jul 03 10:47:18 hmbx kernel: [drm] amdgpu: 8176M of VRAM memory ready
Jul 03 10:47:18 hmbx kernel: [drm] amdgpu: 8176M of GTT memory ready.
Jul 03 10:47:18 hmbx kernel: [drm] GART: num cpu pages 131072, num gpu pages 131072
Jul 03 10:47:18 hmbx kernel: [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
Jul 03 10:47:18 hmbx kernel: amdgpu 0000:0d:00.0: amdgpu: PSP runtime database doesn't exist
Jul 03 10:47:18 hmbx kernel: [drm] Found VCN firmware Version ENC: 1.17 DEC: 5 VEP: 0 Revision: 2
Jul 03 10:47:18 hmbx kernel: amdgpu 0000:0d:00.0: amdgpu: Will use PSP to load VCN firmware
Jul 03 10:47:18 hmbx kernel: [drm] reserve 0x900000 from 0x81fe400000 for PSP TMR
Jul 03 10:47:18 hmbx kernel: amdgpu 0000:0d:00.0: amdgpu: RAS: optional ras ta ucode is not available
Jul 03 10:47:18 hmbx kernel: amdgpu 0000:0d:00.0: amdgpu: RAP: optional rap ta ucode is not available
Jul 03 10:47:18 hmbx kernel: amdgpu 0000:0d:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
Jul 03 10:47:18 hmbx kernel: amdgpu 0000:0d:00.0: amdgpu: use vbios provided pptable
Jul 03 10:47:18 hmbx kernel: amdgpu 0000:0d:00.0: amdgpu: smc_dpm_info table revision(format.content): 4.5
Jul 03 10:47:18 hmbx kernel: amdgpu 0000:0d:00.0: amdgpu: SMU is initialized successfully!
Jul 03 10:47:18 hmbx kernel: [drm] Display Core initialized with v3.2.177!
Jul 03 10:47:18 hmbx kernel: [drm] kiq ring mec 2 pipe 1 q 0
Jul 03 10:47:18 hmbx kernel: [drm] VCN decode and encode initialized successfully(under DPG Mode).
Jul 03 10:47:18 hmbx kernel: [drm] JPEG decode initialized successfully.
Jul 03 10:47:18 hmbx kernel: kfd kfd: amdgpu: Allocated 3969056 bytes on gart
Jul 03 10:47:18 hmbx kernel: amdgpu: HMM registered 8176MB device memory
Jul 03 10:47:18 hmbx kernel: amdgpu: SRAT table not found
Jul 03 10:47:18 hmbx kernel: amdgpu: Virtual CRAT table created for GPU
Jul 03 10:47:18 hmbx kernel: amdgpu: Topology: Add dGPU node [0x731f:0x1002]
Jul 03 10:47:18 hmbx kernel: kfd kfd: amdgpu: added device 1002:731f
Jul 03 10:47:18 hmbx kernel: amdgpu 0000:0d:00.0: amdgpu: SE 2, SH per SE 2, CU per SH 10, active_cu_number 36
Jul 03 10:47:18 hmbx kernel: amdgpu 0000:0d:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
Jul 03 10:47:18 hmbx kernel: amdgpu 0000:0d:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
Jul 03 10:47:18 hmbx kernel: amdgpu 0000:0d:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
Jul 03 10:47:18 hmbx kernel: amdgpu 0000:0d:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
Jul 03 10:47:18 hmbx kernel: amdgpu 0000:0d:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
Jul 03 10:47:18 hmbx kernel: amdgpu 0000:0d:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
Jul 03 10:47:18 hmbx kernel: amdgpu 0000:0d:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
Jul 03 10:47:18 hmbx kernel: amdgpu 0000:0d:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
Jul 03 10:47:18 hmbx kernel: amdgpu 0000:0d:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
Jul 03 10:47:18 hmbx kernel: amdgpu 0000:0d:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
Jul 03 10:47:18 hmbx kernel: amdgpu 0000:0d:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
Jul 03 10:47:18 hmbx kernel: amdgpu 0000:0d:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
Jul 03 10:47:18 hmbx kernel: amdgpu 0000:0d:00.0: amdgpu: ring vcn_dec uses VM inv eng 0 on hub 1
Jul 03 10:47:18 hmbx kernel: amdgpu 0000:0d:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 1 on hub 1
Jul 03 10:47:18 hmbx kernel: amdgpu 0000:0d:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 4 on hub 1
Jul 03 10:47:18 hmbx kernel: amdgpu 0000:0d:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 1
Jul 03 10:47:18 hmbx kernel: [drm] Initialized amdgpu 3.46.0 20150101 for 0000:0d:00.0 on minor 0
Jul 03 10:47:18 hmbx kernel: fbcon: amdgpudrmfb (fb0) is primary device
Jul 03 10:47:18 hmbx kernel: [drm] DSC precompute is not needed.
Jul 03 10:47:18 hmbx kernel: amdgpu 0000:0d:00.0: [drm] fb0: amdgpudrmfb frame buffer device
Jul 03 11:04:00 hmbx systemd[1]: Starting Load Kernel Module drm...
Jul 03 11:04:00 hmbx kernel: snd_hda_intel 0000:0d:00.1: bound 0000:0d:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])
Jul 03 12:40:31 hmbx kernel: [drm] perform_link_training_with_retries: Link training attempt 1 of 4 failed
Jul 03 12:40:32 hmbx kernel: [drm] perform_link_training_with_retries: Link training attempt 2 of 4 failed
Jul 03 14:10:16 hmbx kernel: [drm] perform_link_training_with_retries: Link training attempt 1 of 4 failed
Jul 03 14:10:17 hmbx kernel: [drm] perform_link_training_with_retries: Link training attempt 2 of 4 failed
Jul 03 16:55:06 hmbx kernel: [drm] perform_link_training_with_retries: Link training attempt 1 of 4 failed
Jul 03 16:55:07 hmbx kernel: [drm] perform_link_training_with_retries: Link training attempt 2 of 4 failed
Jul 03 16:55:08 hmbx kernel: [drm] REG_WAIT timeout 1us * 100000 tries - optc1_wait_for_state line:819
Jul 03 20:12:06 hmbx kernel: [drm] perform_link_training_with_retries: Link training attempt 1 of 4 failed
Jul 03 20:12:07 hmbx kernel: [drm] perform_link_training_with_retries: Link training attempt 2 of 4 failed
Jul 03 20:51:52 hmbx kernel: [drm] perform_link_training_with_retries: Link training attempt 1 of 4 failed
Jul 03 20:51:53 hmbx kernel: [drm] perform_link_training_with_retries: Link training attempt 2 of 4 failed
Jul 03 20:51:54 hmbx kernel: [drm] REG_WAIT timeout 1us * 100000 tries - optc1_wait_for_state line:819
Jul 03 22:24:10 hmbx kernel: [drm] perform_link_training_with_retries: Link training attempt 1 of 4 failed
Jul 03 22:24:11 hmbx kernel: [drm] perform_link_training_with_retries: Link training attempt 2 of 4 failed
Jul 03 22:24:11 hmbx kernel: [drm] REG_WAIT timeout 1us * 100000 tries - optc1_wait_for_state line:819
```

With kernel >= 5.18.8 the screen does not wake up from DPMS anymore (the second time) and the machine crashes (without logs) non recoverably after the screen is woken up briefly for the first time.


Additional info:

```
0d:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT] (rev c4)
```

* package version(s): linux, linux-hardened >= 5.18
* there are two drm/amd bugs that may describe what I am seeing (https://gitlab.freedesktop.org/drm/amd/-/issues/2073 and https://gitlab.freedesktop.org/drm/amd/-/issues/2044)

I have tried to revert the patch mentioned in the first upstream ticket, but that changed nothing for my case (with 5.18.8, waking up the screen leads to unrecoverable crash).


Steps to reproduce:

Boot into linux or linux-hardened (>= 5.18). Let screen go to sleep and wake up screen (e.g. by mouse movement/button press).
This task depends upon

Closed by  Toolybird (Toolybird)
Sunday, 26 March 2023, 00:09 GMT
Reason for closing:  Fixed
Additional comments about closing:  linux 6.0.arch1-1
Comment by David Runge (dvzrv) - Monday, 04 July 2022, 12:47 GMT
Maybe related and reported even earlier: https://gitlab.freedesktop.org/drm/amd/-/issues/2029
Comment by David Runge (dvzrv) - Wednesday, 06 July 2022, 11:50 GMT
I have created a dedicated upstream issue: https://gitlab.freedesktop.org/drm/amd/-/issues/2079
Comment by Bráulio Barros de Oliveira (brauliobo) - Sunday, 07 August 2022, 22:52 GMT Comment by loqs (loqs) - Sunday, 26 March 2023, 00:02 GMT
From the upstream bug report was closed [1] with:
This issue is gone with kernel >= 6.0.0

https://gitlab.freedesktop.org/drm/amd/-/issues/2079#note_1651658

Loading...