FS#75189 - radeon GPU driver dies, system becomes unusable

Attached to Project: Arch Linux
Opened by Marius Kleber (MK13) - Wednesday, 29 June 2022, 17:49 GMT
Last edited by Toolybird (Toolybird) - Friday, 02 September 2022, 05:23 GMT
Task Type Bug Report
Category Packages: Core
Status Closed
Assigned To No-one
Architecture x86_64
Severity Critical
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

Description:
After some time the radeon GPU driver dies:

Jun 29 19:04:02 brutebox kernel: radeon 0000:01:00.0: ring 4 stalled for more than 10350msec
Jun 29 19:04:02 brutebox kernel: radeon 0000:01:00.0: GPU lockup (current fence id 0x0000000000101671 last fence id 0x0000000000101672 on ring 4)
Jun 29 19:04:02 brutebox kernel: radeon 0000:01:00.0: failed to get a new IB (-35)
Jun 29 19:04:02 brutebox kernel: radeon 0000:01:00.0: failed to get a new IB (-35)
Jun 29 19:04:02 brutebox kernel: [drm:radeon_cs_ioctl [radeon]] *ERROR* Failed to get ib !
Jun 29 19:04:02 brutebox kernel: [drm:radeon_cs_ioctl [radeon]] *ERROR* Failed to get ib !
Jun 29 19:04:02 brutebox kernel: BUG: unable to handle page fault for address: ffffb135c08f1ffc
Jun 29 19:04:02 brutebox kernel: #PF: supervisor read access in kernel mode
Jun 29 19:04:02 brutebox kernel: #PF: error_code(0x0000) - not-present page
Jun 29 19:04:02 brutebox kernel: PGD 100000067 P4D 100000067 PUD 0
Jun 29 19:04:02 brutebox kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Jun 29 19:04:02 brutebox kernel: CPU: 1 PID: 714 Comm: Xorg:rcs0 Tainted: G OE 5.18.7-arch1-1 #1 b361f845a00a4369e3079c139378bcbc5b131d49
Jun 29 19:04:02 brutebox kernel: Hardware name: Gigabyte Technology Co., Ltd. B85M-DS3H/B85M-DS3H, BIOS F1 09/06/2013
Jun 29 19:04:02 brutebox kernel: RIP: 0010:radeon_ring_backup+0xc2/0x160 [radeon]
Jun 29 19:04:02 brutebox kernel: Code: 49 c1 e6 02 4c 89 f7 e8 7c 16 dd c1 49 89 45 00 48 89 c2 48 85 c0 74 5c 48 8b 4b 10 41 8d 47 01 45 89 ff 23 43 5c 4a 8d 34 b9 <8b> 36 89 32 41 83 fc 01 74 29 ba 04 00 00 00 eb 04 48 8b 4b 10 8d
Jun 29 19:04:02 brutebox kernel: RSP: 0018:ffffb131c0d93a30 EFLAGS: 00010246
Jun 29 19:04:02 brutebox kernel: RAX: 0000000000000000 RBX: ffff93d3c0a09620 RCX: ffffb131c08f2000
Jun 29 19:04:02 brutebox kernel: RDX: ffff93d66b500000 RSI: ffffb135c08f1ffc RDI: 000000000003968f
Jun 29 19:04:02 brutebox kernel: RBP: ffff93d3c0a09600 R08: 0000000000039688 R09: 0000000000000006
Jun 29 19:04:02 brutebox kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000dd41
Jun 29 19:04:02 brutebox kernel: R13: ffffb131c0d93a98 R14: 0000000000037504 R15: 00000000ffffffff
Jun 29 19:04:02 brutebox kernel: FS: 00007fc1e3dff640(0000) GS:ffff93d6e0080000(0000) knlGS:0000000000000000
Jun 29 19:04:02 brutebox kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 29 19:04:02 brutebox kernel: CR2: ffffb135c08f1ffc CR3: 000000011835c004 CR4: 00000000001706e0
Jun 29 19:04:02 brutebox kernel: Call Trace:
Jun 29 19:04:02 brutebox kernel: <TASK>
Jun 29 19:04:02 brutebox kernel: radeon_gpu_reset+0xee/0x330 [radeon ff27255649b437dabb0c7fdfb7440867ccf6b58e]
Jun 29 19:04:02 brutebox kernel: radeon_cs_ioctl+0x32a/0x770 [radeon ff27255649b437dabb0c7fdfb7440867ccf6b58e]
Jun 29 19:04:02 brutebox kernel: ? radeon_cs_parser_init+0x4a0/0x4a0 [radeon ff27255649b437dabb0c7fdfb7440867ccf6b58e]
Jun 29 19:04:02 brutebox kernel: drm_ioctl_kernel+0xc7/0x170
Jun 29 19:04:02 brutebox kernel: drm_ioctl+0x22e/0x410
Jun 29 19:04:02 brutebox kernel: ? radeon_cs_parser_init+0x4a0/0x4a0 [radeon ff27255649b437dabb0c7fdfb7440867ccf6b58e]
Jun 29 19:04:02 brutebox kernel: radeon_drm_ioctl+0x4d/0x80 [radeon ff27255649b437dabb0c7fdfb7440867ccf6b58e]
Jun 29 19:04:02 brutebox kernel: __x64_sys_ioctl+0x8e/0xc0
Jun 29 19:04:02 brutebox kernel: do_syscall_64+0x5c/0x90
Jun 29 19:04:02 brutebox kernel: ? syscall_exit_to_user_mode+0x26/0x50
Jun 29 19:04:02 brutebox kernel: ? do_syscall_64+0x6b/0x90
Jun 29 19:04:02 brutebox kernel: ? do_syscall_64+0x6b/0x90
Jun 29 19:04:02 brutebox kernel: ? asm_sysvec_apic_timer_interrupt+0xe/0x20
Jun 29 19:04:02 brutebox kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
Jun 29 19:04:02 brutebox kernel: RIP: 0033:0x7fc1ef5077af
Jun 29 19:04:02 brutebox kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00
Jun 29 19:04:02 brutebox kernel: RSP: 002b:00007fc1e3dfe9c0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jun 29 19:04:02 brutebox kernel: RAX: ffffffffffffffda RBX: 00007fc1e5bec0c0 RCX: 00007fc1ef5077af
Jun 29 19:04:02 brutebox kernel: RDX: 00007fc1e5bfc0c8 RSI: 00000000c0206466 RDI: 0000000000000010
Jun 29 19:04:02 brutebox kernel: RBP: 00007fc1e5bfc0c8 R08: 0000000000000001 R09: 00000000ffffffff
Jun 29 19:04:02 brutebox kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00000000c0206466
Jun 29 19:04:02 brutebox kernel: R13: 0000000000000010 R14: 00007fc1e5c00190 R15: 00005579442b8de8
Jun 29 19:04:02 brutebox kernel: </TASK>
Jun 29 19:04:02 brutebox kernel: Modules linked in: amdgpu gpu_sched 8021q garp mrp stp llc intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel spi_nor kvm mtd crct10dif_pclmul mousedev joydev crc32_pclmul snd_usb_audio ghash_clmulni_intel snd_usbmidi_lib spi_intel_platform snd_hda_codec_realtek iTCO_wdt intel_pmc_bxt at24 snd_hda_codec_generic ledtrig_audio spi_intel ppdev iTCO_vendor_support aesni_intel snd_hda_codec_hdmi snd_rawmidi crypto_simd snd_hda_intel cryptd snd_intel_dspcfg snd_seq_device r8169 snd_intel_sdw_acpi rapl xone_dongle(OE) realtek xone_gip_bus(OE) i2c_i801 snd_hda_codec intel_cstate cfg80211 mdio_devres snd_hda_core vfat fat intel_uncore mc i2c_smbus pcspkr snd_hwdep libphy lpc_ich radeon rfkill snd_pcm snd_timer snd drm_ttm_helper soundcore parport_pc parport mac_hid wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel dm_multipath dm_mod sg fuse bpf_preload
Jun 29 19:04:02 brutebox kernel: ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 usbhid crc32c_intel xhci_pci xhci_pci_renesas i915 intel_gtt drm_buddy video drm_dp_helper ttm vfio_pci vfio_pci_core irqbypass vfio_virqfd vfio_iommu_type1 vfio
Jun 29 19:04:02 brutebox kernel: CR2: ffffb135c08f1ffc
Jun 29 19:04:02 brutebox kernel: ---[ end trace 0000000000000000 ]---
Jun 29 19:04:02 brutebox kernel: RIP: 0010:radeon_ring_backup+0xc2/0x160 [radeon]
Jun 29 19:04:02 brutebox kernel: Code: 49 c1 e6 02 4c 89 f7 e8 7c 16 dd c1 49 89 45 00 48 89 c2 48 85 c0 74 5c 48 8b 4b 10 41 8d 47 01 45 89 ff 23 43 5c 4a 8d 34 b9 <8b> 36 89 32 41 83 fc 01 74 29 ba 04 00 00 00 eb 04 48 8b 4b 10 8d
Jun 29 19:04:02 brutebox kernel: RSP: 0018:ffffb131c0d93a30 EFLAGS: 00010246
Jun 29 19:04:02 brutebox kernel: RAX: 0000000000000000 RBX: ffff93d3c0a09620 RCX: ffffb131c08f2000
Jun 29 19:04:02 brutebox kernel: RDX: ffff93d66b500000 RSI: ffffb135c08f1ffc RDI: 000000000003968f
Jun 29 19:04:02 brutebox kernel: RBP: ffff93d3c0a09600 R08: 0000000000039688 R09: 0000000000000006
Jun 29 19:04:02 brutebox kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000dd41
Jun 29 19:04:02 brutebox kernel: R13: ffffb131c0d93a98 R14: 0000000000037504 R15: 00000000ffffffff
Jun 29 19:04:02 brutebox kernel: FS: 00007fc1e3dff640(0000) GS:ffff93d6e0080000(0000) knlGS:0000000000000000
Jun 29 19:04:02 brutebox kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 29 19:04:02 brutebox kernel: CR2: ffffb135c08f1ffc CR3: 000000011835c004 CR4: 00000000001706e0

Additional info:
* package version(s)
linux: 5.18.7-arch1-1
* config and/or log files etc.
* link to upstream bug report, if any

Steps to reproduce:
I was not able to identify any particular steps, it appears to be random
This task depends upon

Closed by  Toolybird (Toolybird)
Friday, 02 September 2022, 05:23 GMT
Reason for closing:  None
Additional comments about closing:  Reporter says "I switched my GPU to Nvidia" so there's no point in keeping this open.
Comment by Marius Kleber (MK13) - Wednesday, 29 June 2022, 17:56 GMT
Not sure what other information are required. If you need anything else I will add it of course.
Comment by Curtis (foxcm2000) - Monday, 18 July 2022, 16:56 GMT
Hardware?
Comment by Marius Kleber (MK13) - Monday, 18 July 2022, 17:27 GMT
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cayman PRO [Radeon HD 6950] (prog-if 00 [VGA controller])
Subsystem: PC Partner Limited / Sapphire Technology Device e186
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 30
Region 0: Memory at e0000000 (64-bit, prefetchable) [size=256M]
Region 2: Memory at f0020000 (64-bit, non-prefetchable) [size=128K]
Region 4: I/O ports at e000 [size=256]
Expansion ROM at 000c0000 [disabled] [size=128K]
Capabilities: <access denied>
Kernel driver in use: radeon
Kernel modules: radeon, amdgpu

If you suspect the condition of the hardware - it looks OK to me. Temp is in spec and no rendering issues or something like that.
Comment by Curtis (foxcm2000) - Monday, 18 July 2022, 18:17 GMT
I was just wondering which GPU model (in your case a 6950) was involved because often driver bugs only affect some types of hardware.
Comment by Toolybird (Toolybird) - Tuesday, 02 August 2022, 07:42 GMT
Is this still happening? You might have to report it upstream. Have you fully read [1]? It's a long shot but someone here [2] said they needed to add radeon.dpm=0 for stability.

[1] https://wiki.archlinux.org/title/ATI
[2] https://old.reddit.com/r/linuxquestions/comments/pa6y3w/which_driver_for_amd_ati_radeon_hd/
Comment by Marius Kleber (MK13) - Tuesday, 02 August 2022, 07:50 GMT
I switched my GPU to Nvidia on Sunday (https://bbs.archlinux.org/viewtopic.php?id=278559) so I can't verify anything anymore, sorry :( The last time it happened was last week Thursday or Friday IIRC, so it still happened (5.18.12).

Should I still report upstream, to at least have it documented there as well?

Loading...