FS#67131 - [linux] 5.7.6 Hard system lockup with no journal information

Attached to Project: Arch Linux
Opened by LaserEyess (LaserEyess) - Saturday, 27 June 2020, 13:13 GMT
Last edited by freswa (frederik) - Sunday, 26 July 2020, 14:35 GMT
Task Type Bug Report
Category Packages: Core
Status Closed
Assigned To Tobias Powalowski (tpowa)
Jan Alexander Steffens (heftig)
Levente Polyak (anthraxx)
Architecture All
Severity High
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 1
Private No

Details

Description: Hard system lockup when upgrading to 5.7.6. No information whatsoever in any logs I can find. But there is a complete lack of response, even pinging the machine doesn't work. I suspect it is related to amdgpu or drm because it happens when I start my window manager. Downgrading to 5.7.5 fixes this completely. Interestingly enough this bug seems to be in 5.4.49 as well, potentially some backported fix gone wrong?


Additional info:
* linux 5.7.6
* mesa 20.1.2-1
* sway version 1.5-rc1-c8224270

Steps to reproduce:
1. Reboot
2. Start sway
3. Use computer as normal

I have captured this log with drm.debug=1 and debug=1 in my kernel cmdline https://0x0.st/iJaj.txt (way larger than 2 MB). The end of the log is where the freeze occurs, there is nothing interesting there.

Normally, my kernel cmdline is attached (cmdline.txt)
   cmdline (0.1 KiB)
This task depends upon

Closed by  freswa (frederik)
Sunday, 26 July 2020, 14:35 GMT
Reason for closing:  Fixed
Additional comments about closing:  linux 5.7.8
Comment by loqs (loqs) - Saturday, 27 June 2020, 16:31 GMT
5.7.6 [1] and 5.4.49 [2] share many backports. Can you bisect either of the affected stable branches and locate the causal commit?

[1] https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.7.6
[2] https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.4.49

Edit:
Possibly related https://bbs.archlinux.org/viewtopic.php?id=256929

The same commit was backported to 5.4.49
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=972f961c5930ffa5de5472f7ced6e9b12bfbbf07
Comment by LaserEyess (LaserEyess) - Sunday, 28 June 2020, 00:05 GMT
I can make an attempt to bisect, but unfortunately I don't have time this weekend. I tried booting in 5.7.6 again and did not experience the crash for an hour. I'm going to do some more debugging during the week when I have time.

Upstream bug report for amdgpu: https://gitlab.freedesktop.org/drm/amd/-/issues/1191
Comment by LaserEyess (LaserEyess) - Monday, 29 June 2020, 21:57 GMT
Patch from AMD https://gitlab.freedesktop.org/drm/amd/uploads/70a8bb21134e484c776500208cf3c775/0001-drm-amd-display-Only-revalidate-bandwidth-on-medium-.patch

Been using it for about 30 minutes no, no crashes what so ever. There's a second affirmation in that thread as well, I think this patch fixes this.
Comment by LaserEyess (LaserEyess) - Wednesday, 01 July 2020, 00:54 GMT
Another crash after ~24 hours. Unsure if it's related, but this is a paste of `journalctl -b-1 -k -e`. The actual crash happened between 20:15 and 20:30, I wasn't at the computer at the time. https://0x0.st/iJGb.txt

This is with the patch in the previous comment applied.
Comment by J. Andrew Lanz-O'Brien (jlanzobr) - Wednesday, 01 July 2020, 12:24 GMT
I am affected by this bug as well. Ryzen 3800X and Radeon 5700XT. Downgrading to 5.7.5 completely resolves the issue.
Comment by LaserEyess (LaserEyess) - Friday, 24 July 2020, 17:04 GMT
As of 5.7.8 with https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=b5232e2ee8df85891514c73472cac09921e5d51d in the kernel I think this bug is fixed. I am still getting some crashes but I believe they're fundamentally unrelated to this particular issue.

Loading...