FS#63359 - Can't boot after upgrading to Linux 5.2.5. Kernel BUG.

Attached to Project: Arch Linux
Opened by Omar (OmarSquircleArt) - Sunday, 04 August 2019, 12:17 GMT
Last edited by Antonio Rojas (arojas) - Sunday, 05 January 2020, 21:03 GMT
Task Type Bug Report
Category Kernel
Status Closed
Assigned To No-one
Architecture x86_64
Severity Critical
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

Description:

After upgrading linux and linux-headers from 5.1.7 to 5.2.5, linux hangs at boot. I found the following kernel messages at the very end of the systemd journal:

```
Aug 04 11:47:38 omar kernel: BUG: unable to handle page fault for address: ffff91fe41bdf5f8
Aug 04 11:47:38 omar kernel: #PF: supervisor read access in kernel mode
Aug 04 11:47:38 omar kernel: #PF: error_code(0x0000) - not-present page
```

I attached the full log below. I downgraded linux and linux-headers to 5.1.7 and everything worked as expected. This led me to believe that the kernel is the issue.

Additional info:
* Systemd : 242.84.
* Using systemd-boot.
* Ryzen 7 1700. Latest ASUS BIOS and AMD microcode.

Steps to reproduce:

Upgrade to linux 5.2.5 and reboot. Linux will hang before entering login or tty.
   log.log (91.4 KiB)
This task depends upon

Closed by  Antonio Rojas (arojas)
Sunday, 05 January 2020, 21:03 GMT
Reason for closing:  Fixed
Comment by loqs (loqs) - Sunday, 04 August 2019, 15:57 GMT Comment by Omar (OmarSquircleArt) - Sunday, 04 August 2019, 16:44 GMT
@loqs Isn't this related to the amdgpu xorg driver? In my case, I never started the X server, Linux hanged even before login.
Comment by loqs (loqs) - Sunday, 04 August 2019, 16:54 GMT
Can you rule out the amdgpu kernel module by blacklisting it [1] on the command line?
After the BUG output is a backtrace printed on the console?

[1] https://wiki.archlinux.org/index.php/Kernel_module#Using_kernel_command_line_2
Comment by Omar (OmarSquircleArt) - Sunday, 04 August 2019, 18:58 GMT
@loqs No, unfortunately there is no backtrace.
I will try blacklisting and get back to you later today.
Comment by Omar (OmarSquircleArt) - Sunday, 04 August 2019, 20:42 GMT
@loqs I just blacklisted the amdgpu module and everything worked as expected, so it appears amdgpu is the root of the issue after all.

It should be noted that I had Dynamic Power Management disabled by passing the parameter `amdgpu.dpm=0` to the kernel. Well I just tried enabling DPM by removing the parameter and the system booted successfully. However, with DPM enabled, my system is completely unusable, the screen goes black after 2 minuets of working and nothing worked to fix that, so I always had DPM disabled. Nothing meaningful is logged, so I can't make a proper bug report.

How should we take it from here? Do you have any suggestions?
Thanks!
Comment by Omar (OmarSquircleArt) - Sunday, 04 August 2019, 21:32 GMT
Comment by loqs (loqs) - Monday, 05 August 2019, 16:41 GMT
Does adding the boot option iommu=pt or iommu=off have any effect?
Comment by Omar (OmarSquircleArt) - Monday, 05 August 2019, 21:43 GMT
Neither `iommu=pt` nor `iommu=off` had any effect, the system still didn't boot. This was tested with `amdgpu.dpm=0`.
Comment by loqs (loqs) - Monday, 05 August 2019, 22:15 GMT
You could try [1] or [2] to see if the issue has been fixed upstream already.
Alternately you could bisect between 5.1 and 5.2 to find the cause and report the result upstream see [3] for instructions.

[1] https://aur.archlinux.org/packages/linux-mainline/
[2] https://aur.archlinux.org/packages/linux-amd-staging-drm-next-git/
[3] https://bbs.archlinux.org/viewtopic.php?pid=1855912#p1855912
Comment by Omar (OmarSquircleArt) - Monday, 05 August 2019, 22:50 GMT
Alright. I will try to test mainlin and amd-staging-drm-next-git.
Not sure if I will have the time to bisect right now. I will report back as soon as possible.

Thanks!
Comment by Omar (OmarSquircleArt) - Wednesday, 07 August 2019, 08:12 GMT
`linux-amd-staging-drm-next-git-5.3.841008.c5942cbe0164` doesn't boot. So the issue is still there upstream. I guess our only option now is bisecting.
Comment by Omar (OmarSquircleArt) - Wednesday, 18 September 2019, 20:13 GMT
Bug report in the AMDGPU bug tracker:
https://bugs.freedesktop.org/show_bug.cgi?id=111685

Loading...