Arch Linux

FS#63359 - Can't boot after upgrading to Linux 5.2.5. Kernel BUG.

Attached to Project: Arch Linux
Opened by Omar (OmarSquircleArt) - Sunday, 04 August 2019, 12:17 GMT
Task Type Bug Report
Category Kernel
Status Unconfirmed
Assigned To No-one
Architecture x86_64
Severity Critical
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 0%
Votes 0
Private No



After upgrading linux and linux-headers from 5.1.7 to 5.2.5, linux hangs at boot. I found the following kernel messages at the very end of the systemd journal:

Aug 04 11:47:38 omar kernel: BUG: unable to handle page fault for address: ffff91fe41bdf5f8
Aug 04 11:47:38 omar kernel: #PF: supervisor read access in kernel mode
Aug 04 11:47:38 omar kernel: #PF: error_code(0x0000) - not-present page

I attached the full log below. I downgraded linux and linux-headers to 5.1.7 and everything worked as expected. This led me to believe that the kernel is the issue.

Additional info:
* Systemd : 242.84.
* Using systemd-boot.
* Ryzen 7 1700. Latest ASUS BIOS and AMD microcode.

Steps to reproduce:

Upgrade to linux 5.2.5 and reboot. Linux will hang before entering login or tty.
Comment by loqs (loqs) - Sunday, 04 August 2019, 15:57 GMT Comment by Omar (OmarSquircleArt) - Sunday, 04 August 2019, 16:44 GMT
@loqs Isn't this related to the amdgpu xorg driver? In my case, I never started the X server, Linux hanged even before login.
Comment by loqs (loqs) - Sunday, 04 August 2019, 16:54 GMT
Can you rule out the amdgpu kernel module by blacklisting it [1] on the command line?
After the BUG output is a backtrace printed on the console?

Comment by Omar (OmarSquircleArt) - Sunday, 04 August 2019, 18:58 GMT
@loqs No, unfortunately there is no backtrace.
I will try blacklisting and get back to you later today.
Comment by Omar (OmarSquircleArt) - Sunday, 04 August 2019, 20:42 GMT
@loqs I just blacklisted the amdgpu module and everything worked as expected, so it appears amdgpu is the root of the issue after all.

It should be noted that I had Dynamic Power Management disabled by passing the parameter `amdgpu.dpm=0` to the kernel. Well I just tried enabling DPM by removing the parameter and the system booted successfully. However, with DPM enabled, my system is completely unusable, the screen goes black after 2 minuets of working and nothing worked to fix that, so I always had DPM disabled. Nothing meaningful is logged, so I can't make a proper bug report.

How should we take it from here? Do you have any suggestions?
Comment by Omar (OmarSquircleArt) - Sunday, 04 August 2019, 21:32 GMT
Comment by loqs (loqs) - Monday, 05 August 2019, 16:41 GMT
Does adding the boot option iommu=pt or iommu=off have any effect?
Comment by Omar (OmarSquircleArt) - Monday, 05 August 2019, 21:43 GMT
Neither `iommu=pt` nor `iommu=off` had any effect, the system still didn't boot. This was tested with `amdgpu.dpm=0`.
Comment by loqs (loqs) - Monday, 05 August 2019, 22:15 GMT
You could try [1] or [2] to see if the issue has been fixed upstream already.
Alternately you could bisect between 5.1 and 5.2 to find the cause and report the result upstream see [3] for instructions.

Comment by Omar (OmarSquircleArt) - Monday, 05 August 2019, 22:50 GMT
Alright. I will try to test mainlin and amd-staging-drm-next-git.
Not sure if I will have the time to bisect right now. I will report back as soon as possible.

Comment by Omar (OmarSquircleArt) - Wednesday, 07 August 2019, 08:12 GMT
`linux-amd-staging-drm-next-git-5.3.841008.c5942cbe0164` doesn't boot. So the issue is still there upstream. I guess our only option now is bisecting.
Comment by Omar (OmarSquircleArt) - Wednesday, 18 September 2019, 20:13 GMT
Bug report in the AMDGPU bug tracker: