FS#76844 - [linux] Irregurlar but persistent Kernel Panic since 6.x
Attached to Project:
Arch Linux
Opened by wtf (oi_wtf) - Friday, 16 December 2022, 11:24 GMT
Last edited by Toolybird (Toolybird) - Saturday, 17 December 2022, 20:44 GMT
Opened by wtf (oi_wtf) - Friday, 16 December 2022, 11:24 GMT
Last edited by Toolybird (Toolybird) - Saturday, 17 December 2022, 20:44 GMT
|
Details
Description:
When using any 6.x kernel I get a kernel panic quite consistently but within very irregular timeframes. Sometimes the machine runs fine for 5 days till it hits the panic, sometimes it only takes 10 minutes. I've got this issue since the first 6.0 release I tried, but I needed that PC for daily work and linux-lts kernel worked ans still works fine right now (5.15.82-1-lts), so I switched to that at first. I also did not have any stack traces with a non-tainted kernel, since I needed some oot modules. So it took a while until I could test and play around a bit. I did set up a serial connection after a few panics and collected some of the panic messages. I now tried the 6.1 kernel, and the problem still persists. But since I've got a little more leeway and time to mess with the PC, I could collect some traces in non-tainted state. I was thinking about bisecting, but that could take months since the panic occurs so infrequently. Every step could take a week or more... so that's not really feasible for me... But maybe someone can deduce something from the stack traces. They're kind of consistently failing somewhere near the cpuidle_enter or cpuidle_enter_state function, but with very different errors printed as cause. Some "unable to handle page fault", "scheduling while atomic", "kernel NULL pointer dereference", "general protection fault", and "Kernel stack is corrupted" (with 6.0.5) ... Sounds like maybe the stack got corrupted, but in a slightly different way each time? And I think it should not be defective hardware like RAM, since linux-lts kernels work perfectly fine. Weirdly, only that single machine has this problem, three other machines of mine (2 laptops, 2 custom-built PCs, one of them running 24/7) are perfectly fine with 6.x kernels. One of them is intel based, but most have Ryzen processors, too: A R7 4800H, R5 1600, R5 5600G, and a i7-3630QM. Additional info: * package version(s): 6.0.0 - 6.1.0.arch1 * config and/or log files etc.: see attachments * kernel cmdline: root=UUID=a...7 rw loglevel=6 audit=0 no_console_suspend console=tty0 console=ttyS0,115200n8 resume=UUID=2...4 amdgpu.ppfeaturemask=0xffffffff luks.name=4... (and a few more luks partitions) * no LVM, just luks with btrfs or ext4 directly on top Hardware info: * CPU: R9 3900X * GPU: RX 6800XT * MB: TUF GAMING X570-PLUS, BIOS 4403 04/28/2022 Steps to reproduce: - none, I was not able to find a way to explicitly trigger this |
This task depends upon
Closed by Toolybird (Toolybird)
Saturday, 17 December 2022, 20:44 GMT
Reason for closing: None
Additional comments about closing: Reporter says "problem solved by resetting bios settings to defaults and only re-applying sane settings"
Saturday, 17 December 2022, 20:44 GMT
Reason for closing: None
Additional comments about closing: Reporter says "problem solved by resetting bios settings to defaults and only re-applying sane settings"
kernel-panic_6.0.3-arch3-1.txt
Overclock much? Despite it being stable on LTS, it still smells very much like a hardware issue..
> Overclock much
Nope, not at all (to my knowledge).
I've added that amdgpu.ppfeaturemask originally for my RX480 which I overclocked a bit at the end.
When RX6000 cards became affordable and I replaced that RX480, I removed any overclocks, but didn't remove the parameter since I thought it would not matter much
unless an overclock gets actually applied using sysfs or wherever that knob was?
I'll remove that, though, I don't think I'm going to overclock the new card soon anyway.
> Using taskset stress your CPU 10 (Core 5) and CPU 22 (Core 11) with mprime and you'll probably find errors.
oh, I did not notice it always was the same two CPUs/cores/threads. Thanks for pointing that out.
Using mprime I was actually able to trigger the issue on the LTS kernel, so seems like a confirmation on this actually being an hardware issue.
(Though mprime seems to ignore taskset, so I went and stressed all cores... but cpu 10 paniced again this time (not in cpuidle though), so it does seem like something *is* up with that one... mprime even complained about some rounding error before the panic, so it really points toward an hardware issue, I guess... )
Weirdly, compiling stuff like PHP and other multithreaded stuff that somewhat stressed all cores,
did not trigger this on LTS or 6.x kernels, as I do remember every time it paniced the PC was only doing lightweight stuff like
Browsing the web, SSH, ... never during Gaming, Compiling, converting Video... which somewhat threw me off testing that direction.
Now that I seem to be able to reliably trigger this, I think I'll also try updating and resetting bios/firmware, to compare, maybe something went wrong there...
I didn't do that yet, since I thought if LTS and pre-6.x kernels worked fine, it wouldn't be that.
But seems like that assumption was wrong.
Anyway, thanks for pointing me in the right direction.
I must have either accidently set something weird or the bios settings profile restore after an update went wrong...
That and 6.x must have made it a lot more likely to hit the problem in day to day use, so with 5.x kernels it went unnoticed for a while.
(I hadn't touched bios for at least half a year)
Sorry for the noise.
I feel like that's something I should have tried before reporting a bug.