FS#59945 - [linux] 4.18.* rcu_preempt detected stalls on CPUs / tasks (kernel panic?)

Attached to Project: Arch Linux
Opened by yo (yo_arch) - Wednesday, 05 September 2018, 15:00 GMT
Last edited by Doug Newgard (Scimmia) - Thursday, 20 September 2018, 14:31 GMT
Task Type Bug Report
Category Kernel
Status Closed
Assigned To No-one
Architecture x86_64
Severity Critical
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 2
Private No

Details

Hello everyone

Description:
After updating the kernel from 4.17.9 to 4.18.[1-5] I got a kernel panic (I guess) by not being able to boot.

At boot I get in order:
loading kernel
loading initramfs

then, new screen with:
starting version 239

and then nothing during a few minutes (doesn't ask me for my LUKS partition password as normal)

then I get these messages (this is approximate, can not copy/past, no logs on the system):
info rcu_preempt detected stalls on CPUs / tasks
nonlazy_posted
rcu_preempt kthread starved for jiffies
RCU grace-period kthread stack dump

And a few minutes later, more or less the same messages, and again, and again.


However, once in 15 it boots well and asks me my password directly after "starting version 239"

I tried 4.18.1 and 4.18.5 with the same issue and have one more time downgraded to kernel 4.17.9.


Additional info:
cat /proc/version
Linux version 4.18.5-arch1-1-ARCH (builduser@heftig-12250) (gcc version 8.2.0 (GCC)) #1 SMP PREEMPT Fri Aug 24 12:48:58 UTC 2018

My computer is an acer with the following characteristics (see attachment)

Steps to reproduce:
turn on the laptop until boot
This task depends upon

Closed by  Doug Newgard (Scimmia)
Thursday, 20 September 2018, 14:31 GMT
Reason for closing:  Fixed
Additional comments about closing:  linux 4.18.9.arch1-1
Comment by loqs (loqs) - Wednesday, 05 September 2018, 17:33 GMT
Should be fixed by the patch from
https://lore.kernel.org/lkml/20180905084158.GR24124%40hirez.programming.kicks-ass.net/
Edit:
Fixed up link flyspray parses it as a url and an email address
Comment by Gicu Gorodenco (medved) - Wednesday, 05 September 2018, 21:08 GMT
I have the same problem.
I bet it's a CPU-linked kernel bug, as I also have a Intel Core 2 Duo (a slightly older version though).
UPDATE: workarounded by turning "Off" the "Intel SpeedStep" mode in BIOS options.
On my Dell XPS M1530 it's under "Performance -> SpeedStep Enable".
Comment by yo (yo_arch) - Friday, 07 September 2018, 10:10 GMT
Thanks for the replies.

I checked the Intel SpeedStep but I don't have this option in my BIOS.

What am I supposed to do wih this patch? I mean, I am going to apply it on the 4.18.6 linux kernel, compile it and then install it.
But will this patch be applied to the next official kernels or I will have to download the code source of every new kernel, compile and install it manually?
Comment by loqs (loqs) - Friday, 07 September 2018, 10:33 GMT
The patch has been accepted into https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=timers/urgent&id=e2c631ba75a7e727e8db0a9d30a06bfd434adb3a
It should be then pulled into mainline then as it is marked for stable for 4.18+ it will be queued for a future 4.18 stable release.
The package maintainer could apply it sooner. This does of course assume this fixes the issue for you.
Comment by Siegfried Metz (NiceGuy) - Saturday, 08 September 2018, 14:09 GMT
Also, try the workarounds, in the form of additional kernel boot parameters, we found out @ https://bbs.archlinux.org/viewtopic.php?id=239672
Either "clocksource=hpet" or "tsc=unstable" should equally do the trick to avoid the early boot stalling.

Some of the Intel Core 2 {Duo,Quad} are affected, but apparently not all of them.

The future kernel 4.18.7, released tomorrow evening, will not contain the fix.
Comment by Siegfried Metz (NiceGuy) - Sunday, 09 September 2018, 17:51 GMT
The fix is already in Linus' tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3567994a05ba6490f6055650fbb892c926ae7fca
Mainline kernel 4.19-rc3 is going to have it included.
Comment by yo (yo_arch) - Friday, 14 September 2018, 10:52 GMT
Thanks for the replies I have been trying to apply the patch on the 4.18.6 kernel and compile it to check if it fixes my issue.
However I got a problem when generating the initramfs with mkinitcpio: https://bbs.archlinux.org/viewtopic.php?id=240379
Do you have an idea?
Comment by Siegfried Metz (NiceGuy) - Monday, 17 September 2018, 18:30 GMT
The patch to fix this issue for Intel Core 2 processors has landed in the 4.18 stable queue.
The 4.18.9 stable release is going to include it.

Here is the URL to it:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/stable-queue.git/tree/queue-4.18/clocksource-revert-remove-kthread.patch
Comment by yo (yo_arch) - Monday, 17 September 2018, 20:14 GMT
Thanks a lot for your help.

I tried all the 4.18.* linux kernel with the same boot issue. I applied the patch
https://lore.kernel.org/lkml/20180905084158.GR24124%40hirez.programming.kicks-ass.net/
on the 4.18.6 linux kernel, installed it and my boot issue is SOLVED!!
So, this patch seems to be the solution of my kernel panic problem.

Could you tell me:
Who did code this patch and why? Was-it because of my bug opening?
How did you guess that it will fix my issue?
I dug into the https://git.kernel.org but I don't understand the logical. How do you follow a patch and know when it will be committed/added to the mainline?

Once again thank you for your time.
Comment by yo (yo_arch) - Tuesday, 18 September 2018, 13:55 GMT
Comment by Siegfried Metz (NiceGuy) - Wednesday, 19 September 2018, 10:00 GMT
It might not be the proper place to answer your questions, but for once I will do it.

*) "Who did code this patch and why?"
The kernel developers/maintainers of the subsystem "timers" and source file clocksource.c, in this case Peter Zijlstra.
The patch to fix our boot stalling issue is just a revert patch, as the development for kernel 4.18 added source code,
which stalled the boot process.
Since Intel Core 2 hardware is quite old by now (1 decade) it is important to report issues like this
upstream and raise awareness, as most developers tend to use more up to date hardware by now.

*) "Was-it because of my bug opening?"
No.

*) "How did you guess that it will fix my issue?"
No guessing at all involved!
I took the initiative and reported it the upstream kernel mailing list.
Also, viktorj and I reported it upstream to the kernel bugzilla first and were told to take it to the related mailing list.
Several fellow Arch Linux users, including myself, already compiled a custom kernel with the patch applied,
booted successfully and through collaboration let the others know.
It was the hard work of Arch Linux users, who used git bisect - otherwise you're looking for the needle in the haystack.

*) "How do you follow a patch and know when it will be committed/added to the mainline?"
If you reported it upstream, then directly via e-mail, otherwise follow LKML or similar kernel mailing lists online (or subscribe to it).
Using git.kernel.org if you know where to look is also fine for commit logs.


If you want more information about this issue and how we got the point where stable kernel 4.18.9 is including the fix, read the corresponding Arch forum thread:
https://bbs.archlinux.org/viewtopic.php?id=239672

Please, if you have more questions, I kindly suggest that you ask them in the Arch forum instead.
Comment by Doug Newgard (Scimmia) - Wednesday, 19 September 2018, 23:43 GMT
linux 4.18.9.arch1-1 is in Testing, fixed?
Comment by Siegfried Metz (NiceGuy) - Thursday, 20 September 2018, 07:56 GMT
I tested 4.18.9 and it worked fine with at least 5/5 successful non-stalling boots.

[Edited to add]:
The issue is fixed. Had an additional "reboot session" and kernel 4.18.9 is rock solid.

Loading...