FS#70663 - [linux] 5.12.0-arch1-1 - fails to boot - watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [systemd-

Attached to Project: Arch Linux
Opened by James (thx1138) - Friday, 30 April 2021, 15:45 GMT
Last edited by Andreas Radke (AndyRTR) - Monday, 14 June 2021, 16:36 GMT
Task Type Bug Report
Category Packages: Testing
Status Closed
Assigned To Jan Alexander Steffens (heftig)
Architecture All
Severity Critical
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

Upgrade to linux 5.12.arch1-1

System log throws:

...
watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [systemd-udevd: 241]
...
RIP: 0010:smp_call_function_single+0xf7/0x140
...
Call Trace:
? __flush_tlb_all+0x30/0x30
? __flush_tlb_all+0x30/0x30
on_each_cpu+0x39/0x90
...

and repeats indefinitely.

smp_call_function_single is defined in kernel/smp.c

For now, reverting to 5.11 or lts.
This task depends upon

Closed by  Andreas Radke (AndyRTR)
Monday, 14 June 2021, 16:36 GMT
Reason for closing:  Fixed
Additional comments about closing:  5.12.10.arch1-1
Comment by James (thx1138) - Friday, 30 April 2021, 15:53 GMT
Intel Core2 T7200
Mobile Intel 945PM Express Chipset
ICH7-M
Comment by James (thx1138) - Friday, 30 April 2021, 17:05 GMT
Bug posted to linux-smp
Comment by env (ENV25) - Monday, 03 May 2021, 08:31 GMT Comment by James (thx1138) - Monday, 03 May 2021, 09:43 GMT
$ git bisect bad
7c70f3a7488d2fa62d32849d138bf2b8420fe788 is the first bad commit
commit 7c70f3a7488d2fa62d32849d138bf2b8420fe788
Merge: 20bf195e9391 4d12b7275386
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon Feb 22 13:29:55 2021 -0800

Merge tag 'nfsd-5.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull more nfsd updates from Chuck Lever:
"Here are a few additional NFSD commits for the merge window:

Optimization:
- Cork the socket while there are queued replies

Fixes:
- DRC shutdown ordering
- svc_rdma_accept() lockdep splat"

* tag 'nfsd-5.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
SUNRPC: Further clean up svc_tcp_sendmsg()
SUNRPC: Remove redundant socket flags from svc_tcp_sendmsg()
SUNRPC: Use TCP_CORK to optimise send performance on the server
svcrdma: Hold private mutex while invoking rdma_accept()
nfsd: register pernet ops last, unregister first

fs/nfsd/nfsctl.c | 14 ++++++-------
include/linux/sunrpc/svcsock.h | 2 ++
net/sunrpc/svcsock.c | 35 ++++++++++++++++----------------
net/sunrpc/xprtrdma/svc_rdma_transport.c | 6 +++---
4 files changed, 29 insertions(+), 28 deletions(-)

--------------

There is a small chance that this bisect is not precise, because sometimes the system can boot to a temporarily working state, then lock-up after a short time. I did not test every successful initial boot extensively.

This particular commit does not produce the same "watchdog: BUG: soft lockup" log message. Instead, after sometimes booting to an Xorg display, the system just completely freezes, with not so much as the system log still working.
Comment by Markus Großer (MarkusGrosser) - Friday, 14 May 2021, 15:44 GMT
I am getting an issue with 5.12 that matches James' description. The system boots just fine, but when running Xorg, it progressively detoriates, with some processes eventually not starting, then all graphics except for the mouse cursor freezing, and finally a complete lockup that can only be "resolved" with a hard shutdown. The most reliable sign of a nonworking kernel was firefox not starting (process being "stuck", with nothing from ^C to sudo kill -9 being able to kill it); other programs seemingly work fine at first.

Trying to bisect, I arrived at a different set of commits though.
7a800a20ae6329e803c5c646b20811a6ae9ca136 showed the issue described, where a seemingly working kernel will lock up rather quickly.
f007a3d66c5480c8dae3fa20a89a06861ef1f5db worked flawlessly, without any hiccups doing random internet browsing while I was compiling the next bisect step.
However, there are six commits between those, that did not boot and left me stuck with a black screen right after the bootloader (so no systemd startup message or similar). The system did not react to any inputs (Alt+SysRq) or to a short press of the PC's power button, and thus a hard shutdown was necessary.
Attached is the git log for the offending commits (including the good and bad ones), as to not needlessly fill up the comments with long logs.

In case it helps narrowing the issue, the hardware in use is an Intel i7-6700K (non-overclocked) CPU, 32GB of RAM (at the lowest XMP profile, 2133 or whatever the relevant numbers are), and an AMD Radeon RX 480 GPU. Storage is a bcache setup using a 3TB HDD and half of an 256GB M.2 SSD, which might be relevant since the offending commits concern the block subsystem.

I will try to get the kernel log from as close as possible to the lockup when I find the time for it.
Comment by Markus Großer (MarkusGrosser) - Saturday, 15 May 2021, 14:56 GMT
Update on my part:

Well, turns out I should've googled (or at least looked at the bcache wiki entry) at first, which points to a known bug involving bcache and 5.12: https://www.spinics.net/lists/linux-bcache/msg10077.html

I still find it interesting that I get the same symptoms that James describes, but other than that the issues don't seem to be related.
Comment by James (thx1138) - Wednesday, 19 May 2021, 15:55 GMT
I had to re-run my bisect, with more thorough testing. The result changed, and we are currently investigating the final commit, at 4f432e8bb15b x86/mce: Get rid of mcheck_intel_therm_init(). There are posts going to linux-smp and lkml.
Comment by James (thx1138) - Sunday, 23 May 2021, 23:19 GMT
Finally have a fix. Problem was placement of a System Management function in the boot sequence. Sequence itself being reverted.
Comment by James (thx1138) - Monday, 31 May 2021, 22:51 GMT
Patch merge in process.
Comment by James (thx1138) - Sunday, 13 June 2021, 03:17 GMT
Patched in 5.12.10.arch1-1 - "Fix LVT thermal setup for SMI delivery mode"

Loading...