FS#58542 - [linux] kernels 4.16.6 through 4.16.8 - 140 second boot hang and multiiple call traces in dmesg
Attached to Project:
Arch Linux
Opened by David C. Rankin (drankinatty) - Friday, 11 May 2018, 04:33 GMT
Last edited by Doug Newgard (Scimmia) - Sunday, 30 September 2018, 04:14 GMT
Opened by David C. Rankin (drankinatty) - Friday, 11 May 2018, 04:33 GMT
Last edited by Doug Newgard (Scimmia) - Sunday, 30 September 2018, 04:14 GMT
|
Details
Description:
Beginning with kernel 4.16.6 through current 4.16.8 there are multiple init failures and call traces in dmesg triggering 140 sec. delay in booting as boot process loop over each core validating everything is fine. LTS boots without a problem. This is a Supermicro H8DM8-2 server with dual quad-core Opteron CPUs. Additional info: dmesg output attached including kernel call trace info (filename: dmesg_0510) Steps to reproduce: Simply boot any kernel since 4.16.6 and this problem occurs. All LTS kernels and kernels prior to 4.16.6 booted fine. |
This task depends upon
perhaps related?
[ 0.405313] PCI Interrupt Link [LUB0] enabled at IRQ 23
[ 10.377946] INFO: NMI handler (nmi_cpu_backtrace_handler) took too long to run: 1.690 msecs
[ 10.377946] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 3.829 msecs
However the PCI Interrupt link is behaving, it is injecting a 10 second hang that is causing the NMI handle fits. The CPU stuck messages:
watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [watchdog/6:55]
seem to be a generic kernel issue that are hit-or-miss based on some combination of chipsets. Just 2 days ago there was reporting of the same CPU stuck for 22s! reported on new MSI GE63 7RD hardware on the openSuSE list.
I'll keep trying the kernels as they come along and reading to see if I can find some commonality in boxes this is effecting. I can't explain it. I have 2 Supermicro servers. The H8QM8-2 (Q-quad quad-core) is fine with all kernels, the H8DM8-2 (D-dual quad core) has had problems since reporting this bug.
Thanks for all the effort.
The dmesg output has a bit more detail on the PCI bus startup, but the same issue is present. Once "PCI Interrupt Link [LUB0] enabled at IRQ 23" occurs, then the NMI handler took too long occurs and it's all downhill from there...
The irony is everything works perfectly once it does finish the boot. No problems with any VM running on it, all server functionality is fine. So this is some spurious boot issue.
LTS continues to boot in 12 sec. without any issue.
I need help. 4.18 is the first Arch kernel that will not boot on this SuperMicro motherboard. 4.16 - 4.17 would boot even if it took 3-5 minutes. 4.18 hangs on boot indefinitely.
I have attached the LTS dmesg output here and will attach the 4.18.3 dmesg on reboot from LTS as the next comment.
It appears the folks at working on the kernel found the problem and have fixed it. Beginning with 4.18.8, the "NMI handler (perf_event_nmi_handler) took too long to run: ..." issue is gone. (Thank God) Boots now complete in 17 seconds. Which is fine. A little longer than 4.14, but given all the additions, and lack of errors in the dmesg output, that's good.
I am attaching the dmesg output for 4.18.8-10 for posterity. Nothing on this SuperMicro based machine was changed throughout the entire life of this bug -- aside from 'pacman -Syu' to update the software. This bug is now closed as the problem is gone.