FS#58542 - [linux] kernels 4.16.6 through 4.16.8 - 140 second boot hang and multiiple call traces in dmesg

Attached to Project: Arch Linux
Opened by David C. Rankin (drankinatty) - Friday, 11 May 2018, 04:33 GMT
Last edited by Doug Newgard (Scimmia) - Sunday, 30 September 2018, 04:14 GMT
Task Type Bug Report
Category Kernel
Status Closed
Assigned To Tobias Powalowski (tpowa)
Jan Alexander Steffens (heftig)
Architecture All
Severity Medium
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No



Beginning with kernel 4.16.6 through current 4.16.8 there are multiple init failures and call traces in dmesg triggering 140 sec. delay in booting as boot process loop over each core validating everything is fine. LTS boots without a problem. This is a Supermicro H8DM8-2 server with dual quad-core Opteron CPUs.

Additional info:

dmesg output attached including kernel call trace info (filename: dmesg_0510)

Steps to reproduce:

Simply boot any kernel since 4.16.6 and this problem occurs. All LTS kernels and kernels prior to 4.16.6 booted fine.
   dmesg_0510 (120.9 KiB)
This task depends upon

Closed by  Doug Newgard (Scimmia)
Sunday, 30 September 2018, 04:14 GMT
Reason for closing:  Fixed
Comment by Jan Alexander Steffens (heftig) - Friday, 11 May 2018, 14:45 GMT
Can you bisect the kernel between v4.16.5 and v4.16.6 to get at the problematic commit?
Comment by David C. Rankin (drankinatty) - Saturday, 12 May 2018, 00:16 GMT
I'll give it my best shot, but I've not had a lot of luck with big bisect projects in the past. I'll have time this weekend to try and set it up.
Comment by loqs (loqs) - Saturday, 12 May 2018, 01:11 GMT
4.16.5 - 4.16.6 will need a cherry-pick of 854e55ad289ef8888e7991f0ada85d5846f5afb9 or to be built with gcc7
Comment by David C. Rankin (drankinatty) - Thursday, 17 May 2018, 07:05 GMT
Ok, thank you, I ran out of time this weekend. I updated to linux-firmware 20180507.8fc2d4e-1 in hopes that it may have been related. No help. A new dmesg output was attached along with this comment to provide output based on boot with the new firmware. Same problem continues.
Comment by David C. Rankin (drankinatty) - Monday, 18 June 2018, 23:54 GMT
Here is a comparison between LTS 4.14.50 and 4.17.2 (where the problem has gotten much much worse). LTS boots in roughly 40s which is fine, but 4.17.2 now pushes boot time out past 5 minutes. Here is the LTS dmesg. I'll attach the current dmesg for 4.17.2 in the next comment. I've posted to LKML, but no responses.
Comment by David C. Rankin (drankinatty) - Monday, 18 June 2018, 23:57 GMT
Here is the dmesg for 4.17.2. Bizarrely, there are entries out of order in the listing. I have a second 4.17.2 where the entries are closer in order, but this one shows there is something really messed up with the way the new kernels are responding to this Supermicro hardware.
Comment by Tobias Powalowski (tpowa) - Wednesday, 27 June 2018, 04:51 GMT Comment by David C. Rankin (drankinatty) - Sunday, 08 July 2018, 00:29 GMT
4.17.4 is back down to the 3 minute boot mark, but the dmesg output still looks like scrambled eggs. (it looks better if you run it through `sort` so at least the times are sequential. At least this dmesg is a bit more complete, containing the memory reporting for the computer, etc.. The triggering even in all logs seems to be:

[ 0.405313] PCI Interrupt Link [LUB0] enabled at IRQ 23
[ 10.377946] INFO: NMI handler (nmi_cpu_backtrace_handler) took too long to run: 1.690 msecs
[ 10.377946] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 3.829 msecs

However the PCI Interrupt link is behaving, it is injecting a 10 second hang that is causing the NMI handle fits. The CPU stuck messages:

watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [watchdog/6:55]

seem to be a generic kernel issue that are hit-or-miss based on some combination of chipsets. Just 2 days ago there was reporting of the same CPU stuck for 22s! reported on new MSI GE63 7RD hardware on the openSuSE list.

I'll keep trying the kernels as they come along and reading to see if I can find some commonality in boxes this is effecting. I can't explain it. I have 2 Supermicro servers. The H8QM8-2 (Q-quad quad-core) is fine with all kernels, the H8DM8-2 (D-dual quad core) has had problems since reporting this bug.

Thanks for all the effort.
Comment by David C. Rankin (drankinatty) - Wednesday, 11 July 2018, 19:29 GMT
Well -- we are going in the wrong direction with 4.17.5. We have a new record -- a 6 minute boot! (legend -- less is better:)

The dmesg output has a bit more detail on the PCI bus startup, but the same issue is present. Once "PCI Interrupt Link [LUB0] enabled at IRQ 23" occurs, then the NMI handler took too long occurs and it's all downhill from there...

The irony is everything works perfectly once it does finish the boot. No problems with any VM running on it, all server functionality is fine. So this is some spurious boot issue.
Comment by David C. Rankin (drankinatty) - Tuesday, 17 July 2018, 14:10 GMT
Wohoo! Things are going in the right direction. 4.17.6 booted in less than 2:40, that is less than 1/2 the time than 4.17.5. Another plus is that the dmesg output is now properly ordered and you no longer have to sort to get the time sequential. The same problem is still triggered, but the self-tests triggered as a result are much faster to figure out there is nothing wrong and allow the boot to continue. Thanks for all the help. Hopefully we are narrowing in on a solution.
Comment by David C. Rankin (drankinatty) - Friday, 20 July 2018, 20:22 GMT
Well, we are at 3:00 minutes on 4.17.8. The dmesg output is getting much better, it is back in chronological order. Attached latest dmesg.
Comment by David C. Rankin (drankinatty) - Wednesday, 25 July 2018, 20:33 GMT
I was crossing my fingers that 4.17.9 would be an improvement, no joy. On cold start 4.17.8 was a 256 second boot. On reboot after update to 4.17.9 it was a 238 second boot. (both dmesg attached in tar.xz). Contrast with the other Supermicro H8QM8/H8QM8 Quad-Quad Core box which has a 13 second boot on 4.17.9. Oh well.
Comment by David C. Rankin (drankinatty) - Monday, 06 August 2018, 18:23 GMT
I can't tell if this is progress or just good luck, but boot times for 4.17.11-12 are down to as low as 70s and average 110s which is much much better than the 300+ seconds before. Cold boot of 4.17.11 booted in 72 sec. The "NMI handler (perf_event_nmi_handler) took too long to run: 3.424 msecs" and then the "watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [watchdog/5:48]" and register dumps look clean in dmesg and the entire funny business completed in 40s. On reboot with 4.17.12, the same register dump clean up took 82s. I've attached the xz dmesg logs for all boots 4.17.10-4.17.12
Comment by David C. Rankin (drankinatty) - Tuesday, 21 August 2018, 02:24 GMT
Mayday! 4.18 will no longer boot from cold-start. Will boot when rebooting from LTS (after 2.5 minutes with 8 minute audit)

LTS continues to boot in 12 sec. without any issue.

I need help. 4.18 is the first Arch kernel that will not boot on this SuperMicro motherboard. 4.16 - 4.17 would boot even if it took 3-5 minutes. 4.18 hangs on boot indefinitely.

I have attached the LTS dmesg output here and will attach the 4.18.3 dmesg on reboot from LTS as the next comment.
Comment by David C. Rankin (drankinatty) - Tuesday, 21 August 2018, 02:24 GMT
Linux 4.18.3 dmesg on reboot from LTS
Comment by David C. Rankin (drankinatty) - Saturday, 29 September 2018, 21:52 GMT

It appears the folks at working on the kernel found the problem and have fixed it. Beginning with 4.18.8, the "NMI handler (perf_event_nmi_handler) took too long to run: ..." issue is gone. (Thank God) Boots now complete in 17 seconds. Which is fine. A little longer than 4.14, but given all the additions, and lack of errors in the dmesg output, that's good.

I am attaching the dmesg output for 4.18.8-10 for posterity. Nothing on this SuperMicro based machine was changed throughout the entire life of this bug -- aside from 'pacman -Syu' to update the software. This bug is now closed as the problem is gone.