FS#57511 - [linux] 4.15 hangs during boot with intel-ucode

Attached to Project: Arch Linux
Opened by Andrea Amorosi (AndreaA) - Wednesday, 14 February 2018, 20:33 GMT
Last edited by Jan Alexander Steffens (heftig) - Thursday, 19 April 2018, 18:04 GMT
Task Type Bug Report
Category Kernel
Status Closed
Assigned To Tobias Powalowski (tpowa)
Jan Alexander Steffens (heftig)
Christian Hesse (eworm)
Architecture All
Severity Critical
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 7
Private No

Details

Description:
All the 4.15 kernels up to now released (at the moment 4.15.2-2) hang during boot on my asus 2752vx (intel i7-6700hq with nvidia 950m and optimus prime) if intel-ucode .img is passed as a parameter to initrd.

These are the lines in grub.cfg:
linux /vmlinuz-linux root=UUID=74296c4e-84df-4eda-87a1-09be9d8e114b rw pci=noaer nvidia-drm.modeset=1 initcall_debug no_console_suspend ignore_loglevel dyndbg="file suspend.c +p"
echo 'Caricamento ramdisk iniziale...'
initrd /intel-ucode.img /initramfs-linux.img

At boot time the system writes the 'Caricamento ramdisk iniziale...' on the screen and that line remains on the video, but the system is completely unresponsive and
adding the earlyprintk=efi,keep to the boot command I've discovered that 4.15.xxx kernels hang at the following line:

... x86: Booting SMP configuration ...

Adding acpi=off let the boot precess to proceed a bit and a lot of messages are displayed, but then it hangs again when the Xsystem and sddm should be called.

With the original kernel parameters (without acpi=off) and the initrd line modified in this way:
initrd /initramfs-linux.img
the system works perfectly.

Reverting to the previous intel-ucode version does not solve the issue.

Linux is installed on this pc since March 2016 and up to now it has worked correctly.

This task depends upon

Closed by  Jan Alexander Steffens (heftig)
Thursday, 19 April 2018, 18:04 GMT
Reason for closing:  Fixed
Comment by Andrea Amorosi (AndreaA) - Wednesday, 14 February 2018, 22:55 GMT
The same happens with the newer Linux n752vx 4.15.3-1-ARCH #1 SMP PREEMPT Mon Feb 12 23:01:17 UTC 2018 x86_64 GNU/Linux
Comment by Denis Mayborodin (Dzen_Python) - Saturday, 17 February 2018, 11:39 GMT
Bug confirmed on fresh x86_64 Linux 4.15.3-2-ARCH (Aspire E5-571g-36mp laptop) - system NOT hangs on boot only when I've been deleted /boot/intel-ucode.img from my grub.cfg.
Comment by Andrea Amorosi (AndreaA) - Saturday, 17 February 2018, 17:43 GMT
I confirm it still happens with Linux n752vx 4.15.3-2-ARCH #1 SMP PREEMPT Thu Feb 15 00:13:49 UTC 2018 x86_64 GNU/Linux on my laptop
Comment by Jan Alexander Steffens (heftig) - Saturday, 17 February 2018, 18:57 GMT
Wasn't the 20180108 microcode update retracted because of issues? Maybe we should downgrade back to 20171117.
Comment by Andrea Amorosi (AndreaA) - Saturday, 17 February 2018, 19:20 GMT
I have a week ago already tried to revert to intel-ucode-20171117-1-any.pkg.tar.xz (with 4.15.1), but without success (it was a quick test, but rebooting the issue was still there so I upgraded again to the latest intel-ucode).
Please let me know if you want me to try to revert again intel-ucode with the 4.15.3 or to revert to an older intel-ucode package.
Comment by loqs (loqs) - Saturday, 17 February 2018, 19:33 GMT
4.15.2+ and 4.14.18+ have backports of a5b2966364538a0e68c9fa29bc0a3a1651799035 guarding against the issue with the 20180108 ucode
Edit:
4.14.18+ not 4.14.8+
Comment by loqs (loqs) - Saturday, 17 February 2018, 20:58 GMT
Does it also happen with linux-lts?
Comment by Andrea Amorosi (AndreaA) - Saturday, 17 February 2018, 23:16 GMT
No. linux-lts 4.14.19-1 works perfectly.
Comment by loqs (loqs) - Monday, 19 February 2018, 15:50 GMT
Would suggest bisecting the kernel between 4.14 and 4.15. If you need help with the bisection please start a forum thread.
Comment by Denis Mayborodin (Dzen_Python) - Monday, 19 February 2018, 16:50 GMT
4.15.1 working fine with intel-ucode. I saw this bug after upgrading to 4.15.2.
Comment by loqs (loqs) - Monday, 19 February 2018, 18:02 GMT
Was it linux 4.15.1-3/4.15.1-4 that worked and 4.15.2-1 that had the bug?
As there were three/four version of 4.15.1 all with config changes it will narrow it down a lot if you can find which exact package update that introduces the bug.
If you do not have the versions in your package cache https://archive.archlinux.org/packages/l/linux/ has the packages except 4.15.1-4 not sure if that was ever actually released.
Comment by Andrea Amorosi (AndreaA) - Monday, 19 February 2018, 18:26 GMT
In my case all the following kernels:
linux-4.15.1-2-x86_64.pkg.tar.xz
linux-4.15.2-2-x86_64.pkg.tar.xz
linux-4.15.2-2-x86_64.pkg.tar.xz
linux-4.15.3-2-x86_64.pkg.tar.xz
do not boot if intel-ucode is used.
The last linux package (excluding -lts ones) that works correctly is linux-4.14.15-1-x86_64.pkg.tar.xz
Comment by Nicola (drakkan) - Tuesday, 20 February 2018, 07:52 GMT
on my thinkpad t540p linux 4.15.3-1 boots fine, while 4.15.3-2 often hangs on boot and sometime boots correctly
Comment by loqs (loqs) - Tuesday, 20 February 2018, 19:07 GMT
@drakkan and if you boot the system using 4.15.3-2 without using using intel-ucode.img as an initrd?
Comment by Andrea Amorosi (AndreaA) - Tuesday, 20 February 2018, 19:52 GMT
I've noted something strange/particular.
If I reboot from a working kernel (lts with intel-ucode or not lts without intel-ucode) to load one of these bugged kernel using intel-ucode, it works correctly only the first time, but then if I reboot (or poweroff) it doesn't work anymore hanging at boot.
Then if I try to load a working kernel, the first time it doesn't load, but after that (forcing a poweroff) it starts working again.
It seems to me (but I don't know if it is possible with these complex pc) as if something dirty is put in Bios or Efi using the 4.15-x and intel-ucode and that two reload of a working kernel are needed to correct that.
Maybe the CPU tries to update the ucode and something goes wrong?
Comment by loqs (loqs) - Tuesday, 20 February 2018, 21:23 GMT
It would be much easier for the kernel developers if you could bisect and find the first bad commit so you can contact the right subsystem team from the start.
Opening a general bug report upstream documenting the issue started between 4.14 and 4.15 I would expect no response or you will be requested to perform a bisection anyway.
I can help with the bisection but it will probably clutter this bug report which is why I suggested opening a forum thread for it.
Dzen_Python also needs to do a separate bisection unless it turns out just a config change in the different 4.15.1 releases triggered the issue on that system.
Comment by Nicola (drakkan) - Wednesday, 21 February 2018, 22:09 GMT
@loqs if I boot my system with 4.15.3-2 without using using intel-ucode.img as an initrd the problem is not solved, so maybe this is another issue related to the patch added in 4.15.3-2

4.15.3-2 during boot shows the warning you can see attached that does not happen in 4.15.3-1
Comment by Nicola (drakkan) - Wednesday, 21 February 2018, 22:25 GMT
my problem is solved blacklist nvidiafb so the correct bug is this one:

https://bugs.archlinux.org/task/57578

sorry for the noise
Comment by Denis Mayborodin (Dzen_Python) - Friday, 23 February 2018, 07:53 GMT
I do bisection - on 4.15.1-4 working fine, 4.15.1-2 working fine, 4.15.2-1 - hangs
Comment by Denis Mayborodin (Dzen_Python) - Friday, 23 February 2018, 07:58 GMT
Problem solved installing 4.15.5-1 from testing w/o FB drivers. This patch from Ubuntu...
Comment by Andrea Amorosi (AndreaA) - Friday, 23 February 2018, 23:13 GMT
I still have the same problem with Linux n752vx 4.15.5-1-ARCH #1 SMP PREEMPT Thu Feb 22 22:15:20 UTC 2018 x86_64 GNU/Linux.
It keeps hanging very early if booted with intel-ucode.
Comment by Andrea Amorosi (AndreaA) - Saturday, 24 February 2018, 11:24 GMT
In order to bisect the kernel I've opened this thread https://bbs.archlinux.org/viewtopic.php?pid=1770079#p1770079 with some questions?
Comment by Andrea Amorosi (AndreaA) - Monday, 26 February 2018, 23:20 GMT
linux-lts has started having the same issue since the upgrading to 4.14.22-1
Comment by Federico Cuello (fedux) - Wednesday, 14 March 2018, 11:37 GMT
There is a new version: https://downloadcenter.intel.com/download/27591/Linux-Processor-Microcode-Data-File

Intel Processor Microcode Package for Linux
20180312 Release
Comment by loqs (loqs) - Wednesday, 14 March 2018, 22:35 GMT
4.16-rc5 has some commits for more robust microcode update handling.
Comment by Andrea Amorosi (AndreaA) - Friday, 16 March 2018, 22:30 GMT
The newer micocode does not solve the issue (which is still present in 4.15.9-1-ARCH #1 SMP PREEMPT Sun Mar 11 17:54:33 UTC 2018 x86_64 GNU/Linux)
Comment by loqs (loqs) - Monday, 19 March 2018, 19:37 GMT
You were unable to complete the bisection between 4.14.21 and 4.14.22 which should have been quicker than the 4.14 to 4.15 bisection?
Comment by Daenney (daenney) - Sunday, 25 March 2018, 10:41 GMT
I can confirm this still happens on 4.15.11 and 4.15.12. Booting the LTS kernel consistently works for me, currently at 4.14.29-1-lts.
Comment by loqs (loqs) - Sunday, 25 March 2018, 11:26 GMT
@daenney What kernel version did issue start happening on your system? The issue occurring on 4.14.29 contradicts AndreaA where the lts kernel started being affect with 4.4.22
Comment by Andrea Amorosi (AndreaA) - Sunday, 25 March 2018, 13:42 GMT
At the moment I can't confirm that the issue is present with the lts (it appears and disappears with different lts versions).
All the 4.15.x not lts versions show the issue.
Comment by Daenney (daenney) - Sunday, 25 March 2018, 22:37 GMT
@loqs I've noticed the issue since 4.15.2 if I recall correctly. The whole 4.15 series seems completely messed up for me (on an XPS 9360). There's been a few kernels in the 4.15 series I've managed to boot, but it seems more dumb luck than that it was actually supposed to work.

What I meant to say is that the issue doesn't occur for me on the LTS kernel, which is what I now boot, until 4.16 comes out and I give that a go. Currently at 4.14.29-1-lts, no issues so far. I've booted a whole bunch of the 4.14 LTS series as 4.15 has been wonky since day one, haven't had a single issue with the LTS builds.
Comment by loqs (loqs) - Monday, 26 March 2018, 21:52 GMT
@daenney you could try 4.16-rc7 now with either the config from linux or linux-lts and see if that works.
Until someone affected locates the cause and reports it to the relevant upstream for resolution the issue will remain unresolved.
Comment by Andrea Amorosi (AndreaA) - Saturday, 07 April 2018, 09:15 GMT
I have found that the issue is related to the kernel configuration and not (directly) to kernel code and that changing these lines in the 4.15.1-4 kernel configuration (the one used in almost all the 4.15.x variants)

# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000

to what was previously used

CONFIG_HZ_300=y
# CONFIG_HZ_1000 is not set
CONFIG_HZ=300

solves the issue.
I don't know why the issue does appear only if the intel-ucode is loaded, so maybe it is an issue given by trying to use 1000Hz setting with an i7-6700hq with updated intel-ucode.
Comment by Andrea Amorosi (AndreaA) - Saturday, 07 April 2018, 12:06 GMT
I can confirm that also kernel 4.14 hangs on boot using intel-ucode if 1000Hz step has been used in the config changing from the following lines

CONFIG_HZ_300=y
# CONFIG_HZ_1000 is not set
CONFIG_HZ=300

to these ones

# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000

If intel-ucode is not used also this "test" 4.14 kernel version works correctly.
So the problem is not related to a specific kernel version but to CONFIG_HZ_1000=y and its interaction with intel-ucode and maybe my hardware.
Can someone else confirm this?
Comment by Daenney (daenney) - Monday, 16 April 2018, 08:32 GMT
Ya, I seem to be able to work just fine with the CONFIG_HZ_300=y option.
Comment by Daenney (daenney) - Monday, 16 April 2018, 08:54 GMT
This happened here: https://git.archlinux.org/svntogit/packages.git/commit/trunk/config?h=packages/linux&id=9998d4fe8026c686abe8db9d9c5941d3936af3de for 4.15.0-1 which seems to coincide with most people's observations here that 4.15 is wonky for them.

From what I've been able to find about CONFIG_HZ, which isn't much, it seems that HZ 1000 or so is what desktop systems are using but for mobile devices most things I've found recommend or use HZ 300.

> The timer sets the frequency that an interrupt wakes the kernel up so it can see if it has to do anything. 100Hz (every 10 ms) is traditional. Recently higher rates have been introduced. The more often the kernel wakes up, the lower the latency when it needs to do something. Thats the plus side. The down side is that there is more wasted time when there is nothing to do.

What I'm failing to underrstand is how this would possibly be interacting with Intel microcode updates. It also doesn't seem to affect that many people, I would expect this thread to be a lot busier in that case.

What CPUs do people have that are observing this issue? I'm on a Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz (according to /proc/cpuinfo).
Comment by Daenney (daenney) - Monday, 16 April 2018, 09:54 GMT
Alright, this is getting weirder by the second for me. I took a long look at the difference between a 4.14 kernel boot and a 4.15 kernel boot. Fairly early in the boot process on 4.15 this error gets logged, which doesn't happen on 4.14:

Apr 16 11:14:02 archlinux kernel: DMAR: [INTR-REMAP] Request device [f0:1f.0] fault index 0 [fault reason 37] Blocked a compatibility format interrupt request

After some searching people suggested this was related to Intel IOMMU being turned on by default, but it just not being quite that safe to turn on by default. Following its suggestions I appended intel_iommu=off to my kernel boot line and now everything works. I've rebooted over 5 times now, the error doesn't show in the logs and the system boots, is fully responsive etc.

I'm not sure if this is the same bug everyone else is seeing, or if it happened to manifest in a similar enough way. Either way, have a look at your journalctl log for a 4.15 boot, see if this errors shows up and/or try booting with intel_iommu=off and share the results.
Comment by Jan Alexander Steffens (heftig) - Monday, 16 April 2018, 10:30 GMT
It's not turned on by default.
Comment by Daenney (daenney) - Tuesday, 17 April 2018, 11:15 GMT
That might be, but it is somehow having an effect in my case. Booting without that option I get the error logs and my system can't switch to GUI mode without freezing. With it, the error is gone and the system behaves correctly. I've managed to narrow it down to intel_iommu=igfx_off. Even if it is off by default, explicitly setting the flag seems to have a side effect somewhere in the system.
Comment by Andrea Amorosi (AndreaA) - Thursday, 19 April 2018, 16:32 GMT
The 4.16.2-2 solves the issue for me
No hangs booting with intel-ucode and this is the output of dmesg | grep microcode:

[ 0.464853] calling save_microcode_in_initrd+0x0/0xa4 @ 1
[ 0.464853] initcall save_microcode_in_initrd+0x0/0xa4 returned 0 after 0 usecs
[ 0.810947] calling microcode_init+0x0/0x1fb @ 1
[ 0.811554] microcode: sig=0x506e3, pf=0x20, revision=0xc2
[ 0.812624] microcode: Microcode Update Driver: v2.2.
[ 0.812626] initcall microcode_init+0x0/0x1fb returned 0 after 1064 usecs

Loading...