FS#39811 - [linux] Boot hangs after decompressing kernel 3.14 on Haswell chips

Attached to Project: Arch Linux
Opened by Spyros Stathopoulos (Foucault) - Thursday, 10 April 2014, 20:55 GMT
Last edited by Tobias Powalowski (tpowa) - Sunday, 27 April 2014, 13:17 GMT
Task Type Bug Report
Category Kernel
Status Closed
Assigned To Tobias Powalowski (tpowa)
Thomas Bächler (brain0)
Architecture x86_64
Severity High
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 11
Private No

Details

Description:
After update to kernel 3.14-4 system boot halts after decompressing kernel, right after displaying "Booting the kernel". There are other reports of users that are affected from the same problem both on syslinux and grub [1]. At the moment it is unknown if this problem is specific to the cpu or the chipset. Also impossible to debug since the kernel does not boot at all. System boots normally with kernel 3.13 or LTS. I have also tried to disable compression (using cat in mkinitcpio.conf) and rebuild the initramfs to no avail. Currently I am unsure if this is a kernel regression or an initramfs problem.

Attached (pretty much stock) mkinitcpio.conf

Additional info:
linux: 3.14-4

Steps to reproduce:
1) Have a Haswell CPU and kernel 3.14
2) Boot the system
3) Boot procedure stops after kernel decompression

[1] https://bbs.archlinux.org/viewtopic.php?id=179818
This task depends upon

Closed by  Tobias Powalowski (tpowa)
Sunday, 27 April 2014, 13:17 GMT
Reason for closing:  Fixed
Additional comments about closing:  3.14.2-1
Comment by Adam (adam900710) - Friday, 11 April 2014, 01:11 GMT
Confirmed with my i5 4570 CPU with B85 chipset.

Not related to mkinitcpio but 0007-Fix-the-use-of-code32_start-in-the-EFI-boot-stub.patch seems pretty much suspicious.

Unable to test right now (working PC is not affected), any test is welcomed.
Comment by Thomas Bächler (brain0) - Friday, 11 April 2014, 06:09 GMT
0007-Fix-the-use-of-code32_start-in-the-EFI-boot-stub.patch only affects EFI boot and corrects incorrect memory alignment. The forum thread suggests that everyone affected uses legacy boot, so I don't see anything suspicious here.
Comment by Spyros Stathopoulos (Foucault) - Friday, 11 April 2014, 08:28 GMT
Yes, it can't be related to 0007 since the bug affects both EFI and legacy users. In any case I recompiled the kernel without the patch and it still doesn't boot.
Comment by Thomas Bächler (brain0) - Friday, 11 April 2014, 08:37 GMT
Btw, 0007 has now been merged here https://git.kernel.org/cgit/linux/kernel/git/mfleming/efi.git/commit/?h=urgent&id=7e8213c1f3acc064aef37813a39f13cbfe7c3ce7 with a proper commit message.

It does not apply to 3.14, so I use a slightly different version (also provided by the same author). Anyway, as Spyros confirmed, this is unrelated to the boot problems of this bug report.
Comment by Florian Ehmke (imbaer) - Friday, 11 April 2014, 10:55 GMT
Confirmed with i5 4670k and H87 chipset.
Comment by Maarten Vanden Branden (firefixmaarten) - Friday, 11 April 2014, 12:27 GMT
Doesn't seem to affect mobile Haswell CPU's, at least not mine

i5-4200M - Kernel 3.14.0-4-ARCH
Comment by Thomas Bächler (brain0) - Friday, 11 April 2014, 12:36 GMT
If this problem affected Haswell CPUs, then the kernel wouldn't even have made it to the repository.

All I can read in this report is "OMG OMG my system won't boot", "OMG, me too". There's nothing that anyone can do about it with this kind of information. All I can say is that neither the Haswell CPU nor the chipset have anything to do with it.
Comment by Spyros Stathopoulos (Foucault) - Friday, 11 April 2014, 14:49 GMT
Still, it's very strange why it affects this particular class of very similar cases. I've had unbootable kernels in the past but what troubles me it's that is impossible to debug. In any case, since this is probably a dead end a maintainer can close the task.
Comment by Thomas Bächler (brain0) - Friday, 11 April 2014, 14:52 GMT
Closing is a bad idea, too. It's necessary to find out what all the affected systems have in common. Since I have tested this kernel myself on an H87 chipset with an i5-4670, I doubt there is anything chipset or CPU related to it.
Comment by t-ask (tAsk) - Friday, 11 April 2014, 16:14 GMT
Same problem here. System stops at boottime message Booting the kernel":

Probing EDD (edd=off to disable)... ok
early console in decompress_kernel
Decompressing Linux... Parsinf ELF... done.
Booting the Kernel

System specs:
* Kernel 3.14.0-4-Arch
* CPU: Intel i5 4570
* 24GB RAM
* Grub
* RAID0 with mdadm on /dev/md/0
* Mainbaord: Gigabyte Z87 HD3 (Intel® Z87 Express Chipset)
* Hooks: "base udev autodetect modconf block mdadm_udev filesystems keyboard fsck"
* Syslinux: "APPEND root=/dev/md/0 ro nomodeset ipv6.disable=1"

Instead, my Ultrabook with Intel i5-3337U CPU boots well with the new kernel.Instead, my Ultrabook with Intel i5-3337U CPU boots well with the new kernel.

@brain0: Could you please provide your syslinux.cfg and mkinitcpio.conf files of your working setup. Then I try it with your settings. It looks like we have a similar setup, though.
Comment by Mladen Milinkovic (maxrd2) - Friday, 11 April 2014, 17:00 GMT
Same problem here.
Kernel: 3.14-4-ARCH x86_64
CPU: Intel i3-4330
8GB RAM
RAID1 on boot/root partition
Gigabyte GA-B85M-D3H - Intel B85 Express Chipset

/etc/mkinitcpio.conf:
MODULES=""
BINARIES=""
FILES=""
HOOKS="base udev autodetect modconf block mdadm_udev filesystems resume keyboard fsck"

What can I do to give you guys more information?
Comment by t-ask (tAsk) - Friday, 11 April 2014, 17:21 GMT
@maxrd1: Do you have a RAID setup with mdadm, too?
Comment by Mladen Milinkovic (maxrd2) - Friday, 11 April 2014, 17:23 GMT
yes.. RAID1 with mdadm on boot/root and am using GRUB2 (grub 1:2.02.beta2-2)
Comment by Mladen Milinkovic (maxrd2) - Friday, 11 April 2014, 17:26 GMT
I see two messages when i start computer: Decompressing the kernel and Booting the kernel
Have tried adding verbose to kernel line to get some more info, but that doesn't give anymore info (probably because it hangs before kernel does anything?)
Is there anyway to increase verboseness in grub or something similar that could give more details?
Comment by Mladen Milinkovic (maxrd2) - Friday, 11 April 2014, 17:31 GMT
forgot to paste my kernel line... nothing particular there except disabled ipv6.disable
linux /boot/vmlinuz-linux root=UUID=4dd972d0-4a16-43ce-b451-670df2a99460 rw ipv6.disable=1 resume=/dev/sda3
Comment by Thomas Bächler (brain0) - Friday, 11 April 2014, 17:40 GMT
You guys can try the 'loglevel=7' option, but this looks like it's not even getting as far as parsing the command line. I would mostly be interested in whether you guys boot with BIOS or EFI and what your bootloader is.

As mentioned above, I have two machines with Haswell CPUs that boot fine (using EFI on both).
Comment by Mladen Milinkovic (maxrd2) - Friday, 11 April 2014, 18:05 GMT
grub2... I'm using BIOS.
Over the weekend, will try changing disk to GPT table and using EFI for boot to see if that fixes anything.
Comment by t-ask (tAsk) - Friday, 11 April 2014, 18:17 GMT
@brain0: I'm using BIOS with Grub2.
@maxrd2: Are you booting your /boot from an external/other disk than the one of / ? I have an external USB flash drive with a small /boot partition on it.

I unplugged all my internal drives and just used my USB /boot stick (with MBR) to start the system. I get exactly the same problem - yes, here on purpose :) Yet, the system also stops at "Booting the kernel", too.

I assume, the MBR doesn't find my other drives even with all drives pluged in as usual. Maybe it's syslinux related + RAID?
Comment by Mladen Milinkovic (maxrd2) - Friday, 11 April 2014, 18:20 GMT
@tAsk: my /boot is on / (RAID1) partition
Comment by Mladen Milinkovic (maxrd2) - Friday, 11 April 2014, 18:23 GMT
usually when grub doesn't find root partition/kernel/initrd it gives some error message... to me this doesn't look like it
Comment by t-ask (tAsk) - Friday, 11 April 2014, 18:38 GMT
@maxrd2: So, if I get to the same error message just by using the data of /boot, then it's probably related to the data within /boot. As your system successfully gets to the data of your RAID /boot folder, I assume, it's not directly RAID related.

Maybe, the syslinux-install_update or mkinitcpio script does something wrong here?
Comment by Mladen Milinkovic (maxrd2) - Saturday, 12 April 2014, 00:20 GMT
@tAsk: I'm using grub not syslinux
@brain0: have tried loglevel=7, debug, etc... message doesnt change. After decompressing kernel, everything hangs and i can only turn off pc.

Have done a fresh arch install to external usb drive with EFI+MBR (not GPT) and it's same: 3.13.8-1 boots just fine, 3.14-4 hangs after decompressing kernel.
I could try converting to a GPT partition table, but don't think that will change anything.
So it hangs both on UEFI and BIOS.

If it was some issue with initrd.. There would be some (error) message not just a hang... right?
I mean kernel gets loaded, then scripts in initrd start executing... With loglevel=7 i would at least see some messages from the kernel if this was initrd issue?
Comment by Mladen Milinkovic (maxrd2) - Saturday, 12 April 2014, 02:42 GMT
Just finished upgrading Motherboard BIOS... now 3.14 boots with both EFI and BIOS just fine :D
Comment by Brian Hasselbeck (bhassel) - Saturday, 12 April 2014, 03:44 GMT
I hit this issue with hanging after decompressing the 3.14 kernel. After maxrd2's comment I upgraded the motherboard BIOS, and the 3.14 kernel now boots fine for me as well.

CPU is Intel i5 4670, motherboard is Gigabyte Z87MX-D3H. The BIOS was updated from version F2 (May 2013) to F6 (Jan 2014).
Comment by Thomas Bächler (brain0) - Saturday, 12 April 2014, 08:03 GMT
I would appreciate if anyone affected by this could bisect the problem. Even if this is a firmware problem, there are people that can't install newer firmware and they shouldn't be stuck on old kernels.
Comment by Mladen Milinkovic (maxrd2) - Saturday, 12 April 2014, 09:26 GMT
This is the link to bios i've downloaded and some useless changelog. Not sure if that can help to pinpoint the problem. I think i had F7 before have upgraded to F9.
http://www.gigabyte.com/products/product-page.aspx?pid=4567&m=n#bios
Comment by t-ask (tAsk) - Saturday, 12 April 2014, 12:41 GMT
That's interesting. I checked my BIOS and it is a quite old F2 version of 04/17/2013 :(

As upgrading the BIOS looks promising, I want to upgrade my BIOS, too. Before I do it, I would like to provide you all the information of my current BIOS. I'm thinking of 'dmidecode' and flashrom output pre and post BIOS update.

I already did the following steps:
$ dmidecode
$ dmidecode > pre_bios_dmi
$ dmidecode --dump > pre_bios_dmi_bin
$ flashrom --programmer internal -r pre_bios_rom
$ flashrom --programmer internal -c "MX25L6405(D)" -r pre_bios_rom_MX25L6405-D
$ flashrom --programmer internal -c "MX25L6406E/MX25L6436E" -r pre_bios_rom_MX25L6406E-MX25L6436E
$ flashrom --programmer internal -c "MX25L6445E" -r pre_bios_rom_MX25L6445E

Which additional information do you need, before I start my BIOS upgrade?
Comment by Mladen Milinkovic (maxrd2) - Saturday, 12 April 2014, 14:14 GMT
@tAsk: do you get "This chipset is marked as untested" message from flashrom? And you have dual bios on your motherboard?
Multiple flash chip definitions match the detected chip(s): "MX25L6405(D)", "MX25L6406E/MX25L6436E", "MX25L6445E"?

Because my board has "ITE IT8728F superio chip", I had to use latest flashrom from trunk which has support for this chip, and instead of "--programmer internal" had to use "--programmer internal:dualbiosindex=0" for master bios, or "--programmer internal:dualbiosindex=1" for backup:
$ yaourt -S flashrom-svn
$ flashrom -p internal:dualbiosindex=0 -c MX25L6445E -V -w tmp/mb_bios_ga-b85m-d3h_f9/B85MD3H.F9
Comment by Florian Ehmke (imbaer) - Saturday, 12 April 2014, 16:48 GMT
I updated my Gigabyte H87-HD3 Bios from F2 to F6.
As expected kernel 3.14 boots now.
Comment by GutsBlack (GutsBlack) - Saturday, 12 April 2014, 18:46 GMT
Same problem with Gigabyte H87N-WIFI Rev 1.0, BIOS F2. Upgrade to F5 version fix the problem ;)
Comment by t-ask (tAsk) - Saturday, 12 April 2014, 23:18 GMT
@maxrd2: Yes, I get that message here. And this is a dual BIOS Gigabyte mainboard (http://www.gigabyte.com/products/product-page.aspx?pid=4491#bios)
This is my flashrom output:

Calibrating delay loop... OK.
Found chipset "Intel Z87".
This chipset is marked as untested. If you are using an up-to-date version
of flashrom *and* were (not) able to successfully update your firmware with it,
then please email a report to flashrom@flashrom.org including a verbose (-V) log.
Thank you!
Enabling flash write... Warning: SPI Configuration Lockdown activated.
OK.
Found Macronix flash chip "MX25L6405(D)" (8192 kB, SPI) at physical address 0xff800000.
Found Macronix flash chip "MX25L6406E/MX25L6436E" (8192 kB, SPI) at physical address 0xff800000.
Found Macronix flash chip "MX25L6445E" (8192 kB, SPI) at physical address 0xff800000.
Multiple flash chip definitions match the detected chip(s): "MX25L6405(D)", "MX25L6406E/MX25L6436E", "MX25L6445E"


Maybe I have to use "--programmer internal:dualbiosindex=0", too. Seeing three different chips makes me unconscious of how to flash my setup the right way. Furthermore, viditing my mainboard's download page just offers me *.exe files. Do I have to unpack this file first? And on which chip install it?

@brain0: Do you need information to fix that problem? If yes, which infos should I provide before I flash my BIOS? I would like to help, yet I need help to help you :)
Comment by Thomas Bächler (brain0) - Sunday, 13 April 2014, 06:35 GMT
The only *useful* information is the commit id of the commit that causes this problem. As mentioned above, someone who is affected must bisect the kernel (google for 'git bisect' for an explanation if you haven't heard of it yet).
Comment by t-ask (tAsk) - Sunday, 13 April 2014, 14:45 GMT
Someone with a similar mainboard could downgrade the BIOS to bisect the kernel. I'm going to upgrade my BIOS then. Thanks for your help.
edit: the upgrade fixed the problem.
Comment by GutsBlack (GutsBlack) - Sunday, 13 April 2014, 18:43 GMT
Oups :)
Comment by Spyros Stathopoulos (Foucault) - Sunday, 13 April 2014, 19:48 GMT
I will try checking out the RCs first. Hopefully one of them should exhibit the problem.
Comment by Spyros Stathopoulos (Foucault) - Monday, 14 April 2014, 13:59 GMT
The problem is found between 3.13 and 3.14-rc1. According to the git bisect (log attached) the first bad commit is 671cc68 [1]. If I followed the tree correctly the commit in question is indeed merged between 3.13 and 3.14-rc1.

[1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=671cc68dc61f029d44b43a681356078e02d8dab8
Comment by Thomas Bächler (brain0) - Monday, 14 April 2014, 14:02 GMT
Now, can you apply that patch in reverse to the Arch kernel PKGBUILD and verify that it helps? If so, an upstream bug report should be resolved relatively quickly.
Comment by Thomas Bächler (brain0) - Monday, 14 April 2014, 14:03 GMT
On another note, this looks phishy to me, since this is unrelated to very early boot. Anyway, try to revert it and see what happens.
Comment by Łukasz Twarduś (Ang3lus) - Monday, 14 April 2014, 21:10 GMT
I was messing with this issue for 2 days before i hit this page :D

My 3.14+ kernel boots only with acpi=off or acpi=rsdt, otherwise i get black screen of death after selecting entry from gummiboot:

i7-4770
Gigabyte Z87-HD3
Comment by Spyros Stathopoulos (Foucault) - Tuesday, 15 April 2014, 06:44 GMT
Indeed is ACPI related. Strangely enough on my PC it does not boot even with acpi=off. In any case phishy as it may seem, using attached patch on 3.14 boots the kernel as expected. It basically reverts commits 671cc68-c14ced0. Since the patch does not apply cleanly on v3.14 I had to edit around a bit, so I don't know if anything else breaks. Kernels boots properly and system seems to work OK so far though. Patch is a bit chatty, could use some rebasing but I don't have enough time atm to do that. Maybe later. It would be nice if someone could test it against the [core] kernel and verify it works.
Comment by Thomas Bächler (brain0) - Tuesday, 15 April 2014, 07:13 GMT
That is quite some revert. Spyros, please open a bug report upstream with this information ASAP. We need the maintainers of this code to come up with a proper solution.
Comment by Spyros Stathopoulos (Foucault) - Tuesday, 15 April 2014, 09:47 GMT Comment by Whoever (blk_caesar) - Tuesday, 15 April 2014, 16:10 GMT
Same...

i7-4770
GA-Z87X-UD4H
Comment by abc (Xiflite) - Tuesday, 15 April 2014, 19:46 GMT
The last days I thought I am the only one with this problem. Happy to read this :D

3 Computers: 1 is affected

Specs:
MB: Supermicro X9SAE
CPU: Xeon E3 1275v2
Booting with:
Grub (EFI, latest beta from testing), mdadm RAID 1 (lvm hook also activated)

I've tried to debug. Kernel debug does nothing at all. Grub debug seems to load vmlinuz, the image and then freezes after running some free command. Very strange imho.
Comment by Spyros Stathopoulos (Foucault) - Tuesday, 15 April 2014, 21:35 GMT
Try rebuilding the kernel with the patch attached right above and see if it helps. If it does then quite possibly is the same issue.
Comment by Thomas Bächler (brain0) - Wednesday, 16 April 2014, 06:33 GMT
Please look at  bug 73911 , especially at the tests requested here:

https://bugzilla.kernel.org/show_bug.cgi?id=73911#c8
https://bugzilla.kernel.org/show_bug.cgi?id=73911#c13

You should be able to see a panic message at boot using the option earlyprintk=vga,keep.
Comment by Egorov Sergej (Cetronix) - Wednesday, 16 April 2014, 16:33 GMT
Confirmed with i3-4130 and B85 chipset.
Comment by Łukasz Twarduś (Ang3lus) - Wednesday, 16 April 2014, 16:40 GMT
In my case earlyprintk=vga,keep does not work, I've even tried insane debugging from arch wiki and still blank screen.
Comment by Ian McNee (geekmcbean) - Wednesday, 16 April 2014, 21:52 GMT
Same problem:
Gigabyte Z87-D3HP MB
BIOS F2
Core i5 4440
Kernel 3.14.0-5-ARCH on x86_64

Update from 3.13 to 3.14 led to hang loading initramfs.

FIXED - update BIOS from F2 to F6, all fine.
Comment by Whoever (blk_caesar) - Thursday, 17 April 2014, 12:45 GMT
i7-4770
GA-Z87X-UD4H

Bios update from F5 to F7/F8 did resolve the issue.
Comment by Łukasz Twarduś (Ang3lus) - Thursday, 17 April 2014, 16:11 GMT
i7-4770
Gigabyte Z87-HD3
Update from F4 to F7 solved the issue.
Comment by Thomas Bächler (brain0) - Thursday, 17 April 2014, 19:32 GMT
Just FYI, it seems there is a patch in the upstream bug report that solves the problem.
Comment by Spyros Stathopoulos (Foucault) - Thursday, 17 April 2014, 22:12 GMT
According to upstream, this patch solves the issue. You can try rebuilding the kernel with it. It would be nice if it could be applied to the repo kernel as well.
Comment by Thomas Bächler (brain0) - Friday, 18 April 2014, 09:27 GMT
I'll wait for a few confirmations first. Then we can get a new repo kernel quickly.
Comment by abc (Xiflite) - Friday, 18 April 2014, 19:57 GMT
Patch works for me (Supermicro X9SAE).

Thanks!
Comment by Abelardo Ricart (aricart) - Tuesday, 22 April 2014, 19:21 GMT
Had this same issue with my Toshiba laptop (AMD Bobcat Architecture). Did a git bisect and ended up coming to the same conclusions shown here and on the kernel bugzilla.

Applying the above patch (lv-xsdt2-3.14.patch) allows me to boot once again. Thanks.

Loading...