FS#57067 - [intel-ucode] Problems on Haswell, Broadwell and KabyLake

Attached to Project: Arch Linux
Opened by Peter Weber (hoschi) - Friday, 12 January 2018, 10:56 GMT
Last edited by Christian Hesse (eworm) - Wednesday, 14 March 2018, 21:41 GMT
Task Type Bug Report
Category Upstream Bugs
Status Closed
Assigned To Christian Hesse (eworm)
Architecture All
Severity Medium
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 10
Private No

Details

Description:
Hello!

The new microcode for several CPUs seem to cause undesired reboots, page-faults and system-hangs:
https://newsroom.intel.com/news/intel-security-issue-update-addressing-reboot-issues/
https://support.lenovo.com/de/de/solutions/len-18282

Additional info:
* package version(s): 20180108-1

We cannot fix that and not providing the updates is also a problem.
Maybe we should print a warning message? If not, at least this bug can tell users about the problems and the can decide on their own to apply or not apply the updates microcodes.

Thanks
This task depends upon

Closed by  Christian Hesse (eworm)
Wednesday, 14 March 2018, 21:41 GMT
Reason for closing:  Fixed
Additional comments about closing:  intel-ucode 20180312-1
Comment by Peter Weber (hoschi) - Monday, 22 January 2018, 16:15 GMT
Red Hat has widthdrawn the microcode updates:
https://access.redhat.com/solutions/3315431?sc_cid=701f2000000tsLNAAY&

No congratulations to Intel for this achivement.
Comment by loqs (loqs) - Wednesday, 24 January 2018, 17:46 GMT
http://lkml.org/lkml/2018/1/24/553 As the arch kernels do not use any feature exposed by X86_FEATURE_SPEC_CTRL have any arch systems experienced this issue?
Comment by Anthony Ruhier (Anthony25) - Wednesday, 24 January 2018, 17:51 GMT
@loqs: If, by issue, you are talking about freezes, yes I did. I also had a lot of data corruption on btrfs (my system did not boot after 1 day running on this microcode), which is now solved after a rollback to the old microcode (20171117). I have an i5 4670k (haswell).
Comment by loqs (loqs) - Thursday, 25 January 2018, 18:50 GMT
@Anthony25 can you please report your findings to the thread I linked on the linux kernel mailing list, please include your kernel version, cpuid and bad microcode revision.
Comment by Anthony Ruhier (Anthony25) - Thursday, 25 January 2018, 19:18 GMT
@loqs: I don't mind to do it, but is this really helpful? I mean that Intel seems to be aware about the issue, as the Haswell architecture is quoted in the Intel's announcement. Manufacturers like Lenovo or Dell are also removing their firmware updates that include this version of intel's microcode from their website.

In my opinion, this version should be put in [testing], and provide in [extra], like everyone else, the 20171117 release.
Comment by loqs (loqs) - Thursday, 25 January 2018, 21:12 GMT
In the link I posted the kernel developers were proposing not to blacklist those microcode updates just disable SPEC_CTRL.
Now as the arch kernel is not using any features exposed by SPEC_CTRL I would not have expected your system to experience such issues.
Which is the same basis the kernel developers seem to be working on not that those affected microcodes should never be used.
Are you dual booting with another OS? The only other possibility I can think of is something in userspace was using those features.
Arch could release a new version using the epoch feature to force a downgrade.
Comment by Anthony Ruhier (Anthony25) - Thursday, 25 January 2018, 21:20 GMT
I am dual booting with Windows 10, for gaming only, and I did not have any issue on it (yet?).

I am using the linux-zen kernel, but if I remember correctly, I had the same issues with vanilla (4.14.12 and 4.14.13).

Edit: and, FYI, what I did after multiple reboots on a live-cd to recover my filesystem, corrupting itself during my different retries to understand what was going on, was just to the install the old ucode. Everything is going ok now.
I can try again with the last microcode and the vanilla kernel to be sure it's not related to Zen. (after forcing a backup :p)

Edit 2: Ok, tested on vanilla 4.14.15 with the last ucode, I got a lot of kernel errors. Some apps were freezing, sometimes they could not use the network (followed by errors in a kernel thread related to the network stack), etc.
Comment by loqs (loqs) - Friday, 26 January 2018, 04:08 GMT
Can you try building this kernel 4.15-rc9 with the patches to disable SPEC_CTRL if you use any out of tree drivers such as nvidia or broadcom you will need a patch for those for 4.15 as well.
If the issue persists upstream would seem to be mistaken about their proposed fix being effective.
Comment by Anthony Ruhier (Anthony25) - Friday, 26 January 2018, 08:43 GMT
Thanks, I will test it!
Comment by Anthony Ruhier (Anthony25) - Friday, 26 January 2018, 21:52 GMT
After some hours running on 4.15-rc9 (with the patches disabling SPEC_CTRL), I haven't seen any issue with the last microcode. I have tried to do some intensive work, using a bit of swap, force some random writings, but this time btrfs scrub doesn't indicate any corruption.

I don't know if this patch specifically fixes the issue for me, but something between 4.14 and this commit did.

Thanks loqs!
Comment by Ike Rippin (Janick.Hauck92) - Thursday, 01 February 2018, 11:21 GMT
Intel pulled off 20180108 firmware, it's not supported anymore: https://downloadcenter.intel.com/search?keyword=processor+microcode+data+file

Arch should downgrade to 20171117, most distros already did.
Comment by Anthony Ruhier (Anthony25) - Thursday, 01 February 2018, 11:24 GMT
@Janick.Hauck92: In my opinion, it depends. Linux 4.15 seems to blacklist what is broken with this microcode (on my desktop, it does), so if it's pushed quite early in [core], the current microcode can be kept.
Comment by Ike Rippin (Janick.Hauck92) - Thursday, 01 February 2018, 12:37 GMT
@Anthony25: vendor doesn't support it. Almost no one is using it in linux world. What's the point? What do you get from that? Something fixed on your machine doesn't mean it's fixed on everyone's else.

Arch is supposed to follow upstream not roll its own hacks. There is linux-lts package which will stay on 4.14 for a long time. Intel will release new microcode when it's ready.
Comment by Anthony Ruhier (Anthony25) - Thursday, 01 February 2018, 12:43 GMT
@Janick.Hauck92: yep, you're right, I agree. I don't know if this patch will be backported anyway. I don't think, but can't affirm, that it's only on my machine. Their blacklist was to deal with this shitty microcode. But it doesn't change the situation a lot, and you made a good point with LTS.
Comment by loqs (loqs) - Thursday, 01 February 2018, 16:51 GMT
backported 1df37383a8aeabb9b418698f0bcdffea01f4b1b2 1a29b5b7f347a1a9230c1e0af5b37e3e571588ab c940a3fb1e2e9b7d03228ab28f375fb5a47ff699 caf7501a1b4ec964190f31f9c3f163de252273b8 95ca0ee8636059ea2800dfbac9ecac6212d6b38f 95ca0ee8636059ea2800dfbac9ecac6212d6b38f 5d10cbc91d9eb5537998b65608441b592eec65e7 5d10cbc91d9eb5537998b65608441b592eec65e7 fec9434a12f38d3aeafeb75711b71d8a1fdef621 a5b2966364538a0e68c9fa29bc0a3a1651799035
build tested set could probably be reduced but for a simple test I just kept all patches from the PTI merge up until the patch required to disable IBRS on the affected microcodes.
Comment by Ike Rippin (Janick.Hauck92) - Thursday, 01 February 2018, 17:05 GMT
@loqs the question is why we need patches when we can simply downgrade this package?
Comment by loqs (loqs) - Thursday, 01 February 2018, 17:24 GMT
An alternative solution for the maintainers if the downgrade solution is not acceptable to them. Also the patchset could be extended to add spectre V1 mitigation and improved V2 mitigation.
As a minimal demonstration I stopped the backports without those additional patches.
Comment by loqs (loqs) - Friday, 09 February 2018, 23:21 GMT
4.15.2 and 4.14.18 both contain mitigation's for the affected microcode is anyone still experiencing issues with those kernels?
Comment by Andrea Amorosi (AndreaA) - Sunday, 11 February 2018, 12:29 GMT
My asus n752vx with intel i7-6700HQ cpu hangs during boot at the following dmesg line "Booting SMP configuration" with the 4.15 kernels (both 4.15.1 and 4.15.2 if the intel_ucode.img is passed to initdr
in grub.cfg (initrd /intel-ucode.img /initramfs-linux.img).
With the following initrd line
initrd /initramfs-linux.img the 4.15.2 works perfectly.
Reverting to the previous intel-ucode package version does NOT solve the issue so I do not know if it is an issue related to the intel-ucode or how it is managed by the new kernel 4.15 onwards.
Comment by Josip Ponjavic (metak) - Wednesday, 14 March 2018, 01:12 GMT

Loading...