FS#53227 - [linux-firmware] Random crashes/reboots when using Intel GPU firmware on kernel >= 4.8.6
Attached to Project:
Arch Linux
Opened by Ralf Barth (Haggy) - Thursday, 09 March 2017, 15:16 GMT
Last edited by Sven-Hendrik Haase (Svenstaro) - Thursday, 03 March 2022, 12:20 GMT
Opened by Ralf Barth (Haggy) - Thursday, 09 March 2017, 15:16 GMT
Last edited by Sven-Hendrik Haase (Svenstaro) - Thursday, 03 March 2022, 12:20 GMT
|
Details
Description:
When using the Intel i915 firmware for integrated GPUs on Skylake (that is part of this package), i experience random reboots and kernel crashes. This also affects some other people as there are similar reports found on the web. The error message is more or less always the same: "[i915]] *ERROR* DC state mismatch (0x0 -> 0x2)". My current workaround is to rename/remove /lib/firmware/i915 and rebuild the initrd which obviously complains about missing i915 firmware but lets the machine run stable for days. When the firmware is loaded, i see mentioned kernel crashes in dmesg and the machine reboots at least once a day. Oddly, the reboot happens some minutes *after* the kernel crash is logged. Additional info: * package version(s) Intel firmware as part of linux-firmware in combination with Kernel 4.8.6 and later. * config and/or log files etc. Kernel crash attached. Additional reports (though in german, on Debian) available here: https://github.com/Bananian/ct-server-2016-jessie/issues/1 Steps to reproduce: - Use Kernel 4.8.6 or later on a Skylake system - Make sure i915 is loaded - Wait for the crash and/or reboot. Maybe this is worth to report upstream as i don't see any direct relationship to Arch other than packaging the firmware. Also note that making linux-firmware an optional package (currently the kernel pulls it in) would also work around the problem until it gets fixed upstream. |
This task depends upon
Closed by Sven-Hendrik Haase (Svenstaro)
Thursday, 03 March 2022, 12:20 GMT
Reason for closing: Fixed
Additional comments about closing: 2022-02-27: A task closure has been requested. Reason for request: No more reproducible. Assuming fixed upstream.
Thursday, 03 March 2022, 12:20 GMT
Reason for closing: Fixed
Additional comments about closing: 2022-02-27: A task closure has been requested. Reason for request: No more reproducible. Assuming fixed upstream.
1) the problem doesn't seem to be related to i915 firmware, it still crashes and reboots randomly after I removed i915 firmwares and did (mkinitcpio -c /etc/mkinitcpio.conf -g /boot/initramfs-4.14-x86_64.img)
2) the problem seems to be triggered by a spike in CPU/GPU stress, it can be triggered by two kinds of actions in particular: disk partitioning / OpenGl accelerated applications, sometimes a change in Gnome system settings can also cause a random reboot.
3) the problem does not occur when a realtime Linux kernel is being used.
I highly suspect this to be a bug in Linux power management for Skylake.
Processor: Intel® Core™ i7-6567U
Graphics: Intel® Iris Graphics 550 (Skylake GT3e)
Linux kernel: 4.11.12-1-rt16-MANJARO
TLP modified settings (/etc/default/tlp):
SCHED_POWERSAVE_ON_AC=0
ENERGY_PERF_POLICY_ON_AC=normal
Note: this combination also solved the intel 8265 bluetooth random disconnection and reconnection failure on this machine.
SCHED_POWERSAVE_ON_BAT=0
I had reverted everything to their default state, with only the following changes in TLP settings:
/etc/default/tlp
# Set Intel P-state performance: 0..100 (%)
# Limit the max/min P-state to control the power dissipation of the CPU.
# Values are stated as a percentage of the available performance.
# Requires an Intel Core i processor with intel_pstate driver.
CPU_MIN_PERF_ON_AC=100
CPU_MAX_PERF_ON_AC=100
CPU_MIN_PERF_ON_BAT=100
CPU_MAX_PERF_ON_BAT=100
Now the machine never crashed no matter what tricked I played against it. This workaround seemed to work both under a normal Linux kernel and a realtime kernel.
I guess the crash was triggered by flipping the P-state to save power. In the previous report I was using WebGL Aquarium to do the test, which has a rather linear stress. But later when I was using Krita, whose stresses on the CPU comes in spikes, the old workaround failed to work. When connecting to an external monitor, it also did not crash anyway but I guess it was using full power all the way because of the dualview requires that much power.
I don't possess the device from my initial report anymore. I'm running ArchLinux on machines with CPUs from later generations, and I never encountered a crash like that again. Although note that I'm not using TLP for power management now, so there is a lot variants here.
I certainly cannot assist in testing this bug anymore with my current rigs, please close the bug if you feel like it.