FS#53227 - [linux-firmware] Random crashes/reboots when using Intel GPU firmware on kernel >= 4.8.6

Attached to Project: Arch Linux
Opened by Ralf Barth (Haggy) - Thursday, 09 March 2017, 15:16 GMT
Last edited by Sven-Hendrik Haase (Svenstaro) - Thursday, 03 March 2022, 12:20 GMT
Task Type Bug Report
Category Upstream Bugs
Status Closed
Assigned To Laurent Carlier (lordheavy)
Architecture x86_64
Severity High
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

Description:
When using the Intel i915 firmware for integrated GPUs on Skylake (that is part of this package), i experience random reboots and kernel crashes. This also affects some other people as there are similar reports found on the web. The error message is more or less always the same: "[i915]] *ERROR* DC state mismatch (0x0 -> 0x2)". My current workaround is to rename/remove /lib/firmware/i915 and rebuild the initrd which obviously complains about missing i915 firmware but lets the machine run stable for days. When the firmware is loaded, i see mentioned kernel crashes in dmesg and the machine reboots at least once a day. Oddly, the reboot happens some minutes *after* the kernel crash is logged.

Additional info:
* package version(s)
Intel firmware as part of linux-firmware in combination with Kernel 4.8.6 and later.

* config and/or log files etc.
Kernel crash attached. Additional reports (though in german, on Debian) available here: https://github.com/Bananian/ct-server-2016-jessie/issues/1

Steps to reproduce:
- Use Kernel 4.8.6 or later on a Skylake system
- Make sure i915 is loaded
- Wait for the crash and/or reboot.

Maybe this is worth to report upstream as i don't see any direct relationship to Arch other than packaging the firmware. Also note that making linux-firmware an optional package (currently the kernel pulls it in) would also work around the problem until it gets fixed upstream.
This task depends upon

Closed by  Sven-Hendrik Haase (Svenstaro)
Thursday, 03 March 2022, 12:20 GMT
Reason for closing:  Fixed
Additional comments about closing:  2022-02-27: A task closure has been requested. Reason for request: No more reproducible. Assuming fixed upstream.
Comment by Tyson Tan (tysontan) - Tuesday, 21 November 2017, 15:33 GMT
I'm running Manjaro on a Wacom Mobilestudio Pro 13 which happens to have a skylake CPU. I can confirm this is happening on my machines. I'm not particularly familiar with Linux kernel and its firmwares, but anything I can do to help resolving this problem, please let me know!
Comment by Tyson Tan (tysontan) - Wednesday, 22 November 2017, 01:28 GMT
On my system:
1) the problem doesn't seem to be related to i915 firmware, it still crashes and reboots randomly after I removed i915 firmwares and did (mkinitcpio -c /etc/mkinitcpio.conf -g /boot/initramfs-4.14-x86_64.img)
2) the problem seems to be triggered by a spike in CPU/GPU stress, it can be triggered by two kinds of actions in particular: disk partitioning / OpenGl accelerated applications, sometimes a change in Gnome system settings can also cause a random reboot.
3) the problem does not occur when a realtime Linux kernel is being used.

I highly suspect this to be a bug in Linux power management for Skylake.
Comment by Tyson Tan (tysontan) - Wednesday, 22 November 2017, 07:43 GMT
I did a few more tests. On my system, the combination of a realtime (lowlatency) kernel and certain tlp settings fixed the random crash. Although the side effect was the machine can get very hot because CPU threads cannot be turned off.

Processor: Intel® Core™ i7-6567U
Graphics: Intel® Iris Graphics 550 (Skylake GT3e)
Linux kernel: 4.11.12-1-rt16-MANJARO
TLP modified settings (/etc/default/tlp):
SCHED_POWERSAVE_ON_AC=0
ENERGY_PERF_POLICY_ON_AC=normal

Note: this combination also solved the intel 8265 bluetooth random disconnection and reconnection failure on this machine.
Comment by Tyson Tan (tysontan) - Wednesday, 22 November 2017, 07:45 GMT
If you were testing on a laptop without power supply connected, there is an additional TLP settings change:
SCHED_POWERSAVE_ON_BAT=0
Comment by Tyson Tan (tysontan) - Thursday, 23 November 2017, 05:36 GMT
The new test result I had today made it seemed to be a intel P-state related problem.
I had reverted everything to their default state, with only the following changes in TLP settings:

/etc/default/tlp

# Set Intel P-state performance: 0..100 (%)
# Limit the max/min P-state to control the power dissipation of the CPU.
# Values are stated as a percentage of the available performance.
# Requires an Intel Core i processor with intel_pstate driver.
CPU_MIN_PERF_ON_AC=100
CPU_MAX_PERF_ON_AC=100
CPU_MIN_PERF_ON_BAT=100
CPU_MAX_PERF_ON_BAT=100

Now the machine never crashed no matter what tricked I played against it. This workaround seemed to work both under a normal Linux kernel and a realtime kernel.

I guess the crash was triggered by flipping the P-state to save power. In the previous report I was using WebGL Aquarium to do the test, which has a rather linear stress. But later when I was using Krita, whose stresses on the CPU comes in spikes, the old workaround failed to work. When connecting to an external monitor, it also did not crash anyway but I guess it was using full power all the way because of the dualview requires that much power.
Comment by mattia (nTia89) - Sunday, 27 February 2022, 08:55 GMT
I cannot reproduce the issue. Is it still valid for you?
Comment by Tyson Tan (tysontan) - Sunday, 27 February 2022, 10:00 GMT
I don't think so.

I don't possess the device from my initial report anymore. I'm running ArchLinux on machines with CPUs from later generations, and I never encountered a crash like that again. Although note that I'm not using TLP for power management now, so there is a lot variants here.

I certainly cannot assist in testing this bug anymore with my current rigs, please close the bug if you feel like it.

Loading...