FS#64725 - i915, linux: Resetting rcs0 for hang on rcs0

Attached to Project: Arch Linux
Opened by Robert (fuero) - Wednesday, 04 December 2019, 09:51 GMT
Last edited by freswa (frederik) - Thursday, 30 April 2020, 11:38 GMT
Task Type Bug Report
Category Kernel
Status Closed
Assigned To Jan Alexander Steffens (heftig)
Architecture x86_64
Severity High
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 29
Private No

Details

Description:

I'm experiencing hangs several times a day, producing this in dmesg:

[ 1696.869719] i915 0000:00:02.0: GPU HANG: ecode 9:0:0x00000000, hang on rcs0
[ 1696.869736] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 1696.869736] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 1696.869736] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 1696.869737] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 1696.869737] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 1696.870744] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0

Messing with i915.* settings didn't help

Additional info:
* Hardware: HP EliteDesk 800 G4 SFF (2US83AV)
GPU as reported by lshw:
*-display
description: VGA compatible controller
product: UHD Graphics 630 (Desktop)
vendor: Intel Corporation
physical id: 2
bus info: pci@0000:00:02.0
logical name: /dev/fb0
version: 00
width: 64 bits
clock: 33MHz
capabilities: pciexpress msi pm vga_controller bus_master cap_list rom fb
configuration: depth=32 driver=i915 latency=0 mode=1920x1080 visual=truecolor xres=1920 yres=1080
resources: iomemory:400-3ff irq:140 memory:4000000000-4000ffffff memory:d0000000-dfffffff ioport:3000(size=64) memory:c0000-dffff

* package version(s): linux-hardened-5.3.13.a-1, linux-firmware-20191022.2b016af-3

* config and/or log files etc.:

# cat /proc/cmdline
pti=on page_alloc.shuffle=1 BOOT_IMAGE=/vmlinuz-linux-hardened root=UUID=8697eb87-c34f-4f5f-bbfa-bb738086dbee rw quiet apparmor=1 security=apparmor audit=1 intel_iommu=igfx_off i915.modeset=1 i915.enable_rc6=1 i915.enable_fbc=1 i915.enable_guc_loading=1 i915.enable_guc_submission=1 i915.enable_huc=1 i915.enable_psr=1 i915.disable_power_well=0 i915.semaphores=1

see gpu-error.txt for the contents of /sys/class/drm/card0/error
This task depends upon

Closed by  freswa (frederik)
Thursday, 30 April 2020, 11:38 GMT
Reason for closing:  Fixed
Comment by Robert (fuero) - Wednesday, 04 December 2019, 09:53 GMT
In case it matters, I have 2 monitors attached to the box with Displayport to DVI adapters.
Comment by Michel Koss (MichelKoss1) - Wednesday, 04 December 2019, 15:44 GMT
You have really A LOT of custom i915 settings added in cmdline and it's very plausible your issues are related to those, not linux-hardened. It's better to report them upstream https://bugs.freedesktop.org instead.
Comment by loqs (loqs) - Wednesday, 04 December 2019, 23:38 GMT
Can you reproduce the issue on the none hardened kernel?
Comment by Ivan (Nekroman) - Thursday, 05 December 2019, 14:01 GMT
I can confirm this error on 2 laptops with 2 monitors connected through dock. It happens without monitors as well but not that often. I am using classic linux kernel and tried without any i915 cmd.
Comment by loqs (loqs) - Thursday, 05 December 2019, 15:45 GMT Comment by hexchain (hexchain) - Friday, 06 December 2019, 22:21 GMT
I've filed this bug several days ago: https://gitlab.freedesktop.org/drm/intel/issues/674
Comment by loqs (loqs) - Wednesday, 11 December 2019, 21:46 GMT
https://gitlab.freedesktop.org/drm/intel/issues/673#note_359912
Archive contains backport applied to 5.4.2 please test.
Comment by Jan Alexander Steffens (heftig) - Thursday, 12 December 2019, 11:56 GMT
That's going to break horribly because you ignored that all the Reg State Context IDs were changed in an earlier commit.

I'll wait for a proper backport to 5.4.
Comment by Matthias Lisin (matthias.lisin) - Thursday, 12 December 2019, 20:32 GMT
can reproduce with Dell XPS 13 (9350)
Comment by Rian Quinn (rianquinn) - Saturday, 14 December 2019, 21:50 GMT
I can confirm on a Del XPS 15 7590.
Comment by Kieran Colford (kieranc) - Wednesday, 18 December 2019, 16:20 GMT
I can confirm on System76 Galago Pro (galp3)
Comment by loqs (loqs) - Wednesday, 18 December 2019, 17:04 GMT
Is the issue resolved by 5.4.4.arch1-1 currently in testing?
Comment by Matthias Lisin (matthias.lisin) - Wednesday, 18 December 2019, 19:27 GMT Comment by loqs (loqs) - Wednesday, 18 December 2019, 19:38 GMT
@matthias.lisin you could ask Chris Wilson for a version for 5.4.
Comment by Matthias Lisin (matthias.lisin) - Wednesday, 18 December 2019, 19:39 GMT
@loqs, sure could. But -lts kernel works and I'm in no rush.
Comment by Juan Simón (j1simon) - Thursday, 19 December 2019, 21:22 GMT
Same problem with linux-zen:  FS#64895 
Comment by Christian Hesse (eworm) - Monday, 30 December 2019, 14:19 GMT
Chris Wilson just sent a backported patch to stable mailing list.
Comment by Michel Koss (MichelKoss1) - Monday, 30 December 2019, 15:01 GMT
Link to patch:

"https://lore.kernel.org/stable/20191230111530.3750048-1-chris@chris-wilson.co.uk/"
Comment by Juan Simón (j1simon) - Monday, 30 December 2019, 15:04 GMT
The issue title is wrong. Someone should remove "linux-hardened". This problem is common to all kernels.
I have installed the linux-mainline package from AUR to test this and it works well with version 5.5rc3 onwards.
Comment by loqs (loqs) - Tuesday, 31 December 2019, 19:04 GMT
Is the issue resolved by linux 5.4.7.arch1-1 currently in testing?
Comment by Michel Koss (MichelKoss1) - Wednesday, 01 January 2020, 14:16 GMT Comment by Christian Hesse (eworm) - Wednesday, 01 January 2020, 14:18 GMT
The patch did it to 5.4.7.arch1, heftig added it.
Did not yet verify it's fixed, though.
Comment by loqs (loqs) - Wednesday, 01 January 2020, 17:46 GMT
https://gitlab.freedesktop.org/drm/intel/issues/673#note_373802 reports issue is not fixed with 5.4.7.arch1-1
Comment by Søren Holm (sgh) - Tuesday, 14 January 2020, 09:41 GMT
are you running any virtual machines during this?
Comment by Laurențiu Nicola (lnicola) - Tuesday, 14 January 2020, 09:43 GMT
I see this less often since the patch, but it still happens. I'm not running any VMs. It usually hangs when I'm using VS Code or when I receive a notification in Gnome.
Comment by Daniel Bershatsky (daskol) - Tuesday, 14 January 2020, 09:47 GMT
The issue emerges for me as well. It happens usually when external monitor is plugged. The issue states for a long time and particulary in kernel 5.4.11.
Comment by Junnan Zhang (zhjn921224) - Wednesday, 15 January 2020, 15:19 GMT
Contrary to the previous comment, I have this issue usually when external monitor is *unplugged*.
Comment by Alex Forencich (alex.forencich) - Friday, 17 January 2020, 00:51 GMT
I think I am getting the same crash. I think this has happened twice so far. Currently running 5.4.8-arch1-1. Attached is /sys/class/drm/card0/error.
Comment by Elias Haddad (eliasy) - Friday, 24 January 2020, 01:34 GMT
I can confirm the issue in Arch 5.4.13.

As can be seen on the related tasks, this issue appears to be general, and is reported here:
https://gitlab.freedesktop.org/drm/intel/issues/673#login-pane

At this time, apparently it is fixed in kernel 5.5, but not yet backported to versions 5.4.XX.
Comment by loqs (loqs) - Friday, 24 January 2020, 02:19 GMT
@eliasy the backport [1] was applied by arch from 5.4.7-arch1 [2] to 5.4.11-arch1 [3]
The issue continued to be reported as occurring [4] [5] [6] [7] [8] [9] [10]
The first upstream response I can find to a post that the backport does not work is [11]
but the issue can not be reproduced on drm-tip. The commits that add offline error capture
[12] do not apply cleanly to 5.4.Y. Possibly building the kernel with DRM_I915_PREEMPT_TIMEOUT=0
to disable the forced preemption might provide a trace.

[1] https://lore.kernel.org/stable/20191230111530.3750048-1-chris%40chris-wilson.co.uk/
[2] https://git.archlinux.org/linux.git/log/?h=v5.4.7-arch1
[3] https://git.archlinux.org/linux.git/log/?h=v5.4.11-arch1
[4] https://gitlab.freedesktop.org/drm/intel/issues/673#note_373802
[5] https://gitlab.freedesktop.org/drm/intel/issues/673#note_374650
[6] https://gitlab.freedesktop.org/drm/intel/issues/673#note_378360
[7] https://gitlab.freedesktop.org/drm/intel/issues/673#note_381214
[8] https://gitlab.freedesktop.org/drm/intel/issues/673#note_381568
[9] https://gitlab.freedesktop.org/drm/intel/issues/673#note_381639
[10] https://gitlab.freedesktop.org/drm/intel/issues/673#note_382044
[11] https://gitlab.freedesktop.org/drm/intel/issues/1003#note_391081
[12] https://cgit.freedesktop.org/drm-tip/commit/?id=672c368f9398042b629740cc9816e8e939eff2db
[12] https://cgit.freedesktop.org/drm-tip/commit/?id=32ff621fd74496f0c33644125fb69ff175859b1f
[12] https://cgit.freedesktop.org/drm-tip/commit/?id=748317386afb235e11616098d2c7772e49776b58
Comment by Stephan Munsch (forest_bear59) - Sunday, 26 January 2020, 19:36 GMT
I can confirm this for kernel 5.4.15-arch1-1. Hopefully wait for the 5.5 coming ...
Comment by Jan Alexander Steffens (heftig) - Tuesday, 04 February 2020, 18:59 GMT
I am no longer following issue 673 due to noise. If there's a post-5.5.2 commit that fixes this (preferably in mainline), please link.
Comment by Matthias Lisin (matthias.lisin) - Tuesday, 04 February 2020, 19:47 GMT
still this I guess: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v5.5.2&id=f26a9e959a7b1588c59f7a919b41b67175b211d8

Been using linux-zen 5.5 since testing, the issue does not occur anymore.
Comment by Lukas (luman) - Tuesday, 04 February 2020, 19:54 GMT
Tried zen two days ago and had to switch back to LTS because of lockups. :(
Comment by loqs (loqs) - Saturday, 08 February 2020, 01:22 GMT Comment by Lukas (luman) - Saturday, 08 February 2020, 01:26 GMT
Can do, yes :)

This is a full kernel, right? Will all my dkms modules work with that or rather not?

Also, what do I need to report if there is any issues?

Comment by loqs (loqs) - Saturday, 08 February 2020, 01:58 GMT
Yes it is a full kernel. The DKMS modules may work the tree is 5.5-rc2 plus updates for the drm subsystem.
Yes report if the issue is still present running that kernel. If it is follow https://gitlab.freedesktop.org/drm/intel/wikis/How-to-file-file-i915-bugs
Comment by Lukas (luman) - Saturday, 08 February 2020, 03:01 GMT
Nice, thanks for the link. Installing right now. Will get back here after the weekend I suppose!

Comment by Lukas (luman) - Saturday, 08 February 2020, 14:28 GMT
Unfortunately, this kernel does not boot at all. I'll attach a "screenshot" of the error message.
Is there any way to get this log/errors in a proper way?

[le@y730]: ~>$ journalctl -k -b -1
Data from the specified boot (-1) is not available: No data available
   1.jpg (128.7 KiB)
Comment by loqs (loqs) - Saturday, 08 February 2020, 16:11 GMT
Try adding the boot options ignore_loglevel earlyprintk=efi to get more output
To see if the i915 module is the cause module_blacklist=i915
Comment by Lukas (luman) - Sunday, 09 February 2020, 03:12 GMT
I have the following findings (screenshots from failed boots attached):

Boot with i915 blacklisted fails quite early, on loading the initramfs. it just gets stuck there. (no addidtional log output)

I have a TB3 dock from lenovo connected to my computer and the external screen is connected via HDMI on that dock. In this configuration I encounter the lockups as well as the inability to boot the drm-tip kernel.
As soon as this is disconnected (or TB3 dock is connected w/o HDMI, the device boots.) Also Connecting HDMI directly to the computer works. (with or without TB3 dock additionality connected)

If I connect TB3 with HDMI attached after booting I have an instant freeze.


---> The only non-working configuration seems to be HDMI attached to TB3. This is also the configuration where I encountered random lockups before. (stable kernel)

I hope this helps. Let me know if I can provide more detailed information.
   2.jpg (371 KiB)
   3.jpg (539.6 KiB)
Comment by Lukas (luman) - Sunday, 09 February 2020, 03:13 GMT Comment by loqs (loqs) - Sunday, 09 February 2020, 03:19 GMT
https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs

Is the correct link as I copy pasted it not sure how it got changed.

Edit:
Could be you encountered a more severe version of https://gitlab.freedesktop.org/drm/intel/issues/1141
Comment by Lukas (luman) - Monday, 10 February 2020, 12:33 GMT
To be honest, it is a bit confusing over there. It seems like the issues we are discussing are already fixed (tickets are closed).
Then, there are people creating tickets specifying they have the issue only in TB3 mode and they are being closed as a duplicate of "the big ticket"
My impression is that this is definitely not fixed and thus probably not a duplicate of '673'....

So what's the best thing to do now. I would like to help as well as getting my problems fixed, but also don't feel like wasting mine and anyones others time by creating another duplicate ticket....
Comment by Lukas (luman) - Friday, 14 February 2020, 02:16 GMT
LTS is now 5.4.18-1-lts and I suffer the same problem there as well. :(

Seems like it's time for an upstream Ticket.
Comment by Laurențiu Nicola (lnicola) - Friday, 14 February 2020, 10:39 GMT Comment by Gima (gima) - Saturday, 15 February 2020, 18:57 GMT
Linux sm 5.4.19-1-lts

X.Org X Server 1.20.6
[306118.822] (II) modesetting: Driver for Modesetting Kernel Drivers: kms
[306118.822] (II) modeset(0): using drv /dev/dri/card0

Feb 15 20:28:37 blep kernel: i915 0000:00:02.0: GPU HANG: ecode 9:1:0x00000000, hang on rcs0
Feb 15 20:28:37 blep kernel: GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Feb 15 20:28:37 blep kernel: Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
Feb 15 20:28:37 blep kernel: drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Feb 15 20:28:37 blep kernel: The GPU crash dump is required to analyze GPU hangs, so please always attach it.
Feb 15 20:28:37 blep kernel: GPU crash dump saved to /sys/class/drm/card0/error
Feb 15 20:28:37 blep kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Feb 15 20:28:37 blep kernel: [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
Feb 15 20:28:37 blep kernel: i915 0000:00:02.0: Resetting chip for hang on rcs0
Feb 15 20:28:37 blep kernel: [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
Feb 15 20:28:37 blep kernel: [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
Feb 15 20:28:44 blep kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0


Edit/Added information (2020-02-16):
Updated kernel on 2020-02-12: linux-lts (4.19.101-1 -> 5.4.18-1). Before updating, I had not experienced this kind of lockup ever. After updating the kernel, this lockup happened on 2020-02-15. I was watching video at the time, and audio kept playing, but everything on the screen froze. Power button did nothing. VT-switching did nothing. Will adjusting volume with keyboard AND/OR Magic SysRq AND/OR blindly switching console&logging in&powering off with keyboard -- if this happens again.

DP-1 connected primary 2560x1440+0+0 (normal left inverted right x axis y axis) 725mm x 428mm
2560x1440 59.95*+ 74.97
HDMI-2 connected 1920x1080+2560+0 (normal left inverted right x axis y axis) 531mm x 299mm
1920x1080 60.00*+ 50.00 59.94

CPU: i5-8400
Motherboard: ROG STRIX Z370-G GAMING, BIOS Version: 2401
Comment by Lukas (luman) - Saturday, 15 February 2020, 19:17 GMT
@gima
can you provide more information. external screen? docking station? etc?
Comment by Paul Kerry (paulkerry) - Saturday, 15 February 2020, 21:43 GMT
In case you aren't aware, it appears that there are several issues with 5.4 and 5.5 kernels regarding the i915 kernel module which can cause GPU "hang on rcs0" errors.

Reading the latest comments in https://bugs.archlinux.org/task/65392 some i915 patches have been pushed into the 5.5.4.arch1-1 linux package which is available in "testing" today (2020-02-15), so you could try that particular version and see if it works for you.

Other online sources show that these i915 patches cannot as yet be incorporated into earlier kernel releases, which is affecting linux-lts which is currently at version 5.4.19-1 so this effectively appears to be making the linux-lts 5.4 kernel releases useless if you are using any Intel graphics.

An alternative to the testing linux package is rolling back to the 4.19 series which was linux-lts until fairly recently - see https://wiki.archlinux.org/index.php/Arch_Linux_Archive if you don't have a locally saved copy of the last 4.19.* version.

Cheers
Paul.
Comment by Lukas (luman) - Saturday, 15 February 2020, 22:11 GMT
Hi Paul
thanks for this quite accurate summary. This will help people juming in here at this point. Yes, the new LTS is very frustrating. However, the old LTS is quite old and brings other problems for me unfortunately.
I tested the drm-tip some days ago and it made things even worse. HDMI on my dock was 100% unusable.

Is there an 'easy' way to install the 5.5.4-arch1-1? The only package I can find is linux-pds, which is probably not what I want right?

Furthermore, I still see a big difference between using the internal HDMI port and the on on my TB3-dock and I am quite unsure how related that is and what the best way to contribute here is.

Cheers
Lukas
Comment by Paul Kerry (paulkerry) - Saturday, 15 February 2020, 22:25 GMT
linux 5.5.4.arch1-1 is currently on...
https://www.archlinux.org/packages/testing/x86_64/linux/

and select "Download From Mirror" from the RHS box.

Good luck!
Comment by Lukas (luman) - Sunday, 16 February 2020, 02:19 GMT
Cool, thank you! Just booted it with HDMI connected to the TB3 dock. Definitely works better than drm-tip, but how stable it is we'll seen in a couple of days...
Comment by Gima (gima) - Sunday, 16 February 2020, 11:06 GMT
@luman: Added information to my post.
@paulkerry: I installed "linux 5.5.4.arch1-1" and am running it now. Booted correctly and seems to run without problems. No errors in logs during boot. I'll report in a few days how things've been going. (Though I had the previous buggy kernel running for three days straight and only encountered the bug once, so..my input might be useless (delayed)).
Comment by Paul Kerry (paulkerry) - Sunday, 16 February 2020, 15:27 GMT
Just as a follow-up, linux 5.5.4.arch1-1 has moved from testing to core today (2020-02-16).
Comment by Lukas (luman) - Tuesday, 18 February 2020, 13:10 GMT
So far looks good here. Will those patches also be integrated into zen and lts?
Comment by loqs (loqs) - Tuesday, 18 February 2020, 13:43 GMT
@luman they are in 5.5.4-zen1 [1] when linux-lts moved to 5.4 [2] it did not pick up any of the patches the linux package was applying [3],
some had been applied upstream but not all of them.

[1] https://github.com/zen-kernel/zen-kernel/commits/v5.5.4-zen1
[2] https://git.archlinux.org/svntogit/packages.git/commit/trunk?h=packages/linux-lts&id=8903c370bc711fb61b65f6e3b870672fc32487f1
[3] https://git.archlinux.org/linux.git/log/?h=v5.4.15-arch1
Comment by Lukas (luman) - Tuesday, 18 February 2020, 13:52 GMT
Aaaaw, how unfortunate.

5.4, cannot HDMI

5.5 cannot Flutter
https://github.com/flutter/flutter/issues/49185

Thanks for the Links btw :)
Comment by red solja (redsolja) - Wednesday, 19 February 2020, 08:45 GMT
I am experiencing the same issue (with external monitor at HDMI) with linux-lts:

Linux hostfd 5.4.20-1-lts #1 SMP Sat, 15 Feb 2020 00:19:19 +0000 x86_64 GNU/Linux

greping CMDLINE at /etc/default/grub:
GRUB_CMDLINE_LINUX="cryptdevice=/dev/sda2:hostfdmain transparent_hugepage=never"

Using a Dell Latitude E5470
Comment by Paul Kerry (paulkerry) - Wednesday, 19 February 2020, 09:00 GMT
@redsolja - you must have missed the comments above about the lts kernel not being patched yet: see the comments above "Saturday, 15 February 2020, 21:43" and "Tuesday, 18 February 2020, 13:43".
Comment by red solja (redsolja) - Wednesday, 19 February 2020, 13:01 GMT
@Paulkerry - Yeah, I thought it might prove to be helpful to send relevant info about my setup as well, regarding the bug.
Comment by Gima (gima) - Thursday, 20 February 2020, 10:09 GMT
Reporting. Running "linux 5.5.4.arch1-1" for 4 days and haven't had any GPU problems.

Now waiting for the proper fixes (whatever they are) to be backported to linux-lts. There's a lot going on with some recent bug(fixe)s at the i915 issues list [1].
[1] https://gitlab.freedesktop.org/drm/intel/issues?scope=all&sort=updated_desc&state=all&utf8=%E2%9C%93
Comment by Paul Kerry (paulkerry) - Thursday, 20 February 2020, 13:18 GMT
For those still having issues with either current linux or current linux-lts and are looking for an possible alternative, I notice the older 4.19.* kernel which was until recently linux-lts, is now available in AUR at https://aur.archlinux.org/packages/linux-lts419/ - current version as of writing is 4.19.105-1 which matches the upstream version.
Looking at the Arch Linux Archive, the last 4.19 build as linux-lts was linux-lts-4.19.101-2-x86_64.pkg.tar.zst

Cheers
Paul.
Comment by Lukas (luman) - Thursday, 20 February 2020, 14:12 GMT
@gima can confirm what you are saying. Seems pretty stable now.
@paul cool, that's helpful. thanks for the notification

Loading...