FS#55629 - [linux] Intel i915 driver issue in kernel 4.13 requiring restart.
Attached to Project:
Arch Linux
Opened by John Bennett (Lindows) - Thursday, 14 September 2017, 03:10 GMT
Last edited by Doug Newgard (Scimmia) - Sunday, 08 October 2017, 23:25 GMT
Opened by John Bennett (Lindows) - Thursday, 14 September 2017, 03:10 GMT
Last edited by Doug Newgard (Scimmia) - Sunday, 08 October 2017, 23:25 GMT
|
Details
Error message on boot of CPU pipe A FIFO underrun due to an
issue in the intel i915 driver. Upon start of X Server the
entire screen freezes and the machine locks up. Changing
virtual terminals does not work and the entire machine
requires a shutdown.
This seems to happen on the 4.13.x series of kernel. I have not seen this bug in the 4 .12.x series or the 4.9 series kernels. Thinkpad T410 Architecture: x86_64 Model name: Intel(R) Core(TM) i5-540 M CPU @ 2.53GHz Graphics: Intel Ironlake Mobile Steps to reproduce: -Boot machine and wait for error message which will be displayed as part of the dmesg on boot. -Start X server and wait 30-40 seconds. Laptop will freeze and require a restart. The following error message is displayed in dmesg and journal: kernel:[drm:intel_cpu_fifo_underrun_irq_handler[i915]]*ERROR* CPU pipe A FIFO underrun. I have reverted back to kernel 4.9 LTS to avoid this problem. |
This task depends upon
Closed by Doug Newgard (Scimmia)
Sunday, 08 October 2017, 23:25 GMT
Reason for closing: Fixed
Additional comments about closing: linux 4.13.5-1
Sunday, 08 October 2017, 23:25 GMT
Reason for closing: Fixed
Additional comments about closing: linux 4.13.5-1
One line from dmesg is not very useful out of context see https://01.org/linuxgraphics/documentation/how-report-bugs
The GUI I am using is Cinnamon if that is of any help
Sep 15 08:18:48 Marvin kernel: rtc_cmos: probe of 00:01 failed with error -16
Sep 15 08:18:48 Marvin kernel: pci 0000:00:02.0: Video device with shadowed ROM at [mem 0x000c0000-0x000dffff]
Sep 15 08:18:48 Marvin kernel: pci 0000:00:14.0: can't derive routing for PCI INT A
Sep 15 08:18:48 Marvin kernel: pci 0000:00:14.0: PCI INT A: not connected
i have a broadwell processor with HD Graphics 5500 and Optimus Setup though with a Nvidia 840M, so i got all them Graphics Drivers Problems xD
Problem is caused by this line:
CONFIG_INTEL_IOMMU_DEFAULT_ON=y
in config.x86_64. Until 4.13 kernels it was:
# CONFIG_INTEL_IOMMU_DEFAULT_ON is not set
At this moment only linux-hardened package uses this old config.
I don't know why IOMMU is enabled by default now. It always caused me trouble.
completely does not show the symptoms. Been running Xorg for 30minutes and Wayland
for 2hours. And there was no FIFO underrun error in dmesg when booting right when
the KMS switch happens during systemd-init. Video playback and everything else
and no error dmesg so far.
or finding and linking the relevant upstream bug report first. But if IOMMU is
generally unstable on Intel then I guess it can be closed with a plan to disable
IOMMU in 4.12.3-2, although I find that hard to believe.
Possibly because the kernel versions mentioned may not point at the same regression
but the errors do and at least one recent comment saw this happening with 4.13
and not earlier.
enabled and it makes sense others have been seeing this different
kernels before.
Found upstream bug https://bugs.freedesktop.org/show_bug.cgi?id=100219
I boot a kernel with IOMMU enabled. Worth a test.
is fixed by disabling IOMMU it could be related.
but it's on AMD: https://bugs.archlinux.org/task/53609
no change in BIOS and a custom kernel that completely disable IOMMU
in the device drivers section. Can you test that?
error in dmesg. I'd say 4.13.2 without IOMMU is stable on my machine.
to https://bugs.freedesktop.org/show_bug.cgi?id=100219 so upstream can confirm it is the same issue.
update: it finally worked, it restarted and i was finaly able to login again :)
even though I think there have been sync object patches in 4.13.
[33898.274495] drm/i915: Resetting chip after gpu hang
[33900.194189] asynchronous wait on fence i915:[global]:13057f timed out
I thought many people use VT-d to dedicate a GPU or NIC to a VM and it's not
some experimental/buggy feature, if the mainboard+BIOS is fine.
I mean I don't mind editing the wiki, but I'm surprised.
[drm:intel_pipe_update_end [i915]] *ERROR* Atomic update failure on pipe A (start=46836 end=46837) time 447 us, min 763, max 767, scanline start 762, end 779
[10327.815484] [drm:intel_pipe_update_end [i915]] *ERROR* Atomic update failure on pipe A (start=48452 end=48453) time 385 us, min 763, max 767, scanline start 755, end 768
[10432.817354] [drm:intel_pipe_update_end [i915]] *ERROR* Atomic update failure on pipe A (start=54737 end=54738) time 481 us, min 763, max 767, scanline start 760, end 777
[10671.321949] [drm:intel_pipe_update_end [i915]] *ERROR* Atomic update failure on pipe A (start=69013 end=69014) time 369 us, min 763, max 767, scanline start 758, end 771
[11120.815604] [drm:intel_pipe_update_end [i915]] *ERROR* Atomic update failure on pipe A (start=95918 end=95919) time 452 us, min 763, max 767, scanline start 750, end 767
[11141.816256] [drm:intel_pipe_update_end [i915]] *ERROR* Atomic update failure on pipe A (start=97175 end=97176) time 181 us, min 763, max 767, scanline start 759, end 768
[11224.814967] [drm:intel_pipe_update_end [i915]] *ERROR* Atomic update failure on pipe A (start=102143 end=102144) time 201 us, min 763, max 767, scanline start 762, end 771
[11245.815758] [drm:intel_pipe_update_end [i915]] *ERROR* Atomic update failure on pipe A (start=103400 end=103401) time 202 us, min 763, max 767, scanline start 762, end 768
[11398.815188] [drm:intel_pipe_update_end [i915]] *ERROR* Atomic update failure on pipe A (start=112558 end=112559) time 457 us, min 763, max 767, scanline start 754, end 767
[11474.813774] [drm:intel_pipe_update_end [i915]] *ERROR* Atomic update failure on pipe A (start=117107 end=117108) time 281 us, min 763, max 767, scanline start 755, end 764
[12562.921421] [drm:intel_pipe_update_end [i915]] *ERROR* Atomic update failure on pipe A (start=182237 end=182238) time 340 us, min 763, max 767, scanline start 756, end 768
[13262.815423] [drm:intel_pipe_update_end [i915]] *ERROR* Atomic update failure on pipe A (start=224130 end=224131) time 421 us, min 763, max 767, scanline start 754, end 769
[13318.815706] [drm:intel_pipe_update_end [i915]] *ERROR* Atomic update failure on pipe A (start=227482 end=227483) time 207 us, min 763, max 767, scanline start 761, end 771
[13534.816570] [drm:intel_pipe_update_end [i915]] *ERROR* Atomic update failure on pipe A (start=240411 end=240412) time 204 us, min 763, max 767, scanline start 759, end 768
[13735.932096] [drm:intel_pipe_update_end [i915]] *ERROR* Atomic update failure on pipe A (start=252449 end=252450) time 396 us, min 763, max 767, scanline start 759, end 773
[14264.815898] [drm:intel_pipe_update_end [i915]] *ERROR* Atomic update failure on pipe A (start=284106 end=284107) time 349 us, min 763, max 767, scanline start 758, end 770
I'm on a Sony Vaio VPCEG series with intel core i3 2310M and 2nd gen intel integrated graphics using Cinnamon desktop.
If I kill and stop using Firefox, there are no GPU hangs after the first one happened,
or until I reboot.
https://bugs.freedesktop.org/show_bug.cgi?id=101237
https://bugs.freedesktop.org/show_bug.cgi?id=99720
Is there a kconfig option to disable sync objects? I didn't find it.
for more than a day, including GPU use, and none of the issues, not even the GPU
hangs happened with that. I'm thinking 4.13 DRM is in a bad state right now.
Looking through Greg's 4.13 stable queue, there's no DRM fixes so far but a long
list of XFS patches.
Also https://bugs.archlinux.org/task/55629#comment161179 is still outstanding is everyone assuming it is the same issue?
more frequently with 4.12 and 4.13. Even a completely IOMMU free
4.13.4 had occasional GPU freezes and I can confirm that I was using
VAAPI for a prolonged time while Firefox's GPU use triggered it
reliably. Quitting Firefox made it disappear but Firefox is just a
user of Mesa and DRM and can't be blamed. I think it's a combination
of Mesa and 4.12 or 4.13 DRM that provokes the bug.
You say you didn't see it without IOMMU, but I'm certain that IOMMU
helped increase chances of the bug and you're now merely less likely
to hit it.
The three bugzilla entries I posted above are all about this and it
seems the issue has only become more prominent with 4.12 and 4.13.
Xorg and a VAAPI client. No firefox necessary. Back to 4.9-lts for now because
4.12 is EOL.
I think this is very likely related to
FS#55744.run into any GPU errors yet after two hours of concurrent VAAPI use and heavy CPU
utilization. Seems that merely disabling IOMMU in the kernel config isn't as effective
as disabling it in BIOS.
@sonix07 you might know this but to be safe: VT-d is only needed for
KVM if you want to share your physical devices with a VM. The VMM
only needs VT-x. In /proc/cpuinfo it's the vmx flag.
heavy GPU and CPU utilization, including VAAPI.
I booted custom 4.13.4 vanilla (kconfig disabled IOMMU completely) and it didn't
take an hour before using VAAPI and browsers like Firefox and Chrome caused
GPU errors.
It seems that disabling IOMMU in the kernel isn't a good idea, but keeping
it on and having VT-d disabled in the BIOS works. This is naturally just
a stupid workaround because disabling IOMMU in the kernel should not
cause problems, especially when IOMMU isn't available (BIOS switch).
4.13 and currint drm-tip are in pretty bad shape.
Firefox:
[drm] GPU HANG: ecode 6:0:0x80202f7b, in Compositor [2620], reason: Hang on rcs0, action: reset
[drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
drm/i915: Resetting chip after gpu han
Chrome:
drm/i915: Resetting chip after gpu hang
asynchronous wait on fence i915:[global]:a4255 timed out
drm/i915: Resetting chip after gpu hang
Like I said above applications will use Mesa and Xorg as the API provides
and it's a bug in the graphics stack. If Chrome or Firefox or mpv or
fmmpeg (both when using VAAPI) would do something wrong, the API will
return an error, avoiding GPU hangs. If you can cause a GPU hang, then
this is a local DoS, locking up the desktop for seconds.
contains no remedies.
Has anyone used VT-d on Intel Sandybridge or newer with zero driver issues?
I never had a need and ask myself if this is a new string of regressions
or whether it has always been a lottery.
Have you opened separate reports for each of your issues?
have been useful to move this ticket forward. The bugs are those linked
in this ticket. Sorry I can't be more involved with the debugging process.
I understand why you might assume it's more than one issue, and it might be
multiple bugs in combination causing problems, but they all are related
to VT-d somehow and as a user of the graphics stack it's all the same bug,
if we exclude the FIFO underrun which is fixed by disabling VT-d. Which then
leaves us with 4.13+ being more likely to hang the GPU than previous kernels.
What is interesting is that, like I found, if you disable IOMMU in the
kernel and VT-d in the BIOS, then that kernel will still provoke hangs,
while a kernel with IOMMU activated but VT-d disabled does not. I find
that the most interesting result so far.
Issue has 21 votes so I assume 21 affected individuals but no comment is linked to an upstream report from an arch user.
Perhaps closing this bug report as an upstream issue would encourage reporting upstream instead.
My computer freezes randomly after update to Linux 4.13.3
Dell Latitude E5550
Intel(R) Core(TM) i5-5300U CPU @ 2.30GHz
oct. 02 18:12:27 pierre-dell-latitude kernel: DMAR: DRHD: handling fault status reg 3
oct. 02 18:12:27 pierre-dell-latitude kernel: DMAR: [DMA Write] Request device [00:02.0] fault addr 19e000 [fault reason 23] Unknown
oct. 02 18:12:35 pierre-dell-latitude kernel: [drm] GPU HANG: ecode 8:0:0x85dffffb, in Xwayland [749], reason: Hang on rcs0, action: reset
oct. 02 18:12:35 pierre-dell-latitude kernel: [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
oct. 02 18:12:35 pierre-dell-latitude kernel: [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
oct. 02 18:12:35 pierre-dell-latitude kernel: [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
oct. 02 18:12:35 pierre-dell-latitude kernel: [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
oct. 02 18:12:35 pierre-dell-latitude kernel: [drm] GPU crash dump saved to /sys/class/drm/card0/error
oct. 02 18:12:35 pierre-dell-latitude kernel: drm/i915: Resetting chip after gpu hang
oct. 02 18:12:46 pierre-dell-latitude sudo[18258]: pam_unix(sudo:session): session closed for user root
oct. 02 18:13:17 pierre-dell-latitude kernel: DMAR: DRHD: handling fault status reg 3
oct. 02 18:13:17 pierre-dell-latitude kernel: DMAR: [DMA Write] Request device [00:02.0] fault addr 47a7000 [fault reason 23] Unknown
oct. 02 18:13:25 pierre-dell-latitude kernel: drm/i915: Resetting chip after gpu hang
oct. 02 18:13:33 pierre-dell-latitude kernel: drm/i915: Resetting chip after gpu hang
oct. 02 18:13:36 pierre-dell-latitude kernel: asynchronous wait on fence i915:gnome-shell[725]/1:493a timed out
oct. 02 18:13:41 pierre-dell-latitude kernel: drm/i915: Resetting chip after gpu hang
oct. 02 18:13:44 pierre-dell-latitude kernel: asynchronous wait on fence i915:gnome-shell[725]/1:493b timed out
oct. 02 18:13:49 pierre-dell-latitude kernel: drm/i915: Resetting chip after gpu hang
Then my computer freezes.
I've attached the crash dump
However, I'm a little bit confused by this line in the report:
-Start X server and wait 30-40 seconds. Laptop will freeze and require a restart.
This sounds like you get a graphical output for a short while. For me, it locks up immediately when I try to start X (unless, curiously, I had wayland running beforehand).
@c: Regarding the Firefox issues when just iommu is disabled and not VT-d: That might be even more hardware specific. I have Firefox (nightly) running almost all the time when my laptop is turned on but I haven't had any problems before the 4.13 upgrade, or after turning off iommu.
https://bugs.freedesktop.org/show_bug.cgi?id=103076
The response from upstream was to disable iommu:
DMAR and death is nothing new, see bug 89360. Standard practice is to disable iommu, with intel_iommu=igfx_off.
Running with intel_iommu=igfx_off solves the problem for me. I get almost an immediately lockup in X without the option. With the option, my laptop runs normally.
Downgrading to 4.12 helps.
GPU errors discussed above with BIOS-IOMMU=off and intel_iommu=igfx_off on 4.13.5, which
validates my claim that IOMMU only makes it easier to trigger and there are bigger
bugs in 4.13.5.
An anecdote on my experience with intel-drm over the last two years:
Ever since atomic modesetting started in 4.2, the DRM stack has gotten more
regressive, which is funny since before I never thought bout intel-drm at all.
It all used to work, no errors, no tearing (started with Sandybridge and
solved only with native Wayland or xf86-video-intel ddx in TearFree mode;
no, generic modesetting driver and glamor for that matter isn't tear free yet).
4.13 is wild with GPU hangs, fence timeouts and atomic-ms crashes :-).
One of the 4.13.5 GPU hangs today credited systemd-login, which I think means
it was mpv owned by logind, which owns the Xorg session. Something else than the
usual Firefox or Chrome compositor.
4.14 (maybe even 4.9?) will be extended-lts (4+ years) releases, by the way.
This should fix the issue for most affected systems
@c as this does not resolve your issue please report the issue upstream.
So this FS can be closed as what remains would no longer be a packing and integration issue.