FS#42353 - [linux] 3.17.x Lockups

Attached to Project: Arch Linux
Opened by jason ryan (jasonwryan) - Monday, 13 October 2014, 06:13 GMT
Last edited by Tobias Powalowski (tpowa) - Monday, 17 November 2014, 07:35 GMT
Task Type Bug Report
Category Kernel
Status Closed
Assigned To Tobias Powalowski (tpowa)
Architecture x86_64
Severity Critical
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 5
Private No

Details

Description: After installing a 3.17 kernel (either 3.17-1 or 3.17-2 or compiling 3.17 myself) my machine will completely lock up anywhere between 3-15 minutes after booting. The screen will freeze (with no degradation) and the machine accepts no input from keyboard or touchpad. Trying to SSH fails with "no route to host". The only remedy is a hard shutdown.

The journal ends with (there are no other error messages leading up to the final line):
Oct 13 18:53:03 Shiv kernel: BUG: unable to handle kernel NULL pointer dereference at 000000000000001c

This happens with (3.17-2) and without (3.17-1) the microcode update.

3.16-4 continues to work without issue.


Additional info:
* Intel(R) Core(TM) i5-3317U CPU @ 1.70GHz



Steps to reproduce: Boot into 3.17.x and start working...
This task depends upon

Closed by  Tobias Powalowski (tpowa)
Monday, 17 November 2014, 07:35 GMT
Reason for closing:  Fixed
Comment by jason ryan (jasonwryan) - Thursday, 16 October 2014, 00:24 GMT
Rebuilt 3.17-1 with with CONFIG_PREEMPT_VOLUNTARY=y and, while the machine locked up, it was more gradual. Nothing printed to the journal.

Just tried with 3.17.1-ARCH and the same lockup as initially described: nothing printed to the journal.

What else can I do to try and debug this?
Comment by patrick (potomac) - Thursday, 16 October 2014, 02:13 GMT
the best solution is to do a git bisect in order to find the faulty commit, but it can take time to find this commit,

http://git-scm.com/book/en/Git-Tools-Debugging-with-Git

check also if it's not a hardware problem ( bad memory modules, you can use memtest )
Comment by jason ryan (jasonwryan) - Thursday, 16 October 2014, 02:20 GMT
Thanks patrick. I'll check the memory but as it is comfortably running 3.16.4, I suspect it is a bug in 3.17.

I'll start investigating the bisect approach.
Comment by Clemens Koller (ckoller) - Thursday, 23 October 2014, 23:32 GMT
Let me just XREF to #42505 https://bugs.archlinux.org/task/42505?project=1
might be related - 3.17.1-1 crashes while booting, 3.16.4-* works.
Comment by Chris Jones (cmjones) - Friday, 24 October 2014, 01:33 GMT
I have the same issue (kernel version 3.17.1-1). It only seems to trigger when I navigate to certain websites, rather than being time-based. Another odd thing not mentioned originally is that only input seems to freeze for me (the clock on my terminal still updates).

I've attached some printout from journalctl leading up to the crash, since I had some strange ata2 stuff going on leading up to the "BUG: unable to handle kernel NULL pointer dereference" line. Hope any of this information is helpful.
Comment by jason ryan (jasonwryan) - Friday, 24 October 2014, 02:57 GMT
Just happened on my non-[testing] laptop, with (slightly) more in the journal:

Oct 24 15:28:38 Veles kernel: BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
Oct 24 15:28:38 Veles kernel: IP: [<ffffffff811a3339>] mmu_notifier_unregister+0x19/0xe0
Oct 24 15:28:38 Veles kernel: PGD afe20067 PUD b1716067 PMD 0
Oct 24 15:28:38 Veles kernel: Oops: 0000 [#1] PREEMPT SMP
Comment by P.H. (Vain) - Friday, 24 October 2014, 17:11 GMT
I can reproduce this on two different machines (big workstation with an i7-3770 and a small netbook from 2011 with an Intel Atom N455 -- both Intel GPUs) by opening Google Maps in lariza (a WebKit browser), 3.17.1, x86_64.

Got a little more infos in my journal, hope they are useful.

I *hope* that I can find the time to do some bisecting this weekend ... can't promise anything, though.
   oops.txt (9.8 KiB)
Comment by P.H. (Vain) - Friday, 24 October 2014, 17:49 GMT
If I understand correctly, this has already been fixed and the fix is supposed to be included in 3.18:

https://freedesktop.org/patch/34166/
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=e9681366ea9e76ab8f75e84351f2f3ca63ee542c
Comment by patrick (potomac) - Friday, 24 October 2014, 19:04 GMT
you can try to backport this patch in kernel 3.17 to see if it works,

this patch seems simple, just one file : /drivers/gpu/drm/i915/i915_gem_userptr.c

Comment by jason ryan (jasonwryan) - Friday, 24 October 2014, 20:13 GMT
Patch applied and currently enjoying 30+ minutes of unfrozen uptime...
Comment by Peter Weber (hoschi) - Tuesday, 28 October 2014, 16:50 GMT
Is upstream aware of the necessity to apply this patch the the current stable series i.e. Kernel 3.17.2?
I can't see this patch on Kroah's queue as for now.
Comment by jason ryan (jasonwryan) - Thursday, 30 October 2014, 19:54 GMT Comment by Peter Weber (hoschi) - Friday, 31 October 2014, 08:54 GMT
Thanks for the information. Is upstream already informed?
Comment by jason ryan (jasonwryan) - Friday, 31 October 2014, 09:03 GMT
I don't know how it works; I just (naively) assumed that the patch sent to the freedesktop ML would be referred to whoever looks after that part of the kernel...
Comment by Peter Weber (hoschi) - Friday, 31 October 2014, 09:31 GMT
I've written an email to Greg Kroah-Hartman and refered to the patch and this bug.

I'm suffering myself random hard freezes (at least of STDIN/STDOUT) on my laptop, since 3.17-rc4. My logs look differnt, maybe
based on my custom kernel-configuration. I currently don't know my setting of PREEMPT and doesn't have access to my laptop and
reproducing a random bug isn't that easy (I can work for hours till freeze happens or just and half hour).

// edit
Gregs mailbot slapped me (of course) and told me to send it to the usual mailinglists for this. Done.
Comment by Peter Weber (hoschi) - Tuesday, 11 November 2014, 10:34 GMT
Greg (Thanks!) has added the patch to the queue:
https://git.kernel.org/cgit/linux/kernel/git/stable/stable-queue.git/tree/queue-3.17/drm-i915-do-not-store-the-error-pointer-for-a-failed-userptr-registration.patch

We should see "3.17.3" soon, shall we just wait instead of patching your PKGBUILD?
I've already patched myself the kernel on my machine and it seems to be stable now.
Comment by jason ryan (jasonwryan) - Tuesday, 11 November 2014, 21:37 GMT
Nice one Peter: thank you.

I think it makes sense to wait for 3.17.3...
Comment by Mario (diraimondo) - Saturday, 15 November 2014, 16:04 GMT
kernel 3.17.3 is here: https://www.kernel.org/pub/linux/kernel/v3.x/ChangeLog-3.17.3

The patch is included! Waiting for Arch packaging.

Loading...