FS#29276 - [linux] 3.3.2 Recent kernels cause hard system freeze

Attached to Project: Arch Linux
Opened by cfr (cfr42) - Wednesday, 04 April 2012, 14:34 GMT
Last edited by Gaetan Bisson (vesath) - Monday, 15 October 2012, 04:47 GMT
Task Type Bug Report
Category Kernel
Status Closed
Assigned To Tobias Powalowski (tpowa)
Thomas Bächler (brain0)
Architecture All
Severity Critical
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 13
Private No

Details

Description:

Recent kernels (3.2.1[23], for example) trigger apparently random hard system freezes forcing reboot by holding down power button. (Cannot switch to tty, kill X, get response from function keys etc.) Note that this laptop has no leds on keyboard so I cannot tell if, for example, capslock still has any effect or not. (Note: the key combo to kill X is reenabled via KDE settings as instructed in the wiki so it should work.)

The current LTS kernels did not seem to exhibit the same problem. However, I have just experienced one similar lockup with the most recent LTS kernel. What is most frustrating is that I'm getting nothing logged. The only common factor seems to be that cron runs shortly before and complains that sendmail is not available. However, it does this all the time and does not usually cause a lock up so I can't think that's related.

The system is completely up to date (except for one or two things which have appeared as I've been filing this.) pacman -Syu was run a few hours ago.

Additional info:

* package version(s)
linux 3.2.13-1 (and previous recent kernels) but not usually/previously the lts line
3.0.26-1-lts (single lockup)
xf86-video-intel 2.18.0-1
intel-dri 8.0.2-1
xorg-server 1.12.0.901-1

* config and/or log files etc.
kernel.log - this is an excerpt showing two boots. The first uses 3.2.13-1 and the second the current LTS kernel. This is the only log file I can find which has anything even near the time of the freeze. There is nothing particular in errors.log (except a standard complaint about not finding sendmail), Xorg.0.log.old, acpid.log etc. I would however be happy to post anything which might be useful.

System specs:
Arch Linux | x86_64 | GPT | EFI boot
Lenovo x121e | Intel(R) Core(TM) i3-2367M CPU @ 1.40GHz GenuineIntel | Intel Centrino Wireless-N 1000 | US keyboard with Euro | 320G HDD

I'm not sure if this is related to other bugs. I couldn't find one which looked similar on closer inspection (e.g. similar symptoms but messages in logs I don't see or freezes but only with ipv6 which is disabled on my machine.)

I've included versions for the video drivers because I'm also seeing graphics corruption. This affects the LTS line as well but the problem is worse with the 3.2.+ kernels.

Steps to reproduce:

Boot a recent kernel. Work for a little bit - sorry, there's nothing in particular and no particular timeframe - and wait for everything to freeze. I had thought firefox was particularly likely to trigger this but tonight I was just working on LaTeX source in Kile. After I saved, that was it. I didn't see any graphics corruption though I'm seeing a bit now on the LTS kernel. But the freeze was just as hard.
This task depends upon

Closed by  Gaetan Bisson (vesath)
Monday, 15 October 2012, 04:47 GMT
Reason for closing:  Fixed
Comment by Jung (prokrypt) - Saturday, 07 April 2012, 00:47 GMT
Happened 3 times to me too. First time was when I was halfway through with adding thousands of books to Calibre, second time was when I was viewing security camera footage via sshfs+mplayer, and the third time I wasn't doing anything at all (only firefox was running/idling) but it still froze overnight. Keyboard leds don't respond, and magic sysreq stuff don't work.

Right before the last crash, I kept a terminal open with a tail of the system log (I keep my /var/log on a tmpfs, derp), but unfortunately I forgot to turn off the screensaver, and switched on the monitor find a blank frozen screen.

I have a wireless card installed with modules loaded, but it is unused.
   dmesg.txt (56.6 KiB)
Comment by cfr (cfr42) - Saturday, 07 April 2012, 01:08 GMT
I just realised that I seem not to have attached the log I mentioned. Or was it deleted as of no use?
Comment by cfr (cfr42) - Saturday, 07 April 2012, 01:32 GMT
My current uptime is over 4 hours with the current kernel! The issue is very random i.e. sometimes it freezes after hours, sometimes after minutes. However, very recent non-lts kernels had been much closer to minutes than hours.

Is there something different about the latest kernel (3.2.14-1) which might have fixed the issue? Or is it just my lucky day?

Another possibility is that the issue is now being masked by increased RAM. I installed new RAM today (8G rather than 2G) and I'm wondering a bit if that might be related. Before I uninstalled the 2G stick, I rechecked it with memtest (even though it has only been a few weeks since I last did this - I wanted to check it in the other slot) and no problems were found. So if it is related to the change of RAM, it isn't because there's anything wrong with the 2G stick.
Comment by cfr (cfr42) - Monday, 09 April 2012, 21:04 GMT
Sadly, my hopes have been dashed. I just had another hard lockup with the latest kernel (3.3.1-1). I tried all the usual things - switching to another tty, killing X etc. and none worked. I also tried putting the laptop to sleep using both lid-close and the sleep button. (In the past, I've sometimes been able to tell later that the kernel was still responding by log messages even though sleep failed.) Nothing happened on sleep button. On lid close, the screen goes black but the machine does not sleep. On lid open, the screen comes back from black to its previous frozen state. I also tried removing the usb mouse. This rather bizarrely resulted in the system sound the machine makes when entering sleep.

As a precaution, I had enabled the sysreq key and I even managed to find the right scrap of paper. I am not quite clear from the wiki description what I should see but I pressed ctrl + alt + prtsc + l. It had no apparent effect. I fiddled some more and tried the combination in several different ways (pressing keys in sequence cumulatively, letting go before pressing l...) Nothing. I also tried following with 'reisub' with pauses between each letter. Nothing.

Pressed and held power button, booted into recovery mode. Machine complained that no resume device was set. (Not sure why - I don't set one for recovery mode and never have before.) fsck did the usual stuff to try to clean up filesystem corruption from the crash (lots of issues on my /home partition.)

I then too a look in the logs. I don't know if what I found is useful. I guess the sysreq key probably did something. I excerpted the relevant parts from kernel.log and messages.log and will attach them. Please note that they may be a bit repetitive because I tried several times with the key combination thinking I should get some tangible feedback. Please let me know if they are useful at all.

When I continued booting, I got an error or warning referring to hardware which I think has to do with sound (possibly the microphone) and saying it was handling it generically. I don't think this error is new. I can't find it in the logs, though, and it isn't displayed for long enough to read properly. After logging into KDE, kde power management warned me that my battery has 0% capacity even though it certainly does not. The system does seem to be not registering or reading my battery state correctly, though:

cat /sys/devices/LNXSYSTM\:00/device\:00/PNP0C0A\:00/power_supply/BAT1/energy_*
160000
62160000
56680000

The first figure is energy_full, the second energy_full_design and the third energy_now. So I currently have many times as much power as the battery is capable of holding.

I don't usually have issues with this. I'm just mentioning everything weird in case any of it is useful in the hope that *something* might be.
Comment by Jung (prokrypt) - Monday, 09 April 2012, 21:17 GMT
In order to be able to use sysrq, you need to enable it:
sysctl kernel.sysrq=1
OR set "kernel.sysrq = 1" in /etc/sysctl.conf (and reboot)

Anyways, as for me, I've been getting no more crashes [so far] ever since I removed a cheapo pci express sata raid controller from my computer yesterday--or perhaps I just haven't waited long enough. We'll see...
Comment by cfr (cfr42) - Monday, 09 April 2012, 22:23 GMT
Thanks. As noted, I had already enabled the sysreq key (both immediately and for subsequent boots). I take it that the fact that the relevant bits of the log files start with:

Apr 9 21:24:18 localhost kernel: [35077.533535] SysRq : Show backtrace of all active CPUs

means that I did this successfully. I also assume that it means that the key combination did achieve something - there is certainly a lot of stuff following this invocation in the logs.

It did not, however, allow me to boot cleanly and I had no sign that the key combo was having any effect while I was actually using it. I can only tell I even used it from the evidence in the logs.

It is pretty much definitely not hardware in my case unless it is something Lenovo issues in all laptops of the sort I have, which seems unlikely as Linux didn't mind them until a few kernels ago.

There is something it doesn't like about the sound hardware now and I've finally found it in the logs. It's in boot but not dmesg:

Wed Feb 22 17:41:40 2012: Found hardware: "HDA-Intel" "Intel CougarPoint HDMI" "HDA:14f1506e,17aa21ed,00100002 HDA:80862805,80860101,00100000" "0x17aa" "0x21ed"
Wed Feb 22 17:41:40 2012: Hardware is initialized using a generic method

I don't know why this gets printed to the terminal during boot. Is it getting output to the console rather than the log for some reason? I can't find a log entry later than this one from February although I now see what looks to be a very similar message every boot and shutdown. (At least it starts the same and the bit about initialisation is the same.) But it doesn't look like something which should be causing these sorts of lockups. Especially since I don't use HDMI. So I don't think it has any relevance to this bug.

Incidentally, a second reboot solved the battery issue. Maybe there is something not quite right about booting via recovery mode (using the fallback image etc.)? Or maybe it just needs a second reboot just to calm down and sort itself out properly?
Comment by Brendan MacDonell (bremac) - Sunday, 22 April 2012, 21:41 GMT
I've been experiencing this issue since 3.1 (or 3.2?) as well. I've attached a the dmesg output from my system boot for reference, if there's any hardware commonality. I'll update if I can get a list of locks and a trace when it happens again.
Comment by seby (Nekos) - Tuesday, 24 April 2012, 16:14 GMT
I have the same issue, nothing logged, when i login with ssh, i see that X uses 100% CPU.

xorg-server 1.12.1-1
linux 3.3.2-1
xf86-video-intel 2.18.0-3
intel-dri 8.0.2-1
Comment by Christian Sturm (Athaba) - Tuesday, 01 May 2012, 13:38 GMT
It's the same on Aspire ONE.

Last /var/log/messages before the freeze:

May 1 14:57:18 localhost kernel: [ 328.003163] fuse init (API version 7.18)
May 1 14:57:20 localhost dbus[586]: [system] Activating service name='org.freedesktop.UDisks2' (using servicehelper)
May 1 14:57:21 localhost dbus[586]: [system] Activating service name='org.freedesktop.UPower' (using servicehelper)
May 1 14:57:21 localhost udisksd[998]: udisks daemon version 1.94.0 starting
May 1 14:57:21 localhost dbus[586]: [system] Successfully activated service 'org.freedesktop.UDisks2'
May 1 14:57:21 localhost udisksd[998]: Acquired the name org.freedesktop.UDisks2 on the system message bus
May 1 14:57:23 localhost dbus[586]: [system] Successfully activated service 'org.freedesktop.UPower'
   dmesg (45.2 KiB)
Comment by Jimmie Butler (jimmiebtlr) - Sunday, 20 May 2012, 15:24 GMT
If its of any use running arch without X seems to stop the crashing. I ran mine overnight from command line to do some backups because I wasn't being able to have any complete otherwise, it would crash before they completed. The backups were using a usb drive and wireless heavily, so it would seem to suggest neither of those is crashing arch here.

EDIT: It just crashed without X running, the only thing going on was a dd from one partition of an internal sata drive to the same internal sata drive.
Comment by Greg (dolby) - Monday, 01 October 2012, 21:02 GMT
Still a problem? Did anyone report to kernel bugzilla?
Comment by Brendan MacDonell (bremac) - Tuesday, 02 October 2012, 01:59 GMT
I haven't seen this issue in the past two months.
Comment by Greg (dolby) - Tuesday, 02 October 2012, 12:31 GMT
Can anyone else confirm this is fixed?
Comment by Jimmie Butler (jimmiebtlr) - Tuesday, 02 October 2012, 16:52 GMT
Sorry for the delay, my issue turned out to be hardware related. Not sure if it was bad hardware, or a conflict of some sort but removing a wifi card fixed it on my computer. It would still be nice if some logging were able to document such occurences.

Loading...