FS#29276 - [linux] 3.3.2 Recent kernels cause hard system freeze
Attached to Project:
Arch Linux
Opened by cfr (cfr42) - Wednesday, 04 April 2012, 14:34 GMT
Last edited by Gaetan Bisson (vesath) - Monday, 15 October 2012, 04:47 GMT
Opened by cfr (cfr42) - Wednesday, 04 April 2012, 14:34 GMT
Last edited by Gaetan Bisson (vesath) - Monday, 15 October 2012, 04:47 GMT
|
Details
Description:
Recent kernels (3.2.1[23], for example) trigger apparently random hard system freezes forcing reboot by holding down power button. (Cannot switch to tty, kill X, get response from function keys etc.) Note that this laptop has no leds on keyboard so I cannot tell if, for example, capslock still has any effect or not. (Note: the key combo to kill X is reenabled via KDE settings as instructed in the wiki so it should work.) The current LTS kernels did not seem to exhibit the same problem. However, I have just experienced one similar lockup with the most recent LTS kernel. What is most frustrating is that I'm getting nothing logged. The only common factor seems to be that cron runs shortly before and complains that sendmail is not available. However, it does this all the time and does not usually cause a lock up so I can't think that's related. The system is completely up to date (except for one or two things which have appeared as I've been filing this.) pacman -Syu was run a few hours ago. Additional info: * package version(s) linux 3.2.13-1 (and previous recent kernels) but not usually/previously the lts line 3.0.26-1-lts (single lockup) xf86-video-intel 2.18.0-1 intel-dri 8.0.2-1 xorg-server 1.12.0.901-1 * config and/or log files etc. kernel.log - this is an excerpt showing two boots. The first uses 3.2.13-1 and the second the current LTS kernel. This is the only log file I can find which has anything even near the time of the freeze. There is nothing particular in errors.log (except a standard complaint about not finding sendmail), Xorg.0.log.old, acpid.log etc. I would however be happy to post anything which might be useful. System specs: Arch Linux | x86_64 | GPT | EFI boot Lenovo x121e | Intel(R) Core(TM) i3-2367M CPU @ 1.40GHz GenuineIntel | Intel Centrino Wireless-N 1000 | US keyboard with Euro | 320G HDD I'm not sure if this is related to other bugs. I couldn't find one which looked similar on closer inspection (e.g. similar symptoms but messages in logs I don't see or freezes but only with ipv6 which is disabled on my machine.) I've included versions for the video drivers because I'm also seeing graphics corruption. This affects the LTS line as well but the problem is worse with the 3.2.+ kernels. Steps to reproduce: Boot a recent kernel. Work for a little bit - sorry, there's nothing in particular and no particular timeframe - and wait for everything to freeze. I had thought firefox was particularly likely to trigger this but tonight I was just working on LaTeX source in Kile. After I saved, that was it. I didn't see any graphics corruption though I'm seeing a bit now on the LTS kernel. But the freeze was just as hard. |
This task depends upon
Right before the last crash, I kept a terminal open with a tail of the system log (I keep my /var/log on a tmpfs, derp), but unfortunately I forgot to turn off the screensaver, and switched on the monitor find a blank frozen screen.
I have a wireless card installed with modules loaded, but it is unused.
Is there something different about the latest kernel (3.2.14-1) which might have fixed the issue? Or is it just my lucky day?
Another possibility is that the issue is now being masked by increased RAM. I installed new RAM today (8G rather than 2G) and I'm wondering a bit if that might be related. Before I uninstalled the 2G stick, I rechecked it with memtest (even though it has only been a few weeks since I last did this - I wanted to check it in the other slot) and no problems were found. So if it is related to the change of RAM, it isn't because there's anything wrong with the 2G stick.
As a precaution, I had enabled the sysreq key and I even managed to find the right scrap of paper. I am not quite clear from the wiki description what I should see but I pressed ctrl + alt + prtsc + l. It had no apparent effect. I fiddled some more and tried the combination in several different ways (pressing keys in sequence cumulatively, letting go before pressing l...) Nothing. I also tried following with 'reisub' with pauses between each letter. Nothing.
Pressed and held power button, booted into recovery mode. Machine complained that no resume device was set. (Not sure why - I don't set one for recovery mode and never have before.) fsck did the usual stuff to try to clean up filesystem corruption from the crash (lots of issues on my /home partition.)
I then too a look in the logs. I don't know if what I found is useful. I guess the sysreq key probably did something. I excerpted the relevant parts from kernel.log and messages.log and will attach them. Please note that they may be a bit repetitive because I tried several times with the key combination thinking I should get some tangible feedback. Please let me know if they are useful at all.
When I continued booting, I got an error or warning referring to hardware which I think has to do with sound (possibly the microphone) and saying it was handling it generically. I don't think this error is new. I can't find it in the logs, though, and it isn't displayed for long enough to read properly. After logging into KDE, kde power management warned me that my battery has 0% capacity even though it certainly does not. The system does seem to be not registering or reading my battery state correctly, though:
cat /sys/devices/LNXSYSTM\:00/device\:00/PNP0C0A\:00/power_supply/BAT1/energy_*
160000
62160000
56680000
The first figure is energy_full, the second energy_full_design and the third energy_now. So I currently have many times as much power as the battery is capable of holding.
I don't usually have issues with this. I'm just mentioning everything weird in case any of it is useful in the hope that *something* might be.
messages.log.2012-04-09 (150.7 KiB)
sysctl kernel.sysrq=1
OR set "kernel.sysrq = 1" in /etc/sysctl.conf (and reboot)
Anyways, as for me, I've been getting no more crashes [so far] ever since I removed a cheapo pci express sata raid controller from my computer yesterday--or perhaps I just haven't waited long enough. We'll see...
Apr 9 21:24:18 localhost kernel: [35077.533535] SysRq : Show backtrace of all active CPUs
means that I did this successfully. I also assume that it means that the key combination did achieve something - there is certainly a lot of stuff following this invocation in the logs.
It did not, however, allow me to boot cleanly and I had no sign that the key combo was having any effect while I was actually using it. I can only tell I even used it from the evidence in the logs.
It is pretty much definitely not hardware in my case unless it is something Lenovo issues in all laptops of the sort I have, which seems unlikely as Linux didn't mind them until a few kernels ago.
There is something it doesn't like about the sound hardware now and I've finally found it in the logs. It's in boot but not dmesg:
Wed Feb 22 17:41:40 2012: Found hardware: "HDA-Intel" "Intel CougarPoint HDMI" "HDA:14f1506e,17aa21ed,00100002 HDA:80862805,80860101,00100000" "0x17aa" "0x21ed"
Wed Feb 22 17:41:40 2012: Hardware is initialized using a generic method
I don't know why this gets printed to the terminal during boot. Is it getting output to the console rather than the log for some reason? I can't find a log entry later than this one from February although I now see what looks to be a very similar message every boot and shutdown. (At least it starts the same and the bit about initialisation is the same.) But it doesn't look like something which should be causing these sorts of lockups. Especially since I don't use HDMI. So I don't think it has any relevance to this bug.
Incidentally, a second reboot solved the battery issue. Maybe there is something not quite right about booting via recovery mode (using the fallback image etc.)? Or maybe it just needs a second reboot just to calm down and sort itself out properly?
xorg-server 1.12.1-1
linux 3.3.2-1
xf86-video-intel 2.18.0-3
intel-dri 8.0.2-1
Last /var/log/messages before the freeze:
May 1 14:57:18 localhost kernel: [ 328.003163] fuse init (API version 7.18)
May 1 14:57:20 localhost dbus[586]: [system] Activating service name='org.freedesktop.UDisks2' (using servicehelper)
May 1 14:57:21 localhost dbus[586]: [system] Activating service name='org.freedesktop.UPower' (using servicehelper)
May 1 14:57:21 localhost udisksd[998]: udisks daemon version 1.94.0 starting
May 1 14:57:21 localhost dbus[586]: [system] Successfully activated service 'org.freedesktop.UDisks2'
May 1 14:57:21 localhost udisksd[998]: Acquired the name org.freedesktop.UDisks2 on the system message bus
May 1 14:57:23 localhost dbus[586]: [system] Successfully activated service 'org.freedesktop.UPower'
EDIT: It just crashed without X running, the only thing going on was a dd from one partition of an internal sata drive to the same internal sata drive.