FS#24397 - [kernel26] softlockup with kernel 2.6.39

Attached to Project: Arch Linux
Opened by Hussam Al-Tayeb (hussam) - Monday, 23 May 2011, 02:15 GMT
Last edited by Tobias Powalowski (tpowa) - Thursday, 16 February 2012, 17:57 GMT
Task Type Bug Report
Category Upstream Bugs
Status Closed
Assigned To Tobias Powalowski (tpowa)
Thomas Bächler (brain0)
Architecture All
Severity Critical
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

After upgrading to kernel 2.6.39, I started having soft lockups due to disk activity. anything more that low disk activity would cause a problem in an application.
dmesg would spit out something like [ 1920.307498] INFO: task java:25665 blocked for more than 120 seconds.
[ 1920.307499] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1920.307500] java D f036df98 0 25665 25393 0x00000000
[ 1920.307503] f036dee0 00000086 c15899c8 f036df98 e1c09590 00000001 00cbde28 00000170
[ 1920.307507] f036de98 00000064 f036de60 f036de60 f036de68 f036de68 e1c09590 c14e1440
[ 1920.307511] 081c6000 c14e1440 f5506440 e1c09590 e1c08450 00000000 ffffffff c15899c8
[ 1920.307515] Call Trace:
[ 1920.307518] [<c1073c8d>] ? get_futex_key+0x6d/0x1d0
[ 1920.307520] [<c10742c5>] ? futex_wake+0xe5/0x100
[ 1920.307522] [<c132fd65>] rwsem_down_failed_common+0x95/0xe0
[ 1920.307525] [<c1027640>] ? vmalloc_sync_all+0x120/0x120
[ 1920.307527] [<c132fde2>] rwsem_down_read_failed+0x12/0x14
[ 1920.307529] [<c132fe1f>] call_rwsem_down_read_failed+0x7/0xc
[ 1920.307531] [<c132f69d>] ? down_read+0xd/0x10
[ 1920.307534] [<c1027787>] do_page_fault+0x147/0x420
[ 1920.307536] [<c10760e4>] ? sys_futex+0xc4/0x130
[ 1920.307538] [<c1027640>] ? vmalloc_sync_all+0x120/0x120
[ 1920.307540] [<c1330c4b>] error_code+0x67/0x6c

One application (let's call it A) would then stop being able to read/write from disk. Other running applications would still be able to read/write fine to the disk.
I could even copy the data application A to another folder or delete it.
This isn't a hard lockup and I could still continue to use the computer but then it'll hang at shutdown.
At first I thought the disk (which I bought 12 days ago) is bad so I ran badblocks -vs and didn't find a single bad block. I ran smartctl long test and the disk is fine. It started to feel like some ext4 regression.

I downgraded to kernel 2.6.38.6 and performed a disk intensive action which was recompiling libreoffice. This worked without a problem.
I also tried the application A which was giving problems earlier but I couldn't see a problem again. So I compiled libreoffice again to check and didn't have lockups.
This task depends upon

Closed by  Tobias Powalowski (tpowa)
Thursday, 16 February 2012, 17:57 GMT
Reason for closing:  Upstream
Comment by Tom Gundersen (tomegun) - Monday, 23 May 2011, 08:03 GMT
This is almost certainly an upstream issue, so should probably be reported at: <https://bugzilla.kernel.org/>.
Comment by Hussam Al-Tayeb (hussam) - Monday, 23 May 2011, 08:15 GMT
Ok, I reported a upstream bug. https://bugzilla.kernel.org/show_bug.cgi?id=35662

In the meantime, it is possible that we can have a update in core to 2.6.38.7 while 2.6.39 is still in testing?

Comment by Jens Adam (byte) - Tuesday, 24 May 2011, 10:42 GMT
I had those "hung_task" messages for at least through the whole 2.6.38 releases.
Mostly while dd'ing disk images onto USB sticks, md5summing CD-RWs or similar.
The first hint was always Firefox being completely frozen.
But when the long-running task was completed, all mouse and keyboard input I had done in the meanwhile got fed into Firefox and everything was back to normal.
Comment by Hussam Al-Tayeb (hussam) - Tuesday, 24 May 2011, 16:41 GMT
Andrew Morton seems to suggest it is because of luks encryption in my case.
Comment by sergio (asgarth) - Wednesday, 06 July 2011, 16:53 GMT
Same problem here, without using encryption for any partition. Problem appear only after intensive cpu or disk usage, usually after 4 or more hours from system startup.
Comment by robert r (crobe) - Saturday, 09 July 2011, 10:40 GMT
I also experienced this and I'm using luks with XFS.
Running the "sync" command continued disk writing for a while, so for a short time fix I mad something like "while true; do sync; done", which is not the best fix :)

Loading...