FS#18644 - pacman -Sy sometimes freezes mid-sync in uninterruptible sleep

Attached to Project: Pacman
Opened by Isaac Dupree (idupree) - Thursday, 11 March 2010, 21:01 GMT
Last edited by Dan McGee (toofishes) - Tuesday, 15 February 2011, 23:05 GMT
Task Type Bug Report
Category General
Status Closed
Assigned To No-one
Architecture x86_64
Severity Medium
Priority Normal
Reported Version 3.3.3
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

Summary and Info:

`sudo pacman -Sy` sometimes freezes mid-update in state 'D' -- "uninterruptible sleep", according to 'top'. It's using 0% CPU and very little RAM. But it can't be killed even by kill -9, and it prevents system suspend-to-ram from succeeding... it seems I have to shut down my computer in order to kill it, but then booting, removing db.lck, and running `sudo pacman -Syy` fixes things.

Steps to Reproduce:

Rarely. Don't know how.

Be using my system for a while. I'm using some of my swap space. I upgraded my system a few days ago and haven't shut down since (this time, the only upgrades that look possibly at all related to pacman are:
upgraded openssl (0.9.8m-1 -> 0.9.8m-2)
upgraded readline (6.1.001-1 -> 6.1.002-1)
upgraded shadow (4.1.4.2-1 -> 4.1.4.2-2)
upgraded sudo (1.7.2p4-1 -> 1.7.2p5-1)
)

Any debugging advice? (Especially while my system with this dead-pacman is still running? This issue happened to me once before, also. Unfortunately GDB isn't installed at the moment.)

I seem to remember that last time, I got a backtrace that looked something like the one in this comment http://bugs.archlinux.org/task/16210#comment49650 , but I can't remember how I did it and I might be remembering wrong.
This task depends upon

Closed by  Dan McGee (toofishes)
Tuesday, 15 February 2011, 23:05 GMT
Reason for closing:  Duplicate
Additional comments about closing:   FS#15369 
Comment by Isaac Dupree (idupree) - Friday, 12 March 2010, 21:52 GMT
here's a log I found from the failed suspends (I think it tells us the kernel stack of the stuck pacman process)
Comment by Gavin Bisesi (Daenyth) - Wednesday, 17 March 2010, 03:50 GMT
Can you reproduce the issue consistently? Try with pacman --debug
Comment by Isaac Dupree (idupree) - Wednesday, 17 March 2010, 04:25 GMT
No, I don't currently have a way to reliably reproduce it.

I could always run pacman with --debug from now on, in case I hit the bug...(although --debug seemed to slow pacman down a bit).

I've installed gdb now, though I'm not sure it'll help anything.

Is this process state a legitimate thing under the Linux kernel, or is its existence a kernel bug?

Could it be if a server is in the middle of updating its package lists, that pacman gets confused by wrong data?

I noticed, by watching pacman -Sy running normally -- both its console and through 'top' -- an analogous situation. I believe the bug happens right after pacman is finished downloading one of the package lists (e.g. 'extra', or 'community'); it pauses and goes into state "D" for (when there's no bug) just a bit of time (but the couple bug-times, that amount of time has been "forever" :-)

I ran pacman -Syy under valgrind (sudo valgrind --track-origins=yes pacman -Syy) and the only issue it found occurred before downloading any of the three package-lists -- same valgrind result with pacman -Sy --

:: Synchronizing package databases...
==18566== Syscall param rt_sigaction(act->sa_flags) points to uninitialised byte(s)
==18566== at 0x507B1CE: __libc_sigaction (in /lib/libc-2.11.1.so)
==18566== by 0x4E37B5A: download_internal (in /usr/lib/libalpm.so.4.0.3)
==18566== by 0x4E38229: _alpm_download_single_file (in /usr/lib/libalpm.so.4.0.3)
==18566== by 0x4E31D77: alpm_db_update (in /usr/lib/libalpm.so.4.0.3)
==18566== by 0x409C86: ??? (in /usr/bin/pacman)
==18566== by 0x406E8B: ??? (in /usr/bin/pacman)
==18566== by 0x5067B6C: (below main) (in /lib/libc-2.11.1.so)
==18566== Address 0x7feffd648 is on thread 1's stack
==18566== Uninitialised value was created by a stack allocation
==18566== at 0x4E377D1: download_internal (in /usr/lib/libalpm.so.4.0.3)
==18566==
core is up to date
extra is up to date
community is up to date

OK, can you think of anything else to try? I suppose I could set a loop going of pacman -Sy all the time, to see if I get a bug, though that seems like it'd be rather abusing the mirror-servers :-/ And it might be worth seeking out a kernel/Unix expert to tell me what that state "D" could mean... If any other of my programs hung in this way, I'd suspect a kernel issue more, but it's only been pacman...
Comment by Isaac Dupree (idupree) - Wednesday, 17 March 2010, 04:47 GMT
Although I guess if pacman utilizes a syscall that most programs don't, and if there's an obscure unreliable bug in linux 2.6.32.* x86_64 (ext3 root-filesystem) which is what I've been running on... http://linuxgazette.net/issue83/tag/6.html (from 2002, in the age of Linux-2.4) suggests that the kind of "D" that even the kernel can't get rid of is either a kernel bug or a hardware lock-up... (my hardware doesn't have anything noticably wrong with it at any time, even when running for a couple days after one of these pacman lockups starts. Well, besides irrelevant things like the mouse-button!).

Also, that valgrind uninitialized-bytes issue seems mighty suspicious.

Loading...