FS#18334 - [kernel26] suspend to disk breaks the system (process segfault)

Attached to Project: Arch Linux
Opened by Adrian C. (anrxc) - Monday, 15 February 2010, 03:33 GMT
Last edited by Andrea Scarpino (BaSh) - Thursday, 05 August 2010, 18:11 GMT
Task Type Bug Report
Category Upstream Bugs
Status Closed
Assigned To Tobias Powalowski (tpowa)
Thomas Bächler (brain0)
Architecture i686
Severity Medium
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 4
Private No

Details

Hi,
I am not sure where to file this; kernel26, pm-utils, uswsusp, some other package. I am tagging as kernel26 because 7 days ago on my last hibernation everything was working, and kernel26 was upgraded in the mean time, while uswsusp and pm-utils were not.

I am using software suspend, uswsusp, for a long time now. At the moment pm-suspend calling s2ram still works, as usual. However pm-hibernate calling s2disk does not. Machine properly wakes up from hibernation but the system is broken, I can not start any new processes because they crash, and some time even resumed process will segfault; I had an X11 crash, Emacs crashed...

Being that I can not start any new process it is hard to find any useful data. Only thing I extracted from the logs are those segfaults, below are some examples:

Additional info:
Feb 15 04:19:04 katana kernel: zsh[6376]: segfault at 0 ip b787bc02 sp bfa380b0 error 6 in ld-2.11.1.so[b7869000+1c000]
Feb 15 04:19:25 katana kernel: sudo[6411]: segfault at 0 ip b7785c02 sp bfba6690 error 6 in ld-2.11.1.so[b7773000+1c000]


Steps to reproduce:
This task depends upon

Closed by  Andrea Scarpino (BaSh)
Thursday, 05 August 2010, 18:11 GMT
Reason for closing:  Fixed
Additional comments about closing:  kernel26 2.6.34.2-2
Comment by Jan de Groot (JGC) - Monday, 15 February 2010, 07:39 GMT
Did you actually reboot with the new kernel since the last hibernation? If you upgrade your kernel and don't reboot, but hibernate instead, things could go very wrong.
Comment by Adrian C. (anrxc) - Monday, 15 February 2010, 15:56 GMT
Of course. I managed to squeeze another message from the system just now. Unfortunately that is all I have to go on:

ls: symbol lookup error: /lib/libc.so.6: undefined symbol: error_print_progname, version GLIBC_2.0


Every 3 months it breaks all over again, in the mean time while it does work you simply can not use it because it is like playing lottery with your work, your data and your hardware. You never know where it will break again. So to your earlier question I say again I did reboot, because I shutdown 99% of the time as I don't dare hibernate. It is 2010, for two years I am in constant battle with hibernation - system freeze on resume, no keyboard on resume, graphic corruption on resume, system broken on resume...

One could say that when it works you can stop upgrading, and rolling releases is to blame. But on the other side you keep upgrading because of other software or drivers that is in a poor state.
Comment by Jan de Groot (JGC) - Monday, 15 February 2010, 16:01 GMT
Did you try the regular suspend option by hibernating to your swap partition?
Comment by Adrian C. (anrxc) - Monday, 15 February 2010, 16:29 GMT
I could try kernel hibernate, but only major difference is that uswsusp compresses the image prior to writing it to swap. I will report my findings.

But I don't hope much, I'm almost certain it's not an uswsusp problem. The memory gets corrupted for some reason, and so far the KMS looks like it (from other bug reports I found now). Unfortunately Arch now forces KMS on Intel graphics and I am left without options to test this:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=534422
https://bugzilla.redhat.com/show_bug.cgi?id=524905
Comment by Adrian C. (anrxc) - Monday, 15 February 2010, 20:48 GMT
I did a few hibernate cycles with direct kernel hibernation. The problem remains the same, everything crashes upon resume, to the above segfault messages I can add another (which could maybe lead to other bug reports on Google):

ps: relocation error: /lib/libnss_files.so.2: symbol fgets_unlocked, version GLIBC_2.1 not defined in file libc.so.6 with link time reference

This corruption is probably caused by KMS, if that is so I understand you can not do much unless there is an existing patch available to address the exact issue. At least there is a point of reference to any Intel owner that suffers from the same problem. I have the GM965 chipset, and have been using KMS for some time now, this problem started with kernel26 2.6.32.8 for me.
Comment by Adrian C. (anrxc) - Wednesday, 17 February 2010, 20:22 GMT
Other Arch users started reporting the issue on the BBS: http://bbs.archlinux.org/viewtopic.php?pid=711241
Main upstream bug report is apparently this one: http://bugzilla.kernel.org/show_bug.cgi?id=13811
Comment by Adrian C. (anrxc) - Thursday, 01 April 2010, 14:01 GMT
Still broken, very much so, since this time another machine is affected and that one has the GM45 /X4500MHD. Does anyone have any information about this in the kernel development for 2.6.34?
Comment by C Anthony Risinger (extofme) - Tuesday, 29 June 2010, 16:53 GMT
ah yes, here too. eee s101, intel 945GME chip. everything seems to work fine, but no new processes can be started (segfaults). if you log out, gdm can't restart. little to nothing can run. some things seem to still work (NetworkManager doesn't segfault, and a couple others), but the system is unusable.

i thought it was related to me having btrfs as my root partition, but it doesn't seem so. no FS corruption... luckily.

from logs:

Jun 26 18:51:18 extofem-n0 kernel: PM: restore of devices complete after 936.999 msecs
Jun 26 18:51:18 extofem-n0 kernel: PM: Image restored successfully.
Jun 26 18:51:18 extofem-n0 kernel: Restarting tasks ...
Jun 26 18:51:18 extofem-n0 kernel: hald[1474]: segfault at fffffffe ip b755d000 sp bfe42f9c error 6 in libc-2.12.so[b749c000+145000]
Jun 26 18:51:18 extofem-n0 kernel: udevd[1288]: segfault at ffffffea ip b7793000 sp bfa994cc error 6 in libc-2.12.so[b76d2000+145000]
..........

with many other fails after that, including bash/anything that tries to start up. same error in libc-2.12.so for everything.
Comment by Dmytro Bagrii (dimich) - Saturday, 03 July 2010, 21:29 GMT
I confirm the issue. In the past I used old hardware: VIA KT266 / Athlon XP / 1 Gb RAM. Everything worked fine including hibernation. One week ago i upgraded my hardware to P5G41 / Pentium Dual Core E5400 / 4 Gb RAM. And now after resume from hibernation i get segfault of various programs in glibc-2.12.so and ld-2.12.so:

Jun 30 00:06:55 dimich kernel: fbrun[21342]: segfault at b73d8968 ip b7791410 sp bfe34e74 error 7 in ld-2.12.so[b7788000+1c000]
Jun 30 00:06:57 dimich kernel: fbrun[21343]: segfault at b73c5968 ip b777e410 sp bffaac44 error 7 in ld-2.12.so[b7775000+1c000]
Jul 2 00:58:17 dimich kernel: udevd[2899]: segfault at d10a439 ip b7772c22 sp bfa5690c error 4 in libc-2.12.so[b7706000+145000]
Jul 2 00:58:18 dimich kernel: acpid[3181]: segfault at d153439 ip b77bbc22 sp bfd92144 error 4 in libc-2.12.so[b774f000+145000]
Jul 2 00:58:20 dimich kernel: tilda[3427]: segfault at c99d439 ip b7005c22 sp bfa0923c error 4 in libc-2.12.so[b6f99000+145000]
Jul 3 23:57:42 dimich kernel: less[6921]: segfault at 69f30106 ip b76c2403 sp bff2563c error 6 in libc-2.12.so[b75ef000+145000]

(See detailed log in attachment).
Comment by Adrian C. (anrxc) - Tuesday, 06 July 2010, 15:55 GMT
Supposed to be fixed in commit 985b823b9192 - http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=985b823b9192
Unfortunately it only went into 2.6.35-rc4, so still not working in Arch with latest kernel26 2.6.34.1... I can't wait to have power management again, this was a very ugly bug.
Comment by Adrian C. (anrxc) - Wednesday, 04 August 2010, 14:54 GMT
I can confirm it was fixed, in kernel26 2.6.35 in [testing]. Tested one hibernation cycle on two machines with G965 and GM45 generation of Intel graphics.
Comment by Jan de Groot (JGC) - Wednesday, 04 August 2010, 14:57 GMT
This is fixed in 2.6.34.2 also. The fix has been backported to 2.6.34 upstream.
Comment by Pete (tam1138) - Thursday, 05 August 2010, 16:03 GMT
Confirmed to be fixed in released package kernel26-2.6.34.2-2.

Loading...