Arch Linux

Please read this before reporting a bug:
https://wiki.archlinux.org/title/Bug_reporting_guidelines

Do NOT report bugs when a package is just outdated, or it is in the AUR. Use the 'flag out of date' link on the package page, or the Mailing List.

REPEAT: Do NOT report bugs for outdated packages!
Tasklist

FS#18682 - Illegal instruction in glibc's code

Attached to Project: Arch Linux
Opened by Gilles Bedel (gillux) - Sunday, 14 March 2010, 21:33 GMT
Last edited by Allan McRae (Allan) - Monday, 15 March 2010, 21:48 GMT
Task Type Bug Report
Category Packages: Core
Status Closed
Assigned To Allan McRae (Allan)
Architecture x86_64
Severity High
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 1
Private No

Details

Description:

I'm experiencing some random "Illegal instruction" crashes. It's random because I'm still unable to properly reproduce them, and they vary from one time to another, but sometimes the same context come back. I've been able to generate some coredumps that shows the exact same error occuring for several apps. This includes : gcc (when trying to compile gcc, or sometimes mencoder), rtorrent, mlnet (from mldonkey) and bash. I guess it also affects the kernel because I got many kernel panics, from which I can only get the end of the stack trace displayed on the screen. The things I can see from it vary (from what I remember) : there were page_fault(), other stuff related to tcp, and other things.

For all the coredumps I get (11), the illegal instruction _always_ occurs like it's showed in the attached file (typical_crash). It's within the libc _IO_vfscanf_internal function, called from somewhere I don't know (because I'm unable to compile anything, I can't have the debug symbols). And the insctruction pointer is _IO_vfscanf_internal+304.

Additional info:
Remember I'm unable to compile anything "big" (such as gcc or mencoder) because it always end up somewhere with an Illegal instruction error (either bash or gcc crashing). Smaller programs are OK.

These crashes are not easily reproductibles. For example, I sure it will crash when compiling gcc, but I don't know exacty when. And when I try to reproduce the faulty gcc command after, it just compiles the file without any error.

I've done 7 memtest passes without any error.

When the illegal instruction happens, the insctruction pointer doesn't seems to be aligned with the assembler code. So my guess is that the problem may comes from some glibc alignements mismatchs on x86_64.

All the programs used comes from the binary archlinux packages. Mentionned packages versions:
* core/glibc 2.11.1-1
* community/rtorrent 0.8.6-2
* core/bash 4.1.002-2
* core/gcc 4.4.3-1
* core/gcc-libs 4.4.3-1
* core/kernel26 2.6.32.7-1
This task depends upon

Closed by  Allan McRae (Allan)
Monday, 15 March 2010, 21:48 GMT
Reason for closing:  Not a bug
Comment by Allan McRae (Allan) - Sunday, 14 March 2010, 21:45 GMT
Everything here screams hardware issue to me... but memtest seems to contradict me. Try reinstalling the toolchain (glibc, gcc-libs, gcc, linux-api-headers, binutils) to rule out a corrupt file.
Comment by Gilles Bedel (gillux) - Monday, 15 March 2010, 01:24 GMT
No luck. I downloaded again glibc, gcc-libs, gcc, linux-api-headers, binutils and kernel26, reinstalled them, rebooted but no changes. Gcc compilation failed again with an Illegal instruction error.

I forget to mention that the CPU temperatures are also fine: 40° on heavy load, 29° idle.

Another thing, I recently discovered that after having several program crashes without a kernel panic, more and more processes begin to hang in uninterruptiple sleep (D state in ps). But since wchan is not available (#17756) I can't see what's happening for those. And in the end I can't even reboot properly because rc.shutdown also become stuck in D state...
Comment by Allan McRae (Allan) - Monday, 15 March 2010, 04:40 GMT
I'm not sure how this is going to be tracked down given this appears to only affect you. I could find nothing that seemed related in either the gcc or glibc trackers.

I am doubting it is a code bug from the toolchain given the lack of consistency and that you can continue the build after the crash. Hard-drive issues? Give RAM issues are ruled out, perhaps try building in a RAM tmpfs and see if you get the same error.

Do have a recollection about when these errors started occurring and what you updated around that time? Are all your core packages stock Arch versions?
Comment by Glenn Matthys (RedShift) - Monday, 15 March 2010, 06:54 GMT
Please run a memtest, all the issues you've summed up are a typical scenario of bad ram. See http://www.memtest.org/.
Comment by Allan McRae (Allan) - Monday, 15 March 2010, 07:43 GMT
@Glenn: I see you read the bug report thoroughly...
Comment by Jan de Groot (JGC) - Monday, 15 March 2010, 08:08 GMT
Memtest isn't holy here though. I've faced internal compiler errors and random crashes on a system that could pass memtest all day. At the end it appeared that my mainboard didn't like the fact that I had all memory slots filled, and it would only fail if you stress all banks at the same time for an extended period. Memtest is a great utility, but it fails to detect such cases.
Comment by Pierre Schmitz (Pierre) - Monday, 15 March 2010, 14:57 GMT
You could also run the prime(95) program to stress your system. At best from a live cd to make sure it's not just a broken file system or similar.
Comment by Glenn Matthys (RedShift) - Monday, 15 March 2010, 17:33 GMT
@Allan Oops missed that bit... bit still, I would recommend a memtest with a bit fade test (which I assumed the reporter did not execute)
Comment by Gilles Bedel (gillux) - Monday, 15 March 2010, 18:43 GMT
Thank you for all your comments :)

I didn't know that memtest cannot detect all the errors sometimes. And indeed, this bug is happening since a "major upgrade", which include a classic pacman -Syu, but also 2 new freshly bought RAM modules. Before I had 1x 512MB, and now I added 2x 1GB modules and all the motherboard slots are used. Of course, RAM clocks are the same for the 3 modules. I try to remove the 512MB one.
Comment by Gilles Bedel (gillux) - Monday, 15 March 2010, 21:46 GMT
Problem solved ! Gcc compiled fine this time. No more errors. You can close the bug.

As Jan de Groot experienced, my motherboard didn't like the fact that I had memory slots filled. Despite that each memory module worked fine independently. Strange...

Thank you again for all your support, you've been very helpful :)

Loading...