FS#46562 - [gcc] Segfaults while building the kernel

Attached to Project: Arch Linux
Opened by Alexander Pavel (SuperIce97) - Monday, 05 October 2015, 03:34 GMT
Last edited by Allan McRae (Allan) - Wednesday, 10 February 2016, 23:39 GMT
Task Type Bug Report
Category Packages: Core
Status Closed
Assigned To Allan McRae (Allan)
Architecture All
Severity Medium
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

Description: GCC randomly has segmentation faults while trying to build the linux kernel. The segfaults are always on different files, so it's not a bug in the kernel, but gcc. I installed gcc49 from the AUR and the kernel built perfectly. fine.


Additional info:
Package: gcc 5.2.0-2
Attempted to build kernel using Arch Build System (ABSROOT=. abs core/linux). Works fine with gcc49 built from AUR but crashes with GCC 5.2.0-2 from the official repositories.
Computer Specs: Acer Chromebook C740, Intel Celeron 3205u (Broadwell architecture), 4GB RAM, 16GB SSD (btrfs with lzo compression)


Steps to reproduce:
Attempt to build kernel with Arch Build System and GCC 5.2.0-2. May need a Broadwell CPU to reproduce (not sure if architecture specific yet).
This task depends upon

Closed by  Allan McRae (Allan)
Wednesday, 10 February 2016, 23:39 GMT
Reason for closing:  Works for me
Additional comments about closing:  Seems hardware specific
Comment by Alexander Pavel (SuperIce97) - Monday, 05 October 2015, 03:36 GMT
Crap. This is my first bug report and I forgot to put a summary. Sorry.
Comment by Doug Newgard (Scimmia) - Monday, 05 October 2015, 03:41 GMT
That's an earlier broadwell. Is your microcode updated?

Edit: Hmm, or not. It's a "U" Broadwell, so the microcode isn't going to do it.
Comment by Alexander Pavel (SuperIce97) - Monday, 05 October 2015, 03:43 GMT
How can I check?
Comment by Doug Newgard (Scimmia) - Monday, 05 October 2015, 03:46 GMT
See edited comment. Try to get a backtrace so we can see what's going on and if it's related to the lock elision issue or whatever else might be going on.
Comment by Allan McRae (Allan) - Monday, 05 October 2015, 04:44 GMT
gcc crashing randomly is always a hardware issue and never a gcc issue.
Comment by Alexander Pavel (SuperIce97) - Monday, 05 October 2015, 14:32 GMT
I don't believe it to be a hardware issue. While it does seem to fail at random points in the kernel build, this issue does not occur with GCC 4.9. It fails in 5.2, and I believe that a new feature added between 4.9 and 5.2 is unstable with certain newer processors such as the Broadwell in my device. If it was a hardware issue, the kernel build should have failed with 4.9 as well. I don't have much time to do a trace at the moment, but I believe I will be able to do one in a few hours.
Comment by Allan McRae (Allan) - Monday, 05 October 2015, 14:56 GMT
Try building without makeflags -j1 to make the build a bit more deterministic. Then we will know if it is really random or not.
Comment by Alexander Pavel (SuperIce97) - Tuesday, 06 October 2015, 03:29 GMT
I have some interesting information from some tests I ran today. The glitch seems to be some kind of timing issue according to my Comp. Sci. teacher. While I was trying to debug the program by getting a backtrace, the program would never crash and always built the kernel perfectly fine. Whenever I ran it without trying to get a backtrace, it would crash at random. Since GDB adds a few extra instructions during compilation to monitor the activity thus slowing down compilation, the segfault activity does not appear when being tracked. I'm not quite sure what to do from here, but I am willing to do more tests tomorrow. I've attached a file with 4 different instances of the compilation crashing if anyone is interesting in seeing it. Like I said, the crashes appear to occur at random points.
Comment by patrick (potomac) - Friday, 09 October 2015, 02:42 GMT
check your ram modules with memtest
Comment by Alexander Pavel (SuperIce97) - Saturday, 10 October 2015, 00:23 GMT
Memtest86 doesn't boot on this Chromebook because Memtest 4.3.6 (which is the version available since legacy boot on chromebooks is not EFI) does not support the hardware. However, I built a kernel with the Memtest debug option set, which checks memory with 17 different patterns before using it and flags and avoids the memory if it is bad. Running that kernel with the "memtest" option enabled in grub, I filled up the memory as much as I could, causing some programs to crash. free -m was showing at 34MB free at one point. According to dmesg, memtest was enabled and no memory blocks were flagged as bad and thus the memory should not be faulty.
Comment by patrick (potomac) - Saturday, 10 October 2015, 00:50 GMT
I suspect also gcc 5.2.0 to trigger weird bugs with some processors, glibc can also trigger segmentation errors on some processors,

but you need to be sure that it's not a configuration problem ( bad settings in /etc/makepkg.conf, a problem with LD_LIBRARY_PATH )
Comment by Alexander Pavel (SuperIce97) - Saturday, 10 October 2015, 00:54 GMT
I know it's not a makepkg configuration problem because even just running "make" or "make -j3" will lead to the error occurring. How can I check if it's a problem with LD_LIBRARY_PATH?
Comment by patrick (potomac) - Saturday, 10 October 2015, 01:01 GMT
you can use "ldd" to see if the location of the shared libraries used by gcc is correct :

$ ldd /usr/bin/gcc
linux-vdso.so.1 (0x00007fff853cf000)
libm.so.6 => /usr/lib/libm.so.6 (0x00007fa7720e3000)
libc.so.6 => /usr/lib/libc.so.6 (0x00007fa771d3f000)
/lib64/ld-linux-x86-64.so.2 (0x00007fa7723e1000)

if the environmental variable LD_LIBRARY_PATH is empty ( echo $LD_LIBRARY_PATH ) then it's ok
Comment by Alexander Pavel (SuperIce97) - Saturday, 10 October 2015, 01:04 GMT
The locations are all correct (and the environment variable is empty). Any ideas on what could be done to debug this?
Comment by Evangelos Foutras (foutrelis) - Saturday, 10 October 2015, 01:57 GMT
Crashes are logged and can be viewed using coredumpctl. `coredumpctl gdb <PID>` might provide a clue as to what is happening. (Probably won't though, due to lack of debugging symbols.)

Also, make sure you have configured early microcode updates: https://wiki.archlinux.org/index.php/Microcode

Some higher-end Broadwell processors seem to require even newer microcode; our intel-ucode package ships the latest version provided by Intel but that's from January 2015. [1] It's not likely that your CPU model is affected but do test any new microcode once it becomes available.

[1] https://bugzilla.kernel.org/show_bug.cgi?id=103351
Comment by Alexander Pavel (SuperIce97) - Friday, 08 January 2016, 18:02 GMT
I've figured it out! It is a microcode issue. Intel released a new microcode package early November last year and I tried it out and now the kernel builds fine. It must've been some kind of errata in a new instruction that GCC 5 was using (in fact GCC 5.3.0 stopped having issues with most other packages except for the kernel). I've been looking to see why the intel-ucode package has not been updated and I think it must be because the microcode for certain older CPUs is not in the new package (for instance, the 2630QM that I have in my gaming/workstation laptop). Intel in fact lists 3 microcode packages as "latest". Should I add a feature request to include all three ucode blobs in the package and have the system decide which one to use? Maybe we could have a chained type of loader where it tries all 3 on boot in order from oldest to newest, which would make sure the newest version for that CPU is loaded. Obviously, simply not including the latest ucode for the latest CPUs is not quite a great option.
Comment by patrick (potomac) - Friday, 08 January 2016, 20:50 GMT
you can also flash the bios of your motherboard to the last version, a recent bios could have the new microcode from november 2015

Loading...