FS#11141 - Kernel oops on Archlinux guest in Virtualbox 1.6.4

Attached to Project: Arch Linux
Opened by Mathias Burén (fackamato) - Thursday, 07 August 2008, 10:07 GMT
Last edited by Tobias Powalowski (tpowa) - Saturday, 04 October 2008, 13:07 GMT
Task Type Bug Report
Category Kernel
Status Closed
Assigned To Thomas Bächler (brain0)
Architecture i686
Severity Medium
Priority Normal
Reported Version None
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 2
Private No

Details

I think this is a problem with the Archlinux kernel, therefore I'm posting this bug here. It can also be seen at: http://forums.virtualbox.org/viewtopic.php?t=8499

Description: Kernel oops on Archlinux guest in Virtualbox 1.6.4. I used the latest ISO install CD, installed the system, enabled the testing repo, pacman -Syu, reboot. Now I did pacman -S xorg kde, and it begins to download packages. Then all of a sudden (after ~6 minutes perhaps?) I get a kernel oops.

kernel 2.6.26-ARCH, boot options were rootflags=data=writeback noapic nosmp acpi=off

Picture attached.
This task depends upon

Closed by  Tobias Powalowski (tpowa)
Saturday, 04 October 2008, 13:07 GMT
Reason for closing:  Fixed
Comment by Mathias Burén (fackamato) - Thursday, 07 August 2008, 10:26 GMT
See http://pastebin.ca/1094562 for dmesg.txt
Comment by Mathias Burén (fackamato) - Thursday, 07 August 2008, 12:14 GMT
It seems to occur during disk usage. No matter if I use sata or pata disk controller (Virtualbox has SATA option). But the oops seems related to disk writing/reading, no matter the filesystem. (I first got an oops with xfs, thought it was an error in Arch testing, so reinstalled and used ext3, and got an oops with ext3.)
Comment by Gerhard Brauer (GerBra) - Monday, 11 August 2008, 14:04 GMT
I could confirm this. With 2.6.26 in the guest (host is still 2.6.25) the guest crashes 8 of 10 times when doing for example bsdtar on a big file, like Vitualbox-OSE. After such a crash the reset leads often to a oops directly after reboot directly after the initrd start (Unable to handle kernel bug... directly after: Freeing SMP alternatives).

At the moment my guest oopses every reboot, regardly what boot parameter i use, also with fallback image.
I will boot the guest from cd and try to rebuild the initrd again.

But this seems something happend only in virtualbox (or virtual machines?). On my Laptop with 2.6.26 from testing i have never had such a thing. Also on forums etc. is nothing similar from users on "real" machines.

I have a trace in /var/log/everything.log from the oops in the guest, also a screenshot from boot trap. If this is usefull for you i could post/attach it.
Comment by Gerhard Brauer (GerBra) - Monday, 11 August 2008, 17:26 GMT
I could reanimate my guest from trapping by recreating the initrd. This was the only way to boot again the guest with 2.6.26.
Doing the same things as before (makepkg virtualbox-modules) leads to an oops again, and to an unbootable guest again.
Comment by David Rosenstrauch (darose) - Friday, 22 August 2008, 19:40 GMT
I'm also seeing this in my virtualbox guest
Comment by David Rosenstrauch (darose) - Friday, 22 August 2008, 19:44 GMT
Forgot: disabling acpi on the guest seems to fix this (as the original poster indicated in the virtualbox forums post)
Comment by Xavier (shining) - Friday, 22 August 2008, 22:34 GMT
Confirming.

My archlinux guest also crashed in virtualbox with kernel 2.6.26.
Disabling acpi helped to boot most of the times, but the system was still highly unstable and *always* crashed eventually.
Come to think of it, it also happened very often while some IO actions were happening, which confirms what Mathias said.

I rebuilt both 2.6.26 and later 2.6.26.3 with a minimal config, and it has been working perfectly. I don't know which config changes were relevant.
As Gerbra, the crash during boot happened directly after Freeing SMP alternatives, so I disabled SMP in my minimal config. Maybe this did it?
If anyone could confirm this, it would be interesting. I am not able to do it at the moment.
Comment by Xavier (shining) - Monday, 25 August 2008, 10:49 GMT
Did anyone find any kernel upstream reports, like at bugzilla.kernel.org or lkml ?
I found one with some information on virtualbox bug tracker :
http://www.virtualbox.org/ticket/1875
Comment by Xavier (shining) - Monday, 25 August 2008, 11:10 GMT
Re-adding SMP to my custom 2.6.26.3 kernel does not change anything. It still boots and works perfectly fine.
Comment by Gerhard Brauer (GerBra) - Monday, 25 August 2008, 11:18 GMT
At Froscon Conference last weekend we have often installed multi-arch ISO (kernel 2.6.26) on virtualbox on an x86_64 host.
virtualbox was not the OSE version. There we haven't any oops or crash with 2.6.26. Perhaps some tests could help to isolate the problem. Have anyone this problem on a non OSE version? Have anyone this problem on an x86_64 (OSE or full version)?

On my few test i mean that enabling VT-x could help against the oops during boot sometimes. But not against the heavy IO oops.

Comment by Gerhard Brauer (GerBra) - Monday, 25 August 2008, 13:33 GMT
Ok, i also tried virtualbox_bin 1.6. The same problem like with the OSE version here on i686.

Some "facts":
Enabling VT-x or disabling acpi in guest settings (and using noacpi as guest kernel parameter) always goes over the early oops after "Freeing SMP"
The oops could here always reproduced during the installer in package installation (heavy disk IO). Oops goes to dmesg, package installation goes further but with FAILED when finished. Seems that nothing got written to HD after the first trap.
I have tested several filessystems.
If i shutdown the guest (HOST+Q->shutdown virtual machine) i often get an error dialog from virtualbox (logfile and screenshot)
Currently i could not get these logfile (i have serveral vbox logs with > 50MB) and it seems that it's impossible to seatch therein the exact place when the fault happens. If i could reporduce this virtualbox error dialog i will isolate it.
I attach the guest dmesg and messages.log, maybe this is usefull.
Comment by David Rosenstrauch (darose) - Monday, 25 August 2008, 14:18 GMT
"Comment by Gerhard Brauer (GerBra) - Monday, 25 August 2008, 11:18 GMT

Have anyone this problem on a non OSE version? Have anyone this problem on an x86_64 (OSE or full version)?"

Yes, I'm experiencing this problem with the non-OSE on x86_64.

It's a very weird problem - intermittent. Sometimes it boots; other times it fails with the "BUG: unable to handle kernel" message. And other times it boots, runs successfully for a while, and then segfaults on something later on (e.g., most recently when I was building the virtualbox-ose-additions package). And yes, when it segfaults, the trace does seem to be in some storage/file system code. Weird.
Comment by Xavier (shining) - Monday, 25 August 2008, 14:32 GMT
Off-Topic : the virtualbox-ose-additions pkgbuild is really dumb, I would not recommend building it to anyone.
It builds the whole source, and only keeps like 1% of it.
Instead, you can just build that 1%, and have a build 100 times faster for the same result :P
(if you are wondering, yes, I sent these to the maintainer (Bash) 10 days ago. He was in holiday at that time)
Comment by Tiago Pereira (Mech) - Monday, 25 August 2008, 18:24 GMT
I have added noreplace-paravirt as a boot param in grub and I no longer have panics but everything is really slow now... I am not sure what this option does I found it in the ubuntu lauchpad (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/246067) but it does solve the problem temporarly and may help in finding the origin of the problem.
Comment by Xavier (shining) - Monday, 25 August 2008, 21:41 GMT
Last message about the virtualbox packages off-topic :
Bash actually updated them a few days ago, but I confused him by sending him several version of virtualbox-ose-additions.
He fixed it and it all looks great now, so you can simply use the packages from community and the PKGBUILDs from abs.
Comment by Gerhard Brauer (GerBra) - Tuesday, 26 August 2008, 13:10 GMT
Comment by Tiago Pereira (Mech) - Monday, 25 August 2008, 20:24 GMT+2
"I have added noreplace-paravirt as a boot param..."

I think this could not be a solution. I tested it with the multi-arch iso (2.6.26) and with this parameter it's extremly slow and the iso doesn't boot (initrd don't find the cdrom after udev hook).
The "problem" by all kernel-parameters and virtualbox-settings is IMHO: there was never a problem with kernel < 2.6.26. No need for tricks. Running the same virtualbox version on my host these errors only occurs when the guests switch to 2.6.26 (on arch guests by pacman update, or for the Install-ISO using the 2008.08-froscon-iso with 2.6.26 as kernel).
And when this error raise on vbox-ose, vbox-sun, i686 and on x86_64 (seen from the host side), then in my opinion it's a kernel bug (or a modification on which Sun must modify their virtualbox code). Qemu for ex. has not this problem.
Comment by Gerhard Brauer (GerBra) - Tuesday, 26 August 2008, 14:30 GMT
There is also an thread about this on lkml:
http://lkml.org/lkml/2008/8/20/359
Comment by Xavier (shining) - Tuesday, 26 August 2008, 14:51 GMT
Thanks for the link!
I like lkml.org but I had one bad experience about tracking all answers in a thread. marc or gmane worked better :
http://marc.info/?t=121926107200003&r=1&w=2
http://thread.gmane.org/gmane.linux.kernel/724038/focus=725005
So there has been quite a few exchanges, but unfortunately no real progress yet.
Comment by Gerhard Brauer (GerBra) - Tuesday, 26 August 2008, 15:24 GMT
I wrote a mail on lkml with a summary of my/our informations. Currently i build a kernel with the hints from Mathieu answered to my mail.
Comment by Dane (Xiol) - Tuesday, 26 August 2008, 22:25 GMT
Confirming this problem with VirtualBox 1.6.4 on Windows. Installed ArchLinux from 2008.06 FTP, booted fine but after installation (and installation of 2.6.26 kernel) system would not boot. Adding noapci to boot line worked for me (so far).
Comment by Gerhard Brauer (GerBra) - Tuesday, 26 August 2008, 22:44 GMT
I think we know a little bit more about the reasons after several posts on lkml. With the hints and changes to kernel source i currently have my virtualbox with arch 2.6.26 quiet stable: it could always boot without "tricks" and when bootet i could do compiling and such things with heavy disk io.
We do further testing (it's releated to 2.6.26 irq state handling, as far as i understand cause i'm not an developer. And my C
expirience is much uglier than my english..<g>).
Also these guys must resolve if changes maybe have side effects to "real" architectures. Linux *only* on virtualbox is not a goal...

I think i could post tomorrow a patch for Arch stock kernel that some of you maybe would like to test. You need to build a new
kernel for your guests.
Comment by Xavier (shining) - Wednesday, 27 August 2008, 06:37 GMT
GerBra : I just read the lkml thread, thanks a lot for all the testing you did!
By the way, being a C developer is one thing, learning and understanding the kernel internals is another :)
Finally, it seems like the last patch you had do test did not work. At this point, I don't think you need to do wider testing, I think you rather have to wait for other ideas/proposals.
Anyway, thanks again for all your effort and keep up the good testing, it will hopefully lead somewhere :)
Comment by Gerhard Brauer (GerBra) - Friday, 29 August 2008, 14:02 GMT
Could you please test it again with the new kernel26 2.6.26.3-1 ?

Short testing here gave me significant betterment (mostly for the early oops after freeing smp..). But i could reproduce the kernel panic under heavy disk io. I'll test more in the later evening...
Comment by Xavier (shining) - Friday, 29 August 2008, 15:53 GMT
I confirm this behavior. Arch 2.6.26.3 kernel boots fine, but does not survive heavy disk io (extracting the linux kernel caused a panic at around half way)
My vanilla 2.6.26.3 with custom config (attached) still works perfectly fine under any situations.
Comment by David Rosenstrauch (darose) - Tuesday, 02 September 2008, 21:23 GMT
@shining:

Any chance you could post a kernel26 i686 package built using your custom config? I've only got x86_64 boxes anymore and so can't build an i686 package. (I could obviously try to build it inside a vbox guest, but that would most likely crash with a kernel oops due to this very problem.)
Comment by Xavier (shining) - Tuesday, 02 September 2008, 23:14 GMT
That is a good point :) I think what I did is just downgrade to the 2.6.25 arch kernel, which worked perfectly, and built my custom 2.6.26 from there.
It is still there :
from http://wiki.archlinux.org/index.php/Downgrade_packages
http://ftp.tu-chemnitz.de/pub/linux/sunsite.unc-mirror/distributions/archlinux/core/os/i686/kernel26-2.6.25.11-1-i686.pkg.tar.gz
Comment by David Rosenstrauch (darose) - Wednesday, 03 September 2008, 00:57 GMT
Huh! Learn something new every day! I've been using Arch for years, and didn't know that there were repos available with old package versions.

Well, that ought to do the trick then, until this is fixed. Thanks much for the tip.
Comment by Thayer Williams (thayer) - Sunday, 07 September 2008, 22:00 GMT
Just encountered this bug today myself. The interesting thing is that it only seems to happen when I attempt to install the 'kde' package group. I installed the gnome and xfce package groups without fail (and did so several times just to be sure), but as soon as I run 'pacman -S kde' it triggers an oops.
Comment by Jan M. (funkyou) - Thursday, 11 September 2008, 12:56 GMT
The bad news: Also encountered this some days ago while creating filesystems in a VM.

The good news: It is fixed in VirtualBox 2.0.2. Last comment here: http://www.virtualbox.org/ticket/1875
Comment by Gerhard Brauer (GerBra) - Sunday, 21 September 2008, 13:32 GMT
From my side i could also say: 2.0.2 solved the problem. I could boot/use Arch Froscon ISO (2.6.26) in VMs and have also no problem anymore when doing heavy disk IO in the VM.
So maybe we could close this in the next days?
Comment by Corrado Primier (bardo) - Saturday, 04 October 2008, 12:11 GMT
Can this bug be closed, then?
Comment by Gerhard Brauer (GerBra) - Saturday, 04 October 2008, 12:21 GMT
Yes, i've posted the news on LKML and the virtualbox.org bugreport is also closed/fixed.

Loading...