FS#31493 - [syslinux] 4.05-7 Intermittent failure to boot with RAID1

Attached to Project: Arch Linux
Opened by Peter Hardman (shetland_breeder) - Tuesday, 11 September 2012, 08:30 GMT
Last edited by Thomas Bächler (brain0) - Monday, 29 October 2012, 15:32 GMT
Task Type Bug Report
Category Packages: Core
Status Closed
Assigned To Tobias Powalowski (tpowa)
Thomas Bächler (brain0)
Architecture x86_64
Severity High
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 1
Private No

Details

I have encountered a problem similar to that of  FS#31175 . As this is presumed fixed with 4.05-7 I thought I had better report it.

If I reboot the PC it will almost always fail to load the menu (sits for several seconds with a blank screen) and then reboot.
If I power the machine off for a few seconds and then power on it will mostly boot normally, but sometimes just keep restarting as above.

Judging from the time delay between the POST completing and the machine rebooting _something_ is running in a loop. I have only once had SYSLINUX hang at the copyright notice.

This machine runs software RAID1 with two hard drives. I have / in the first partition, the swap (also raid), /home and several other partitions all also RAID1. The metadata is 0.9. and the partitions start at block 63. This array has been in use for 3 years or more now in a Dell PE840, hence the old partitioning and mdadm metadata.

I built a new machine a fortnight ago with a Gigabyte GB-H61M-S2PV motherboard and moved the RAID array to the new machine (very smooth - just installed the correct video driver and moved the drives).

When I built the machine and before removing the RAID array I installed a single drive and did a clean Arch install. This was using syslinux and there were no boot problems.

Syslinux was installed on the RAID array with the syslinux-install_update script, and that gave me the approriate messages for installing on a raid array.

I've not done any debugging apart from adding the syntax hilighter lines to syslinux.cfg to make sure there's no bad syntax. The machine is now back to booting with grub-legacy (just restored the MBR bootloader).

An obvious step is to swap the array back into the Dell (and the single drive back to the new machine). I shall report back once this is done.





Additional info:
* package version(s)
* config and/or log files etc.


Steps to reproduce:
This task depends upon

Closed by  Thomas Bächler (brain0)
Monday, 29 October 2012, 15:32 GMT
Reason for closing:  Fixed
Additional comments about closing:  _Probably_ fixed, reopen when the problem surfaces again.
Comment by Peter Hardman (shetland_breeder) - Tuesday, 11 September 2012, 11:14 GMT
I've now tried a single disk with syslinux 4.05-7 which boots fine on the dell PE840.

In my new hardware this fails to boot every time. If I run the Arch live CD to the point where it logs in and then reboot it will get to the syslinux menu and boot OK.

If I downgrade syslinux to 4.05-6 then again it boots every time.

Seems there is a problem with 4.05-7 on this specific hardware - I have two other PCs (with old hardware) as well as the Dell and a Thinkpad T61 that boot fine with 4.05-7.
Comment by Thomas Bächler (brain0) - Tuesday, 11 September 2012, 15:38 GMT
First of all, I do use syslinux on a RAID 1 where the boot partition starts at 1MB and I use metadata 1.0 for the RAID (syslinux does not support 1.2). This works perfectly every time.

Note that syslinux 4.05-7 is now built the way upstream recommends: We do NOT use our own compiler to rebuild the lowlevel bootloader files, but instead we use the files shipped with syslinux ('make installer' target).

Bootloader are incredibly complex, so I suggest you do the following:

1) Download the syslinux tarball from the homepage and install syslinux using their extlinux installer - this should yield the same result as the installer from Arch - if not, we have a bug I cannot explain.
2) Join the #syslinux IRC channel on freenode or the syslinux mailing list. There are experts there who know how to debug these kinds of problems - I sadly don't.

Also, as your issue seems partially fixed by powering on/off the machine, there could be a memory problem that wasn't visible with a different syslinux build. You can verify that your memory is fine with memtest, but that is time-consuming.
Comment by Peter Hardman (shetland_breeder) - Tuesday, 11 September 2012, 20:20 GMT
Since I also get the problem on a single disk installation it seems that RAID is not an issue.

You are right, there's plenty of lines of attack yet, but you might have known what the problem was already :)

memtest86+ throws no errors (over 4 runs) BTW.

I probably shan't be able to work on this again until October - away on holiday. If you leave this open I'll update it when I've done some more work on it.

Pete
Comment by Peter Hardman (shetland_breeder) - Monday, 01 October 2012, 11:56 GMT
Today I downloaded the syslinux tarball and installed extlinux from that, inluding copying all the com32 files to make an installation with the same files as in the Arch package.

It works flawlessly.

However - every single file apart from poweroff.com (including extlinux) has a different md5 sum from those in the Arch package. So I suppose our toolchain somehow does something different from Syslinux' toolchain.

Pete
Comment by Thomas Bächler (brain0) - Monday, 01 October 2012, 12:07 GMT
This is surprising to me. The 'make installer' target which we use does not rebuild the relevant bootloader parts, but only the installer binaries.
Comment by Thomas Bächler (brain0) - Wednesday, 24 October 2012, 19:05 GMT
4.06 is in testing, and it has many fixes. Please test.
Comment by Peter Hardman (shetland_breeder) - Thursday, 25 October 2012, 09:59 GMT
Since my previous comment I've uninstalled syslinux, including the /boot/syslinux directory and ldlinux.sys and reinstalled. It now works >> 95% of the time. Very occasionally on a resume from hibernate it will hang at the copyright, or even more occasionally just spin rebooting. But so far it's always been recoverable by powering off and on again.

Can I just install syslinux from testing? I'm very reluctant to install everything from testing as this is my 'production' PC.

Comment by Thomas Bächler (brain0) - Thursday, 25 October 2012, 10:59 GMT
Yes, shouldn't be a problem.
Comment by Peter Hardman (shetland_breeder) - Monday, 29 October 2012, 15:00 GMT
I updated to 4.06 on Friday on both the primary installation and the secondary installation (chainloaded from the primary) and it all seems OK.

Loading...