FS#6514 - xfs - possible file corruption

Attached to Project: Arch Linux
Opened by david cheung (scruffidog) - Friday, 02 March 2007, 18:18 GMT
Last edited by Paul Mattal (paul) - Monday, 31 December 2007, 17:28 GMT
Task Type Bug Report
Category System
Status Closed
Assigned To Paul Mattal (paul)
Architecture All
Severity Critical
Priority Normal
Reported Version 0.7.2 Gimmick
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

i've been running into a very particular problem with xfs:

the setup:
vanilla linux 2.6.20 client and server
external usb HD via usb2/fw1394 (type of connection does not matter)
xfs filesystem

the issue:
on large files (> 500M) copied to a different location (the external disk or external host) utilizing xfs as the underlying fielsystem, the checksum (or md5sum) do not always match up no matter what means of transfer used (rsync, cp, cpio, tar, etc). worse yet, the checksums changes on problematic files after updates:
after transferring files and running a checksum on it, the source and destination are different.
rsync --inplace -c --block-size=4096 -avu foo /foobar/foo
checksum the source and destination and it matches
run sync several times and re-checksum to still match
after some random time, a checksum returns a mismatch on src & dest

it is mostly likely not a problem with the hard drive(s) since this is the first thing i checked with consistency checks, running SMART software, low level disk scans and testing against multiple drives in different connection configurations.

my first goal is to see if someone else can reproduce this problem with an external usb2/1394 configuration. I usually just replicate my media collection from the internal disk to an externally connected one to test.
This task depends upon

Closed by  Paul Mattal (paul)
Monday, 31 December 2007, 17:28 GMT
Reason for closing:  Deferred
Additional comments about closing:  As far as we can tell, this was either a hardware issue or something that needs to be handled upstream.
Comment by Jan de Groot (JGC) - Friday, 02 March 2007, 18:55 GMT
Never had these issues with XFS before. One thing that can be the problem here is your external harddisk casing. I had two of those things that went dead and one of them took the physical harddisk with it while passing away. (it's not nice when a maxtor drive gets undervolted a lot and causes spindowns all the time).

To be certain about this, please attach your USB/FW casing to a mac or windows PC and test if the same things happen there. In this case we can have several points of failures, we should try to exclude as much as possible.
Comment by david cheung (scruffidog) - Friday, 02 March 2007, 19:37 GMT
true and i agree that the external enclosures may be problematic, however, it does not explain the behavior when going over the network to a server with internal disks.
Comment by João Rodrigues (gothicknight) - Tuesday, 06 March 2007, 00:36 GMT
I've used my external HD drive using XFS and I've got no problem on files checksums, and i use the disk ALLOT. Some time ago I've got a problem with a "broken" USB cable witch almost cause an warranty replacement.
Comment by Andreas Radke (AndyRTR) - Wednesday, 07 March 2007, 06:44 GMT
sounds like a bug in either the harddisc (IDE?) chip driver or in the kernel usb system layer.

can you please try to find out your external harddisc controller chip using lsusb (-v) and your onboard usb controller lspci | grep -i usb!
Comment by Andreas Radke (AndyRTR) - Friday, 20 April 2007, 04:54 GMT
status?
Comment by Dale Blount (dale) - Friday, 27 April 2007, 19:34 GMT
since your backups are already pretty much useless, could you try ext3 on the external disk for kicks?
Comment by Andreas Hauser (buggs) - Thursday, 10 May 2007, 10:59 GMT
Well this could be related to problems i have with a git repository on XFS.
Git makes copies of the files and saves them in a file named after the SHA1 sum.
From time to time the files do not match their SHA1 filename anymore!
Looking at the files this is because of bit flips (one or two).
Has happened about 5 times in the last 2 month with files > 1MB only until now.
This is on x86_64.
Comment by Andreas Hauser (buggs) - Thursday, 10 May 2007, 11:00 GMT
It's a local SATA Raptor.
Comment by Celti Burroughs (Celti) - Friday, 25 May 2007, 13:59 GMT
I'm not inclined to blame XFS here - I can't duplicate it on multiple XFS drives - some PATA, some SATA.
Comment by Adrián VH (cthulhufhtagn) - Thursday, 07 June 2007, 00:03 GMT
Try checking your memory with memtest86+. It might be faulty even when you don't notice in normal day use. I ran through a similar situation.
Comment by Andreas Hauser (buggs) - Monday, 11 June 2007, 06:13 GMT
Moving the repo to another disk with ext3 seems to have solved the issues.
Comment by Felix (thetrivialstuff) - Wednesday, 08 August 2007, 16:22 GMT
Hi -- have you ever used xfs_fsr on the corrupted data?

I recently recovered from some xfs corruption that I think was caused when I ran xfs_fsr (the defragger) on my /home partition. I've never had problems with xfs before, and I've been using it for about a year; but this was the first time I'd ever tried xfs_fsr. A few hours after the defrag, I noticed some of my mail in my mailboxes looked weird, like there were bits missing. I looked at the mailboxes and found that there were areas that had been blanked out with null bytes!

I wrote a little perl script to look for similar areas (since I could not for the life of me figure out how to make grep look for 0x00 characters...) and found that the problem was shockingly widespread on that partition (lots of text files matched). I also noticed that it only seemed to be affecting files that had been written to since the defrag.

Fortunately, being a person who has trust issues ( ;) ) I had made an incremental backup about 30 seconds before the defrag and was able to patch all the holes in each file from that, so I didn't lose any data (yet).

Anyway, I googled xfs_fsr and null bytes and found a couple discussions about the exact same problem, and in one of them someone had suggested a kernel compilation issue that might make O_DIRECT eat files -- http://osdir.com/ml/file-systems.xfs.general/2003-04/msg00070.html . I e-mailed the two people I'd found with this problem and one of them replied this morning to say that that post was correct -- the problem was fixed when he recompiled his 2.4 kernel.

I'm running a totally different kernel version from what he was -- Linux 2.6.22-ck #1 SMP PREEMPT Sun Jul 22 21:53:24 IST 2007 i686 Intel(R) Pentium(R) 4 CPU 3.20GHz GenuineIntel GNU/Linux -- but it may still be a compilation problem and I'm going to see if I can find some tests that specifically check whether O_DIRECT works (whatever that is exactly; I have no idea myself), and then maybe switch back to something other than the CK kernel.

Oh -- one more thing to note -- that Pentium 4 CPU I've got is dual core and linux sees it as 2 distinct CPU's on bootup. Could there be some kind of multithreading problem during file writes causing issues?

~Felix.

PS: What kind of corruption are your files suffering? Are those changes in checksums being caused by null byte areas? I'll post my bingrep thingie so you can use that to check -- use a command like this with it:

find <mount point of bad xfs filesystem> -mount -type f -exec bingrep.pl -s 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 '{}' \; > checkthesefiles
Comment by Andreas Hauser (buggs) - Thursday, 09 August 2007, 06:27 GMT
While I was also using a dual core processor, i was not using xfs_fsr, unless some cron job runs it.

The problems were as described bit flips.
Like
10101001
vs.
00101001
Comment by James Rayner (iphitus) - Sunday, 16 December 2007, 01:53 GMT
Someone send this upstream to either the kernel bugzilla or LKML if this is still an issue.
Comment by Paul Mattal (paul) - Monday, 31 December 2007, 16:18 GMT
It appears nobody has successfully duplicated this bug at at least one person (Patrick) has tried.

Andreas, have you let a memtest run overnight on the box and/or are you using ECC mobo and RAM? If the hardware test has actually been done and comes up clean, I'm willing to put in a little more time trying to replicate this.
Comment by Andreas Hauser (buggs) - Monday, 31 December 2007, 17:26 GMT
Yes, memtest ran fine, but no ECC here.

Personally, I just switched to ext3 on the disk. Then 6 month later the disk broke.
One point for the the hardware failure case. On the otherhand the pattern of the error
was not really suggesting a disk problem, more RAM or CPU cache, which both seem to be
OK in tests and do not cause problems in other areas.

I think, if it is a bug, then it's probably upstream and should be solved there.
Since it happened on x86_64, it probably is not be related to the only Archlinux specific
condition, i686 optimization, that I could imagine.

Loading...