FS#32204 - [linux] Incorporate Ted Tso patch that MAY fix ext4 corruption

Attached to Project: Arch Linux
Opened by John (graysky) - Wednesday, 24 October 2012, 19:44 GMT
Last edited by Dave Reisner (falconindy) - Thursday, 25 October 2012, 19:14 GMT
Task Type Bug Report
Category Upstream Bugs
Status Closed
Assigned To Tobias Powalowski (tpowa)
Thomas Bächler (brain0)
Architecture All
Severity Medium
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 7
Private No

Details

The bug to which I am referencing is the "Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)" which has been posted to lkml earlier. I think the risk/benefit on this issue warrants discussion at the ARCH kernel maintainer level. How do people feel about incorporating Ted's restoration of commit eeecef0af5? In his words, "...we know that my patch definitely restores the behaviour previous to commit eeecef0af5, so it can't hurt, but we do want to make 100% sure that it really fixes the problem. "

The patch was posted to lkml here: https://lkml.org/lkml/2012/10/23/690
The quote I lifted from here: https://lkml.org/lkml/2012/10/23/741

I have patched this into 3.6.3 just fine.

Additional info:
* package version(s) 3.6.3-1
This task depends upon

Closed by  Dave Reisner (falconindy)
Thursday, 25 October 2012, 19:14 GMT
Reason for closing:  Won't implement
Additional comments about closing:  Nothing to fix.
Comment by Peter Wu (Lekensteyn) - Wednesday, 24 October 2012, 20:11 GMT
Has anyone actually experienced the bug? So far only two people seems to have this issue.

I'd say, have your backups ready and if your filesystem breaks, then join the upstream thread and help debugging this.
Comment by Bill Pickett (headkase) - Wednesday, 24 October 2012, 21:19 GMT
@Peter Wu,

The wider issue is that not all Arch users are trolling the forums everyday to come across the thread in "Kernel and Hardware" or even a Linux-related news site.

Ignorance of this issue - especially on a system that appears to be working fine - can sometimes not be so blissful.

A high-visibility notice, say an announcement on the main page, perhaps coupled with reverting the problematic commit might be a course of action that potentially "un-ruins" someone's day. Even just an announcement warning against rapidly mounting/unmounting ext4 partitions in succession would be a nice heads-up.
Comment by Peter Wu (Lekensteyn) - Wednesday, 24 October 2012, 21:49 GMT
@Bill Pickett, I could not reproduce it with:

1. mount (w/ and w/o the mount options in the linked mailing list message) an ext4 partition
2. write a new or an existing file
3. umount
repeat this 10 times. Adding sleeps does not help. Tested this with a usermodelinux kernel with busybox only. No corruption.

There is still not a (reliable) way to reproduce this bug. Imo it's better to wait for upstream to find the cause, prepare and test a patch that fixes the issue for sure.

I am not trying to ignore the issue, just relativing it. People are frightened, maybe for no reason at all.

[1]: https://lkml.org/lkml/2012/10/24/533
Comment by Bill Pickett (headkase) - Wednesday, 24 October 2012, 22:06 GMT
@Peter Wu,

Since you have a volume to test:

http://www.h-online.com/open/news/item/Stable-Linux-kernel-hit-by-ext4-data-corruption-bug-1736110.html

"The bug only manifests when the filesystem's starting block is zero. This will cause the kernel to truncate the journal when the filesystem is unmounted. This situation occurs if the filesystem is mounted and unmounted so quickly, that the journal log does not have a chance to be written completely. The first time this occurs, the ext4 driver can recover the journal which will not lead to any ill effects. Should the same situation arise twice, though, data from the newer mount session will get written to the journal before data from the older one, leading to metadata blocks that "can end up getting very scrambled indeed", according to Ts'o."

Can you replicate with that sequence?
Comment by John (graysky) - Wednesday, 24 October 2012, 23:29 GMT
1) Can someone change my original ticket to show that it is logged against core/linux not core/linux-headers (error when I looked-up "x-h" under http://www.archlinux.org/packages to find it faster).
2) Whether or not a single user can reproduce this is not the issue :p
Comment by Till Matthiesen (high.entropy) - Wednesday, 24 October 2012, 23:59 GMT
Sounds like the patch does not resolve the issue.

https://lkml.org/lkml/2012/10/24/519
Comment by John (graysky) - Thursday, 25 October 2012, 00:18 GMT
Ted posted two patches. Wasn't clear to me which patch nix had applied.

Patch 1: https://lkml.org/lkml/2012/10/23/690
Patch 2: https://lkml.org/lkml/2012/10/24/14
Comment by Dave Reisner (falconindy) - Thursday, 25 October 2012, 00:25 GMT
Until Linus merges something, there's absolutely nothing to be done here.
Comment by John (graysky) - Thursday, 25 October 2012, 00:35 GMT
@Dave - If Patch 1 reverts a previous commit and _may_ solve the issue, I respectfully disagree with your statement.
Comment by Dave Reisner (falconindy) - Thursday, 25 October 2012, 00:45 GMT
Neither patch reverts anything. You're of course entitled to your opinion, but perhaps you should actually read and _understand_ a little more before posting knee-jerk reactionary bug reports.
Comment by Greg (dolby) - Thursday, 25 October 2012, 04:43 GMT Comment by Tobias Powalowski (tpowa) - Thursday, 25 October 2012, 05:33 GMT
There is no need to hurry, keep cool when everything is analyzed we will patch or bump to next kernel version.
Comment by Peter Wu (Lekensteyn) - Thursday, 25 October 2012, 08:31 GMT
@Bill The H is not a reliable source for this atm. Only those with the right technical insight (ext4 devs) can comment on it. They have now found the rare case where it happens, see the mailing list or the Plus page linked above.

This is not a special bug. In fact, if Michael did not spread panic, nobody would know anything about it. Here is a statement from Tso; https://lkml.org/lkml/2012/10/24/535

I agree with Tobias, let's wait for upstream now.
Comment by Nitro (nitr0) - Thursday, 25 October 2012, 08:52 GMT
Well it looks like (thankfully) that was caused by a very weird setup. Hard to know if that is really a bug anymore at this time.

Edit: Sorry I missed that the link above was already posted ;)
Comment by Thomas Bächler (brain0) - Thursday, 25 October 2012, 08:52 GMT
I'll summarize that G+ post that Greg posted earlier. The original problem happened when
a) The reporter used umount -l
b) He shut down before the actual umount finished
c) He used nobarrier (NOT the default)

So, this problem is in fact very rare and anyone here is unlikely to hit it. Has anyone here actually been able to reproduce it? I guess not.
Comment by John (graysky) - Thursday, 25 October 2012, 19:00 GMT
Thx for the links all. Seems as though likelihood is very very low; wait for an upstream sign off as tpowa suggested.

Loading...