FS#32204 - [linux] Incorporate Ted Tso patch that MAY fix ext4 corruption
Attached to Project:
Arch Linux
Opened by John (graysky) - Wednesday, 24 October 2012, 19:44 GMT
Last edited by Dave Reisner (falconindy) - Thursday, 25 October 2012, 19:14 GMT
Opened by John (graysky) - Wednesday, 24 October 2012, 19:44 GMT
Last edited by Dave Reisner (falconindy) - Thursday, 25 October 2012, 19:14 GMT
|
Details
The bug to which I am referencing is the "Apparent serious
progressive ext4 data corruption bug in 3.6.3 (and other
stable branches?)" which has been posted to lkml earlier. I
think the risk/benefit on this issue warrants discussion at
the ARCH kernel maintainer level. How do people feel about
incorporating Ted's restoration of commit eeecef0af5? In his
words, "...we know that my patch definitely restores the
behaviour previous to commit eeecef0af5, so it can't hurt,
but we do want to make 100% sure that it really fixes the
problem. "
The patch was posted to lkml here: https://lkml.org/lkml/2012/10/23/690 The quote I lifted from here: https://lkml.org/lkml/2012/10/23/741 I have patched this into 3.6.3 just fine. Additional info: * package version(s) 3.6.3-1 |
This task depends upon
Closed by Dave Reisner (falconindy)
Thursday, 25 October 2012, 19:14 GMT
Reason for closing: Won't implement
Additional comments about closing: Nothing to fix.
Thursday, 25 October 2012, 19:14 GMT
Reason for closing: Won't implement
Additional comments about closing: Nothing to fix.
I'd say, have your backups ready and if your filesystem breaks, then join the upstream thread and help debugging this.
The wider issue is that not all Arch users are trolling the forums everyday to come across the thread in "Kernel and Hardware" or even a Linux-related news site.
Ignorance of this issue - especially on a system that appears to be working fine - can sometimes not be so blissful.
A high-visibility notice, say an announcement on the main page, perhaps coupled with reverting the problematic commit might be a course of action that potentially "un-ruins" someone's day. Even just an announcement warning against rapidly mounting/unmounting ext4 partitions in succession would be a nice heads-up.
1. mount (w/ and w/o the mount options in the linked mailing list message) an ext4 partition
2. write a new or an existing file
3. umount
repeat this 10 times. Adding sleeps does not help. Tested this with a usermodelinux kernel with busybox only. No corruption.
There is still not a (reliable) way to reproduce this bug. Imo it's better to wait for upstream to find the cause, prepare and test a patch that fixes the issue for sure.
I am not trying to ignore the issue, just relativing it. People are frightened, maybe for no reason at all.
[1]: https://lkml.org/lkml/2012/10/24/533
Since you have a volume to test:
http://www.h-online.com/open/news/item/Stable-Linux-kernel-hit-by-ext4-data-corruption-bug-1736110.html
"The bug only manifests when the filesystem's starting block is zero. This will cause the kernel to truncate the journal when the filesystem is unmounted. This situation occurs if the filesystem is mounted and unmounted so quickly, that the journal log does not have a chance to be written completely. The first time this occurs, the ext4 driver can recover the journal which will not lead to any ill effects. Should the same situation arise twice, though, data from the newer mount session will get written to the journal before data from the older one, leading to metadata blocks that "can end up getting very scrambled indeed", according to Ts'o."
Can you replicate with that sequence?
2) Whether or not a single user can reproduce this is not the issue :p
https://lkml.org/lkml/2012/10/24/519
Patch 1: https://lkml.org/lkml/2012/10/23/690
Patch 2: https://lkml.org/lkml/2012/10/24/14
This is not a special bug. In fact, if Michael did not spread panic, nobody would know anything about it. Here is a statement from Tso; https://lkml.org/lkml/2012/10/24/535
I agree with Tobias, let's wait for upstream now.
Edit: Sorry I missed that the link above was already posted ;)
a) The reporter used umount -l
b) He shut down before the actual umount finished
c) He used nobarrier (NOT the default)
So, this problem is in fact very rare and anyone here is unlikely to hit it. Has anyone here actually been able to reproduce it? I guess not.