FS#63733 - [linux] BTRFS dev recommends not yet running 5.2 or 5.3

Attached to Project: Arch Linux
Opened by James Harvey (jamespharvey20) - Thursday, 12 September 2019, 08:26 GMT
Last edited by Jan Alexander Steffens (heftig) - Saturday, 14 September 2019, 12:03 GMT
Task Type Bug Report
Category Packages: Core
Status Closed
Assigned To Jan Alexander Steffens (heftig)
Architecture All
Severity Critical
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 1
Private No

Details

BTRFS strikes again.

BTRFS dev Filipe Manana (SUSE):
"So we definitely have a serious regression Until the fix gets merged to 5.2 kernels (and 5.3), I don't really recommend running 5.2 or 5.3."

Description: BTRFS 5.2 regression can cause either:
1. system hang, doesn't risk corruption.
2. BTRFS transaction is committed despite required btree nodes not having been written, which leads to "parent transid verify failed on ..." messages which are often volume-fatal.

I have ran into effect #1 (a system hang) in a VM about 10 times under heavy I/O load. I've been tracking it down, initially thinking it was a QEMU bug.

Additional info:
* linux 5.2.x/5.3rc to date
* https://marc.info/?l=linux-btrfs&m=156827465218288&w=2


Steps to reproduce:
1. Use BTRFS
2. Use linux 5.2.x/5.3rc to date, or it even looks like git master to date
3. Get unlucky


I've asked in the linked mailing list thread recommendations to distros and users, regarding backporting vs downgrading.
This task depends upon

Closed by  Jan Alexander Steffens (heftig)
Saturday, 14 September 2019, 12:03 GMT
Reason for closing:  Fixed
Additional comments about closing:  linux 5.2.14.arch2-1
Comment by James Harvey (jamespharvey20) - Thursday, 12 September 2019, 09:27 GMT
After seeing the couple responses that have come in for distro/user recommendations, looks to me like the best thing to do would be for Arch to apply https://patchwork.kernel.org/patch/11141559/
Comment by James Harvey (jamespharvey20) - Thursday, 12 September 2019, 11:04 GMT
Sorry, this should have been:
* on Packages: Core
* critical
Comment by Jan Alexander Steffens (heftig) - Thursday, 12 September 2019, 14:02 GMT
Backported in linux 5.2.14.arch2-1.
Comment by Maxim (mxfm) - Thursday, 12 September 2019, 18:59 GMT
This issue is overblown.

Some background and brief description from btrfs mailing list. Approximately in late summer one user started discussion about btrfs data corruption after updating to 5.2. In later June another user reported data corruption after running 5.2 for some time. After fixing his problem the second user continued running 5.2 without issues. His message "I am running 5.2 and everything currently is OK" was sent in late August. This issue seemed to be resolved. Afterwards this discussion switched to relatively separate issue about spurious space cache warnings after running 5.2. Several users said thay they received such space cache warnings (I also found such message in my journal log). One additional user reported data corruption. Some days ago one (not with very high contribution) developer claimed that there is critical regression and proposed a patch.

Please note, that currently there are several reported cases with data corruption after switching to 5.2/running 5.2 for some time. In addition, there are several messages about space cache warnings which do not harm. Kernel 5.2 released some time ago, so the number of btrfs users running 5.2 without any issue is strongly higher than 3. Circumstances and conditions which trigger data corruption are not understood. Proposed patch was not reviewed by now.
Comment by James Harvey (jamespharvey20) - Friday, 13 September 2019, 03:53 GMT
mxfm, could be, I don't have a strong opinion on if it's overblown. For sure, you have to get a bit unlucky. At least for me, it was a huge pain to track this down to btrfs. It's intermittent, and only happens after about an hour of heavy I/O. I replicated a similar looking lockup when switching the volume to XFS, as a test. Who knows how many users have experienced it with a lockup, and haven't looked into the cause or been vocal about it. The message of "task... blocked for more than..." often indicates non-fs problems. I also have no idea how many users would report a killed fs versus just get mad and move on. Luckily, it really looks like it wouldn't cause any silent corruption, so everyone who hasn't experienced a problem should not be worried.

I also wish the proposed patch would have comments by other btrfs devs by now. Granted, it's only been 1.5 days. I can only say 2 things. 1) I've been running the proposed patch for quite a few hours now without a lockup. With intermittent issues, you never know for sure, but it absolutely appears to have fixed the problems I've been having. 2) I'm glad I'm not making the call of deciding whether to include this patch or not, as it always could be premature.
Comment by loqs (loqs) - Friday, 13 September 2019, 15:01 GMT

Loading...