FS#54700 - [linux-hardened] Encypted xfs filesystem fails to boot after upgrading to 4.12

Attached to Project: Community Packages
Opened by K.S. Bhaskar (ksbhaskar) - Tuesday, 04 July 2017, 22:13 GMT
Last edited by Daniel Micay (thestinger) - Tuesday, 11 July 2017, 20:01 GMT
Task Type Bug Report
Category Upstream Bugs
Status Closed
Assigned To Daniel Micay (thestinger)
Architecture All
Severity High
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 0%
Votes 1
Private No

Details

Description:

In my machines, I have encrypted partitions for /home. So, after booting, I run a script that executes cryptsetup to make the unencrypted partition available under /dev/mapper, e.g., /dev/mapper/home-aes, and then mount an xfs filesystem in /dev/mapper/home-aes as /home. I have done this for years with Ubuntu and Arch, and works just fine.

However, after applying a set of patches that included the 4.12 kernel, when I attempt the mount, I get errors such as the following:

[ 23.115192] XFS (dm—0): metadata I/O error: block 0x2 ("xfs_trans_read_buf_map") error 5 numblks 1
[ 23.115300] XFS (dm—0): metadata I/O error: block 0x324b002 ("xfs_trans_read_buf_map") error 5 numblks 1
[ 23.115380] XFS (dm-0): metadata I/O error: block 0x6496002 ("xfs_trans_read_buf_map") error 5 numblks 1
[ 23.115459] XFS (dm-0): metadata I/O error: block 0x96e1002 ("xfs_trans_read_buf_map") error 5 numblks 1
[ 23.115468] XFS (dm-0): Corruption of in—memory data detected. Shutting down filesystem
[ 23.115471] XFS (dm-0): Please umount the filesystem and rectify the problem(s)

Running xfs_repair makes no difference - the problem persists. However the same file system mounts just fine on Ubuntu 17.04 (my laptops dual boot).

Additional info:
* package version(s)
* config and/or log files etc.


Steps to reproduce:

Boot and run the following commands:

cryptsetup -c aes -s 256 create home-aes /dev/nvme0n1p5
mount -t xfs -o discard,noatime /dev/mapper/home-aes /home
This task depends upon

Closed by  Daniel Micay (thestinger)
Tuesday, 11 July 2017, 20:01 GMT
Reason for closing:  Upstream
Additional comments about closing:  This has been narrowed down as an upstream issue reproducible with CONFIG_SLUB_DEBUG_ON=y or passing the equivalent slub_debug=FZPU on the kernel line.

Since it's neither specific to linux-hardened or Arch Linux, it needs to be reported and fixed upstream. I'd be willing to backport an upstream fix for it but otherwise it's not something that I'll be working on.

Can let me know when there's an upstream fix and I'll apply it.
Comment by loqs (loqs) - Tuesday, 04 July 2017, 23:42 GMT
This appears to be an upstream issue please bisect the kernel and report the offending commit upstream.
Comment by Paul Adams (paul.zrexx12r) - Wednesday, 05 July 2017, 09:01 GMT
Exactly same problem, identical error message(s)... But if I boot the system from "standard" kernel[4.11.7-1], i.e. not hardened [4.12.a-1], all good....

Further, I have an encrypted ext4 partition, as well as encrypted xfs... Only the encrypted xfs have the failure, all other partitions (non-encrypted) mount fine...

Have just upgraded, another system (on same machine), and problem occurs, as above, with same conditions
Comment by loqs (loqs) - Wednesday, 05 July 2017, 10:22 GMT
$ uname -r
4.12.0-1-ARCH
$ cd /tmp/
$ dd if=/dev/zero of=disk.img bs=1M count=256
256+0 records in
256+0 records out
268435456 bytes (268 MB, 256 MiB) copied, 0.056453 s, 4.8 GB/s
$ sudo losetup /dev/loop0 disk.img
$ sudo cryptsetup -y -c aes -s 256 create home-aes /dev/loop0
Enter passphrase:
Verify passphrase:
$ sudo mkfs.xfs /dev/mapper/home-aes
meta-data=/dev/mapper/home-aes isize=512 agcount=4, agsize=16384 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=0, rmapbt=0, reflink=0
data = bsize=4096 blocks=65536, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=855, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
$ sudo mount -t xfs -o discard,noatime /dev/mapper/home-aes /mnt
$ sudo umount /mnt
$ sudo cryptsetup close home-aes
$ sudo cryptsetup open /dev/loop0 home-aes -c aes -s 256 --type plain
Enter passphrase:
$ sudo mount -t xfs -o discard,noatime /dev/mapper/home-aes /mnt

This is using 4.12 from mainline with arch's 4.11 config plus all default selections on new options on x86_64
Possibly an issue caused by a linux-hardened specific change.
Edit: Can you replicate my test on linux-hardened to test this please?
Edit2: If you can not replicate my test with linux-hardened can you try to replicate with linux 4.12-1 from staging?
Comment by Daniel Micay (thestinger) - Thursday, 06 July 2017, 05:25 GMT
@paul.zrexx12r: So does this happen to you with staging/linux (4.12-1) or no?
Comment by Paul Adams (paul.zrexx12r) - Thursday, 06 July 2017, 06:11 GMT
Apology if I am being obtuse, but where would I get the staging/linux(4.12-1) kernel from?
My pacman.conf is set for Testing, Core, Extra, Community.
mirrorlist is (mostly) set to Australian servers/mirrors.

Latest kernel on testing is 4.11.9-1, but I have not tried that... (but happy to give it ago, if you want)...

Also, if you want to give me instructions for the staging setup to obtain the 4.12-1 kernel, cool to try that out...

cheers

Comment by loqs (loqs) - Thursday, 06 July 2017, 08:19 GMT
Make sure system is fully updated then
Method 1:
Add a section to pacman.conf above [testing]
[staging]
Include = /etc/pacman.d/mirrorlist
then do pacman -Syy linux linux-headers (this stops the rest of staging being brought in)
then disable/remove the section for staging run pacman -Syyu to avoid partial update.
Method 2:
Copy an enabled http/htps mirror entry from /etc/pacman.d/mirrorlist replace /$repo/os/$arch with /staging/os/x86_64/ (assuming x84_64)
Download linux and linux-headers and install with pacman -U. (as the packages are installed from local files the signatures will not be checked)
Comment by Paul Adams (paul.zrexx12r) - Thursday, 06 July 2017, 11:20 GMT
Thanks for the instructions... All good.

Used method 1. Results standard kernel "upped" to 4.12-1, and after commenting out [staging] and [testing] sections (I have never used testing repos, btw) ran -Syyu and linux-hardened when to 4.12-1b (up from "a")

tested both kernels...

Vanilla type [4.12-1] kernel, sweet as, no problem picking up encrypted xfs or ext4 vols..
Hardened type [4.12-1b] kernel, problem remains, spurious error msgs. as above for the encrypted xfs(s), but no problem with encrypted ext4...

cheers
Comment by Daniel Micay (thestinger) - Sunday, 09 July 2017, 02:34 GMT
This is likely an upstream memory corruption bug uncovered by one of the features.

Try rebuilding the package with CONFIG_SLAB_CANARY=n, CONFIG_SLAB_SANITIZE=n, CONFIG_SLAB_SANITIZE_VERIFY=n, CONFIG_PAGE_SANITIZE=n and CONFIG_PAGE_SANITIZE_VERIFY=n in config.x86_64.
Comment by loqs (loqs) - Sunday, 09 July 2017, 14:11 GMT
Using reproducer from comment 3 mount fails with those options enabled but succeeds with those options disabled.
@thestinger can you not replicate the issue locally?
Edit:
CONFIG_PAGE_SANITIZE=y CONFIG_PAGE_SANITIZE_VERIFY=y added back issue still not reproduced
Edit2:
CONFIG_SLAB_CANARY=y added back issue reproduced
SGI XFS with ACLs, security attributes, realtime, no debug enabled
XFS (dm-2): Mounting V5 Filesystem
Ending clean mount
metadata I/O error: block 0x2 ("xfs_trans_read_buf_map") error 5 numblks 1
XFS (dm-2): metadata I/O error: block 0x20002 ("xfs_trans_read_buf_map") error 5 numblks 1
XFS (dm-2): metadata I/O error: block 0x40002 ("xfs_trans_read_buf_map") error 5 numblks 1
XFS (dm-2): metadata I/O error: block 0x60002 ("xfs_trans_read_buf_map") error 5 numblks 1
XFS (dm-2): Error -5 reserving per-AG metadata reserve pool.
XFS (dm-2): xfs_do_force_shutdown(0x8) called from line 1017 of file fs/xfs/xfs_fsops.c. Return address = 0xffffffffc0979280
XFS (dm-2): Corruption of in-memory data detected. Shutting down filesystem
XFS (dm-2): Please umount the filesystem and rectify the problem(s)
Edit3:
An error under here https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/xfs/libxfs/xfs_ag_resv.c?h=v4.12#n232 ?
Comment by Daniel Micay (thestinger) - Sunday, 09 July 2017, 18:54 GMT
> can you not replicate the issue locally?

Other people are capable of doing work too, and here's an opportunity to do that. I'm fairly convinced it's a memory corruption bug, perhaps being caught because something is checking beyond the bounds of an object and a canary is put there instead of it being padding. It's possible but unlikely that it's a bug in the SLAB_CANARY feature. It works with everything else, including ksize(...).

> CONFIG_SLAB_CANARY=y added back issue reproduced

Try with CONFIG_SLAB_CANARY=y, CONFIG_SLAB_SANITIZE=n, CONFIG_SLAB_SANITIZE_VERIFY=n, CONFIG_PAGE_SANITIZE=n and CONFIG_PAGE_SANITIZE_VERIFY=n.
Comment by loqs (loqs) - Sunday, 09 July 2017, 19:41 GMT
With just CONFIG_SLAB_CANARY=y I can not reproduce the issue please note this is just my test case I am exercising as I do not use linux-hardened or xfs in normal use.
Edit:
Do you think KASAN would be able to detect the out of bounds access or a kprobe inside xfs_ag_resv_init?
Comment by Daniel Micay (thestinger) - Monday, 10 July 2017, 03:16 GMT
No, but maybe CONFIG_SLUB_DEBUG_ALWAYS_ON would work.
Comment by Daniel Micay (thestinger) - Monday, 10 July 2017, 03:17 GMT
If you can't reproduce it with only CONFIG_SLAB_CANARY=y then what I said doesn't really hold.
Comment by Daniel Micay (thestinger) - Monday, 10 July 2017, 03:17 GMT
You should figure out the minimal set of options where it happens, before anything else.
Comment by loqs (loqs) - Monday, 10 July 2017, 09:47 GMT
Reran the test with only CONFIG_SLAB_CANARY=y and I was able to reproduce the issue.
Edit:
Rebuilt with CONFIG_XFS_WARN=y set as well first run did not reproduce the issue, second run did reproduce the issue.
No additional xfs related output generated with /proc/sys/kernel/printk set to 7.
Edit2:
After 10 test runs 9 reproduced the issue.
Edit3:
Rebuilt without CONFIG_XFS_WARN=y with CONFIG_XFS_DEBUG=y first run did not reproduce the issue, second run did reproduce the issue.
No additional xfs related output generated with /proc/sys/kernel/printk set to 7. (identical behavior to that noted in Edit)
Comment by Daniel Micay (thestinger) - Tuesday, 11 July 2017, 05:51 GMT
What about with CONFIG_SLUB_DEBUG_ON=y CONFIG_SLAB_CANARY=n, in a configuration where CONFIG_SLAB_CANARY=y would trigger the issue?
Comment by loqs (loqs) - Tuesday, 11 July 2017, 11:55 GMT
CONFIG_SLUB_DEBUG_ON=y, CONFIG_SLAB_CANARY=n, CONFIG_SLAB_SANITIZE=n, CONFIG_SLAB_SANITIZE_VERIFY=n, CONFIG_PAGE_SANITIZE=n and CONFIG_PAGE_SANITIZE_VERIFY=n.
Triggered the issue no additional output on dmesg.
Should the issue be reproducible with linux 4.12-2 with the command line slub_debug? As I was not able to reproduce the issue that way test sample of one.
Edit:
Reproduced on second run linux 4.12-2 with the command line slub_debug.
So ksbhaskar or paul.zrexx12r can report the issue upstream provided they can replicate my findings.
Edit2:
http://oss.sgi.com/bugzilla/ ( currently appears to have a backend issue )
http://xfs.org/index.php/XFS_FAQ#Q:_Where_can_I_find_documentation_about_XFS.3F ( as above is not functioning try the irc channel linked here )
http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
Edit3:
"https://bugzilla.kernel.org/buglist.cgi?product=File%20System&component=XFS&resolution=---" (possibly use the kernel bugzilla instead ( quoted as flyspray is not parsing the url correctly ) )
Comment by Daniel Micay (thestinger) - Tuesday, 11 July 2017, 18:36 GMT
Yeah, that means it's a memory corruption bug in the mainline Linux kernel that can be reported upstream. That's good news.
Comment by Daniel Micay (thestinger) - Tuesday, 11 July 2017, 19:34 GMT
@loqs: you could report it upstream yourself since you can replicate it with your simple test
Comment by loqs (loqs) - Tuesday, 11 July 2017, 19:52 GMT
I feel I have already made a fair contribution towards resolving an issue I am not impacted by in my normal system use.
Comment by Daniel Micay (thestinger) - Tuesday, 11 July 2017, 19:57 GMT
Definitely, I'll leave it to ksbhaskar to report it upstream then.

Loading...