FS#45129 - [linux] deadlock with stacked loop devices

Attached to Project: Arch Linux
Opened by Christian Hesse (eworm) - Friday, 29 May 2015, 13:06 GMT
Last edited by Evangelos Foutras (foutrelis) - Tuesday, 30 June 2015, 08:09 GMT
Task Type Bug Report
Category Kernel
Status Closed
Assigned To Tobias Powalowski (tpowa)
Thomas Bächler (brain0)
Evangelos Foutras (foutrelis)
Architecture All
Severity Medium
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 1
Private No

Details

Description:
Linux 4.0.x suffers a problem with stacked loop devices [1]. All loop block devices are handled in one worker thread, which can deadlock. System hangs and log messages are printed:

INFO: task kworker/u#:#:### blocked for more than 120 seconds
INFO: task kloopd/### blocked for more than 120 seconds

The official install media is (or will be) hit by this problem.

Patches are available [1][2] and queued for linux 4.2 and stable (Cc: stable@vger.kernel.org). I've updated the patches to apply cleanly to linux 4.0.4 [3][4].

[0] http://marc.info/?l=linux-kernel&m=143280649731902&w=2
[1] http://git.kernel.org/cgit/linux/kernel/git/axboe/linux-block.git/commit/?id=f4aa4c7b
[2] http://git.kernel.org/cgit/linux/kernel/git/axboe/linux-block.git/commit/?id=4d4e41ae
[3] http://www.eworm.de/download/linux/linux-0001-loop.patch
[4] http://www.eworm.de/download/linux/linux-0002-loop.patch

Additional info:
linux 4.0.4-2
This task depends upon

Closed by  Evangelos Foutras (foutrelis)
Tuesday, 30 June 2015, 08:09 GMT
Reason for closing:  Fixed
Additional comments about closing:  linux 4.0.7-2
Comment by Tido@Tido.com (Tido.com) - Sunday, 07 June 2015, 00:32 GMT
I can confirm that the official install media does have this problem.

Unlike [0] I was not using copytoram

It happened to me using archlinux-2015.06.01-dual.iso (Not tainted 4.0.4-2-ARCH #1)
and produced these to the console:

INFO: task kloopd:### blocked for more than 120 seconds.
INFO: task systemd-udevd:### blocked for more than 120 seconds.
INFO: task kworker/u##:##:#### blocked for more than 120 seconds.
(with kworker repeated 5 times)

Then 120 seconds later each was repeated in the same order one time
(after that no further console messages)

The system has not hung completely

processes in remote (ssh) sessions are still active
(top, iotop)

The console is still active
(presumably only for commands which don't trigger activity through the loop device)

Any command which triggers activity through the loop device locks hard

System load at the time extremely heavy I/O:
badblocks to 6 devices ~470 M/s sustained write
CPU bound openssl encryption writing to 3 dm-crypt/LUKS devices
Comment by Tido@Tido.com (Tido.com) - Sunday, 07 June 2015, 16:54 GMT
After two more attempts to install this server with the 2015.06.01 ISO the results were similarly bad under even minimal load

Effectively 2015.06.01 is too fragile to be used to install Arch

Is it possible to release new ISO with a patched kernel?
Or is the only alternative to use an older ISO?

Try1: Working with small shell scripts in /root triggered dm-0 and EXT4 to fail

[10697.019711] device-mapper: snapshots: Invalidating snapshot: Unable to
allocate exception.
[10697.021529] EXT4-fs warning (device dm-0): ext4_end_bio:317: I/O error
-5 writing to inode 1968623 (offset 0 size 16384 starting block 7907290)
[10697.021540] EXT4-fs warning (device dm-0): ext4_end_bio:317: I/O error

...

[11093.489824] EXT4-fs error (device dm-0): ext4_find_entry:1289: inode #917521: comm bash: reading directory lblock 0
[11093.495356] EXT4-fs (dm-0): previous I/O error to superblock detected
[11093.501021] Buffer I/O error on dev dm-0, logical block 0, lost sync page write
[11093.506834] EXT4-fs error (device dm-0): ext4_read_inode_bitmap:185: comm bash: Cannot read inode bitmap - block_group = 112, inode_bitmap = 3670032

...

[11091.153099] EXT4-fs (dm-0): This should not happen!! Data will be lost

tmpfs files were not impacted


Try2 & 3: Booting using copytoram

never made it to a shell prompt on either try

hung during booting with "task kworker" while cycling through 3 start jobs that never complete

Comment by Tido@Tido.com (Tido.com) - Wednesday, 10 June 2015, 11:14 GMT
I made additional tests and the short version is that using Christian's patched kernel solves the problem
(4.0.5-1-ARCH from nepulinux-2015.06.08.16.20.iso)

Try4: Booting 2015-06-01 with minimal use of anything but tmpfs

This worked for a while but eventually locked up with kworker

Try5: Booting 2015-05-01

Did not seem to exhibit the problem - but I did not put it through as much testing

Try6: Booting the latest Nepu

This worked flawlessly even with copytoram

On a purely read load across 9 devices it held ~980M/s sustained with peaks over 1000M/s
with concurrent usage of the loop based filesystems (read/write)
Comment by Christian Hesse (eworm) - Thursday, 18 June 2015, 07:07 GMT
Next ISO release is in less than two weeks... Any chance to get a fixed linux build before?
I am using the patches since about a month without issues.
Comment by Christian Hesse (eworm) - Wednesday, 24 June 2015, 07:59 GMT
This did not make its way to 4.0.6-1... Looks like we will see another borked ISO release.
Comment by Christian Hesse (eworm) - Friday, 26 June 2015, 11:08 GMT

Loading...