FS#45129 - [linux] deadlock with stacked loop devices
Attached to Project:
Arch Linux
Opened by Christian Hesse (eworm) - Friday, 29 May 2015, 13:06 GMT
Last edited by Evangelos Foutras (foutrelis) - Tuesday, 30 June 2015, 08:09 GMT
Opened by Christian Hesse (eworm) - Friday, 29 May 2015, 13:06 GMT
Last edited by Evangelos Foutras (foutrelis) - Tuesday, 30 June 2015, 08:09 GMT
|
Details
Description:
Linux 4.0.x suffers a problem with stacked loop devices [1]. All loop block devices are handled in one worker thread, which can deadlock. System hangs and log messages are printed: INFO: task kworker/u#:#:### blocked for more than 120 seconds INFO: task kloopd/### blocked for more than 120 seconds The official install media is (or will be) hit by this problem. Patches are available [1][2] and queued for linux 4.2 and stable (Cc: stable@vger.kernel.org). I've updated the patches to apply cleanly to linux 4.0.4 [3][4]. [0] http://marc.info/?l=linux-kernel&m=143280649731902&w=2 [1] http://git.kernel.org/cgit/linux/kernel/git/axboe/linux-block.git/commit/?id=f4aa4c7b [2] http://git.kernel.org/cgit/linux/kernel/git/axboe/linux-block.git/commit/?id=4d4e41ae [3] http://www.eworm.de/download/linux/linux-0001-loop.patch [4] http://www.eworm.de/download/linux/linux-0002-loop.patch Additional info: linux 4.0.4-2 |
This task depends upon
Closed by Evangelos Foutras (foutrelis)
Tuesday, 30 June 2015, 08:09 GMT
Reason for closing: Fixed
Additional comments about closing: linux 4.0.7-2
Tuesday, 30 June 2015, 08:09 GMT
Reason for closing: Fixed
Additional comments about closing: linux 4.0.7-2
Unlike [0] I was not using copytoram
It happened to me using archlinux-2015.06.01-dual.iso (Not tainted 4.0.4-2-ARCH #1)
and produced these to the console:
INFO: task kloopd:### blocked for more than 120 seconds.
INFO: task systemd-udevd:### blocked for more than 120 seconds.
INFO: task kworker/u##:##:#### blocked for more than 120 seconds.
(with kworker repeated 5 times)
Then 120 seconds later each was repeated in the same order one time
(after that no further console messages)
The system has not hung completely
processes in remote (ssh) sessions are still active
(top, iotop)
The console is still active
(presumably only for commands which don't trigger activity through the loop device)
Any command which triggers activity through the loop device locks hard
System load at the time extremely heavy I/O:
badblocks to 6 devices ~470 M/s sustained write
CPU bound openssl encryption writing to 3 dm-crypt/LUKS devices
Effectively 2015.06.01 is too fragile to be used to install Arch
Is it possible to release new ISO with a patched kernel?
Or is the only alternative to use an older ISO?
Try1: Working with small shell scripts in /root triggered dm-0 and EXT4 to fail
[10697.019711] device-mapper: snapshots: Invalidating snapshot: Unable to
allocate exception.
[10697.021529] EXT4-fs warning (device dm-0): ext4_end_bio:317: I/O error
-5 writing to inode 1968623 (offset 0 size 16384 starting block 7907290)
[10697.021540] EXT4-fs warning (device dm-0): ext4_end_bio:317: I/O error
...
[11093.489824] EXT4-fs error (device dm-0): ext4_find_entry:1289: inode #917521: comm bash: reading directory lblock 0
[11093.495356] EXT4-fs (dm-0): previous I/O error to superblock detected
[11093.501021] Buffer I/O error on dev dm-0, logical block 0, lost sync page write
[11093.506834] EXT4-fs error (device dm-0): ext4_read_inode_bitmap:185: comm bash: Cannot read inode bitmap - block_group = 112, inode_bitmap = 3670032
...
[11091.153099] EXT4-fs (dm-0): This should not happen!! Data will be lost
tmpfs files were not impacted
Try2 & 3: Booting using copytoram
never made it to a shell prompt on either try
hung during booting with "task kworker" while cycling through 3 start jobs that never complete
(4.0.5-1-ARCH from nepulinux-2015.06.08.16.20.iso)
Try4: Booting 2015-06-01 with minimal use of anything but tmpfs
This worked for a while but eventually locked up with kworker
Try5: Booting 2015-05-01
Did not seem to exhibit the problem - but I did not put it through as much testing
Try6: Booting the latest Nepu
This worked flawlessly even with copytoram
On a purely read load across 9 devices it held ~980M/s sustained with peaks over 1000M/s
with concurrent usage of the loop based filesystems (read/write)
I am using the patches since about a month without issues.
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=f4aa4c7b
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=4d4e41ae