FS#46002 - [linux] null pointer dereference with RAID5
Attached to Project:
Arch Linux
Opened by Christian Hesse (eworm) - Monday, 17 August 2015, 08:31 GMT
Last edited by Tobias Powalowski (tpowa) - Monday, 28 September 2015, 06:58 GMT
Opened by Christian Hesse (eworm) - Monday, 17 August 2015, 08:31 GMT
Last edited by Tobias Powalowski (tpowa) - Monday, 28 September 2015, 06:58 GMT
|
Details
Description:
Linux 4.1.x can crash hard with a null pointer dereference when using RAID5. I was hit twice since running 4.1.x. Citing Neil Brown from his patch [0]: > Cache size can grow or shrink due to various pressures at > any time. So when we resize the cache as part of a 'grow' > operation (i.e. change the size to allow more devices) we need > to blocks that automatic growing/shrinking. > > So introduce a mutex. auto grow/shrink uses mutex_trylock() > and just doesn't bother if there is a blockage. > Resizing the whole cache holds the mutex to ensure that > the correct number of new stripes is allocated. > > This bug can result in some stripes not being freed when an > array is stopped. This leads to the kmem_cache not being > freed and a subsequent array can try to use the same kmem_cache > and get confused. Should be sufficient to apply the patch by Neil Brown [0], but it does not apply cleanly to 4.1.x. Either we have to backport it or apply a series by Yuanhan Liu [1][2][3] in preparation. [0] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/patch/?id=2d5b569b [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/patch/?id=9f3520c3 [2] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/patch/?id=b1b46486 [3] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/patch/?id=e9e4c377 Additional info: linux 4.1.5-1 |
This task depends upon
Closed by Tobias Powalowski (tpowa)
Monday, 28 September 2015, 06:58 GMT
Reason for closing: Fixed
Additional comments about closing: 4.2.1-1
Monday, 28 September 2015, 06:58 GMT
Reason for closing: Fixed
Additional comments about closing: 4.2.1-1
Another search made me stumble in this:
https://lists.manjaro.org/pipermail/manjaro-dev/Week-of-Mon-20150727/000557.html
Just compiling to give it a shot.
What we need is another patch by Neil Brown [0][1].
[0] http://marc.info/?l=linux-raid&m=144039460103982&w=2
[1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/patch/?id=49895bcc
Once during a copy operation after 15GiB and once during a pvcreate on another (freshly created) raid5.
Both on up-to-date 4.1.6-1-ARCH.
The once during the copy operation is more critical as I was copying *from* the raid5. So this can happen during normale usage and normal read operations.
After some research I found the fixes from Neil Brown send to KGH. They are still not in 3.1.7. I guess they will be in 3.1.8.
They are in 4.2.0.
Some more details:
http://comments.gmane.org/gmane.linux.raid/49638
http://permalink.gmane.org/gmane.linux.kernel.commits.head/538207
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=49895bcc7e566ba455eb2996607d6fbd3447ce16
Comment: "stable@vger.kernel.org (4.1 - please release with 2d5b569b665)"
Which is:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=2d5b569b665
Comment: "stable@vger.kernel.org (4.1 - please delay until 2 weeks after release of 4.2)"
Thanks to Christian Hesse for bringing this up to Neil Brown.
Can we get this patches applied? It will take some time before we get 4.1.8 (if we get it at all) or 4.2.X in stable repos and this bug is hitting normal usage, even during read operations.
[35248.469766] BUG: unable to handle kernel NULL pointer dereference at (null)
[35248.469837] IP: [<ffffffffa015bb91>] get_free_stripe+0x31/0xf0 [raid456]
[...]
[35248.477237] RIP [<ffffffffa015bb91>] get_free_stripe+0x31/0xf0 [raid456]
[35248.478229] RSP <ffff8801bfddb718>
[35248.479200] CR2: 0000000000000000
[35248.483013] ---[ end trace 4a3497943502ed7e ]---
[35248.483955] note: cp[30471] exited with preempt_count 1
[73643.618384] RIP [<ffffffffa015ffb0>] __find_stripe+0x30/0xc0 [raid456]
[73643.618386] RSP <ffff880004193778>
[73643.618394] ---[ end trace 4a3497943502ed7f ]---
[73643.618401] note: pvcreate[10434] exited with preempt_count 1
Can we somehow link this report to https://www.archlinux.org/packages/testing/x86_64/linux-lts/ ?
At the moment it has no reported bugs.
Btw. 4.1.8 is out at kernel.org and still misses the fix, I don't know why.
My server is crashing every two days without it. That's not a good base for linux-lts.