FS#46002 - [linux] null pointer dereference with RAID5

Attached to Project: Arch Linux
Opened by Christian Hesse (eworm) - Monday, 17 August 2015, 08:31 GMT
Last edited by Tobias Powalowski (tpowa) - Monday, 28 September 2015, 06:58 GMT
Task Type Bug Report
Category Kernel
Status Closed
Assigned To Tobias Powalowski (tpowa)
Thomas Bächler (brain0)
Architecture All
Severity Medium
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 3
Private No

Details

Description:
Linux 4.1.x can crash hard with a null pointer dereference when using RAID5. I was hit twice since running 4.1.x. Citing Neil Brown from his patch [0]:

> Cache size can grow or shrink due to various pressures at
> any time. So when we resize the cache as part of a 'grow'
> operation (i.e. change the size to allow more devices) we need
> to blocks that automatic growing/shrinking.
>
> So introduce a mutex. auto grow/shrink uses mutex_trylock()
> and just doesn't bother if there is a blockage.
> Resizing the whole cache holds the mutex to ensure that
> the correct number of new stripes is allocated.
>
> This bug can result in some stripes not being freed when an
> array is stopped. This leads to the kmem_cache not being
> freed and a subsequent array can try to use the same kmem_cache
> and get confused.

Should be sufficient to apply the patch by Neil Brown [0], but it does not apply cleanly to 4.1.x. Either we have to backport it or apply a series by Yuanhan Liu [1][2][3] in preparation.

[0] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/patch/?id=2d5b569b
[1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/patch/?id=9f3520c3
[2] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/patch/?id=b1b46486
[3] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/patch/?id=e9e4c377

Additional info:
linux 4.1.5-1
This task depends upon

Closed by  Tobias Powalowski (tpowa)
Monday, 28 September 2015, 06:58 GMT
Reason for closing:  Fixed
Additional comments about closing:  4.2.1-1
Comment by Christian Hesse (eworm) - Saturday, 22 August 2015, 09:44 GMT
Looks like this still happens with the patches applied...
Another search made me stumble in this:
https://lists.manjaro.org/pipermail/manjaro-dev/Week-of-Mon-20150727/000557.html

Just compiling to give it a shot.
Comment by Christian Hesse (eworm) - Monday, 24 August 2015, 06:46 GMT
Ignore my last post... That was nonsense.

What we need is another patch by Neil Brown [0][1].

[0] http://marc.info/?l=linux-raid&m=144039460103982&w=2
[1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/patch/?id=49895bcc
Comment by Kev (Kev) - Saturday, 19 September 2015, 09:34 GMT
I got hit by this bug, too.

Once during a copy operation after 15GiB and once during a pvcreate on another (freshly created) raid5.
Both on up-to-date 4.1.6-1-ARCH.

The once during the copy operation is more critical as I was copying *from* the raid5. So this can happen during normale usage and normal read operations.

After some research I found the fixes from Neil Brown send to KGH. They are still not in 3.1.7. I guess they will be in 3.1.8.
They are in 4.2.0.

Some more details:
http://comments.gmane.org/gmane.linux.raid/49638
http://permalink.gmane.org/gmane.linux.kernel.commits.head/538207
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=49895bcc7e566ba455eb2996607d6fbd3447ce16
Comment: "stable@vger.kernel.org (4.1 - please release with 2d5b569b665)"
Which is:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=2d5b569b665
Comment: "stable@vger.kernel.org (4.1 - please delay until 2 weeks after release of 4.2)"

Thanks to Christian Hesse for bringing this up to Neil Brown.

Can we get this patches applied? It will take some time before we get 4.1.8 (if we get it at all) or 4.2.X in stable repos and this bug is hitting normal usage, even during read operations.

[35248.469766] BUG: unable to handle kernel NULL pointer dereference at (null)
[35248.469837] IP: [<ffffffffa015bb91>] get_free_stripe+0x31/0xf0 [raid456]
[...]
[35248.477237] RIP [<ffffffffa015bb91>] get_free_stripe+0x31/0xf0 [raid456]
[35248.478229] RSP <ffff8801bfddb718>
[35248.479200] CR2: 0000000000000000
[35248.483013] ---[ end trace 4a3497943502ed7e ]---
[35248.483955] note: cp[30471] exited with preempt_count 1


[73643.618384] RIP [<ffffffffa015ffb0>] __find_stripe+0x30/0xc0 [raid456]
[73643.618386] RSP <ffff880004193778>
[73643.618394] ---[ end trace 4a3497943502ed7f ]---
[73643.618401] note: pvcreate[10434] exited with preempt_count 1


Comment by Christian Hesse (eworm) - Monday, 21 September 2015, 20:56 GMT
Uh, we have linux-lts 4.1.7-2 in [core], which has this RAID5 issue, no?
Comment by Kev (Kev) - Tuesday, 22 September 2015, 08:01 GMT
Yes, that's true. It's in [testing] at the moment.
Can we somehow link this report to https://www.archlinux.org/packages/testing/x86_64/linux-lts/ ?
At the moment it has no reported bugs.

Btw. 4.1.8 is out at kernel.org and still misses the fix, I don't know why.
My server is crashing every two days without it. That's not a good base for linux-lts.

Loading...