FS#46002 : [linux] null pointer dereference with RAID5

FS#46002 - [linux] null pointer dereference with RAID5

Attached to Project: Arch Linux
Opened by Christian Hesse (eworm) - Monday, 17 August 2015, 08:31 GMT
Last edited by Tobias Powalowski (tpowa) - Monday, 28 September 2015, 06:58 GMT

Task Type	Bug Report
Category	Kernel
Status	Closed
Assigned To	Tobias Powalowski (tpowa) Thomas Bächler (brain0)
Architecture	All
Severity	Medium
Priority	Normal
Reported Version
Due in Version	Undecided
Due Date	Undecided
Percent Complete
Votes	3 Curtis Lee Bolin (curtisleebolin) (2015-09-27) Kev (Kev) (2015-09-19) Gene (GeneC) (2015-09-03)
Private	No

Details

Description:
Linux 4.1.x can crash hard with a null pointer dereference when using RAID5. I was hit twice since running 4.1.x. Citing Neil Brown from his patch [0]:

> Cache size can grow or shrink due to various pressures at
> any time. So when we resize the cache as part of a 'grow'
> operation (i.e. change the size to allow more devices) we need
> to blocks that automatic growing/shrinking.
>
> So introduce a mutex. auto grow/shrink uses mutex_trylock()
> and just doesn't bother if there is a blockage.
> Resizing the whole cache holds the mutex to ensure that
> the correct number of new stripes is allocated.
>
> This bug can result in some stripes not being freed when an
> array is stopped. This leads to the kmem_cache not being
> freed and a subsequent array can try to use the same kmem_cache
> and get confused.

Should be sufficient to apply the patch by Neil Brown [0], but it does not apply cleanly to 4.1.x. Either we have to backport it or apply a series by Yuanhan Liu [1][2][3] in preparation.

[0] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/patch/?id=2d5b569b
[1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/patch/?id=9f3520c3
[2] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/patch/?id=b1b46486
[3] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/patch/?id=e9e4c377

Additional info:
linux 4.1.5-1

This task depends upon

Closed by Tobias Powalowski (tpowa)
Monday, 28 September 2015, 06:58 GMT
Reason for closing: Fixed
Additional comments about closing: 4.2.1-1

Comment by Christian Hesse (eworm) - Saturday, 22 August 2015, 09:44 GMT

Looks like this still happens with the patches applied...
Another search made me stumble in this:
https://lists.manjaro.org/pipermail/manjaro-dev/Week-of-Mon-20150727/000557.html

Just compiling to give it a shot.

Comment by Christian Hesse (eworm) - Monday, 24 August 2015, 06:46 GMT

Ignore my last post... That was nonsense.

What we need is another patch by Neil Brown [0][1].

[0] http://marc.info/?l=linux-raid&m=144039460103982&w=2
[1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/patch/?id=49895bcc

Comment by Kev (Kev) - Saturday, 19 September 2015, 09:34 GMT

I got hit by this bug, too.

Once during a copy operation after 15GiB and once during a pvcreate on another (freshly created) raid5.
Both on up-to-date 4.1.6-1-ARCH.

The once during the copy operation is more critical as I was copying *from* the raid5. So this can happen during normale usage and normal read operations.

After some research I found the fixes from Neil Brown send to KGH. They are still not in 3.1.7. I guess they will be in 3.1.8.
They are in 4.2.0.

Some more details:
http://comments.gmane.org/gmane.linux.raid/49638
http://permalink.gmane.org/gmane.linux.kernel.commits.head/538207
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=49895bcc7e566ba455eb2996607d6fbd3447ce16
Comment: "stable@vger.kernel.org (4.1 - please release with 2d5b569b665)"
Which is:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=2d5b569b665
Comment: "stable@vger.kernel.org (4.1 - please delay until 2 weeks after release of 4.2)"

Thanks to Christian Hesse for bringing this up to Neil Brown.

Can we get this patches applied? It will take some time before we get 4.1.8 (if we get it at all) or 4.2.X in stable repos and this bug is hitting normal usage, even during read operations.

[35248.469766] BUG: unable to handle kernel NULL pointer dereference at (null)
[35248.469837] IP: [<ffffffffa015bb91>] get_free_stripe+0x31/0xf0 [raid456]
[...]
[35248.477237] RIP [<ffffffffa015bb91>] get_free_stripe+0x31/0xf0 [raid456]
[35248.478229] RSP <ffff8801bfddb718>
[35248.479200] CR2: 0000000000000000
[35248.483013] ---[ end trace 4a3497943502ed7e ]---
[35248.483955] note: cp[30471] exited with preempt_count 1

[73643.618384] RIP [<ffffffffa015ffb0>] __find_stripe+0x30/0xc0 [raid456]
[73643.618386] RSP <ffff880004193778>
[73643.618394] ---[ end trace 4a3497943502ed7f ]---
[73643.618401] note: pvcreate[10434] exited with preempt_count 1

Comment by Christian Hesse (eworm) - Monday, 21 September 2015, 20:56 GMT

Uh, we have linux-lts 4.1.7-2 in [core], which has this RAID5 issue, no?

Comment by Kev (Kev) - Tuesday, 22 September 2015, 08:01 GMT

Yes, that's true. It's in [testing] at the moment.
Can we somehow link this report to https://www.archlinux.org/packages/testing/x86_64/linux-lts/ ?
At the moment it has no reported bugs.

Btw. 4.1.8 is out at kernel.org and still misses the fix, I don't know why.
My server is crashing every two days without it. That's not a good base for linux-lts.

Arch Linux

FS#46002 - [linux] null pointer dereference with RAID5

Details

Loading...