FS#70992 : bvec_alloc crash with kernel 5.12.5-arch1-1

FS#70992 - bvec_alloc crash with kernel 5.12.5-arch1-1

Attached to Project: Arch Linux
Opened by Jens Stutte (jensstutte) - Saturday, 22 May 2021, 19:23 GMT
Last edited by Toolybird (Toolybird) - Tuesday, 06 June 2023, 03:21 GMT

Task Type	Bug Report
Category	Kernel
Status	Closed
Assigned To	Jan Alexander Steffens (heftig)
Architecture	x86_64
Severity	High
Priority	Normal
Reported Version
Due in Version	Undecided
Due Date	Undecided
Percent Complete
Votes	0
Private	No

Details

Description:
After upgrading to kernel 5.12.5-arch1-1, I experience frequent hangs and found the following in my journalctl:

Mai 21 19:09:06 vdr kernel: ------------[ cut here ]------------
Mai 21 19:09:06 vdr kernel: kernel BUG at block/bio.c:52!
Mai 21 19:09:06 vdr kernel: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
Mai 21 19:09:06 vdr kernel: CPU: 13 PID: 272 Comm: kworker/u64:4 Not tainted 5.12.5-arch1-1 #1
Mai 21 19:09:06 vdr kernel: Hardware name: ASUS System Product Name/TUF GAMING B550M-PLUS, BIOS 1804 02/02/2021
Mai 21 19:09:06 vdr kernel: Workqueue: writeback wb_workfn (flush-9:0)
Mai 21 19:09:06 vdr kernel: RIP: 0010:biovec_slab.part.0+0x5/0x10
Mai 21 19:09:06 vdr kernel: Code: 81 18 63 00 48 8b 6b f0 48 85 ed 75 ca 5b 4c 89 e7 5d 41 5c e9 4c 18 63 00 48 c7 43 f8 00 00 00 00 eb c1 66 90 0f 1f 44 00 00 <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 4>
Mai 21 19:09:06 vdr kernel: RSP: 0018:ffffb37500f27620 EFLAGS: 00010202
Mai 21 19:09:06 vdr kernel: RAX: 00000000000000bf RBX: ffffb37500f27654 RCX: 0000000000000100
Mai 21 19:09:06 vdr kernel: RDX: 0000000000000c00 RSI: ffffb37500f27654 RDI: ffff970080e9dc38
Mai 21 19:09:06 vdr kernel: RBP: 0000000000000c00 R08: ffff970080e9dc38 R09: ffff970109e06a00
Mai 21 19:09:06 vdr kernel: R10: 0000000000000004 R11: ffffb37500f27788 R12: ffff970080e9dc38
Mai 21 19:09:06 vdr kernel: R13: 0000000000000c00 R14: 0000000000000c00 R15: ffff970080e9dbf0
Mai 21 19:09:06 vdr kernel: FS: 0000000000000000(0000) GS:ffff97078ed40000(0000) knlGS:0000000000000000
Mai 21 19:09:06 vdr kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mai 21 19:09:06 vdr kernel: CR2: 00007fa7679f3010 CR3: 000000014c682000 CR4: 0000000000350ee0
Mai 21 19:09:06 vdr kernel: Call Trace:
Mai 21 19:09:06 vdr kernel: bvec_alloc+0x90/0xc0
Mai 21 19:09:06 vdr kernel: bio_alloc_bioset+0x1b3/0x260
Mai 21 19:09:06 vdr kernel: raid1_make_request+0x9ce/0xc50 [raid1]
Mai 21 19:09:06 vdr kernel: ? __bio_clone_fast+0xa8/0xe0
Mai 21 19:09:06 vdr kernel: md_handle_request+0x158/0x1d0 [md_mod]
Mai 21 19:09:06 vdr kernel: md_submit_bio+0xcd/0x110 [md_mod]
Mai 21 19:09:06 vdr kernel: submit_bio_noacct+0x139/0x530
Mai 21 19:09:06 vdr kernel: ? __test_set_page_writeback+0x89/0x2d0
Mai 21 19:09:06 vdr kernel: submit_bio+0x78/0x1d0
Mai 21 19:09:06 vdr kernel: ext4_bio_write_page+0x1fd/0x630 [ext4]
Mai 21 19:09:06 vdr kernel: mpage_submit_page+0x46/0x80 [ext4]
Mai 21 19:09:06 vdr kernel: ext4_writepages+0x9ed/0x1170 [ext4]
Mai 21 19:09:06 vdr kernel: ? do_writepages+0x41/0x100
Mai 21 19:09:06 vdr kernel: do_writepages+0x41/0x100
Mai 21 19:09:06 vdr kernel: ? __wb_calc_thresh+0x4b/0x140
Mai 21 19:09:06 vdr kernel: __writeback_single_inode+0x3d/0x310
Mai 21 19:09:06 vdr kernel: ? wbc_detach_inode+0x13f/0x210
Mai 21 19:09:06 vdr kernel: writeback_sb_inodes+0x1fc/0x480
Mai 21 19:09:06 vdr kernel: __writeback_inodes_wb+0x4c/0xe0
Mai 21 19:09:06 vdr kernel: wb_writeback+0x22e/0x320
Mai 21 19:09:06 vdr kernel: wb_workfn+0x392/0x5c0
Mai 21 19:09:06 vdr kernel: process_one_work+0x214/0x3e0
Mai 21 19:09:06 vdr kernel: worker_thread+0x4d/0x3d0
Mai 21 19:09:06 vdr kernel: ? process_one_work+0x3e0/0x3e0
Mai 21 19:09:06 vdr kernel: kthread+0x133/0x150
Mai 21 19:09:06 vdr kernel: ? kthread_associate_blkcg+0xc0/0xc0
Mai 21 19:09:06 vdr kernel: ret_from_fork+0x22/0x30
Mai 21 19:09:06 vdr kernel: Modules linked in: cfg80211 8021q garp mrp stp llc nct6775 mousedev joydev intel_rapl_msr intel_rapl_common amdgpu edac_mce_amd snd_hda_codec_realtek snd_hda_codec_generic le>
Mai 21 19:09:06 vdr kernel: ---[ end trace 475f9c7132a03933 ]---

Former 5.11.16-arch1-1 worked.
I see that the kernel is now compiled with GCC 11 vs. GCC 10 before. There might be some non-Ryzen 9 3900 XT compatible code generation ongoing here?

Additional info:
* package version(s)
* config and/or log files etc.
* link to upstream bug report, if any

Steps to reproduce:
Upgrade kernel and boot.

This task depends upon

Closed by Toolybird (Toolybird)
Tuesday, 06 June 2023, 03:21 GMT
Reason for closing: No response
Additional comments about closing: Old and stale. If still an issue, please follow PM's instructions and report the issue upstream and submit the patch.

Comment by loqs (loqs) - Saturday, 22 May 2021, 21:34 GMT

Does the system use bcache?
https://bbs.archlinux.org/viewtopic.php?id=266125

Comment by Jens Stutte (jensstutte) - Sunday, 23 May 2021, 07:17 GMT

Not that I am aware of, there are no /dev/bcache* and from what I read I would remember the non-trivial setup routine for sure. The system does contain only SSDs (except for a normally switched off external backup HD).
BTW, I see the exact same behavior as in https://bbs.archlinux.org/viewtopic.php?pid=1971470#p1971470, thanks for pointing me there!

Comment by Jens Stutte (jensstutte) - Sunday, 23 May 2021, 07:19 GMT

PS - not sure if it can be related: I do use mdm on that system:

~~~
[root@vdr jens]# df -h
Dateisystem Größe Benutzt Verf. Verw% Eingehängt auf
dev 16G 0 16G 0% /dev
run 16G 1,4M 16G 1% /run
/dev/sda4 63G 33G 27G 56% /
tmpfs 16G 0 16G 0% /dev/shm
tmpfs 16G 0 16G 0% /tmp
/dev/md0 1,8T 739G 1002G 43% /mnt/raid
tmpfs 3,2G 60K 3,2G 1% /run/user/1000
tmpfs 3,2G 60K 3,2G 1% /run/user/969
~~~

Comment by Jens Stutte (jensstutte) - Sunday, 23 May 2021, 08:00 GMT

There seems to be a patch on its way: https://lkml.org/lkml/2021/5/17/1888 (actually a serious of a few patches, it seems). While I am not sure, why this can happen on my machine (I see no module bcache loaded), we might just want to wait to see that patch landing.

Comment by Jens Stutte (jensstutte) - Sunday, 23 May 2021, 08:33 GMT

Sorry for the noise, double POST.

Comment by Jens Stutte (jensstutte) - Sunday, 23 May 2021, 10:19 GMT

So looking at this a bit more, I see that the [BUG in bio.c](https://elixir.bootlin.com/linux/v5.12-rc1/source/block/bio.c#L52) has been introduced in 5.12 in general. This can reveal obviously formerly hidden bugs in any place, and the bcache one might be just one of those. So I would assume that I hit a different case triggered probably by my raid setup. I am not familiar with how to find the "regressing" patch though, which might contain further hints on what to do.

Comment by Jens Stutte (jensstutte) - Sunday, 23 May 2021, 12:13 GMT

I filed https://bugzilla.kernel.org/show_bug.cgi?id=213181.

Comment by Jens Stutte (jensstutte) - Sunday, 23 May 2021, 12:38 GMT

Link to the diff of the patch that introduced that BUG(): https://github.com/torvalds/linux/commit/7a800a20ae6329e803c5c646b20811a6ae9ca136#diff-8192395c9eecb8e11533bf38d0ee498d5778724d859ca0a7cb2fdbabe8002eb2

Comment by Alexander Ullrich (Mika79) - Friday, 11 June 2021, 22:25 GMT

From
https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/log/?h=block-5.13

and
180599cb-7c2e-da35-96a5-225462c6cd71@kernel.dk/T/#t"> https://lore.kernel.org/linux-bcache/180599cb-7c2e-da35-96a5-225462c6cd71@kernel.dk/T/#t

These two tested patches are supposed to fix the issue for actual bcache use. They are probably going into 5.13. Please consider applying them to all supported, suffering kernels.

https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/patch/?id=1616a4c2ab1a80893b6890ae93da40a2b1d0c691
https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/patch/?id=41fe8d088e96472f63164e213de44ec77be69478

Comment by Craig (Hazey) - Thursday, 17 June 2021, 20:03 GMT

Patches merged into 5.12.11 - hopefully should be good now!

https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.12.11

https://git.kernel.org/pub/scm/linux/kernel/git/stable/stable-queue.git/diff/releases/5.12.11/bcache-avoid-oversized-read-request-in-cache-missing-code-path.patch?h=v5.12.11
https://git.kernel.org/pub/scm/linux/kernel/git/stable/stable-queue.git/diff/releases/5.12.11/bcache-remove-bcache-device-self-defined-readahead.patch?h=v5.12.11

Comment by Jens Stutte (jensstutte) - Friday, 18 June 2021, 06:04 GMT

Just to be sure: I do not have configured bcache on my system. These patches are supposed to solve this problem anyway? I will try to find some time to check, but it feels unlikely from the touched files there.

Comment by Ilgiz (Nurik) - Wednesday, 07 July 2021, 11:44 GMT

Still the same issue on kernel 5.12.14-arch1-1. Any help please?

My OS is on lvm which is on mdraid. I don't use bcache.

Part of call trace:

---
[ 8.096941] Call Trace:
[ 8.097936] bvec_alloc+0x90/0xc0
[ 8.098934] bio_alloc_bioset+0x1b3/0x260
[ 8.099959] raid1_make_request+0x9ce/0xc50 [raid1]
[ 8.100988] ? __bio_clone_fast+0xa8/0xe0
[ 8.102008] md_handle_request+0x158/0x1d0 [md_mod]
[ 8.103050] md_submit_bio+0xcd/0x110 [md_mod]
[ 8.104084] submit_bio_noacct+0x139/0x530
[ 8.105127] submit_bio+0x78/0x1d0
[ 8.106163] ext4_io_submit+0x48/0x60 [ext4]
[ 8.107242] ext4_writepages+0x652/0x1170 [ext4]
[ 8.108300] ? do_writepages+0x41/0x100
[ 8.109338] ? __ext4_mark_inode_dirty+0x240/0x240 [ext4]
[ 8.110406] do_writepages+0x41/0x100
[ 8.111450] __filemap_fdatawrite_range+0xc5/0x100
[ 8.112513] file_write_and_wait_range+0x61/0xb0
[ 8.113564] ext4_sync_file+0x73/0x370 [ext4]
[ 8.114607] __x64_sys_fsync+0x33/0x60
[ 8.115635] do_syscall_64+0x33/0x40
[ 8.116670] entry_SYSCALL_64_after_hwframe+0x44/0xae

---

Comment by Jens Stutte (jensstutte) - Thursday, 08 July 2021, 16:45 GMT

Hi Ilgiz, thanks for confirming my doubt (though I'd have preferred to see it solved, of course). I noted your findings on https://bugzilla.kernel.org/show_bug.cgi?id=213181 in order to accelerate things a bit.

Comment by loqs (loqs) - Thursday, 08 July 2021, 22:06 GMT

Is the issue present on 5.13.1? If so please try 5.14-rc1 when it is released.

Comment by Jens Stutte (jensstutte) - Monday, 09 August 2021, 20:27 GMT

Hi loqs, as I do not see the patch proposed here: https://bugzilla.kernel.org/show_bug.cgi?id=213181#c10 in any upstream 5.14-rcX I assume this will not change much? Or is there an arch specific patch? I just want to avoid to put my system at risk without knowing that actually changed something worth trying. Thanks!

Comment by loqs (loqs) - Monday, 09 August 2021, 23:17 GMT

The current Arch additional patches can be found here [1]. I suspect nothing more will happen until upstream receives feedback on the proposed fix.

[1] https://github.com/archlinux/linux/commits/v5.13.9-arch1

Comment by Jens Stutte (jensstutte) - Tuesday, 10 August 2021, 06:19 GMT

Thanks, apparently this does not contain the proposed fix, either: https://github.com/archlinux/linux/blob/v5.13.9-arch1/drivers/md/raid1.c#L1454, so you are probably right - nothing will happen until someone tries the fix.

I was trying to setup the arch kernel build environment to do so, but I am relatively new to arch (former gentoo user) and was struggling yesterday a bit with makepkg and keys. I guess I'll need just to dedicate some more time then.

But if it takes too long I might just consider to disable write-behind on those devices, IIUC that would fix the issue, too.

Comment by loqs (loqs) - Tuesday, 10 August 2021, 07:26 GMT

The patch does not apply to 5.13.9:
drivers/md/raid1.c: In function ‘raid1_write_request’:
drivers/md/raid1.c:1454:67: error: ‘PAGE_SECTORS’ undeclared (first use in this function); did you mean ‘READ_SECTORS’?
1454 | max_sectors = min_t(uint32_t, max_sectors, BIO_MAX_VECS * PAGE_SECTORS);
Edit:
Same error with 5.14-rc5. Attached the patch I used.

test.patch (0.5 KiB)

Comment by Jens Stutte (jensstutte) - Tuesday, 10 August 2021, 11:30 GMT

Yes, if I add a:

#include "bcache/util.h"

at the top of raid1.c, it compiles, at least. But it feels kind of wrong to make md depend on bcache?

Comment by Jens Stutte (jensstutte) - Friday, 24 September 2021, 08:07 GMT

Hello,

I just tried kernel 5.14.7-arch1-1 which is supposed to contain the patch. Unfortunately the problem persists, see attached log.
With the custom kernel I built from the initial tentative patch (that looks a bit different from what went into the kernel sources) it still works, instead.

crash.log (3.4 KiB)

Comment by Jens Stutte (jensstutte) - Friday, 24 September 2021, 13:51 GMT

With the attached patch it returns to work.
The condition used to decide if we need to split differed from the condition used to decide when to alloc.
This patch just splits always if there is a bitmap and the max_sectors is too big.

raid1.patch (2.2 KiB)

Comment by Jan Alexander Steffens (heftig) - Friday, 24 September 2021, 17:32 GMT

Please get your patch into an upstream tree first.

Arch Linux

FS#70992 - bvec_alloc crash with kernel 5.12.5-arch1-1

Details

Loading...