Arch Linux

Please read this before reporting a bug:
https://wiki.archlinux.org/title/Bug_reporting_guidelines

Do NOT report bugs when a package is just outdated, or it is in the AUR. Use the 'flag out of date' link on the package page, or the Mailing List.

REPEAT: Do NOT report bugs for outdated packages!
Tasklist

FS#75925 - [linux] System freezes since 5.19.8 when using docker

Attached to Project: Arch Linux
Opened by Patrick (suiiii) - Friday, 16 September 2022, 20:34 GMT
Last edited by Toolybird (Toolybird) - Thursday, 22 September 2022, 01:22 GMT
Task Type Bug Report
Category Kernel
Status Waiting on Response
Assigned To No-one
Architecture x86_64
Severity Critical
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 0%
Votes 0
Private No

Details

Description:

The system freezes when using docker on 5.19.8 and 5.19.9.
This happens when doing docker run, docker pull and docker prune, most reliably when pruning the system. run and pull seem to work for a short time (like 1 run, maybe 2) before freezing.

After downgrading to 5.19.7 the system works fine again.
There was also no docker update during the kernel updates.

I'd figure it is an upstream problem, but I could not find any other reports of this. So I wanted to report here first before going upstream.

I attached 3 dumps from journalctl but I am also pasting some part of the dumps to google can pick it up.

First I am getting a warning
WARNING: CPU: 26 PID: 1150 at fs/kernfs/dir.c:504 __kernfs_remove.part.0+0x2bf/0x300
...
Call Trace:
<TASK>
? cpumask_next+0x22/0x30
? kernfs_name_hash+0x12/0x80
kernfs_remove_by_name_ns+0x64/0xb0
sysfs_slab_add+0x166/0x200
__kmem_cache_create+0x3f1/0x4e0
kmem_cache_create_usercopy+0x172/0x2e0
kmem_cache_create+0x16/0x20
bioset_init+0x202/0x280
dm_alloc_md_mempools+0xe5/0x180 [dm_mod e0e7e531acb17cea3054e278f4217ef31a69a6b7]
dm_table_complete+0x3a0/0x690 [dm_mod e0e7e531acb17cea3054e278f4217ef31a69a6b7]
table_load+0x171/0x2f0 [dm_mod e0e7e531acb17cea3054e278f4217ef31a69a6b7]
? dev_suspend+0x2c0/0x2c0 [dm_mod e0e7e531acb17cea3054e278f4217ef31a69a6b7]
ctl_ioctl+0x206/0x460 [dm_mod e0e7e531acb17cea3054e278f4217ef31a69a6b7]
dm_ctl_ioctl+0xe/0x20 [dm_mod e0e7e531acb17cea3054e278f4217ef31a69a6b7]
__x64_sys_ioctl+0x94/0xd0
do_syscall_64+0x5f/0x90
? exit_to_user_mode_prepare+0x16f/0x1d0
? syscall_exit_to_user_mode+0x1b/0x40
? do_syscall_64+0x6b/0x90
? exc_page_fault+0x74/0x170
entry_SYSCALL_64_after_hwframe+0x63/0xcd



Followed by a kernel BUG:
kernel BUG at mm/slub.c:381!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 26 PID: 1150 Comm: dockerd Tainted: G W 5.19.9-arch1-1 #1 3da5a84b9442a05cd5bc412feaf8d6ab31862ed4
...
Call Trace:
<TASK>
kernfs_put.part.0+0x58/0x1a0
__kernfs_remove.part.0+0x18c/0x300
? cpumask_next+0x22/0x30
? kernfs_name_hash+0x12/0x80
kernfs_remove_by_name_ns+0x64/0xb0
sysfs_slab_add+0x166/0x200
__kmem_cache_create+0x3f1/0x4e0
kmem_cache_create_usercopy+0x172/0x2e0
kmem_cache_create+0x16/0x20
bioset_init+0x202/0x280
dm_alloc_md_mempools+0xe5/0x180 [dm_mod e0e7e531acb17cea3054e278f4217ef31a69a6b7]
dm_table_complete+0x3a0/0x690 [dm_mod e0e7e531acb17cea3054e278f4217ef31a69a6b7]
table_load+0x171/0x2f0 [dm_mod e0e7e531acb17cea3054e278f4217ef31a69a6b7]
? dev_suspend+0x2c0/0x2c0 [dm_mod e0e7e531acb17cea3054e278f4217ef31a69a6b7]
ctl_ioctl+0x206/0x460 [dm_mod e0e7e531acb17cea3054e278f4217ef31a69a6b7]
dm_ctl_ioctl+0xe/0x20 [dm_mod e0e7e531acb17cea3054e278f4217ef31a69a6b7]
__x64_sys_ioctl+0x94/0xd0
do_syscall_64+0x5f/0x90
? exit_to_user_mode_prepare+0x16f/0x1d0
? syscall_exit_to_user_mode+0x1b/0x40
? do_syscall_64+0x6b/0x90
? exc_page_fault+0x74/0x170
entry_SYSCALL_64_after_hwframe+0x63/0xcd


Additional info:

docker --version
Docker version 20.10.18, build b40c2f6b5d

uname -r
5.19.9-arch1-1


Steps to reproduce:

* be on 5.19.8-arch1-1 or 5.19.9-arch1-1
* do `docker system prune -a -f --volumes` (system needs to have images pulled, containers, volumes, etc - essentially needs to have data) otherwise do a `docker pull` (maybe multiple)
* system freezes
   dumps.txt (34.1 KiB)
This task depends upon

Comment by loqs (loqs) - Saturday, 17 September 2022, 10:41 GMT
As the issue seems reproducible I would suggest trying to bisect it [1] before contacting upstream. Below are links to built kernels for 5.19.7 and 5.19.8 without Arch's additional commits and the first bisection point.

https://drive.google.com/file/d/1yH0ImhBsv6eOXauulhaPsWVUUTWpVO4_/view?usp=sharing linux-5.19.7-1-x86_64.pkg.tar.zst
https://drive.google.com/file/d/1y6XCqxq-JgBS6vSQc733cESQViJ4XSrS/view?usp=sharing linux-headers-5.19.7-1-x86_64.pkg.tar.zst

https://drive.google.com/file/d/1JWip1texRp2iI8uFJPURYLh9u1-muwoO/view?usp=sharing linux-5.19.8-1-x86_64.pkg.tar.zst
https://drive.google.com/file/d/16ARyEKUjFCUrksJ3M5T60qxVXFbnfUe5/view?usp=sharing linux-headers-5.19.8-1-x86_64.pkg.tar.zst

https://drive.google.com/file/d/1ftqZrJtYiCSBW927VgPEIdmYYTR1JsKu/view?usp=sharing linux-5.19.7.r78.gbb4be611c2f5-1-x86_64.pkg.tar.zst
https://drive.google.com/file/d/1xedA6u5LIMih-kW8nGYFRbd6jbqIonpj/view?usp=sharing linux-headers-5.19.7.r78.gbb4be611c2f5-1-x86_64.pkg.tar.zst

[1] https://wiki.archlinux.org/title/Bisecting_bugs_with_Git
Comment by Patrick (suiiii) - Saturday, 17 September 2022, 19:46 GMT
@loqs thanks for the help

It looks like the problem is not reliably reproducible after all.

I did some testing with upstream 5.19.7 and 5.19.8 and both seemed to work fine. Afterwards I upgraded to arch 5.19.8 which was also fine (?). arch 5.19.9 also worked fine for some time until I tried another system prune.

Each time I did a bunch of pulls, runs, builds, and prunes which usually caused the problem after 1-3 operations.

I also found this bug report upstream where I linked this ticket too: https://bugzilla.kernel.org/show_bug.cgi?id=216493
There is also this discussion which seems to discuss the root cause:20220913121723.691454-1-lk@c--e.de/T/#mc068df068cfd19c43b16542e74d4b72dfc1b0569"> https://lore.kernel.org/lkml/20220913121723.691454-1-lk@c--e.de/T/#mc068df068cfd19c43b16542e74d4b72dfc1b0569

I'd guess I'll stick with 5.19.7 on my main machine for now and try to get a vm test system up and running to reproduce the problem
Comment by Patrick (suiiii) - Monday, 19 September 2022, 15:27 GMT
I did some more digging into the topic as it might be triggered by docker using the device mapper as storage provider. I don't know why it was configured that way since the overlay2 driver should be the default. At the same time, podman was using the overlay provider. Anyway, forcing docker to use the overlay provider solved the issue for me.

I am still trying to reproduce the error in a vm, explicitly using the device mapper storage provider, even reproducing my main system with lvm on luks. But still no luck.
Comment by Arthur Carcano (acarcano) - Thursday, 22 September 2022, 09:27 GMT
Hi,

I've just encountered the very same bug. Don't know how I can help more.

Docker version 20.10.18, build b40c2f6b5d
uname -r: 5.19.10-arch1-1
Comment by loqs (loqs) - Thursday, 22 September 2022, 23:18 GMT
@acarcano can you reliably reproduce the issue?
Comment by Arthur Carcano (acarcano) - Friday, 23 September 2022, 14:17 GMT
Given that it results in a Kernel freeze and that using the overlay storage seems to have fixed the issue, I haven't really tried to reproduce it.

However, it happened nearly immediately as I started using docker. I was using the default configuration, with my whole disk LUKS-encrypted, and things went south as described by Patrick pretty fast, so I'd guess that it is reproducible. Unfortunately, I don't have the time to create a VM-based test-bed to bisect the kernel.

Thanks for following up anyway,

Loading...