FS#75925 - [linux] System freezes since 5.19.8 when using docker

Attached to Project: Arch Linux
Opened by Patrick (suiiii) - Friday, 16 September 2022, 20:34 GMT
Last edited by Toolybird (Toolybird) - Wednesday, 16 August 2023, 03:08 GMT
Task Type Bug Report
Category Kernel
Status Closed
Assigned To Tobias Powalowski (tpowa)
Jan Alexander Steffens (heftig)
David Runge (dvzrv)
Levente Polyak (anthraxx)
Architecture x86_64
Severity Critical
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 1
Private No

Details

Description:

The system freezes when using docker on 5.19.8 and 5.19.9.
This happens when doing docker run, docker pull and docker prune, most reliably when pruning the system. run and pull seem to work for a short time (like 1 run, maybe 2) before freezing.

After downgrading to 5.19.7 the system works fine again.
There was also no docker update during the kernel updates.

I'd figure it is an upstream problem, but I could not find any other reports of this. So I wanted to report here first before going upstream.

I attached 3 dumps from journalctl but I am also pasting some part of the dumps to google can pick it up.

First I am getting a warning
WARNING: CPU: 26 PID: 1150 at fs/kernfs/dir.c:504 __kernfs_remove.part.0+0x2bf/0x300
...
Call Trace:
<TASK>
? cpumask_next+0x22/0x30
? kernfs_name_hash+0x12/0x80
kernfs_remove_by_name_ns+0x64/0xb0
sysfs_slab_add+0x166/0x200
__kmem_cache_create+0x3f1/0x4e0
kmem_cache_create_usercopy+0x172/0x2e0
kmem_cache_create+0x16/0x20
bioset_init+0x202/0x280
dm_alloc_md_mempools+0xe5/0x180 [dm_mod e0e7e531acb17cea3054e278f4217ef31a69a6b7]
dm_table_complete+0x3a0/0x690 [dm_mod e0e7e531acb17cea3054e278f4217ef31a69a6b7]
table_load+0x171/0x2f0 [dm_mod e0e7e531acb17cea3054e278f4217ef31a69a6b7]
? dev_suspend+0x2c0/0x2c0 [dm_mod e0e7e531acb17cea3054e278f4217ef31a69a6b7]
ctl_ioctl+0x206/0x460 [dm_mod e0e7e531acb17cea3054e278f4217ef31a69a6b7]
dm_ctl_ioctl+0xe/0x20 [dm_mod e0e7e531acb17cea3054e278f4217ef31a69a6b7]
__x64_sys_ioctl+0x94/0xd0
do_syscall_64+0x5f/0x90
? exit_to_user_mode_prepare+0x16f/0x1d0
? syscall_exit_to_user_mode+0x1b/0x40
? do_syscall_64+0x6b/0x90
? exc_page_fault+0x74/0x170
entry_SYSCALL_64_after_hwframe+0x63/0xcd



Followed by a kernel BUG:
kernel BUG at mm/slub.c:381!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 26 PID: 1150 Comm: dockerd Tainted: G W 5.19.9-arch1-1 #1 3da5a84b9442a05cd5bc412feaf8d6ab31862ed4
...
Call Trace:
<TASK>
kernfs_put.part.0+0x58/0x1a0
__kernfs_remove.part.0+0x18c/0x300
? cpumask_next+0x22/0x30
? kernfs_name_hash+0x12/0x80
kernfs_remove_by_name_ns+0x64/0xb0
sysfs_slab_add+0x166/0x200
__kmem_cache_create+0x3f1/0x4e0
kmem_cache_create_usercopy+0x172/0x2e0
kmem_cache_create+0x16/0x20
bioset_init+0x202/0x280
dm_alloc_md_mempools+0xe5/0x180 [dm_mod e0e7e531acb17cea3054e278f4217ef31a69a6b7]
dm_table_complete+0x3a0/0x690 [dm_mod e0e7e531acb17cea3054e278f4217ef31a69a6b7]
table_load+0x171/0x2f0 [dm_mod e0e7e531acb17cea3054e278f4217ef31a69a6b7]
? dev_suspend+0x2c0/0x2c0 [dm_mod e0e7e531acb17cea3054e278f4217ef31a69a6b7]
ctl_ioctl+0x206/0x460 [dm_mod e0e7e531acb17cea3054e278f4217ef31a69a6b7]
dm_ctl_ioctl+0xe/0x20 [dm_mod e0e7e531acb17cea3054e278f4217ef31a69a6b7]
__x64_sys_ioctl+0x94/0xd0
do_syscall_64+0x5f/0x90
? exit_to_user_mode_prepare+0x16f/0x1d0
? syscall_exit_to_user_mode+0x1b/0x40
? do_syscall_64+0x6b/0x90
? exc_page_fault+0x74/0x170
entry_SYSCALL_64_after_hwframe+0x63/0xcd


Additional info:

docker --version
Docker version 20.10.18, build b40c2f6b5d

uname -r
5.19.9-arch1-1


Steps to reproduce:

* be on 5.19.8-arch1-1 or 5.19.9-arch1-1
* do `docker system prune -a -f --volumes` (system needs to have images pulled, containers, volumes, etc - essentially needs to have data) otherwise do a `docker pull` (maybe multiple)
* system freezes
   dumps.txt (34.1 KiB)
This task depends upon

Closed by  Toolybird (Toolybird)
Wednesday, 16 August 2023, 03:08 GMT
Reason for closing:  Fixed
Additional comments about closing:  Old and stale. Original issue no longer occurring with latest pkgs.
Comment by loqs (loqs) - Saturday, 17 September 2022, 10:41 GMT
As the issue seems reproducible I would suggest trying to bisect it [1] before contacting upstream. Below are links to built kernels for 5.19.7 and 5.19.8 without Arch's additional commits and the first bisection point.

https://drive.google.com/file/d/1yH0ImhBsv6eOXauulhaPsWVUUTWpVO4_/view?usp=sharing linux-5.19.7-1-x86_64.pkg.tar.zst
https://drive.google.com/file/d/1y6XCqxq-JgBS6vSQc733cESQViJ4XSrS/view?usp=sharing linux-headers-5.19.7-1-x86_64.pkg.tar.zst

https://drive.google.com/file/d/1JWip1texRp2iI8uFJPURYLh9u1-muwoO/view?usp=sharing linux-5.19.8-1-x86_64.pkg.tar.zst
https://drive.google.com/file/d/16ARyEKUjFCUrksJ3M5T60qxVXFbnfUe5/view?usp=sharing linux-headers-5.19.8-1-x86_64.pkg.tar.zst

https://drive.google.com/file/d/1ftqZrJtYiCSBW927VgPEIdmYYTR1JsKu/view?usp=sharing linux-5.19.7.r78.gbb4be611c2f5-1-x86_64.pkg.tar.zst
https://drive.google.com/file/d/1xedA6u5LIMih-kW8nGYFRbd6jbqIonpj/view?usp=sharing linux-headers-5.19.7.r78.gbb4be611c2f5-1-x86_64.pkg.tar.zst

[1] https://wiki.archlinux.org/title/Bisecting_bugs_with_Git
Comment by Patrick (suiiii) - Saturday, 17 September 2022, 19:46 GMT
@loqs thanks for the help

It looks like the problem is not reliably reproducible after all.

I did some testing with upstream 5.19.7 and 5.19.8 and both seemed to work fine. Afterwards I upgraded to arch 5.19.8 which was also fine (?). arch 5.19.9 also worked fine for some time until I tried another system prune.

Each time I did a bunch of pulls, runs, builds, and prunes which usually caused the problem after 1-3 operations.

I also found this bug report upstream where I linked this ticket too: https://bugzilla.kernel.org/show_bug.cgi?id=216493
There is also this discussion which seems to discuss the root cause: https://lore.kernel.org/lkml/20220913121723.691454-1-lk%40c--e.de/T/#mc068df068cfd19c43b16542e74d4b72dfc1b0569

I'd guess I'll stick with 5.19.7 on my main machine for now and try to get a vm test system up and running to reproduce the problem
Comment by Patrick (suiiii) - Monday, 19 September 2022, 15:27 GMT
I did some more digging into the topic as it might be triggered by docker using the device mapper as storage provider. I don't know why it was configured that way since the overlay2 driver should be the default. At the same time, podman was using the overlay provider. Anyway, forcing docker to use the overlay provider solved the issue for me.

I am still trying to reproduce the error in a vm, explicitly using the device mapper storage provider, even reproducing my main system with lvm on luks. But still no luck.
Comment by Arthur Carcano (acarcano) - Thursday, 22 September 2022, 09:27 GMT
Hi,

I've just encountered the very same bug. Don't know how I can help more.

Docker version 20.10.18, build b40c2f6b5d
uname -r: 5.19.10-arch1-1
Comment by loqs (loqs) - Thursday, 22 September 2022, 23:18 GMT
@acarcano can you reliably reproduce the issue?
Comment by Arthur Carcano (acarcano) - Friday, 23 September 2022, 14:17 GMT
Given that it results in a Kernel freeze and that using the overlay storage seems to have fixed the issue, I haven't really tried to reproduce it.

However, it happened nearly immediately as I started using docker. I was using the default configuration, with my whole disk LUKS-encrypted, and things went south as described by Patrick pretty fast, so I'd guess that it is reproducible. Unfortunately, I don't have the time to create a VM-based test-bed to bisect the kernel.

Thanks for following up anyway,
Comment by Toolybird (Toolybird) - Sunday, 23 October 2022, 21:00 GMT
Is this still happening with latest kernels? According to the upstream bugzilla link, this patch [1] from mainline might fix it. Someone wanna give it a go?

[1] https://github.com/archlinux/linux/commit/4abc9965
Comment by Patrick (suiiii) - Wednesday, 26 October 2022, 13:46 GMT
I cannot reproduce the error anymore on my main machine and on test VMs. I tried different storage drivers and kernels (everything on lvm with luks) but I did not encounter the error again.
Comment by Chris Kankiewicz (PHLAK) - Tuesday, 15 August 2023, 23:20 GMT
I believe I'm experience this same (or awfully similar) issue. I'm running Docker 1:24.0.5-1 with the 6.1.43-1-lts kernel and I've been getting random complete system freezes requiring a hard reboot. This started out infrequently (e.g. once or twice a month) but has accelerated to multiple freezes a day. I've been struggling to pinpoint the exact source of the issue but it's looking like Docker might be the culprit. It seems that as long as Docker is running the freezes happen eventually.

I am using LUKS full-disk-encryption for my OS (EXT4) and storage (ZFS) drives. Originally, when the freezing started, I had Docker configured with the ZFS storage driver for storing layers and volumes on my ZFS array. However, I recently reverted back to default (i.e. overlay2 I think) on my OS drive and it's still freezing. My journal logs don't seem to be reporting anything relevant on freeze. I've currently completely disabled Docker and an letting the system run to see if it freezes but so far it has not (been up over 24 hours). I'm at wits end and had even started replacing hardware (SSD and RAM) before finding this bug report and am hoping to find a resolution soon. If there's any way I can be helpful let me know.
Comment by Chris Kankiewicz (PHLAK) - Tuesday, 15 August 2023, 23:22 GMT
@acarcano When you say that "using the overlay storage seems to have fixed the issue" how did you configure this? I thought overlay2 was the default.
Comment by Toolybird (Toolybird) - Wednesday, 16 August 2023, 03:07 GMT
@PHLAK, please don't "necro" old tickets. Please instead use the proper support channels (Forum/IRC/Mailing Lists/Reddit/etc) to seek troubleshooting assistance.

Loading...