FS#77481 - [docker] 1:23.0.1-1 broken, either won't start or sends SIGTERM to the app inside container

Attached to Project: Community Packages
Opened by Kirill (Zack) - Monday, 13 February 2023, 11:38 GMT
Last edited by Toolybird (Toolybird) - Sunday, 14 May 2023, 21:51 GMT
Task Type Bug Report
Category Packages
Status Closed
Assigned To Lukas Fleischer (lfleischer)
Sébastien Luttringer (seblu)
Morten Linderud (Foxboron)
Architecture All
Severity High
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 9
Private No

Details

Description:
Updating to community/docker 1:23.0.1-1 from 1:20.10.9-1 results in random container crashes or docker engine not being able to start at all in some cases.
The app inside the container catches SIGTERM.
Rollback to 1:20.10.9-1 solves the problem.
Forum ref: https://bbs.archlinux.org/viewtopic.php?pid=2084430

Additional info:
* package version(s): 1:23.0.1-1
* config and/or log files etc. :
Posted by another user with crashing service:
> journalctl -xeu docker.service
Feb 13 10:39:26 beta systemd[1]: docker.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ The unit docker.service has entered the 'failed' state with result 'exit-code'.
Feb 13 10:39:26 beta systemd[1]: Failed to start Docker Application Container Engine.
░░ Subject: A start job for unit docker.service has failed
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ A start job for unit docker.service has finished with a failure.
░░
░░ The job identifier is 1229 and the job result is failed.


Steps to reproduce:
Update to 1:23.0.1-1, start docker engine, if it works, start for example lscr.io/linuxserver/wireguard:latest and try to connect with a client, results in SIGTERM.
This task depends upon

Closed by  Toolybird (Toolybird)
Sunday, 14 May 2023, 21:51 GMT
Reason for closing:  Fixed
Additional comments about closing:  docker 1:23.0.6-1
Comment by Evgenii Morozov (emorozov) - Monday, 13 February 2023, 12:16 GMT
MariaDB 5.6 container starts but consumes 16.5 Gb of RAM on start. If your system has 16 Gb of RAM or less, it may freeze. I've upgraded docker to the latest version on another machine with 64 Gb of RAM, MySQL container starts there successfully (still eating 16.5 Gb of RAM) but seems to perform significantly slower.

Downgraded docker to 20.10.23, rebooted, now everything works just fine.
Comment by Simon Tunnat (imon) - Tuesday, 14 February 2023, 07:48 GMT
I also had strange problems with Docker yesterday. I was trying to run some Ansible roles in a Docker container and Ansible just got completely stuck when retrieving the facts.
It seems downgrading to Docker 20.10.23 has resolved that issue for now.

It also seems a restart of the docker.service was not enough and I needed to restart my machine.
Comment by Peter Jung (ptr1337) - Tuesday, 14 February 2023, 08:46 GMT
I faced also issue when using https://github.com/WhyNotHugo/docker-makepkg to build packages.
At the packaging part it just hanged for like 3 minutes in the "fakeroot" process till it finally started to package.
Downgrading fixed also the issue
Comment by Toolybird (Toolybird) - Thursday, 16 February 2023, 04:22 GMT
So far nobody has provided enough detailed info to pinpoint an Arch packaging bug. Anything in the system logs? Any crashes reported by coredumpctl? You'll need to debug this somehow. An upstream issue is still a possibility.
Comment by Evgenii Morozov (emorozov) - Thursday, 16 February 2023, 05:20 GMT
This is most probably an upstream issue. I've tried to debug, but gave up after spending a couple of hours. In my case MySQL works if you have more than 16 Gb of RAM. Just on previous version of docker, it grabs no more than 4 Gb of RAM, and on this latest version of docker it grabs 16.5 Gb immediately on start, and also becomes slow answering queries.

I tried most obvious things: disabling hugepages completely, looking at MySQL process memory maps, but there is no obvious solution. It's just that MySQL process grabs 4-5 times more RAM than usual for some reason. Other containers also (rabbitmq, for example), but not as much, MySQL is the largest offender. The issue is so low-level, that I have no idea how to debug it.

Probably should've reported upstream, but the package in this state is also unusable.
Comment by Morten Linderud (Foxboron) - Thursday, 16 February 2023, 22:33 GMT
I have been unable to reproduce this issue and haven't had a lot of time to try and figure out what has broken.

Please supply proper verbose debugging logs from docker and attach them. Output from failing services isn't super usefull.
Comment by Bodo Graumann (bodo) - Friday, 17 February 2023, 11:54 GMT
Not sure if this is the same problem, but for me docker didn't even start after updating to v23.
Luckily on this machine I don't have to keep any containers or volumes, so I just `sudo rm /var/lib/docker -rf`.
Now it works again, but that really shouldn't be the way to go :-/

Relevant logs:

Feb 17 12:29:55 LOCALHOST dockerd[1040]: time="2023-02-17T12:29:55.726920766+01:00" level=warning msg="Usage of loopback devices is strongly discouraged for production use. Please use `--storage-opt dm.thinpooldev` or use `man dockerd` to refer to dm.thinpooldev section." storage-driver=devicemapper
Feb 17 12:29:55 LOCALHOST dockerd[1040]: time="2023-02-17T12:29:55.764271718+01:00" level=warning msg="Base device already exists and has filesystem xfs on it. User specified filesystem will be ignored." storage-driver=devicemapper
Feb 17 12:29:55 LOCALHOST dockerd[1040]: time="2023-02-17T12:29:55.827565418+01:00" level=error msg="[graphdriver] prior storage driver devicemapper is deprecated and will be removed in a future release; update the the daemon configuration and explicitly choose this storage driver to continue using it; visit https://docs.docker.com/go/storage-driver/ for more information"
Comment by Morten Linderud (Foxboron) - Friday, 17 February 2023, 11:58 GMT
Users have to deal with deprecated features when packages updates. So I don't consider that particular error a bug, however I suspect it's a different issue.
Comment by l3s2d (l3s2d) - Saturday, 18 February 2023, 02:08 GMT
I believe this issue is related to https://bugs.archlinux.org/task/77548, please see that one for more info.
Comment by Toolybird (Toolybird) - Thursday, 23 February 2023, 22:20 GMT
Dupe  FS#77622 
Comment by Anatoly Shipov (aashipov) - Friday, 24 February 2023, 09:31 GMT
Unsure if it's distro-specific issue, so filed in moby bugtracker https://github.com/moby/moby/issues/45072, copy here for future reference
Comment by Amir Zarrinkafsh (nightah) - Monday, 27 February 2023, 03:00 GMT
To provide some context, I had the same issue as logged here: https://bugs.archlinux.org/task/77622?project=5&string=docker.

I could reproduce the issue with Docker 23.0.1 if the storage backend was overlay2 however, the issue wasn't reproducible with the ZFS storage backend.

It seems that https://bugs.archlinux.org/task/77548 is related and perhaps part of the cause. If I set LimitNOFILE=1048576 in containerd.service with Docker 23.0.1, it would cause services that previously would start ballooning in memory and often be non-functional to actually work though I'm not sure if this is perhaps hiding some other issue.
This also seems to explain why an outright downgrade and restart of the docker service alone would not suffice, along with the user reports of having to reboot. Testing a downgrade and restart of docker along with a restart of containerd seems to restore services.

EDIT: I also noticed that this issue manifests itself in k3s because it uses containerd by default.
It seems that systemd changes the value of nr_open, per the following commit: https://gihub.com/systemd/systemd/commit/a8b627aaed409a15260c25988970c795bf963812 which results in LimitNOFILE=infinity actually representing the value: 1073741816. https://access.redhat.com/solutions/1479623 has a pretty good summary for fs.nr_open and its changes over time.

Testing on Ubuntu and other Linux variants where this issue doesn't manifest with Docker 23.0.1 showed that nr_open was instead set to 1048576 which aligns with https://github.com/containerd/containerd/issues/3201.
I'm not sure why this seems to affect the newer version of Docker but I'll provide that context in https://github.com/moby/moby/issues/45072 for follow-up.
Comment by Pieter Frenssen (pfrenssen) - Monday, 03 April 2023, 08:26 GMT
This thread also has a lot of interesting information about this bug: https://github.com/containerd/containerd/pull/7566
Comment by Toolybird (Toolybird) - Tuesday, 02 May 2023, 21:44 GMT
So what's the latest here? Are folks still affected with latest pkgs? The evidence seems to point to this being an upstream config issue that can be worked around?
Comment by Morten Linderud (Foxboron) - Tuesday, 02 May 2023, 21:50 GMT
Actually, this seems to have been a Go compiler issue. I didn't catch this before doing the package release earlier today, so I'll rebuild docker towards the latest go release and someone can try and help me verify if that solves it or not.

https://github.com/golang/go/commit/ecf7e00db8b8f5fff622f232bd8c515814c2ecc6

https://github.com/golang/go/issues/59064
Comment by Morten Linderud (Foxboron) - Tuesday, 02 May 2023, 21:59 GMT
Docker 1:23.0.5-2 has been rebuilt with go 2:1.20.4-1 which *should* contain the correct rlimit fix for this.

Please test the package.
Comment by Morten Linderud (Foxboron) - Thursday, 11 May 2023, 20:59 GMT
Nobody?
Comment by Kirill (Zack) - Thursday, 11 May 2023, 22:18 GMT
I'll have a look on the weekend.
Comment by Kirill (Zack) - Sunday, 14 May 2023, 11:10 GMT
<duplicate comment>
Comment by Kirill (Zack) - Sunday, 14 May 2023, 11:11 GMT
Seems to be working fine on docker 1:23.0.6-1.

Loading...