FS#77481 - [docker] 1:23.0.1-1 broken, either won't start or sends SIGTERM to the app inside container
Attached to Project:
Community Packages
Opened by Kirill (Zack) - Monday, 13 February 2023, 11:38 GMT
Last edited by Toolybird (Toolybird) - Sunday, 14 May 2023, 21:51 GMT
Opened by Kirill (Zack) - Monday, 13 February 2023, 11:38 GMT
Last edited by Toolybird (Toolybird) - Sunday, 14 May 2023, 21:51 GMT
|
Details
Description:
Updating to community/docker 1:23.0.1-1 from 1:20.10.9-1 results in random container crashes or docker engine not being able to start at all in some cases. The app inside the container catches SIGTERM. Rollback to 1:20.10.9-1 solves the problem. Forum ref: https://bbs.archlinux.org/viewtopic.php?pid=2084430 Additional info: * package version(s): 1:23.0.1-1 * config and/or log files etc. : Posted by another user with crashing service: > journalctl -xeu docker.service Feb 13 10:39:26 beta systemd[1]: docker.service: Failed with result 'exit-code'. ░░ Subject: Unit failed ░░ Defined-By: systemd ░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel ░░ ░░ The unit docker.service has entered the 'failed' state with result 'exit-code'. Feb 13 10:39:26 beta systemd[1]: Failed to start Docker Application Container Engine. ░░ Subject: A start job for unit docker.service has failed ░░ Defined-By: systemd ░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel ░░ ░░ A start job for unit docker.service has finished with a failure. ░░ ░░ The job identifier is 1229 and the job result is failed. Steps to reproduce: Update to 1:23.0.1-1, start docker engine, if it works, start for example lscr.io/linuxserver/wireguard:latest and try to connect with a client, results in SIGTERM. |
This task depends upon
Closed by Toolybird (Toolybird)
Sunday, 14 May 2023, 21:51 GMT
Reason for closing: Fixed
Additional comments about closing: docker 1:23.0.6-1
Sunday, 14 May 2023, 21:51 GMT
Reason for closing: Fixed
Additional comments about closing: docker 1:23.0.6-1
Downgraded docker to 20.10.23, rebooted, now everything works just fine.
It seems downgrading to Docker 20.10.23 has resolved that issue for now.
It also seems a restart of the docker.service was not enough and I needed to restart my machine.
At the packaging part it just hanged for like 3 minutes in the "fakeroot" process till it finally started to package.
Downgrading fixed also the issue
I tried most obvious things: disabling hugepages completely, looking at MySQL process memory maps, but there is no obvious solution. It's just that MySQL process grabs 4-5 times more RAM than usual for some reason. Other containers also (rabbitmq, for example), but not as much, MySQL is the largest offender. The issue is so low-level, that I have no idea how to debug it.
Probably should've reported upstream, but the package in this state is also unusable.
Please supply proper verbose debugging logs from docker and attach them. Output from failing services isn't super usefull.
Luckily on this machine I don't have to keep any containers or volumes, so I just `sudo rm /var/lib/docker -rf`.
Now it works again, but that really shouldn't be the way to go :-/
Relevant logs:
Feb 17 12:29:55 LOCALHOST dockerd[1040]: time="2023-02-17T12:29:55.726920766+01:00" level=warning msg="Usage of loopback devices is strongly discouraged for production use. Please use `--storage-opt dm.thinpooldev` or use `man dockerd` to refer to dm.thinpooldev section." storage-driver=devicemapper
Feb 17 12:29:55 LOCALHOST dockerd[1040]: time="2023-02-17T12:29:55.764271718+01:00" level=warning msg="Base device already exists and has filesystem xfs on it. User specified filesystem will be ignored." storage-driver=devicemapper
Feb 17 12:29:55 LOCALHOST dockerd[1040]: time="2023-02-17T12:29:55.827565418+01:00" level=error msg="[graphdriver] prior storage driver devicemapper is deprecated and will be removed in a future release; update the the daemon configuration and explicitly choose this storage driver to continue using it; visit https://docs.docker.com/go/storage-driver/ for more information"
FS#77622I could reproduce the issue with Docker 23.0.1 if the storage backend was overlay2 however, the issue wasn't reproducible with the ZFS storage backend.
It seems that https://bugs.archlinux.org/task/77548 is related and perhaps part of the cause. If I set LimitNOFILE=1048576 in containerd.service with Docker 23.0.1, it would cause services that previously would start ballooning in memory and often be non-functional to actually work though I'm not sure if this is perhaps hiding some other issue.
This also seems to explain why an outright downgrade and restart of the docker service alone would not suffice, along with the user reports of having to reboot. Testing a downgrade and restart of docker along with a restart of containerd seems to restore services.
EDIT: I also noticed that this issue manifests itself in k3s because it uses containerd by default.
It seems that systemd changes the value of nr_open, per the following commit: https://gihub.com/systemd/systemd/commit/a8b627aaed409a15260c25988970c795bf963812 which results in LimitNOFILE=infinity actually representing the value: 1073741816. https://access.redhat.com/solutions/1479623 has a pretty good summary for fs.nr_open and its changes over time.
Testing on Ubuntu and other Linux variants where this issue doesn't manifest with Docker 23.0.1 showed that nr_open was instead set to 1048576 which aligns with https://github.com/containerd/containerd/issues/3201.
I'm not sure why this seems to affect the newer version of Docker but I'll provide that context in https://github.com/moby/moby/issues/45072 for follow-up.
https://github.com/golang/go/commit/ecf7e00db8b8f5fff622f232bd8c515814c2ecc6
https://github.com/golang/go/issues/59064
Please test the package.