FS#74397 - [kubelet] Image garbage collection broken because imageFs can't be found

Attached to Project: Community Packages
Opened by Wolfgang Walther (wolfgangwalther) - Friday, 08 April 2022, 07:01 GMT
Last edited by David Runge (dvzrv) - Friday, 29 April 2022, 07:02 GMT
Task Type Bug Report
Category Packages
Status Closed
Assigned To David Runge (dvzrv)
Christian Rebischke (Shibumi)
Morten Linderud (Foxboron)
Architecture All
Severity High
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

I'm running a kubernetes cluster on Arch. The journal for kubelet.service is flooded with the following errors every few minutes:

E0407 [timestamp] 479 kubelet.go:1347] "Image garbage collection failed multiple times in a row" err="failed to get imageFs info: non-existent label \"docker-images\""

After running the cluster for a few weeks with a lot of containers spinning up and down (running a Gitlab CI instance with Gitlab Runner) the disk was 100% full, because no images were cleared up. I observe the same on both nodes, wich are set up identically:
- Using docker/containerd as run-time
- btrfs as the root filesystem
- btrfs as the docker storage driver

I run kubelet 1.23.5-1 right now, but have started a few versions back.

This issue reliably shows up when the kubelet.service is started before the docker.service as described in [1]. When the docker.service is started first, the problem disappears. Adding After=docker.service to the kubelet.service's [Unit] section fixed it for me.

It seems like this was fixed in cri-o last year [2], although I would argue that this is better fixed in the kubelet.service file for all container run-times together.

[1]: https://github.com/cri-o/cri-o/issues/4437
[2]: https://github.com/cri-o/cri-o/pull/4443
This task depends upon

Closed by  David Runge (dvzrv)
Friday, 29 April 2022, 07:02 GMT
Reason for closing:  Fixed
Additional comments about closing:  Fixed with kubelet 1.23.6-2
Comment by David Runge (dvzrv) - Thursday, 28 April 2022, 07:17 GMT
@wolfgangwalther: Thanks for the ticket and thanks for being one of the brave ones running their own cluster! :)

We're always happy about feedback. I'll look into this as soon as possible!
Comment by David Runge (dvzrv) - Thursday, 28 April 2022, 10:28 GMT
@wolfgangwalther: Can you check whether it is enough to actually order containerd.service before kubelet.service (as that is the container runtime)?
Comment by David Runge (dvzrv) - Thursday, 28 April 2022, 10:35 GMT
I have created a PR upstream for this: https://github.com/containerd/containerd/pull/6873
Comment by Wolfgang Walther (wolfgangwalther) - Thursday, 28 April 2022, 11:28 GMT
> @wolfgangwalther: Can you check whether it is enough to actually order containerd.service before kubelet.service (as that is the container runtime)?

I checked, and this does **not** fix it.

I did the following:
1. Removed After=docker.service from kubelet.service.
2. Rebooted and observed the error message showing up again after about ~5 min.
3. Added Before=kubelet.service to containerd.service.
4. Rebooted - and observed the error message again in intervals of 5 minutes.
5. Moved Before=kubelet.service from containerd.service to docker.service.
6: Rebooted - and after 15 minutes there was still no error message, it's working again.

Adding this to containerd.service does not fix it, but adding it to docker.service does.
Comment by David Runge (dvzrv) - Thursday, 28 April 2022, 11:43 GMT
@wolfgangwalther: Many thanks! I'll open a PR for the other upstream as well then and will add a system-wide override for docker.service in the meantime.
Comment by David Runge (dvzrv) - Friday, 29 April 2022, 06:59 GMT
Okay, nevermind. I don't think either containerd nor docker is interested in a change like that for their respective services.
I'll modify kubelet.service, although I am not happy about this gathered "special knowledge" in various downstream locations.
Hopefully this can be upstreamed to the kubernetes project.

Loading...