FS#77736 - [docker] cvmfs container mounts slowly after 23.0.1 upgrade

Attached to Project: Community Packages
Opened by WuJJ (wujj123456) - Monday, 06 March 2023, 01:33 GMT
Last edited by Toolybird (Toolybird) - Thursday, 23 November 2023, 21:35 GMT
Task Type Bug Report
Category Packages
Status Closed
Assigned To Lukas Fleischer (lfleischer)
Sébastien Luttringer (seblu)
Morten Linderud (Foxboron)
Architecture All
Severity Medium
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

Description: After I upgraded to docker 23.0.1 (build a5ee5b1dfc) through pacman -Syu, cvmfs container mounts very slowly. What used take a few seconds for each mount now takes 10+ minutes. The cvmfs image and version hasn't changed. This issue started happening a few weeks ago. Today I finally figured out that downgrading docker to 20.10.9 restores previously good behavior. I've been able to reproduce this in a freshly installed ArchLinux VM as well.

Installing a similar kernel (6.2.2-060202-generic) and same docker version (23.0.1, build a5ee5b1) on Ubuntu 22.10 server doesn't reproduce such behavior, which is why I believe this might be Arch specific.

Additional info:
* package version(s)
* docker-1:23.0.1-1
* kernel: linux-zen-6.2.2.zen1-1
* config and/or log files etc.
* link to upstream bug report, if any

Steps to reproduce:
1. Start the container.

docker run --rm -d --name cvmfs \
-e CVMFS_HTTP_PROXY="auto;DIRECT" \
-e CVMFS_REPOSITORIES="atlas.cern.ch,atlas-condb.cern.ch,grid.cern.ch,cernvm-prod.cern.ch,sft.cern.ch,alice.cern.ch" \
--cap-add SYS_ADMIN \
--device /dev/fuse \
--volume /cvmfs:/cvmfs:shared \
cvmfs/service:latest

2. This would print out output like the following. Note that /cvmfs will be created with corresponding mount points.
```
CernVM-FS service container version 2.10.0-1
<omitted>
[INF] mounting cvmfs-config.cern.ch atlas.cern.ch atlas-condb.cern.ch grid.cern.ch cernvm-prod.cern.ch sft.cern.ch alice.cern.ch
CernVM-FS: loading Fuse module... done
CernVM-FS: mounted cvmfs on /cvmfs/cvmfs-config.cern.ch
<more loading/mounted lines omitted>
[INF] done mounting, entering service life cycle
```

Normally, it should only take a few seconds for each pair of "loading Fuse module" and "mounted ..." to print. However, currently on Arch, each repo would take 10+ minutes before the mount completes. They do eventually succeed and once the "entering service life cycle" line is printed, the mounted directories behave as normal, serving files and directories just fine.

Unfortunately I am not the developer of cvmfs, so couldn't easily explain what it's trying to do. Their overview page might be useful: https://cvmfs.readthedocs.io/en/stable/cpt-overview.html. The repos here are all public, so you can run the command as is.
This task depends upon

Closed by  Toolybird (Toolybird)
Thursday, 23 November 2023, 21:35 GMT
Reason for closing:  None
Additional comments about closing:  Unfortunately, no-one seems to have any ideas about this. If it's still an issue, please take it to the support channels (Forum etc) to see if anyone can assist troubleshoot it.
Comment by WuJJ (wujj123456) - Monday, 06 March 2023, 01:40 GMT
Forgot to mention the downgrade part. Running the following command and then reboot would restore the good startup behavior for the cvmfs container.

$ pacman -U https://archive.archlinux.org/packages/d/docker/docker-1%3A20.10.9-1-x86_64.pkg.tar.zst
<reboot>
$ docker --version
Docker version 20.10.9, build c2ea9bc90b

Upgrading with "pacman -Syu" and reboot would reproduce the slow mounts again.
Comment by Toolybird (Toolybird) - Monday, 06 March 2023, 05:44 GMT
This is more than likely dupe of  FS#77481 . Could you please try the LimitNOFILE=xxx workaround mentioned there?
Comment by WuJJ (wujj123456) - Monday, 06 March 2023, 06:54 GMT
The workaround didn't help here. I changed LimitNOFILE to 1048576 in /usr/lib/systemd/system/docker.service and the issue persisted after a reboot. I also tried forcing a system wide fs.nr_open by adding it in /etc/sysctl.d/, verified number after reboot, but the mount is still as slow.

Regarding other behaviors in that task, the container's memory usage didn't balloon during the startup period. The container always successfully start eventually, just super slow. I indeed also need to reboot after downgrading docker in order to get the good behavior back though. Just restarting the docker service after downgrading doesn't help.
Comment by Toolybird (Toolybird) - Monday, 06 March 2023, 07:15 GMT
Ok, thanks for trying. Your Ubuntu tests are good info, but still inconclusive because nobody can pinpoint an Arch packaging bug. You will need to debug this somehow. I would suggest performing a git bisection [1]. If successful we will have a much better idea of where the problem lies.

[1] https://wiki.archlinux.org/title/Bisecting_bugs_with_Git
Comment by WuJJ (wujj123456) - Monday, 06 March 2023, 08:04 GMT
You mean bisecting docker to find a potential relevant commit? https://github.com/docker/docker directs to https://github.com/moby/moby now. Would that be the right one to start?
Comment by Toolybird (Toolybird) - Monday, 06 March 2023, 21:00 GMT
> You mean bisecting docker to find a potential relevant commit?

Yes. You'll need to be reasonably proficient at building Arch pkgs. But now that I look at the docker PKGBUILD, it's a bit complicated so it could be a challenge. Let's see if the PM's have any ideas.
Comment by WuJJ (wujj123456) - Tuesday, 07 March 2023, 03:48 GMT
Do you mind pointing me to the git tree anyway? I later realized it can't be the upstream tree if the same version works in Ubuntu. I've only built a few small AUR, so I am a bit curious where the tree for non-AUR lives, even if I may not be able to successfully build.

In addition, I found some more interesting behavior. It feels like some state on the system permanently changed until reboot after starting the docker systemd service.

1. Fresh reboot of a newly installed VM.
2. `sudo dockerd` in one terminal, which runs what pacman installed in /usr/bin/.
3. Start the cvmfs container with the command described above. This mounts directories fast. Stop the container.
4. Kill the dockerd (ctrl+c)
5. `sudo systemctl start docker`
6. Start the cvmfs container again. Now the mount is very slow. Stop the container.
7. `sudo systemctl stop docker`
8. `sudo dockerd` in terminal again
9. Start the cvmfs container again. The mount remains very slow. Stop the container.
10. Kill the dockerd (ctrl+c)
11. Only a reboot can recover the good mounting time.

PS: I repeated step 2-4, 5-7, 8-10 twice before proceeding to next steps and the behavior was consistent.
Comment by Toolybird (Toolybird) - Tuesday, 23 May 2023, 06:32 GMT
Still happening with latest pkgs?
Comment by WuJJ (wujj123456) - Tuesday, 23 May 2023, 15:34 GMT
Yes, still happening as of docker-1:24.0.0-1
Comment by WuJJ (wujj123456) - Sunday, 02 July 2023, 19:11 GMT
While verifying on latest docker (24.0.2) today, I noticed the container is consuming 100% of a core when the issue is happening. That's probably what was going on previously too but I didn't check then. I took a perf trace as attached. perf.data is the raw data captured with `perf record -g -p <pid>`, while perf.txt is the output of `perf report -g`. Hopefully someone more familiar with the kernel stack might be able to tell what could be causing the issue.
   perf.data (238.7 KiB)
   perf.txt (34.7 KiB)

Loading...