FS#65197 - Increase database size limit

Attached to Project: Pacman
Opened by Matt McDonald (gardotd426) - Saturday, 18 January 2020, 19:46 GMT
Last edited by Andrew Gregory (andrewgregory) - Thursday, 02 July 2020, 20:06 GMT
Task Type Bug Report
Category General
Status Closed
Assigned To No-one
Architecture All
Severity Medium
Priority Normal
Reported Version 5.2.1
Due in Version 5.2.2
Due Date Undecided
Percent Complete 100%
Votes 2
Private No

Details

Description: After adding the chaotic-aur repository to /etc/pacman.conf, and receiving and signing the keys, pacman -Fy refuses to get the file list:

sudo pacman -Fy --verbose
--verbose
[sudo] password for matt:
Root : /
Conf File : /etc/pacman.conf
DB Path : /var/lib/pacman/
Cache Dirs: /var/cache/pacman/pkg/
Hook Dirs : /usr/share/libalpm/hooks/ /etc/pacman.d/hooks/
Lock File : /var/lib/pacman/db.lck
Log File : /var/log/pacman.log
GPG Dir : /etc/pacman.d/gnupg/
Targets : None
:: Synchronizing package databases...
core is up to date
extra is up to date
community is up to date
multilib is up to date
valveaur is up to date
error: failed retrieving file 'chaotic-aur.files' from lonewolf-builder.duckdns.org : Maximum file size exceeded
error: failed retrieving file 'chaotic-aur.files' from chaotic.bangl.de : Maximum file size exceeded
error: failed retrieving file 'chaotic-aur.files' from chaotic.bangl.de : Maximum file size exceeded
error: failed retrieving file 'chaotic-aur.files' from repo.kitsuna.net : Maximum file size exceeded
error: failed to update chaotic-aur (download library error)
error: failed to synchronize all databases


I opened an issue on the chaotic-aur github page, and pedrohlc, the maintainer, thought it might be because he needed rebuild the repo, but after doing that he discovered that he ran into the same issue, and suggested that this must be a problem with Arch and that we need to file a bug report. I have a Manjaro installation on this same PC, and the repo works just fine in Manjaro with zero issues. After he rebuilt the repo, I tried again just in case, and the issue persists. If I download chaotic-aur.files myself and place it in /var/lib/pacman/sync, then it all works fine. Since the actual process of adding the repo to pacman.conf and receiving and signing the keys (as instructed on the arch wiki) works fine in Manjaro, this is upstream from that, and since Arch is responsible for pacman bugs and there's nothing upstream of Arch for pacman, this bug would fall under Arch's responsibility as per the Arch wiki. I'm more than happy to provide any needed info from either my Arch or Manjaro (in case you need to compare for some reason) installations. From what I read on the how to file a bug report article on the wiki, I figured this should be categorized as a Medium-level bug (non-essential broken function), however I apologize if I categorized it wrong.



Additional info:
* package version(s)
5.2.1-4

* config and/or log files etc.
will attach /etc/pacman.conf


Steps to reproduce:

Add the following to /etc/pacman.conf:
[chaotic-aur]
Server = http://lonewolf-builder.duckdns.org/$repo/x86_64
Server = http://chaotic.bangl.de/chaotic-aur/x86_64
Server = http://chaotic.bangl.de/$repo/x86_64
Server = https://repo.kitsuna.net/x86_64

Run the following:
sudo pacman-key --keyserver keys.mozilla.org -r 3056513887B78AEB
sudo pacman-key --lsign-key 3056513887B78AEB


Run:
sudo pacman -Syy

After which, run:

sudo pacman -Fy
..to get the package database. At that point, it will fail with the errors listed above. If, however, you add chaotic-aur.files to /var/lib/pacman/sync after it fails, and run it again, it will succeed.

Here is the link to the github issue page, if anyone needs it. Also, pedrohlc said that arch's own repo-add tool generated the db files in the first place. Anyhow, here's the github, and I'm more than happy to help however I can https://github.com/PedroHLC/chaotic-aur/issues/36
This task depends upon

Closed by  Andrew Gregory (andrewgregory)
Thursday, 02 July 2020, 20:06 GMT
Reason for closing:  Fixed
Additional comments about closing:  Commit 2856a7dea3c0d4584e126b5ca5957e13e23f83d1
Comment by Robin Broda (coderobe) - Saturday, 18 January 2020, 19:48 GMT
There is a hard limit on database sizes that pacman enforces.
Comment by Matt McDonald (gardotd426) - Saturday, 18 January 2020, 20:20 GMT
Any idea why the limit is different (or if it's different) on Manjaro? From what little I can tell, it pretty much just uses the regular pacman, it's not like a heavily modified version from what I know
Comment by Matt McDonald (gardotd426) - Saturday, 18 January 2020, 20:24 GMT
Also, wouldn't that require chaotic-aur to be larger than any of the official repos since they don't go over this limit? It doesn't seem like chaotic-aur is anywhere near the size of extra or community.
Comment by Alexander Schnaidt (Namarrgon) - Saturday, 18 January 2020, 20:25 GMT Comment by Matt McDonald (gardotd426) - Saturday, 18 January 2020, 20:27 GMT
Ahh. Yeah, it does seem that chaotic-aur does go over the 25MB limit for database files. I thought it was a strictly number-of-packages-based thing.
Comment by Allan McRae (Allan) - Saturday, 18 January 2020, 23:35 GMT
Why is this closed as not a bug. Clearly pacman needs adjusted here.
Comment by Robin Broda (coderobe) - Saturday, 18 January 2020, 23:50 GMT
Maybe instead of being hardcoded at compile-time this should be an option available in pacman.conf. What do you think?
Comment by Matt McDonald (gardotd426) - Sunday, 19 January 2020, 00:51 GMT
I think Robin's suggestion makes a ton of sense. It does seem like a rather arbitrarily low limit, when Manjaro has to patch pacman just for their community repo to work, and even community repositories that, even if they're not "official," are listed on the Arch Wiki, can't even be used in vanilla Arch.
Comment by Allan McRae (Allan) - Sunday, 19 January 2020, 01:13 GMT
Not adding a config option yet. That would be a big change, and not suitable for backporting. Submitted a patch bumping to 128MB in the mean time.
Comment by Matt McDonald (gardotd426) - Sunday, 19 January 2020, 01:15 GMT
That's totally understandable, I was mainly just thinking about in the future.
Comment by Allan McRae (Allan) - Sunday, 19 January 2020, 01:26 GMT
For the future, I'd prefer removing the limit altogether. It is arbitrary, and a user can see the size of the database being downloaded and stop the download if they perceive something wrong. I see no real justification for a limit at all.
Comment by Matt McDonald (gardotd426) - Sunday, 19 January 2020, 01:28 GMT
Yeah it definitely surprised me at how low it is, it seems arbitrary having a limit at all, but especially having one that low, even more so when Arch very much allows and gives instructions on community repositories which could be very likely to have databases on the larger side.
Comment by Eli Schwartz (eschwartz) - Sunday, 19 January 2020, 04:27 GMT
I can understand wanting to remove the limits altogether, but at the same time I also feel like maybe repositories should not be so large. For e.g. community.files you can get 25% savings just by recompressing from gzip to zstd, and I'd actually prefer more, individually smaller repos since they incur a smaller download cost when updating only one package in a 30 or 40 MB database.
Comment by Allan McRae (Allan) - Sunday, 19 January 2020, 04:36 GMT
@Eli: Create a new bug report for Arch dbscripts. Or a thread on arch-dev-public. Saying that, sticking the ~1000 perl packages that are not needed by anything else in their own repo would be a start!
Comment by Philip Müller (philm) - Sunday, 19 January 2020, 09:31 GMT
We mentioned it in IRC in #archlinux-projects on the 7th of January and pointed to the reason and our quick resolution on our end. Some Arch devs even started to discuss the future of the community
repo... splitting it to reduce its size, changing the database file compression... Maybe check the chat logs of that day.
Comment by Matt McDonald (gardotd426) - Sunday, 19 January 2020, 12:06 GMT
Yeah it definitely surprised me at how low it is, it seems arbitrary having a limit at all, but especially having one that low, even more so when Arch very much allows and gives instructions on community repositories which could be very likely to have databases on the larger side.
Comment by Allan McRae (Allan) - Sunday, 19 January 2020, 12:40 GMT
The limit was set when we did not have files databases, and likely before we had signatures in databases too. Without those, repo databases could increased by a large amount without hitting the limit, something in the range of 100,000 packages. With signatures, this limit could handle a repo of about 40,000 packages. Essentially, perfectly reasonable limits until the files database came along.

Having some limit also prevents a rogue repo database download continuing until the (probably root) partition is full.
Comment by Matt McDonald (gardotd426) - Sunday, 19 January 2020, 13:02 GMT
That's perfectly understandable. I imagine some limits are necessary, but like you said, now with the signatures and everything, 25MB seems incredibly small and like it wouldn't be that detrimental to raise it at least a little.
Comment by Dave Reisner (falconindy) - Monday, 20 January 2020, 13:06 GMT
> The limit was set when we did not have files databases

I think you mean this predates pacman's second coming of the -F flag. Dan and I introduced the limit after we switched from libfetch to libcurl. You can see the rationale at the time in 6dc71926f9b16e. This has aged decently well, aside from the claim that 25MiB is double the size of all the repos including files (currently weighing in at 39MB including staging and testing repos).

I'm not sure where you're getting the idea that DB bloat comes from the addition of signatures. Even with all packages being signed, the difference in repo size between db and files is generally (at least in current Arch repos) about 2-5x larger[0]. I have a hard time believing that a fixed addition of <500 bytes is responsible for exceeding the DB size limit.

Generally, I'm in favor of keeping the limit and this might actually be a case where it makes sense to make the unknown package/db upper limit a pacman.conf knob. Distros should have an understanding of how big their repos are in order to set sane defaults in the distributed config, and users can bump the limit as they see fit (for cases like custom repos). Alternatively, we can extend the logic in be_sync.c to set a payload max_size based on the db extension, but that's probably just going to bite us again O(years) from now.

[0] for f in /var/lib/pacman/sync/*.db; do awk -v d="$(wc -c <"$f")" -v f="$(wc -c <"${f%.db}.files")" 'BEGIN { inc=(f-d)/d; if (inc) printf "%.2f%%\n", inc }'; done
Comment by Matt McDonald (gardotd426) - Monday, 20 January 2020, 13:55 GMT
Your first suggestion, having some limit with distros having an understanding of how large their repos are getting and setting sane defaults, but also allowing users to adjust the limit as needed with a pacman.conf line, sounds like a very pragmatic solution, albeit that's without me knowing how difficult that would be to implement. But like Allan said, clearly having a 25MB strict limit with no way for users to adjust this seems untenable. The actual community repository is already at 20M, and I mean I guess splitting repos up could be an alternative solution, but that seems like it's a really awkward and overly complicated solution (with many more potential issues), all for what? I imagine there's some principle behind that argument, such as like Eli mentioned with having smaller download costs, but at a certain point, I don't see it.

But also regarding another thing Eli mentioned, I spoke with pedrohlc and I tested using zstd myself and when I untar-ed the original chaotic-aur.files database and then re-archived it with zstd, it did show a 40 percent decrease in total size, but pedrohlc said he couldn't figure out how to change the compression level with repo-add, and that's how he's making the database archives. And the 40 percent decrease requires the ZSTD_CLEVEL=19 envvar, I'm not sure if repo-add respects that or not.
Comment by Eli Schwartz (eschwartz) - Monday, 20 January 2020, 15:01 GMT
repo-add would respect the COMPRESSZST=() tuneable once I polish up https://patchwork.archlinux.org/patch/1042/
Comment by Matt McDonald (gardotd426) - Tuesday, 12 May 2020, 13:18 GMT
I see this is still open since it's not actually been permanently fixed yet, but the chaotic-aur maintainer just informed me that he's back over the limit again, at 27MB. So apparently the temporary raising of the limit to 128MB has been revoked? I see that the fix is supposed to come in pacman 5.2.2, but there's no due date for when that might be, so just wondering if the limit was indeed revoked and if so if maybe a smaller raise could be reinstated? Right now chaotic-aur like I said is at 27MB.
Comment by Eli Schwartz (eschwartz) - Tuesday, 12 May 2020, 13:22 GMT
It hasn't been reopened since day one, when it was closed as not a bug by one person, then reopened by another person. Nothing has changed. If something had changed, you would have gotten a notification email from this bug report.
Comment by Matt McDonald (gardotd426) - Tuesday, 12 May 2020, 13:24 GMT
Yeah I realized that right after I made the comment, I was editing it as I got the notification for your comment. Sorry about that. So the limit is still 128 then?
Comment by Allan McRae (Allan) - Tuesday, 12 May 2020, 13:27 GMT
The limit in git is 128MB. But this has not been backported to the current release series.

Loading...