Pacman

Historical bug tracker for the Pacman package manager.

The pacman bug tracker has moved to gitlab:
https://gitlab.archlinux.org/pacman/pacman/-/issues

This tracker remains open for interaction with historical bugs during the transition period. Any new bugs reports will be closed without further action.
Tasklist

FS#73217 - Please consider providing some sort of diff for packages

Attached to Project: Pacman
Opened by Eric Engestrom (1ace) - Saturday, 01 January 2022, 23:58 GMT
Last edited by Allan McRae (Allan) - Saturday, 12 March 2022, 23:42 GMT
Task Type Feature Request
Category Packages: Core
Status Closed
Assigned To No-one
Architecture All
Severity Low
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

My specific use-case might be niche (I sometimes have to work on my phone's connection for a week or two), but I think everyone would benefit from being able to download the binary diff since the previous package version instead of downloading the entire package every time.

This would obviously increase the mirror's storage usage, but it would also greatly reduce their bandwidth usage, which I believe to be where most of the cost lies for a mirror provider, so I expect they would approve. This would need to be confirmed with them though, obviously.

Implementation-wise, I'm thinking that whenever a new package is generated, `makepkg` would uncompress the old & new packages and diff them (using `bsdiff` or equivalent), and that diff would be stored next to the new package.
This way, someone who doesn't have the previous package in their cache would download the full package, as before, nothing changes.
But if `pacman` sees a previous version in its cache (such as when performing an `-Syu` update), it would query the mirror for the diff between that version and the latest, and fall back to downloading the full package if that diff doesn't exist.
If it does exist, the diff is applied onto the package in the cache (or directly on the local install?)

To avoid having too big diffs (resulting in too much storage cost/not enough bandwidth saving), a threshold can be added, where the diff isn't uploaded if it's not <50% of the full package size, for instance for packages that contain mostly binary files that are recompiled into something completely different each time. The whole "reproducible builds" effort should also help with this, to avoid having compilation outputs that change from one version to the next when the source file hasn't changed.

In summary:
- The downside is (on top the of implementation effort) slightly[*] increased storage requirement and a slightly[*] longer package build & upload for package managers.
- The upside is faster download/updates for users, and less bandwidth consumed for users & mirror providers.

[*] Note that I /assume/ that the build time & storage space cost are small, but I haven't built a prototype of this and ran it against the existing repos' packages, which is the only way to actually know.
I'm willing to make these implementations, but I want first to make sure this is an idea that has a chance of being accepted, and also I'll need a contact person (or some access to gitlab.archlinux.org) to discuss the makepkg & pacman implementations :)
This task depends upon

Closed by  Allan McRae (Allan)
Saturday, 12 March 2022, 23:42 GMT
Reason for closing:  None
Additional comments about closing:  package diffs will only be reconsidered if someone provides details of an implementation with numbers demonstrating its values
Comment by Doug Newgard (Scimmia) - Sunday, 02 January 2022, 00:21 GMT
Considering Pacman removed support for deltas a few years ago, there's really no way for Arch to do this.
Comment by Jonathon (jonathon) - Sunday, 02 January 2022, 00:51 GMT
One approach would be to use zsync as it doesn't need intermediary deltas (use the cached package as seed, remote package as source). It would need a fair bit of work to implement though - both as a pacman download agent, and on the server-side with zsync file generation for packages (or added as part of makepkg).

A PoC might involve a mirror that generates zsync files for packages along with a wrapper that uses zsync/zsync2 to download packages to the cache for use by pacman.
Comment by Eric Engestrom (1ace) - Sunday, 02 January 2022, 01:06 GMT
> Considering Pacman removed support for deltas a few years ago

Oh, I wasn't aware of that; do you have a link to the discussion that lead to this decision?

> One approach would be to use zsync as it doesn't need intermediary deltas (use the cached package as seed, remote package as source)

That would be a diff of the compressed package then, right? If so, the better the compression algorithm, the more the diff would tend towards 100%; I have no idea how the current zstd would fare, but this doesn't sound like a viable long term solution :/
I don't know zsync though, I'll have a look; thanks!
Comment by Allan McRae (Allan) - Sunday, 02 January 2022, 01:22 GMT
Binary diffs are near to useless on a rolling release distro. The toolchain and package versions change often enough that a large proportion of updates are not small bug fixes (which binary diffs best support). You save maybe 50% of the download for substantial uncompress/recompress time (the signatures are for the compressed package).
Comment by Eric Engestrom (1ace) - Sunday, 02 January 2022, 01:25 GMT
Answering myself (I found it after asking), the discussion when removing the deltas is here:
https://lists.archlinux.org/pipermail/pacman-dev/2019-March/023211.html
https://lists.archlinux.org/pipermail/pacman-dev/2019-March/023217.html
https://lists.archlinux.org/pipermail/pacman-dev/2019-March/023218.html

It sounds like basically, it was just a bad implementation (inefficient & insecure) so it was removed, but another implementation of the same general idea might work, right?
Comment by Eric Engestrom (1ace) - Sunday, 02 January 2022, 01:26 GMT
> the signatures are for the compressed package

But signing the diff the same way the package is signed would avoid having to do all that, right?
Comment by Eric Engestrom (1ace) - Sunday, 02 January 2022, 01:38 GMT
(Sorry if I seem pushy, I'm just trying to understand what is an actual problem, versus what "just" needs work; I appreciate everyone's time to answer these questions)
Comment by Allan McRae (Allan) - Sunday, 02 January 2022, 02:15 GMT
Other implementations showed little net benefit. If people can provide actual numbers to quantify the benefits over a reasonable period of time (say a month), then it may be considered again.

You could solve your issue by not updating when you only have phone access. I think it would be more constructive to work on your update addiction!

Comment by Eric Engestrom (1ace) - Sunday, 02 January 2022, 16:33 GMT
> I think it would be more constructive to work on your update addiction!

Haha, that's fair :)

I'm leaving this task open as I might give it a shot anyway at some point (and measure the actual savings!), or maybe someone else will want to give it a go.
I'll post a link here if/when I start the makepkg/pacman implementations.
Comment by Jonathon (jonathon) - Monday, 03 January 2022, 19:46 GMT
Initial bit of testing zsync2, it does nothing for standard zstd package archives. zsync supports looking inside gzip-compressed archives so could potentially be extended to support zstd in a similar way too, and zstd has an `--rsyncable` flag that could also help zsync identify common blocks.

Loading...