FS#4633 - Use hash to check package database version on mirror sites

Attached to Project: Pacman
Opened by Anonymous Submitter - Sunday, 14 May 2006, 00:00 GMT
Last edited by arjan timmerman (blaasvis) - Thursday, 25 May 2006, 15:53 GMT
Task Type Feature Request
Category
Status Closed
Assigned To Judd Vinet (judd)
Architecture not specified
Severity Low
Priority Normal
Reported Version 0.7.1 Noodle
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

Hi,

Everytime I run "pacman -Sy" to update my package database files pacman always downloads the full database files. Although downloading around a megabyte each time is probably not a big load on the mirror sites, it would seem to me that it would be more efficient (and polite to the mirror sites) to have pacman download a file containing an md5sum or a sha1sum (or similar hash), compare it with the locally calculated or stored hash, and only then download the package database file if there is a difference.

The goal of this feature request is similar to  FS#2583 , although that is proposing to use the HTTP Last-Modified header to achieve the same thing. I think the drawback of using Last-Modified though is that that value is a mirror site web server generated value, rather than an upstream Arch Linux distribution calculated value. I think it would be better to have pacman only rely on data generated by the Arch Linux project itself, rather than on the web server at a mirror site.

Because of the importance of the package database, if the database is downloaded after there is a hash mismatch, I think there would be value in then comparing the a locally calculate database hash with the has file downloaded. If there is a mismatch at that point, it would mean that the mirror site is inconsistent, and therefore a different mirror should be used.

Anyway, just a thought.

Regards,
Mark.
This task depends upon

Closed by  Judd Vinet (judd)
Thursday, 25 May 2006, 17:12 GMT
Reason for closing:  Works for me
Comment by Jens Adam (byte) - Sunday, 14 May 2006, 14:31 GMT
There is no need for this as Pacman only downloads the db.tar.gz files if their server time is newer than their .lastupdate entries, and the files of [current], [extra], [community], [unstable] and [testing] together are only ~400 KB.
Comment by Anonymous Submitter - Tuesday, 16 May 2006, 10:50 GMT
Hi Jens,

Thanks for answering, I didn't realise pacman did that. Everytime I've run it seemed to be downloading the files.

Do you happen to know if what you've said about it checking still applies if an external download utility is used for the download ? Fairly recently I've set the XferCommand option in my pacman.conf file to use wget, to enable download speed limiting via the --limit-rate option (note that I thought it was always downloading the db.tar.gz files even before I made this change, so I don't think it co-incided). Being able to set the download speed limit is quite useful on a slower link, so that you can still do other things like browse the web while the package downloads are occuring (of course I could use the Linux kernel's network traffic shaping capabilities to do this, however it is a lot more work than specifing a limit on wget to achieve the same or similar result.)

I do agree 400KB isn't a lot to download, it just seems to me that a few KB download of a hashfile and then a hash comparison would be achieve the same result with a lot more network/mirror site efficiency, as well as providing a further level of assurance of the integrity of the repository files. (Actually, thinking about it from a security point of view, it would be better to download the hash from the main archlinux.org site, and then compare it with the mirror site. This would further protect against either a mirror site being out of date or having been intentionally subverted.)

Thanks,
Mark.

Comment by Jens Adam (byte) - Tuesday, 16 May 2006, 14:52 GMT
I've always used pacman itself for syncing and downloading. I have two different pacman.conf files, one with all default repositories except [testing] enabled and Server=ftp.archlinux.org, the second is identical but with a different mirror (closer/faster). pacman's time comparison is very simple: 1) login anonymously via FTP and cd to ...current/os/i686/ 2) MDTM current.db.tar.gz 3) compare that string to /var/lib/pacman/current/.lastupdate 4) if it's different (older or newer) fetch the db file.
If you use a XferCommand or use a HTTP mirror then I'd guess you are right, pacman always refreshes the db files.
The security aspect is a nice one, never thought of that.
Comment by Jens Adam (byte) - Tuesday, 16 May 2006, 14:56 GMT
Forgot one thing: when using a FTP server not supporting the MDTM command, then pacman would also fetch the db upon every -Sy.
Comment by Robert Howard (iBertus) - Saturday, 20 May 2006, 23:18 GMT
I would think it more efficient to only download the changed parts of the database, but this would be a major pain to implement and would interfere with the gzipping of the database. As long as the database is under 1MB it shouldn't be a problem to download.
Comment by Judd Vinet (judd) - Thursday, 25 May 2006, 17:12 GMT
Jens, as you said, pacman needs an FTP server with the MDTM command in order to stat the DB files before downloading. If it has this, then DB files are only downloaded if they're newer than the ones already on your local machine.

We don't support this method when using HTTP in pacman2, but we should have it in pacman3.

Loading...