FS#2264 - pacman should make SOME use of bzip2

Attached to Project: Pacman
Opened by Nikos Kouremenos (zeppelin) - Wednesday, 23 February 2005, 17:47 GMT
Last edited by Dan McGee (toofishes) - Thursday, 04 September 2008, 15:21 GMT
Task Type Feature Request
Category
Status Closed
Assigned To Aaron Griffin (phrakture)
Architecture not specified
Severity Very Low
Priority Normal
Reported Version 0.7 Wombat
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 2
Private No

Details

to the beloved judd,

cpus evolve too fast. so does memory. Arch is an i686 and never said it wanted to go i486 or i386 or sth like it. I wonder why we keep on insisiting on gz when we have bzip2 which is *superior* in filesize. Bzip2 is not anymore what rzip is now. It's here and it's stable and it's working. If you don't like it for some reason, I can understand it. But I shouldn't get the: "the time that needs to be compressed is more than the benefit you get in the filesize" when we're talking about packages such as xorg, ooo, samba, mono etc. I mean, everyday that passes by and arch is becoming more famous, I worry for the server costs you have to pay guys.. At least this way you can free some MB (maybe even GB) / week from archlinux.org

As a start, db should be bzip2 instead of gzipped. At least this! :)

Judd, I was happy to propose [and only that because it's in pure old C :P] to compare filesize of the DBs and not update then if not needed, now I and your servers would be happy to see this. :)
This task depends upon

Closed by  Dan McGee (toofishes)
Thursday, 04 September 2008, 15:21 GMT
Reason for closing:  Won't implement
Additional comments about closing:  We can reopen this if it becomes necessary or someone identifies more use cases on the ML and helps us implement more generic archive handling.
Comment by Andreas Radke (AndyRTR) - Wednesday, 27 December 2006, 06:08 GMT
i vote for that change gz -> bz2.

especially for the x86_64 port: our binaries are somewhat bigger than on i686 and cpu power is not a real problem. but bandwidth still is.
Comment by Aaron Griffin (phrakture) - Thursday, 28 December 2006, 17:05 GMT
This is a much larger issue than a simple change, from gz->bz2. This one requires lots of discussion and things like that. It has been discussed a lot before. The basic issue is that there is a trade-off. Binaries won't compress all that much more, percentage-wise, and you're increasing cpu load and time while decreasing bandwidth and disk space. It's the typical "speed/space" tradeoff that every computer science student discusses in some class.

The fact is, it's been decided and there are pros and cons for each side.

I will leave this open, in order to reference at a later date.

I leave you with one interesting tid-bit:
Try out a bz2 database with pacman3. I _think_ it should work out of the box, assuming you change the DB_EXT definition to "db.bz2"
Comment by Nikos Kouremenos (zeppelin) - Friday, 29 December 2006, 20:18 GMT
bz2 should be default. arch is i686 so cpu doesn't matter
Comment by Aaron Griffin (phrakture) - Sunday, 31 December 2006, 09:33 GMT
Saying CPU doesn't matter due to the i686 requirement doesn't make sense. Even on a dual core machine like I am sitting at now, extracting a 1MB bz2 file takes a good 2 seconds longer than the same data gzipped. That is highly significant.

Again. This matter has come up 4 million times. It is over and done with. Arch will use gzip. Sorry.
Comment by Nikos Kouremenos (zeppelin) - Sunday, 31 December 2006, 16:27 GMT
dude you must be kidding me. 2 secs? how much faster did you get the smaller package (let's say it is an Xorg package?). How much less is the bw that is the cost of running archlinux.org?

I don't care for neither gzip or bzip2, but if you can't see why it will benefit, then I can't force you of course..
Comment by Roman Kyrylych (Romashka) - Monday, 01 January 2007, 21:35 GMT
Of course bzip2 is slower when compressing/decompressing.
But using bzip2 will result in short _total_ installing time, if only you don't install from cache or local mirror. That's because _downloading_ smaller files will take less time. And for most of us the difference will be much more than seconds.
This is my point.
Anyway I cannot insist on something. It depends on your choice, Aaron.

Of course delta sync downloads would be even better, but they are harder to implement and maintain.
Comment by Roman Kyrylych (Romashka) - Monday, 01 January 2007, 21:39 GMT
About decompression speed again:
I suggest to add 'NOCOMPRESS" option to makepkg and tar-only packages support to pacman.
This will be useful for large packages that contain mostly hardly compressable data, like large binary games with already compressed gamedata files into "packs".
Comment by Andreas Radke (AndyRTR) - Thursday, 15 March 2007, 06:19 GMT
i'm still voting to switch to bzip2 compression just for a faster download time. at least for cpu-powerfull ports like x86_64. how about making use of DualCore power some of our cpus now have.

http://compression.ca/pbzip2/ - the project has reached 1.0 state. maybe we could even use it as a replacement for bzip2 in current/base after some testing.
Comment by Aaron Griffin (phrakture) - Thursday, 15 March 2007, 06:27 GMT
I'm going to say the same thing I did with the xdelta guys. I don't really consider this all that important. If you do, however, you are welcome to provide a patch and I will see what I can do about integrating it. As it stands, using gzip works, and has been working for some time, so there's not much reason to change it. I have many other features/fixes I want to work on before something like this.

(Side note: please do not 'replace' bzip2 with something like the above. If anything, provide it as an option in extra or something using the standard provides/replaces way of doing things. A full replacement is probably not a good idea)
Comment by Dan McGee (toofishes) - Tuesday, 08 May 2007, 07:42 GMT
I would be another vote against this change- given the speed of my connection the unzipping portion takes much longer than the download itself.

Instead of adding to the flames, I'll present some real numbers. This is on an exceptionally large file, the kernel source tarball, so repeating this experiment on something smaller (say our kernel binary package) might be more useful. Realize that you only need to calculate the download time and speed for either tar.gz or bz2; you can use math to figure out the other one given the sizes of the two packages.

File: linux-2.6.21.1.*
tar: 243 MB
bz2: 42 MB
gz: 54 MB

bz2:
download: 9.4 sec
unzip: 38.6 sec
zip: 136.95 sec (wow!)

gz:
download: 11.7 sec
unzip: 4.8 sec
zip: 24.4 sec

Notice how much faster the gzip is for me- it takes 1/3 of the time of the bzip download/unzip process. The best thing I can produce here is another number- the speed of the connection, given my machines unzip times, where the two would break even. This test was done with a 4.5 MB/sec download rate, which is pretty fast. Doing some linear equations, I can tell you that my connection would have be be slower than 327 KB/sec to prefer bzip. That is an awfully slow connection for those with broadband, isn't it?

(If you want the answer I found above, find the intersection of these two equations:
y1 = 38.6 + (42984 / x)
y2 = 4.8 + (54029 / x)
The numbers should be fairly obvious where they come from, they are times and sizes from above.)
Comment by Nagy Gabor (combo) - Tuesday, 15 January 2008, 20:40 GMT
Well, libarchive can handle .tar.gz and .tar.bz2 transparently; so this is your choice which compression do you prefer.
So pacman -U foo.tar.bz2 will also work, if .PKGINFO and such things exist in the archive. And in sync repos we use the %FILENAME% field to determine filename, so this is quite flexible.
The same true for repo uncompressing (in ArchLinux we use db.tar.gz DBEXT which could be a bit misleading in case of compression change ;-), simple .db would be transparent).

So if you implement compression type option to makepkg, this task can be closed, since this was a distro specific discussion :-P

Loading...