FS#8586 - Implement tar database backend

Attached to Project: Pacman
Opened by Dan McGee (toofishes) - Friday, 09 November 2007, 19:23 GMT
Last edited by Allan McRae (Allan) - Thursday, 14 October 2010, 10:19 GMT
Task Type Feature Request
Category Backend/Core
Status Closed
Assigned To Dan McGee (toofishes)
Allan McRae (Allan)
Architecture All
Severity Low
Priority Normal
Reported Version git
Due in Version 3.5.0
Due Date Undecided
Percent Complete 100%
Votes 10
Private No


Everyone wants things sped up, so this is one possible approach that I think is worth it. We could easily download a db.tar.gz, inflate it but leave it tarred, and use it as our database. We would still have the benefits of the files approach, but gain a massive speed increase as we could load straight from the tar file into memory.

$ time tar cf local.db.tar local/
real 0m0.112s
user 0m0.023s
sys 0m0.087s

$ time tar xf local.db.tar
real 0m0.229s
user 0m0.010s
sys 0m0.210s

I'm not sure yet if I want to convert both the local and the sync DBs to the format. It makes a lot more sense to convert the sync DBs becuase they are read-only. In reality, the local DB and sync DB shouldn't have to be the same format, so this feature request is also a reminder to fix that issue.
This task depends upon

Closed by  Allan McRae (Allan)
Thursday, 14 October 2010, 10:19 GMT
Reason for closing:  Implemented
Additional comments about closing:  Lots of git commits culminating in commit 4a8e396a
Comment by Aaron Griffin (phrakture) - Friday, 09 November 2007, 19:32 GMT
I did this actually. I implemented this WITHOUT extracting the tarfile.

There complexity is in writing. You can't write to the "stream" that you get from libarchive.
So this means that the local database needs to be something else. We can't do it with a tar file. This is fine, but it's a lot of complexity.

I may still have some of this code. I will check tonight
Comment by Nagy Gabor (combo) - Tuesday, 18 December 2007, 12:15 GMT
Well, I also afraid of the fact that we cannot use the same method for sync and local dbs:-(
Some little comments about disk usage:
tar is not a really "diskspace safer" method (compared to zip -0 for example), (probably because it is mainly used with gzip and bzip2...) it adds too many \0-s to the archive, and since we have usually small files, on my system using tar resulted 3 times bigger archive than needed (zip vs tar). (However, this was even smaller than du -hs repo/ <- because of ext3 fs overhead)
Comment by Dan McGee (toofishes) - Tuesday, 18 December 2007, 14:11 GMT
Why does it matter if sync and local DBs are using the same method? You gave no reasoning...

Disk usage? I wasn't concerned about disk usage whatsoever- our current DB is as big as you will get. I was concerned about extraction speed, and no zipping of the contents will give us a ton of speed. Either way, I think my backend will support zipped or unzipped files, although I plan on actually implementing the latter.
Comment by Nagy Gabor (combo) - Tuesday, 18 December 2007, 16:23 GMT
"Why does it matter if sync and local DBs are using the same method? You gave no reasoning..."
Imho that's why we haven't implemented one-file method yet: I cannot quote, but iirc Aaron didn't like this on ML at all when I mentioned this many months ago (he found ugly). I cannot say any real contras, but indeed, we will need different treatment for sync and local repos in be_files.c.

"Disk usage?..."
Tar's \0 fill is just (very) ugly imho <- I mentioned this, because I was surprised when I saw this. I didn't said that you should compress the file, but why the hell tar adds so much "needless" \0 bytes to the archive? Of course this is not a big problem at all, but even a primitive own format would be "nicer" than this. [OFF: It is a mystery why people use tar, gzip and bz2 in the 21th century...;-) /OFF]
Comment by Aaron Griffin (phrakture) - Tuesday, 18 December 2007, 16:44 GMT
Can we not get into random semantics for something that doesn't even have an implementation. You're trying to argue against something that has 0 code.
Open another FR if you have another idea please.
Comment by Aaron Griffin (phrakture) - Thursday, 17 January 2008, 18:43 GMT
http://phraktured.net/dbread.c.txt <-- libarchive implementation ./dbread extra.db.tar.gz
http://phraktured.net/dbreadX.c.txt <-- file implementation ./dbreadX /var/lib/pacman/sync/extra/

Both of these simply read through all /desc files in full. No parsing is done, but for comparisson, that will be constant on both sides of the equation.

These were written ages ago (I think even before Dan joined, heh) so they're probably assy
Comment by Pas (PAStheLoD) - Wednesday, 13 January 2010, 04:37 GMT

What about SQLite?

$ time pacman -Ss divx > /dev/null

real 0m0.670s
user 0m0.070s
sys 0m0.570s

$ time sqlite3 extra.db3 'select * from packages where pkg LIKE "%divx%" or contents LIKE "%divx%"' > /dev/null

real 0m0.102s
user 0m0.020s
sys 0m0.047s


It can scale very well, plus doesn't eat up a lot of disk space because of 4K blocksize (also it consumes filesystem metadata space too)
Comment by Allan McRae (Allan) - Saturday, 03 July 2010, 14:31 GMT
Seem Aaron's demos of the speed difference seem long gone, here is some I made. It compares reading in all the desc files from /var/lib/pacman/sync/extra compared to reading them from extra.db.tar (uncompressed repo db).

> sync; echo 3 > /proc/sys/vm/drop_caches
> time ./readfile

real 0m27.914s
user 0m0.137s
sys 0m0.660s

> sync; echo 3 > /proc/sys/vm/drop_caches
> time ./readtar

real 0m0.247s
user 0m0.070s
sys 0m0.017s

And that is without the "fgets" function we have for reading from tar files being optimised...
Comment by Allan McRae (Allan) - Sunday, 04 July 2010, 11:43 GMT
Here is some code to take the compressed database and convert it to a non-compressed one. This will be the basis of a modified alpm_db_update.