FS#8586 - Implement tar database backend
Attached to Project:
Pacman
Opened by Dan McGee (toofishes) - Friday, 09 November 2007, 19:23 GMT
Last edited by Allan McRae (Allan) - Thursday, 14 October 2010, 10:19 GMT
Opened by Dan McGee (toofishes) - Friday, 09 November 2007, 19:23 GMT
Last edited by Allan McRae (Allan) - Thursday, 14 October 2010, 10:19 GMT
|
Details
Everyone wants things sped up, so this is one possible
approach that I think is worth it. We could easily download
a db.tar.gz, inflate it but leave it tarred, and use it as
our database. We would still have the benefits of the files
approach, but gain a massive speed increase as we could load
straight from the tar file into memory.
$ time tar cf local.db.tar local/ real 0m0.112s user 0m0.023s sys 0m0.087s $ time tar xf local.db.tar real 0m0.229s user 0m0.010s sys 0m0.210s I'm not sure yet if I want to convert both the local and the sync DBs to the format. It makes a lot more sense to convert the sync DBs becuase they are read-only. In reality, the local DB and sync DB shouldn't have to be the same format, so this feature request is also a reminder to fix that issue. |
This task depends upon
Closed by Allan McRae (Allan)
Thursday, 14 October 2010, 10:19 GMT
Reason for closing: Implemented
Additional comments about closing: Lots of git commits culminating in commit 4a8e396a
Thursday, 14 October 2010, 10:19 GMT
Reason for closing: Implemented
Additional comments about closing: Lots of git commits culminating in commit 4a8e396a
There complexity is in writing. You can't write to the "stream" that you get from libarchive.
So this means that the local database needs to be something else. We can't do it with a tar file. This is fine, but it's a lot of complexity.
I may still have some of this code. I will check tonight
------
Some little comments about disk usage:
tar is not a really "diskspace safer" method (compared to zip -0 for example), (probably because it is mainly used with gzip and bzip2...) it adds too many \0-s to the archive, and since we have usually small files, on my system using tar resulted 3 times bigger archive than needed (zip vs tar). (However, this was even smaller than du -hs repo/ <- because of ext3 fs overhead)
Disk usage? I wasn't concerned about disk usage whatsoever- our current DB is as big as you will get. I was concerned about extraction speed, and no zipping of the contents will give us a ton of speed. Either way, I think my backend will support zipped or unzipped files, although I plan on actually implementing the latter.
Imho that's why we haven't implemented one-file method yet: I cannot quote, but iirc Aaron didn't like this on ML at all when I mentioned this many months ago (he found ugly). I cannot say any real contras, but indeed, we will need different treatment for sync and local repos in be_files.c.
"Disk usage?..."
Tar's \0 fill is just (very) ugly imho <- I mentioned this, because I was surprised when I saw this. I didn't said that you should compress the file, but why the hell tar adds so much "needless" \0 bytes to the archive? Of course this is not a big problem at all, but even a primitive own format would be "nicer" than this. [OFF: It is a mystery why people use tar, gzip and bz2 in the 21th century...;-) /OFF]
Open another FR if you have another idea please.
http://phraktured.net/dbreadX.c.txt <-- file implementation ./dbreadX /var/lib/pacman/sync/extra/
Both of these simply read through all /desc files in full. No parsing is done, but for comparisson, that will be constant on both sides of the equation.
These were written ages ago (I think even before Dan joined, heh) so they're probably assy
What about SQLite?
$ time pacman -Ss divx > /dev/null
real 0m0.670s
user 0m0.070s
sys 0m0.570s
$ time sqlite3 extra.db3 'select * from packages where pkg LIKE "%divx%" or contents LIKE "%divx%"' > /dev/null
real 0m0.102s
user 0m0.020s
sys 0m0.047s
http://pasthelod.hell-and-heaven.org/paclite/
It can scale very well, plus doesn't eat up a lot of disk space because of 4K blocksize (also it consumes filesystem metadata space too)
> sync; echo 3 > /proc/sys/vm/drop_caches
> time ./readfile
real 0m27.914s
user 0m0.137s
sys 0m0.660s
> sync; echo 3 > /proc/sys/vm/drop_caches
> time ./readtar
real 0m0.247s
user 0m0.070s
sys 0m0.017s
And that is without the "fgets" function we have for reading from tar files being optimised...
readtar.c (1.5 KiB)