FS#11225 - repo-add creates wrong CSIZE entry

Attached to Project: Pacman
Opened by Nicolai Lissner (blackpenguin) - Friday, 15 August 2008, 15:08 GMT
Last edited by Dan McGee (toofishes) - Saturday, 23 August 2008, 13:50 GMT
Task Type Bug Report
Category General
Status Closed
Assigned To Xavier (shining)
Architecture All
Severity Low
Priority Normal
Reported Version 3.2.0
Due in Version 3.2.1
Due Date Undecided
Percent Complete 100%
Votes 0
Private No


repo-add uses an invalid method to create the %CSIZE% entry in *.db.tar.gz

/bin/du is NOT a tool to find size of a file, it reports the disk-usage of a file,
which is not the same as it depends on the used filesystem.

please replace

csize=$(du -kL "$deltafile" | awk '{print $1 * 1024}')
csize=$(stat -c %s "$deltafile")

csize=$(du -kL "$pkgfile" | awk '{print $1 * 1024}')
csize=$(stat -c %s "$pkgfile")

Thank you :)
This task depends upon

Closed by  Dan McGee (toofishes)
Saturday, 23 August 2008, 13:50 GMT
Reason for closing:  Fixed
Additional comments about closing:  Commit 282eeadc68fec1da8651d0c65ad0dfebd11a9c7f
Comment by Xavier (shining) - Friday, 15 August 2008, 19:14 GMT
See  FS#10459 
du can also report the size, but the argument doing that is not portable so it was dropped.
And stat is not portable either unfortunately.
Comment by Gavin Bisesi (Daenyth) - Friday, 15 August 2008, 19:40 GMT
"Not portable" in what manner? Across filesystems? Perhaps it can check the filesystem of the target partition and then use the correct tool?
Comment by Nicolai Lissner (blackpenguin) - Friday, 15 August 2008, 20:16 GMT
@Daenyth: portable to BSD (according to  FS#10459 )

However, I can't grasp why there never have been a problem with wrong %CSIZE% fields before. I guess gensync used a working method?
And what is wrong about the way Xilon suggested in http://bugs.archlinux.org/task/10459#comment28616
What exactly is not clean in using $(uname -s) to achieve portability if portability is an issue.
IMHO it's the only clean way, and actually much cleaner than using a method that reports wrong sizes. At least if you want to use coreutils to do it.

I thought you are trying to improve things. Don't make archlinux worse just for BSD portability, please. repo-add is generally a good idea, but it shouldn't break things.

Comment by Xavier (shining) - Friday, 15 August 2008, 20:16 GMT
Did you read the other bug report? Not portable across operating systems : Linux and BSD/OSX
Maybe we could have a check for using a different command for each OS, but I am not even sure how to do that.

BTW, the bug category is wrong, it should be pacman.
Comment by Xavier (shining) - Friday, 15 August 2008, 20:27 GMT
1) portability is important
2) disk usage and real usage are close enough, the size does not need to be perfectly accurate
3) OS specific checks are never clean

That said, we indeed had to introduce a minor regression in order to avoid an os specific check.
You are welcome to work on a patch using uname -s, possibly with the help of Xilon, and I will support it.
Comment by Gavin Bisesi (Daenyth) - Friday, 15 August 2008, 21:04 GMT
I am fine with either behavior. If we use the same imprecise version on all, perhaps it's worth a caveat in the manpage that it might not always be 100% accurate
Comment by Nicolai Lissner (blackpenguin) - Friday, 15 August 2008, 21:27 GMT
you are talking about a security risk here. md5sums can be faked. it's pretty much harder to fake an md5sum while keeping filesize. I can chose to download *.db.tar.gz from one mirror and the package from another - which will make it much harder to manipulate the binary-package. Call me a paranoid, I call this sane caution. However, I'm not sure what OSX and BSD will report on "uname -s" but it should work very simply like this:

case $(uname -s) in Linux) csize=$(stat -c %s "$pkgname"); ;; BSD) csize=$(stat -f "%z" "$pkgname"); ;; OSX) csize=$(...insert proper OSX command here...); ;;esac

Or, if you don't like this, just adding a line
[[ $(uname -s) == "Linux" ]] && csize=$(stat -c %s "$pkgname")
behind the current line that fills csize would at least report exact sizes when run on linux (I still believe most archlinux users use archLINUX)

Last not least, it would be very easy to write a small helper-tool that just reports the correct filesize based on c function stat() and would be portable (I wouldn't call this more clean, I just add this as an idea if someone doesn't like the "case $(uname -s) in..."

I know pacman doesn't care about filesizes when downloading files (however it actually _should_ care, but that's another issue). An md5sum only (without filesize) is pretty much useless. I also know, in an ideal world we wouldn't need to worry about manipulation of binary packages. Unfortunately we don't live in an ideal world. But we should do what we can do to make it better.

Sure, if you don't care about security at all and you just want the value to report an approx. downloadsize only, you are right, the value doesn't need to be accurate.
OTOH it's easy to implement an accurate version.

Comment by Nicolai Lissner (blackpenguin) - Friday, 15 August 2008, 21:30 GMT
oh... and yes, I felt the category should be pacman, but unfortunately that isn't available to chose.
Comment by Xavier (shining) - Friday, 15 August 2008, 21:57 GMT
The databases and the packages lie all together on every mirror. Surely if you can manipulate packages on one mirror, you can also edit the md5sum and filesize so that they match. These are not a security measure in any way.
Comment by Nicolai Lissner (blackpenguin) - Friday, 15 August 2008, 23:44 GMT
To repeat myself: "I can chose to download *.db.tar.gz from one mirror and the package from another - which will make it much harder to manipulate the binary-package" I'm NOT forced to use the *.db.tar.gz from the same mirror I get my packages from. And actually I don't do this.
Comment by Xavier (shining) - Saturday, 16 August 2008, 08:57 GMT
pacman actually forces you to do this, it does not let you the choice. You would either have to hack pacman or to do it manually. I doubt many users do that.
And what if the main server is compromised?

If you are concerned about security, you should rather help implementing package signing :
Comment by Nicolai Lissner (blackpenguin) - Saturday, 16 August 2008, 10:51 GMT
I don't use pacman but spaceman instead. spaceman is the package manage of gnuffy (and yes, it supports signing). See http://gnuffy.chaotika.org for more information.
Comment by Nicolai Lissner (blackpenguin) - Saturday, 16 August 2008, 11:00 GMT
however, if the mainserver is compromised and mirrors copy the compromised packages it's really worst case. But I wouldn't ignore security completely just because there is nothing like 100% secure.
And even with pacman you can do this very simply, a simple script "syncpkglist" that would change mirror, download *.db.tar.gz, and change mirror back would do it. You are probably right, not very much archlinux users do that. OTOH I guess not very much archlinux users would expect, that pacman doesn't compare filesizes but MD5SUM only. So, these users who get *.db.tar.gz from main-server but packages from other mirrors are even lost if CSIZE is not accurate.
Of course, signed packages would be much better and I would definetely welcome if archlinux would sign packages, too.
Comment by Nagy Gabor (combo) - Saturday, 16 August 2008, 21:41 GMT
"Don't make archlinux worse just for BSD portability."
Comment by Xavier (shining) - Monday, 18 August 2008, 08:48 GMT
Again, if you care, submit a real patch.
Even if you don't know the exact uname -s output, it does not matter, it can be fixed later.
Note that du is used in 3 different places : http://projects.archlinux.org/?p=pacman.git;a=commitdiff;h=149839c5391e9a93465f86dbb8d095a0150d755d
So the code has to be factorized a bit (at least inside repo-add).
And in makepkg, it is a recursive du (du -s) which is used. So if you want to implement that with stat, you will probably need a combination of find, stat, and bc, a bit like this example from my last comment in the other bug report :
echo $(find . -printf "%s + ") 0 |bc
Comment by Nagy Gabor (combo) - Monday, 18 August 2008, 10:47 GMT
I vote for implementing the linux-way and forget about other OSs.
Or make somehow configurable this get size check.

My main reason: I blame linux and BSD leaders because of this "incompatibility" (I may be wrong here, but this is my standpoint). Implementing two/three/... methods for certain OSs would support this "state", not implement is simply a protest vote from my side.
Comment by Xavier (shining) - Monday, 18 August 2008, 11:16 GMT
I like that idea of making it configurable. Besides makepkg and repo-add both already source /etc/makepkg.conf so we can just put the command there.

# Linux / Cygwin
STATCMD="stat -c %s"
#STATCMD="stat -f %z"

Something like that?

If that is ok, we just have to implement this in makepkg and repo-add ...
Comment by Nagy Gabor (combo) - Monday, 18 August 2008, 11:53 GMT
"Something like that?"
Yes. IMHO this way is better.

Comment by Nicolai Lissner (blackpenguin) - Monday, 18 August 2008, 21:36 GMT
I like that, too :D
Comment by Xavier (shining) - Monday, 18 August 2008, 22:24 GMT
Please test.
Comment by Nicolai Lissner (blackpenguin) - Monday, 18 August 2008, 22:48 GMT
works fine with linux
Comment by Xavier (shining) - Monday, 18 August 2008, 22:51 GMT
For bsd, I had a report that the following command works as expected :
find /tmp -exec stat -f %z '{}' ';' | awk '{sum+=$1} END {printf("%d\n", sum)}'

So we should be fine.
Comment by Dan McGee (toofishes) - Wednesday, 20 August 2008, 00:50 GMT
Holy smokes, this was a long discussion that I hadn't looked at until just now.

A few thoughts from my end, which unfortunately all is the path to getting code into pacman. :P
1. Md5sums are not for security. Ever. So any post implying this is just dumb. Package signing is the place to look for security.
2. How big of a difference are we talking here on sizes? No one has really identified this yet.
3. Why was more objection not raised when this went into git through the ML...
4. Designing applications to be portable is somewhat fun at least for me, so reverting this change and saying "we using Linux are superior" is a dumb justification to me.
5. I'm not a big fan of uname hacks.
6. If we do this STATCMD thing, it would make more sense for configure to fill this in, just as we do with the makeflags.
Comment by Nicolai Lissner (blackpenguin) - Wednesday, 20 August 2008, 03:08 GMT

1. Right, but archlinux doesn't sign packages yet, does it? And without signatures, the *combination* of md5 and filesize is more secure than nothing, if you get your *db.tar.gz from a different site than the package. Nobody claims it would be 100% secure or eligible to replace signatures. Of course, I prefer signing, too. And if archlinux would have signatures I wouldn never have opened this bug, because I could live with ignoring filesizes then. No need to discuss that. And especially no need to call anyone dumb.

2. Actually most (all?) packages officially released by archlinux seem to match the sizes given in their *.db.tar.gz (Are these *.db.tar.gz actually created with repo-add? I doubt that.) The problem appeared to me first with the kdemod-repository. So for me it has been a new problem.

3. I've neither have read the ML nor do I follow the complete development of pacman. The problem that a downloaded file matched md5sum but didn't match the filesize given in *.db.tar.gz appeared to me the first time exactly one day before reporting this bug. I hunted down the reason for this and then reported it as a bug. So that's the reason why.

4. There's absolutely nothing wrong in designing applications portable. I don't question this. But I question changes of code behaviour to less accurate filesizes than the tools formerly used reported, I still would call this a bug. Btw, who said "we using Linux are superior"? I can't find anything like that and I don't think anyone in this thread really thought something like that. Anyway it's not about Linux. The problem here is GNU vs Unix. not Linux vs BSD or something. However it's senseless to discuss this point, as we know, GNU is not Unix. There _are_ some differences between the systems, and we have to handle this if we want portability. It's absolutely not a question of being superior or something. Differences rarely imply superiority. They are just what they are: differences. And they make the world more colorful. Let's handle this colors in full saturation instead of blurring them.

5. as you have seen - the solution of Xavier doesn't use uname.

6. How is configure supposed to find out the OS? Oh, damn, it is going to use uname ;-) But sure, it should fill this in automatically.

Comment by Dan McGee (toofishes) - Wednesday, 20 August 2008, 04:34 GMT
1. Ignoring.
2. The Arch repositories do in fact use repo-add, but the pre-3.2.0 version at the moment which used the old way of 'du -b' to get the size.
3. Knowing that they were the exact size before is helpful, so thanks- I see now that the more important problem lies in repo-add and not makepkg.
4. Filesize was not intended to be a verification utility, and the "close enough" rule got applied here. And I saw this comment which seemed to imply some sense of superiority: "Don't make archlinux worse just for BSD portability, please." Having a slightly-off filesize is a far cry from making Arch terrible, that is what I was trying to convey.
5. I see this, I was just responding to the *entire* thread as I said.
6. But configure is specifically designed to do OS and machine-specific checks, which is a whole lot better than makepkg failing the first time you use it, for instance. Runtime detection is not our best friend. And in this case, should we even abstract it out to makepkg.conf at all, or just leave it in makepkg? That would make more sense.

I'm not trying to be hostile here btw. I'm just getting out my thoughts on this issue. And I've even attached my proposed solution to this whole fun problem. Note that I did not touch makepkg because our size there is not critical- a size to the nearest K is just fine, and switching to a find/stat way of doing it would cause all hard links to get double-counted.
Comment by Allan McRae (Allan) - Wednesday, 20 August 2008, 05:46 GMT
Looks like a good solution to me.
Comment by Xavier (shining) - Wednesday, 20 August 2008, 06:50 GMT
Looks good to me too!
(and du is indeed much more practical for getting the size of a directory)
Comment by Dan McGee (toofishes) - Wednesday, 20 August 2008, 11:48 GMT
Cool- can you guys possibly give this a test so we can get this into 3.2.1? I didn't test it at all. :)
Comment by Nicolai Lissner (blackpenguin) - Wednesday, 20 August 2008, 12:16 GMT
tested with linux: works perfectly. thanks :)