FS#64252 - [makepkg] new size calculation doesn't handle hard links
Attached to Project:
Pacman
Opened by Evangelos Foutras (foutrelis) - Friday, 25 October 2019, 03:00 GMT
Last edited by Allan McRae (Allan) - Tuesday, 05 November 2019, 05:01 GMT
Opened by Evangelos Foutras (foutrelis) - Friday, 25 October 2019, 03:00 GMT
Last edited by Allan McRae (Allan) - Tuesday, 05 November 2019, 05:01 GMT
Details
As seen in
The new size computation method causes hard links to be counted multiple times. [1] This is not representative of actual installed size as most (all?) Linux file systems support hard links. Under /usr/lib/dri, mesa ships 14 files but only 2 are unique: $ du -sb --apparent-size /usr/lib/dri 32548112 /usr/lib/dri $ find /usr/lib/dri -type f -exec cat {} + | wc -c 243613616 [1] https://git.archlinux.org/pacman.git/commit/?id=f26cb61cb6 |
This task depends upon
This task blocks these from closing
FS#64248 - mesa package upgrade increases installed size by
400%
Closed by Allan McRae (Allan)
Tuesday, 05 November 2019, 05:01 GMT
Reason for closing: Fixed
Additional comments about closing: git commit 0272fca9
Tuesday, 05 November 2019, 05:01 GMT
Reason for closing: Fixed
Additional comments about closing: git commit 0272fca9
04:54 PM <eschwartz> The mesa package claims to have grown in size by 201.22 MiB, but it turns out this is because a bunch of files are hardlinked together so wc -c counts them 9 times and 5 times respectively. I don't suppose there is anything we can do about this...
04:54 PM <eschwartz> at least it's nice and consistent when counting the files 14 times and getting the same result each time.
05:22 PM <dreisner> we've been through this too many damn times. someone needs to go write some portable tool to count unique inodes and sum sizes.
05:27 PM <elibrokeit> I can live with that as long as we don't have packages that repeated dly grow by 50mb then shrink by the same amount, whenever different comaintainers of a package build it on different filesystems!
It was suggested (and shot down) that we could write a lint warning to not use hardlinks:
05:35 PM <agregory> should tell packagers to stop using hardlinks
05:37 PM <dreisner> well, it's solving the wrong problem. we shouldn't disallow hardlinks just because we can't size the package correctly.
Allan observed:
06:19 PM <amcrae> I don't care what number we use, as long as it is approximately correct and is reproducible
06:21 PM <dreisner> in that case the current solution is Totally Fine™
...
Overall I'm not sure whether we should actually do something about this, but if we do do something, what should we do?
I guess we need to write our own solution to this. I'm happy for a script to be added that steps through package files and skip "wc -c" any that have an inode the same size as previous.
write_pkginfo() {
local IFS=$'\n'
local size="$(find . -type f -links 1 -exec cat {} + 2>/dev/null | wc -c)"
inum='0'
for i in $(find . -type f -links +1 -printf '%i %p\n'|sort); do
n=${i% *}
fname=${i#* }
[ "$inum" != "$n" ] && {
z=$(cat "$fname"|wc -c)
(( size += z ))
}
inum=$n
done
echo $size
}
write_pkginfo
The former is relevant because we try to be distro and even OS agnostic (we have code to run on bsd/macOS), the latter is relevant both because other Linux distros might want to use busybox, and because the upcoming alternatives system might mean archlinux itself lets you use busybox as the system find program.
#!/bin/bash
write_pkginfo() {
local IFS=$'\n'
local size="$(find . -xdev -type f -links 1 -exec cat {} + 2>/dev/null | wc -c)"
inum='0'
for i in $(find . -xdev -type f -links +1 -exec /bin/ls -lin {} +|sort); do
n=${i%% *}
fname=${i##* }
[ "$inum" != "$n" ] && {
z=$(cat "$fname"|wc -c)
(( size += z ))
}
inum=$n
done
echo $size
}
write_pkginfo
$ cd /usr/lib/dri; for i in $(find . -xdev -type f -links +1 -exec /bin/ls -lin {} +|sort); do n=${i%% *}; fname=${i##* }; printf 'n: <%s>, fname: <%s>\n' "$n" "$fname"; done
n: <210649>, fname: <210649>
n: <-rwxr-xr-x>, fname: <-rwxr-xr-x>
n: <5>, fname: <5>
n: <0>, fname: <0>
n: <0>, fname: <0>
n: <12320632>, fname: <12320632>
n: <Oct>, fname: <Oct>
n: <24>, fname: <24>
n: <12:51>, fname: <12:51>
n: <./i915_dri.so>, fname: <./i915_dri.so>
n: <210649>, fname: <210649>
n: <-rwxr-xr-x>, fname: <-rwxr-xr-x>
n: <5>, fname: <5>
n: <0>, fname: <0>
n: <0>, fname: <0>
n: <12320632>, fname: <12320632>
n: <Oct>, fname: <Oct>
n: <24>, fname: <24>
n: <12:51>, fname: <12:51>
n: <./i965_dri.so>, fname: <./i965_dri.so>
n: <210649>, fname: <210649>
n: <-rwxr-xr-x>, fname: <-rwxr-xr-x>
n: <5>, fname: <5>
n: <0>, fname: <0>
n: <0>, fname: <0>
n: <12320632>, fname: <12320632>
n: <Oct>, fname: <Oct>
n: <24>, fname: <24>
n: <12:51>, fname: <12:51>
n: <./nouveau_vieux_dri.so>, fname: <./nouveau_vieux_dri.so>
n: <210649>, fname: <210649>
n: <-rwxr-xr-x>, fname: <-rwxr-xr-x>
n: <5>, fname: <5>
n: <0>, fname: <0>
n: <0>, fname: <0>
n: <12320632>, fname: <12320632>
n: <Oct>, fname: <Oct>
n: <24>, fname: <24>
n: <12:51>, fname: <12:51>
n: <./r200_dri.so>, fname: <./r200_dri.so>
n: <210649>, fname: <210649>
n: <-rwxr-xr-x>, fname: <-rwxr-xr-x>
n: <5>, fname: <5>
n: <0>, fname: <0>
n: <0>, fname: <0>
n: <12320632>, fname: <12320632>
n: <Oct>, fname: <Oct>
n: <24>, fname: <24>
n: <12:51>, fname: <12:51>
n: <./radeon_dri.so>, fname: <./radeon_dri.so>
n: <210650>, fname: <210650>
n: <-rwxr-xr-x>, fname: <-rwxr-xr-x>
n: <9>, fname: <9>
n: <0>, fname: <0>
n: <0>, fname: <0>
n: <20223384>, fname: <20223384>
n: <Oct>, fname: <Oct>
n: <24>, fname: <24>
n: <12:51>, fname: <12:51>
n: <./iris_dri.so>, fname: <./iris_dri.so>
n: <210650>, fname: <210650>
n: <-rwxr-xr-x>, fname: <-rwxr-xr-x>
n: <9>, fname: <9>
n: <0>, fname: <0>
n: <0>, fname: <0>
n: <20223384>, fname: <20223384>
n: <Oct>, fname: <Oct>
n: <24>, fname: <24>
n: <12:51>, fname: <12:51>
n: <./kms_swrast_dri.so>, fname: <./kms_swrast_dri.so>
n: <210650>, fname: <210650>
n: <-rwxr-xr-x>, fname: <-rwxr-xr-x>
n: <9>, fname: <9>
n: <0>, fname: <0>
n: <0>, fname: <0>
n: <20223384>, fname: <20223384>
n: <Oct>, fname: <Oct>
n: <24>, fname: <24>
n: <12:51>, fname: <12:51>
n: <./nouveau_dri.so>, fname: <./nouveau_dri.so>
n: <210650>, fname: <210650>
n: <-rwxr-xr-x>, fname: <-rwxr-xr-x>
n: <9>, fname: <9>
n: <0>, fname: <0>
n: <0>, fname: <0>
n: <20223384>, fname: <20223384>
n: <Oct>, fname: <Oct>
n: <24>, fname: <24>
n: <12:51>, fname: <12:51>
n: <./r300_dri.so>, fname: <./r300_dri.so>
n: <210650>, fname: <210650>
n: <-rwxr-xr-x>, fname: <-rwxr-xr-x>
n: <9>, fname: <9>
n: <0>, fname: <0>
n: <0>, fname: <0>
n: <20223384>, fname: <20223384>
n: <Oct>, fname: <Oct>
n: <24>, fname: <24>
n: <12:51>, fname: <12:51>
n: <./r600_dri.so>, fname: <./r600_dri.so>
n: <210650>, fname: <210650>
n: <-rwxr-xr-x>, fname: <-rwxr-xr-x>
n: <9>, fname: <9>
n: <0>, fname: <0>
n: <0>, fname: <0>
n: <20223384>, fname: <20223384>
n: <Oct>, fname: <Oct>
n: <24>, fname: <24>
n: <12:51>, fname: <12:51>
n: <./radeonsi_dri.so>, fname: <./radeonsi_dri.so>
n: <210650>, fname: <210650>
n: <-rwxr-xr-x>, fname: <-rwxr-xr-x>
n: <9>, fname: <9>
n: <0>, fname: <0>
n: <0>, fname: <0>
n: <20223384>, fname: <20223384>
n: <Oct>, fname: <Oct>
n: <24>, fname: <24>
n: <12:51>, fname: <12:51>
n: <./swrast_dri.so>, fname: <./swrast_dri.so>
n: <210650>, fname: <210650>
n: <-rwxr-xr-x>, fname: <-rwxr-xr-x>
n: <9>, fname: <9>
n: <0>, fname: <0>
n: <0>, fname: <0>
n: <20223384>, fname: <20223384>
n: <Oct>, fname: <Oct>
n: <24>, fname: <24>
n: <12:51>, fname: <12:51>
n: <./virtio_gpu_dri.so>, fname: <./virtio_gpu_dri.so>
n: <210650>, fname: <210650>
n: <-rwxr-xr-x>, fname: <-rwxr-xr-x>
n: <9>, fname: <9>
n: <0>, fname: <0>
n: <0>, fname: <0>
n: <20223384>, fname: <20223384>
n: <Oct>, fname: <Oct>
n: <24>, fname: <24>
n: <12:51>, fname: <12:51>
n: <./vmwgfx_dri.so>, fname: <./vmwgfx_dri.so>
Also please note that while the most obvious problem of using a for loop on the contents of a subprocess could be solved using while read instead, you still cannot parse the filenames from ls while ignoring the very strong warning to never, ever, ever use 'ls' in a script because
a) filenames can contain newlines, pacman forbids newlines as a side effect of the database format but this could change so we should write robust code which does not rely on the database format
b) filenames can contain spaces, and many archlinux packages contain filenames with spaces, so you cannot even declare newlines illegal then accurately parse the output of ls and assume the last component of each line is a filename -- you still need to actually parse the line and do heuristics to detect the implementation-defined output of various fields like "date". Furthermore, filenames could contain valid dates in them...
IFS=$'\n' is important
pkg_size works for me on linux and gives the good size for /usr/lib/dri 32544016
pkg_size does not support spaces in filenames.
I think than ls -linm and fname=${i#* } works. i try it on linux and slitaz busybox and i come back tomorrow with the good solution (perhaps)
IFS=$'\n' is also unambiguously wrong, because "cmd | while read -r line" is infinitely safer by default and doesn't require changing a global shell feature (and no, using local to scope it to just this function is at best a half solution).
> pkg_size does not support spaces in filenames.
Then pkg_size does not support makepkg, end of story.
Nor does ls -linm work, because you're just shifting the problem around. ", " is permitted in filenames too. The only thing which is not permitted in filenames is the null byte \0 (and, as an implementation detail of pacman, $'\n' is also not permitted)
- works on GNU/linux
- works on bsd ( not tested )
- doesn't work with busybox but makepkg either ( < < )
- without hardlinks gives the same result than cat | wc -c
- with hardlinks give the good size ( 32544016 for /usr/lib/dri )
- 10% faster than cat | wc -c ( tested on /usr/lib )
- fin de l'histoire
The INODECMD is stat -c '%i %n', which on my machine has output like:
12345 file
with
newlines
12346 otherfile
Also, the $file should be quoted in assignment and cat "$file" as well now that the cat is executed in bash. Otherwise the filename will be split. Wasn't a problem with find because -exec always passed the filename as one arg.
I don't have any non-linux machines to test, but if it appears correct size-wise configure could be modified to substitute the correct stat flags per platform, similar to Allan's patch.
Works for me on mesa pkg and with newline files.
If we're going with the above patch, I have more concerns to voice, other than that `cat $file` must be quoted.
1) `read file` must use -r to avoid read consuming backslash characters in filenames as well as erroneously interpreting a trailing backslash as an escaped newline.
2) `cat "$file" | wc -c` will fail when the filename is a valid option flag for cat. You might expect `cat -- "$file"` to work, but cat still interprets a single dash following the double dash as an alias for stdin, making files named '-' stall makepkg indefinitely waiting for input.
3) read ignores trailing delimiters, e.g. `read file <<< 'filename '` stores only "filename" in file, not "filename ", causing an error.
4) using ' ' (SPACE 0x20) as a delimiter for the first read causes read to consume _all_ space characters after the first word break. As a result, files will have prefix space characters stripped when parsed from stat -c '%i %n', causing an error.
So files containing '\', '-*', ' *', or '* ' in addition to $'\n' are not supported. Generally speaking, I am more confident in the null-delimited approach to correctly handle unforseen cases involving special characters.
These issues are effectively circumvented as well in my approach, so even if we don't care about newlines I think it should be reconsidered.
INODECMD doesn't break on spaces, which is good. It could be trivially extended to not break on newlines either, I think... in the only other use of it, we're also doing a while read loop, so we might as well add \0 to its output, I think.
@Brocellous, that being said while find -print0 is not a POSIX option and POSIX options are generally safest, it is an option that is implemented on I think all actual find implementations, for example the busybox applet and https://man.openbsd.org/find.1 -- because -print is sort of useless, and -print0 is needed for sanity, and basically everyone writing a find implementation seems to have agreed or independently come up with the desire to to implement null-delimited output.
We already use -print0 in several places in makepkg. -printf is a different story, and is indeed unique to GNU find.
That is a good note about -print0, thanks for pointing it out. I can submit a patch with -print0 instead if you like, I think it should work fine and may be more readable even if its not POSIX.
Is this the right place for me to submit and discuss patches? Or should they be submitted to pacman-dev?
BTW, I'm also quite happy to add a lint to prevent file names starting or ending with a space... I'm of the opinion those file names should not exist.
Summing up the filesizes in a directory isn't really pacman related, and we already found a way to count filesizes without writing a custom C utility for it, so unless something changes, the current state of pacman master is, this is fixed: https://git.archlinux.org/pacman.git/commit/?id=0272fca993718460bf7ecb7fdc3ca7dad1c7e6cd