FS#64252 - [makepkg] new size calculation doesn't handle hard links

Attached to Project: Pacman
Opened by Evangelos Foutras (foutrelis) - Friday, 25 October 2019, 03:00 GMT
Last edited by Allan McRae (Allan) - Tuesday, 05 November 2019, 05:01 GMT
Task Type Bug Report
Category makepkg
Status Closed
Assigned To No-one
Architecture All
Severity Low
Priority Normal
Reported Version 5.2.0
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 1
Private No

Details

As seen in  FS#64248 , mesa's installed size (as reported by pacman) grew by over 200 MB.

The new size computation method causes hard links to be counted multiple times. [1] This is not representative of actual installed size as most (all?) Linux file systems support hard links.

Under /usr/lib/dri, mesa ships 14 files but only 2 are unique:

$ du -sb --apparent-size /usr/lib/dri
32548112 /usr/lib/dri

$ find /usr/lib/dri -type f -exec cat {} + | wc -c
243613616

[1] https://git.archlinux.org/pacman.git/commit/?id=f26cb61cb6
This task depends upon

Closed by  Allan McRae (Allan)
Tuesday, 05 November 2019, 05:01 GMT
Reason for closing:  Fixed
Additional comments about closing:  git commit 0272fca9
Comment by Eli Schwartz (eschwartz) - Friday, 25 October 2019, 03:13 GMT
We were coincidentally discussing this in IRC today, as I observed the same package changes. The problem is that all previous methods invariably returned the wrong information entirely, and our current method is returning very consistent information which is reproducible and correctly describes what the package is like... taken on a per-file level.


04:54 PM <eschwartz> The mesa package claims to have grown in size by 201.22 MiB, but it turns out this is because a bunch of files are hardlinked together so wc -c counts them 9 times and 5 times respectively. I don't suppose there is anything we can do about this...
04:54 PM <eschwartz> at least it's nice and consistent when counting the files 14 times and getting the same result each time.
05:22 PM <dreisner> we've been through this too many damn times. someone needs to go write some portable tool to count unique inodes and sum sizes.
05:27 PM <elibrokeit> I can live with that as long as we don't have packages that repeated dly grow by 50mb then shrink by the same amount, whenever different comaintainers of a package build it on different filesystems!


It was suggested (and shot down) that we could write a lint warning to not use hardlinks:

05:35 PM <agregory> should tell packagers to stop using hardlinks
05:37 PM <dreisner> well, it's solving the wrong problem. we shouldn't disallow hardlinks just because we can't size the package correctly.


Allan observed:

06:19 PM <amcrae> I don't care what number we use, as long as it is approximately correct and is reproducible
06:21 PM <dreisner> in that case the current solution is Totally Fine™


...

Overall I'm not sure whether we should actually do something about this, but if we do do something, what should we do?
Comment by Allan McRae (Allan) - Friday, 25 October 2019, 03:50 GMT
We are not changing this again, unless there is a genuine solution!

I guess we need to write our own solution to this. I'm happy for a script to be added that steps through package files and skip "wc -c" any that have an inode the same size as previous.
Comment by alfred nenesse (nenesse) - Friday, 25 October 2019, 11:22 GMT
#!/bin/bash

write_pkginfo() {
local IFS=$'\n'
local size="$(find . -type f -links 1 -exec cat {} + 2>/dev/null | wc -c)"
inum='0'
for i in $(find . -type f -links +1 -printf '%i %p\n'|sort); do
n=${i% *}
fname=${i#* }
[ "$inum" != "$n" ] && {
z=$(cat "$fname"|wc -c)
(( size += z ))
}
inum=$n
done
echo $size
}
write_pkginfo
   pkg_size (0.4 KiB)
Comment by Eli Schwartz (eschwartz) - Friday, 25 October 2019, 12:00 GMT
find -printf is very GNU specific, it won't work if the system find is provided by something else like BSD find or busybox find.

The former is relevant because we try to be distro and even OS agnostic (we have code to run on bsd/macOS), the latter is relevant both because other Linux distros might want to use busybox, and because the upcoming alternatives system might mean archlinux itself lets you use busybox as the system find program.
Comment by alfred nenesse (nenesse) - Friday, 25 October 2019, 13:59 GMT
with -exec /bin/ls

#!/bin/bash

write_pkginfo() {
local IFS=$'\n'
local size="$(find . -xdev -type f -links 1 -exec cat {} + 2>/dev/null | wc -c)"
inum='0'
for i in $(find . -xdev -type f -links +1 -exec /bin/ls -lin {} +|sort); do
n=${i%% *}
fname=${i##* }
[ "$inum" != "$n" ] && {
z=$(cat "$fname"|wc -c)
(( size += z ))
}
inum=$n
done
echo $size
}
write_pkginfo
   pkg_size (0.4 KiB)
Comment by alfred nenesse (nenesse) - Friday, 25 October 2019, 15:47 GMT
for busybox replace (( size += z )) by size=$(( size + z ))
   pkg_size (0.4 KiB)
Comment by Eli Schwartz (eschwartz) - Friday, 25 October 2019, 17:20 GMT
Have you actually tried this??? for i in $(subshell) is automatically wrong no matter what, always, and now you are also parsing the output of ls: https://mywiki.wooledge.org/ParsingLs

$ cd /usr/lib/dri; for i in $(find . -xdev -type f -links +1 -exec /bin/ls -lin {} +|sort); do n=${i%% *}; fname=${i##* }; printf 'n: <%s>, fname: <%s>\n' "$n" "$fname"; done
n: <210649>, fname: <210649>
n: <-rwxr-xr-x>, fname: <-rwxr-xr-x>
n: <5>, fname: <5>
n: <0>, fname: <0>
n: <0>, fname: <0>
n: <12320632>, fname: <12320632>
n: <Oct>, fname: <Oct>
n: <24>, fname: <24>
n: <12:51>, fname: <12:51>
n: <./i915_dri.so>, fname: <./i915_dri.so>
n: <210649>, fname: <210649>
n: <-rwxr-xr-x>, fname: <-rwxr-xr-x>
n: <5>, fname: <5>
n: <0>, fname: <0>
n: <0>, fname: <0>
n: <12320632>, fname: <12320632>
n: <Oct>, fname: <Oct>
n: <24>, fname: <24>
n: <12:51>, fname: <12:51>
n: <./i965_dri.so>, fname: <./i965_dri.so>
n: <210649>, fname: <210649>
n: <-rwxr-xr-x>, fname: <-rwxr-xr-x>
n: <5>, fname: <5>
n: <0>, fname: <0>
n: <0>, fname: <0>
n: <12320632>, fname: <12320632>
n: <Oct>, fname: <Oct>
n: <24>, fname: <24>
n: <12:51>, fname: <12:51>
n: <./nouveau_vieux_dri.so>, fname: <./nouveau_vieux_dri.so>
n: <210649>, fname: <210649>
n: <-rwxr-xr-x>, fname: <-rwxr-xr-x>
n: <5>, fname: <5>
n: <0>, fname: <0>
n: <0>, fname: <0>
n: <12320632>, fname: <12320632>
n: <Oct>, fname: <Oct>
n: <24>, fname: <24>
n: <12:51>, fname: <12:51>
n: <./r200_dri.so>, fname: <./r200_dri.so>
n: <210649>, fname: <210649>
n: <-rwxr-xr-x>, fname: <-rwxr-xr-x>
n: <5>, fname: <5>
n: <0>, fname: <0>
n: <0>, fname: <0>
n: <12320632>, fname: <12320632>
n: <Oct>, fname: <Oct>
n: <24>, fname: <24>
n: <12:51>, fname: <12:51>
n: <./radeon_dri.so>, fname: <./radeon_dri.so>
n: <210650>, fname: <210650>
n: <-rwxr-xr-x>, fname: <-rwxr-xr-x>
n: <9>, fname: <9>
n: <0>, fname: <0>
n: <0>, fname: <0>
n: <20223384>, fname: <20223384>
n: <Oct>, fname: <Oct>
n: <24>, fname: <24>
n: <12:51>, fname: <12:51>
n: <./iris_dri.so>, fname: <./iris_dri.so>
n: <210650>, fname: <210650>
n: <-rwxr-xr-x>, fname: <-rwxr-xr-x>
n: <9>, fname: <9>
n: <0>, fname: <0>
n: <0>, fname: <0>
n: <20223384>, fname: <20223384>
n: <Oct>, fname: <Oct>
n: <24>, fname: <24>
n: <12:51>, fname: <12:51>
n: <./kms_swrast_dri.so>, fname: <./kms_swrast_dri.so>
n: <210650>, fname: <210650>
n: <-rwxr-xr-x>, fname: <-rwxr-xr-x>
n: <9>, fname: <9>
n: <0>, fname: <0>
n: <0>, fname: <0>
n: <20223384>, fname: <20223384>
n: <Oct>, fname: <Oct>
n: <24>, fname: <24>
n: <12:51>, fname: <12:51>
n: <./nouveau_dri.so>, fname: <./nouveau_dri.so>
n: <210650>, fname: <210650>
n: <-rwxr-xr-x>, fname: <-rwxr-xr-x>
n: <9>, fname: <9>
n: <0>, fname: <0>
n: <0>, fname: <0>
n: <20223384>, fname: <20223384>
n: <Oct>, fname: <Oct>
n: <24>, fname: <24>
n: <12:51>, fname: <12:51>
n: <./r300_dri.so>, fname: <./r300_dri.so>
n: <210650>, fname: <210650>
n: <-rwxr-xr-x>, fname: <-rwxr-xr-x>
n: <9>, fname: <9>
n: <0>, fname: <0>
n: <0>, fname: <0>
n: <20223384>, fname: <20223384>
n: <Oct>, fname: <Oct>
n: <24>, fname: <24>
n: <12:51>, fname: <12:51>
n: <./r600_dri.so>, fname: <./r600_dri.so>
n: <210650>, fname: <210650>
n: <-rwxr-xr-x>, fname: <-rwxr-xr-x>
n: <9>, fname: <9>
n: <0>, fname: <0>
n: <0>, fname: <0>
n: <20223384>, fname: <20223384>
n: <Oct>, fname: <Oct>
n: <24>, fname: <24>
n: <12:51>, fname: <12:51>
n: <./radeonsi_dri.so>, fname: <./radeonsi_dri.so>
n: <210650>, fname: <210650>
n: <-rwxr-xr-x>, fname: <-rwxr-xr-x>
n: <9>, fname: <9>
n: <0>, fname: <0>
n: <0>, fname: <0>
n: <20223384>, fname: <20223384>
n: <Oct>, fname: <Oct>
n: <24>, fname: <24>
n: <12:51>, fname: <12:51>
n: <./swrast_dri.so>, fname: <./swrast_dri.so>
n: <210650>, fname: <210650>
n: <-rwxr-xr-x>, fname: <-rwxr-xr-x>
n: <9>, fname: <9>
n: <0>, fname: <0>
n: <0>, fname: <0>
n: <20223384>, fname: <20223384>
n: <Oct>, fname: <Oct>
n: <24>, fname: <24>
n: <12:51>, fname: <12:51>
n: <./virtio_gpu_dri.so>, fname: <./virtio_gpu_dri.so>
n: <210650>, fname: <210650>
n: <-rwxr-xr-x>, fname: <-rwxr-xr-x>
n: <9>, fname: <9>
n: <0>, fname: <0>
n: <0>, fname: <0>
n: <20223384>, fname: <20223384>
n: <Oct>, fname: <Oct>
n: <24>, fname: <24>
n: <12:51>, fname: <12:51>
n: <./vmwgfx_dri.so>, fname: <./vmwgfx_dri.so>


Also please note that while the most obvious problem of using a for loop on the contents of a subprocess could be solved using while read instead, you still cannot parse the filenames from ls while ignoring the very strong warning to never, ever, ever use 'ls' in a script because

a) filenames can contain newlines, pacman forbids newlines as a side effect of the database format but this could change so we should write robust code which does not rely on the database format
b) filenames can contain spaces, and many archlinux packages contain filenames with spaces, so you cannot even declare newlines illegal then accurately parse the output of ls and assume the last component of each line is a filename -- you still need to actually parse the line and do heuristics to detect the implementation-defined output of various fields like "date". Furthermore, filenames could contain valid dates in them...
Comment by alfred nenesse (nenesse) - Friday, 25 October 2019, 17:48 GMT
cd /usr/lib/dri; for i in $(find . -xdev -type f -links +1 -exec /bin/ls -lin {} +|sort); do n=${i%% *}; fname=${i##* }; printf 'n: <%s>, fname: <%s>\n' "$n" "$fname"; done

IFS=$'\n' is important

pkg_size works for me on linux and gives the good size for /usr/lib/dri 32544016

pkg_size does not support spaces in filenames.

I think than ls -linm and fname=${i#* } works. i try it on linux and slitaz busybox and i come back tomorrow with the good solution (perhaps)



Comment by Eli Schwartz (eschwartz) - Friday, 25 October 2019, 19:58 GMT
IFS=$'\n' is also inscrutable and easy to miss, since it is not actually in the location where it is used.

IFS=$'\n' is also unambiguously wrong, because "cmd | while read -r line" is infinitely safer by default and doesn't require changing a global shell feature (and no, using local to scope it to just this function is at best a half solution).

> pkg_size does not support spaces in filenames.

Then pkg_size does not support makepkg, end of story.

Nor does ls -linm work, because you're just shifting the problem around. ", " is permitted in filenames too. The only thing which is not permitted in filenames is the null byte \0 (and, as an implementation detail of pacman, $'\n' is also not permitted)
Comment by alfred nenesse (nenesse) - Saturday, 26 October 2019, 09:47 GMT
New version of pkg_size with stat.
- works on GNU/linux
- works on bsd ( not tested )
- doesn't work with busybox but makepkg either ( < < )
- without hardlinks gives the same result than cat | wc -c
- with hardlinks give the good size ( 32544016 for /usr/lib/dri )
- 10% faster than cat | wc -c ( tested on /usr/lib )
- fin de l'histoire
   pkg_size (0.4 KiB)
Comment by Allan McRae (Allan) - Saturday, 26 October 2019, 11:36 GMT
.
Comment by Katharina Bönninghoff (kathh) - Saturday, 26 October 2019, 13:05 GMT
Couldn't this be solved by a C program, or does `pkg_size` need to be a shell script?
Comment by Allan McRae (Allan) - Saturday, 26 October 2019, 13:12 GMT
why write a C program when 7 lines of bash does the job?
Comment by Katharina Bönninghoff (kathh) - Saturday, 26 October 2019, 14:26 GMT
That's a very good point.
Comment by Ronan Pigott (Brocellous) - Saturday, 26 October 2019, 19:41 GMT
I thought it was a goal to support newlines in file name, no? If I'm not mistaken, the `read file` in that last patch will choke on newlines, right?

The INODECMD is stat -c '%i %n', which on my machine has output like:

12345 file
with
newlines
12346 otherfile

Also, the $file should be quoted in assignment and cat "$file" as well now that the cat is executed in bash. Otherwise the filename will be split. Wasn't a problem with find because -exec always passed the filename as one arg.
Comment by Ronan Pigott (Brocellous) - Saturday, 26 October 2019, 21:37 GMT
Here is my proposal. I use bash printf to format a null-delimited file list for parsing, since the find -print0 and -printf are GNU extensions.

I don't have any non-linux machines to test, but if it appears correct size-wise configure could be modified to substitute the correct stat flags per platform, similar to Allan's patch.

Works for me on mesa pkg and with newline files.
Comment by Allan McRae (Allan) - Saturday, 26 October 2019, 22:47 GMT
pacman does not support newlines in file names, so I see no reason that makepkg should. In fact, we should have a lint file that detects filenames with newlines and abandons packaging.
Comment by Allan McRae (Allan) - Saturday, 26 October 2019, 22:47 GMT
And we already do!
Comment by Ronan Pigott (Brocellous) - Saturday, 26 October 2019, 23:59 GMT
Yes eschwartz pointed that out above, but still seemed interested in supporting newlines in filenames for makepkg. Thoughts? TBH, I think I'm on the side of not supporting newlines but that's beside the point.

If we're going with the above patch, I have more concerns to voice, other than that `cat $file` must be quoted.

1) `read file` must use -r to avoid read consuming backslash characters in filenames as well as erroneously interpreting a trailing backslash as an escaped newline.
2) `cat "$file" | wc -c` will fail when the filename is a valid option flag for cat. You might expect `cat -- "$file"` to work, but cat still interprets a single dash following the double dash as an alias for stdin, making files named '-' stall makepkg indefinitely waiting for input.
3) read ignores trailing delimiters, e.g. `read file <<< 'filename '` stores only "filename" in file, not "filename ", causing an error.
4) using ' ' (SPACE 0x20) as a delimiter for the first read causes read to consume _all_ space characters after the first word break. As a result, files will have prefix space characters stripped when parsed from stat -c '%i %n', causing an error.

So files containing '\', '-*', ' *', or '* ' in addition to $'\n' are not supported. Generally speaking, I am more confident in the null-delimited approach to correctly handle unforseen cases involving special characters.

These issues are effectively circumvented as well in my approach, so even if we don't care about newlines I think it should be reconsidered.
Comment by Eli Schwartz (eschwartz) - Sunday, 27 October 2019, 00:08 GMT
As I said above, it's not a deal breaker if we don't support files with newlines. My main concern was we absolutely must support filenames with spaces... It would still be nice if, wherever possible, we used code which does not depend on that assumption.

INODECMD doesn't break on spaces, which is good. It could be trivially extended to not break on newlines either, I think... in the only other use of it, we're also doing a while read loop, so we might as well add \0 to its output, I think.

@Brocellous, that being said while find -print0 is not a POSIX option and POSIX options are generally safest, it is an option that is implemented on I think all actual find implementations, for example the busybox applet and https://man.openbsd.org/find.1 -- because -print is sort of useless, and -print0 is needed for sanity, and basically everyone writing a find implementation seems to have agreed or independently come up with the desire to to implement null-delimited output.

We already use -print0 in several places in makepkg. -printf is a different story, and is indeed unique to GNU find.
Comment by Ronan Pigott (Brocellous) - Sunday, 27 October 2019, 00:29 GMT
@eshwartz But doesn't the use of INODECMD break on spaces in this case? How would you address cases 3 and 4 above? I am quite certain that Allan's patch does not correctly handle ' name' or 'name ' files.

That is a good note about -print0, thanks for pointing it out. I can submit a patch with -print0 instead if you like, I think it should work fine and may be more readable even if its not POSIX.

Is this the right place for me to submit and discuss patches? Or should they be submitted to pacman-dev?
Comment by Allan McRae (Allan) - Sunday, 27 October 2019, 00:40 GMT
pacman-dev is the better place to submit and discuss patches.

BTW, I'm also quite happy to add a lint to prevent file names starting or ending with a space... I'm of the opinion those file names should not exist.
Comment by Eli Schwartz (eschwartz) - Sunday, 27 October 2019, 01:06 GMT
@Brocellous, I carelessly did not read the man page or help text for stat, and did not realize it forces newlines and the actual printf option which suppresses newlines and can just give you null-delimited output is... Not portable! So kill off that thought anyway.
Comment by Ruslan (ruff) - Thursday, 31 October 2019, 16:13 GMT
why don't you just du the target fakeroot dir?
Comment by Eli Schwartz (eschwartz) - Thursday, 31 October 2019, 16:17 GMT
Because that's the method we explicitly switched away from, and therefore in order to understand why we do not use du you need to check the history of how we ended up using cat | wc -c in the first place.

Comment by Ruslan (ruff) - Thursday, 31 October 2019, 16:28 GMT
Yea sorry i thought that after posting. So apparently neither you can reliably cal the size using shell script. Do you need to two-liner c utility to do that? perhaps as function of makepkg binary?
Comment by Ruslan (ruff) - Thursday, 31 October 2019, 16:31 GMT
sorry, pacman binary, as you have deptest
Comment by Eli Schwartz (eschwartz) - Thursday, 31 October 2019, 17:00 GMT
Well, --deptest queries pacman for useful information from the pacman binary about the libalpm database.

Summing up the filesizes in a directory isn't really pacman related, and we already found a way to count filesizes without writing a custom C utility for it, so unless something changes, the current state of pacman master is, this is fixed: https://git.archlinux.org/pacman.git/commit/?id=0272fca993718460bf7ecb7fdc3ca7dad1c7e6cd

Loading...