FS#61717 - Packages not reproducible due to different size

Attached to Project: Pacman
Opened by Jelle van der Waa (jelly) - Monday, 11 February 2019, 10:25 GMT
Last edited by Allan McRae (Allan) - Friday, 11 October 2019, 10:24 GMT
Task Type Bug Report
Category makepkg
Status Closed
Assigned To No-one
Architecture All
Severity Low
Priority Normal
Reported Version 5.1.2
Due in Version 5.2.0
Due Date Undecided
Percent Complete 100%
Votes 0
Private No


Summary and Info:

For reproducible builds we started reproducing packages using archlinux-repro (on Github.com/archlinux ) and discovered that some packages where not reproducible due to a difference in PKGINFO's size. [1]
This issue seems to be caused by the size calculation in makepkg ie. du -ks --apparent-size reports a different size for files under
btrfs compressed FS and for example ext4. [2]

[1] https://gist.github.com/jelly/19edcbe1a9531b6890044adb9adc5152
[2]bug-coreutils@gnu.org/msg30689.html"> https://www.mail-archive.com/bug-coreutils@gnu.org/msg30689.html

Steps to Reproduce:

- Create two loopback fs's one ext4 and one btrfs
- dd if=/dev/zero of=testbtrfs.img bs=1M count=200
- mkfs.btrfs testbtrfs.img
- dd if=/dev/zero of=testfs.img bs=1M count=10
- mkfs.ext4 testfs.img
- export SOURCE_DATE_EPOCH=1544997248
- mount it somewhere but with the same destdir
- mount -t ext4 testfs.img /mnt/builddir
- makepkg archlinux-keyring in /mnt/builddir

- umount /mnt/builddir
- mount -t btrfs -o loop,compress=zlib testbtrfs.img /mnt/builddir
- makepkg archlinux-keyring
- umount /mnt/builddir
- diffoscope the result

See that the size is different for both packages.

Another testcase is against tmpfs.

- mkdir ~/builddir
- mount -t tmpfs -o size=1G tmpfs /home/jelle/builddir
- makepkg in builddir
- umount ~/builddir
- makepkg in builddir
- difffoscope result

Notice that the size is also different!
This task depends upon

Closed by  Allan McRae (Allan)
Friday, 11 October 2019, 10:24 GMT
Reason for closing:  Fixed
Additional comments about closing:  git commit f26cb61c
Comment by Robin Broda (coderobe) - Monday, 11 February 2019, 10:52 GMT
Thoughts about using `wc`?

It's most definitely gonna be slower than asking the fs metadata, but will be consistent.
Comment by Jelle van der Waa (jelly) - Monday, 11 February 2019, 11:21 GMT
Note that we had this solution before https://git.archlinux.org/pacman.git/commit/scripts/makepkg.sh.in?id=3f1ea8b62f46a915c94a5b46e21ad39ea2628f65 and the package archlinux-keyring package was build in tmpfs and reproduced on ext4.
Comment by Allan McRae (Allan) - Monday, 11 February 2019, 11:23 GMT
argh... this has changed so many times...
Comment by Allan McRae (Allan) - Monday, 11 February 2019, 11:54 GMT Comment by Bernhard M. Wiedemann (bmwiedemann) - Monday, 11 February 2019, 14:12 GMT
Reminds me of https://github.com/rpm-software-management/rpm/pull/229

Instead of wc -c $file
you could also use
stat -c %s $file

strace shows, both use fstat to read file metadata and not the content itself.

It only differs for directories.
Comment by Allan McRae (Allan) - Monday, 11 February 2019, 21:07 GMT
We did have stat -c once. Not very portable from memory.
Comment by Eli Schwartz (eschwartz) - Friday, 08 March 2019, 03:11 GMT
We already configure stat with different flags, three times -- two of which can be removed, BTW: https://lists.archlinux.org/pipermail/pacman-dev/2019-March/023241.html
The remaining case is INODECMD, used in zipman.

Since we currently assume -c can format information most places (it works for GNU and busybox at a minimum), and we fallback to -f on BSD/Darwin, we can assume our existing assumptions stand strong. Can we use that instead?
Comment by Allan McRae (Allan) - Friday, 08 March 2019, 03:23 GMT
So back two revisions ago? That also did not work...
Comment by Eli Schwartz (eschwartz) - Friday, 08 March 2019, 03:51 GMT
From a quick reading of the history it looked like we moved back to du, and only then added the sleep workaround. I think a conflicting merge threw me off though.

Darn it.
Comment by Eli Schwartz (eschwartz) - Friday, 08 March 2019, 04:04 GMT
Welp, given how comparatively little time is actually wasted by wc -c (especially as files are only just created and should still be in the disk cache) I suppose this is pretty reasonable after all. It feels dirty but it would seem to be very reliable (and *extremely* portable) which is a lot more interesting.

It's also more reliable, apparently, to not try calculating the size of a directory entry.
Comment by Jan Alexander Steffens (heftig) - Tuesday, 12 March 2019, 15:01 GMT
Yeah, the size of directories is the problem here.

du without --apparent-size has the problem of FS-dependent block sizes and thus rounding.
du, even with --apparent-size, has the problem of FS-dependent directory sizes. We can't tell it to ignore directories.

So using du is right out.

I would really like to avoid pushing all the file contents into a pipe (twice, if we count the actual archive creation).

I've tested
find . -type f -exec stat -c %s {} + | awk '{ x+=$1 } END { print x }'
find . -type f -exec cat {} + | wc -c
and both produce the same result for the reproduction case above, but the former is much faster.
Test case for speed was a directory with 10417 files totaling 335022335 bytes. (~30ms vs ~810ms)

Is `stat -c %s` really not portable enough?
Comment by Jan Alexander Steffens (heftig) - Tuesday, 12 March 2019, 15:29 GMT
Ah, I see, we had it as SIZECMD until 1af766987f66c63b029578374d679a353960d46d, but only for getting the size of single files, unrelated to this issue.

Using POSIX-only commands like in that commit, one can construct a more complex pipeline:

find . -type f -exec sh -c 'wc -c "$@" | tail -n 1' sh {} + | awk '{ x+=$1 } END { print x }'

Not quite as fast, at ~100ms, but this code is getting unwieldy and I'd prefer the cat|wc again.
Comment by Eli Schwartz (eschwartz) - Thursday, 21 March 2019, 22:48 GMT
  • Field changed: Category (General → makepkg)
  • Field changed: Due in Version (Undecided → 5.2.0)
  • Field changed: Percent Complete (0% → 100%)