FS#70954 - [gcc] [toolchain] build order / reproducibility issue

Attached to Project: Arch Linux
Opened by Toolybird (Toolybird) - Thursday, 20 May 2021, 02:32 GMT
Last edited by Toolybird (Toolybird) - Monday, 09 October 2023, 21:23 GMT
Task Type General Gripe
Category Packages: Core
Status Closed
Assigned To Giancarlo Razzolini (grazzolini)
freswa (frederik)
Architecture All
Severity Low
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

(I'm sure "the powers that be" are fully aware of any issues raised here, but it still might be good to have this documented somewhere. Feel free to shut this down if not appropriate for the bug tracker.)

Whenever Arch upgrades the toolchain with a major new GCC version (e.g. GCC-10 -> GCC11), a situation arises where the toolchain becomes (kind of) unreproducible. This is evidenced by (as of this writing) new entries appearing on the the Arch Repro Status Page[1] for toolchain components (in particular; binutils, gcc and gcc-libs).

Part of the cause is quite obvious when considering the current toolchain build order. Glibc startfiles compiled with the previous GCC are linked into the final binaries for both binutils and gcc. For example (as of this writing):

$ strings /usr/bin/ld | grep GCC:
GCC: (GNU) 10.2.0
GCC: (GNU) 11.1.0
$ strings /usr/bin/gcc | grep GCC:
GCC: (GNU) 10.2.0
GCC: (GNU) 11.1.0

Code from the previous toolchain is leaking into the current. This will of course sort itself out as toolchain components get rebuilt for minor revisions. But it would be nice if everything "just worked" from the get-go.

One way to fix this would be a slight tweak to the current toolchain build order. The status quo has clearly served Arch well over the years (albeit with this tiny flaw) so this is merely a suggestion from the peanut gallery :)

Current:
linux-api-headers->glibc->binutils->gcc->binutils->glibc

Proposed:
linux-api-headers->glibc->binutils->gcc->glibc->binutils->gcc

Because GCC is such a beast to compile, it would make sense (and is perfectly acceptable IMHO) for the first GCC to be compiled with `--disable-bootstrap' (in fact I would advocate for the both GCC's to be non-bootstrapped, but that's a separate, possibly controversial, topic for another day).

Part of my thinking is based on experience building cross toolchains. Sidenote: there is a python script in the glibc sources `build-many-glibcs.py' which IMHO represents state-of-the-art methodology for building cross toolchains. If you haven't already, check it out, it's awesome!

Anyway, just throwing it out there for comment.

[1]: https://reproducible.archlinux.org/
This task depends upon

Closed by  Toolybird (Toolybird)
Monday, 09 October 2023, 21:23 GMT
Reason for closing:  Fixed
Comment by Allan McRae (Allan) - Thursday, 20 May 2021, 03:30 GMT
Hrm... I'm trying to work out if this also suggests the need to rebuild glibc for more minor gcc bumps (e.g. 11.1.0 to 11.2.0, or even 11.1.0 to 11.1.1) which we have not done in the past. e.g. binutils built after a minor gcc bump is still reproducible, but will contain references to multiple gcc versions.
Comment by Eli Schwartz (eschwartz) - Thursday, 20 May 2021, 04:07 GMT
There shouldn't be any reproducibility problems except when bootstrapping using packages that never get published.

Comment by Toolybird (Toolybird) - Thursday, 20 May 2021, 05:19 GMT
> except when bootstrapping using packages that never get published

Bingo! Because that's *exactly* what happens now in the Arch toolchain bootstrap. There are intermediate packages along the way.

There's got to be a way to express these dependencies properly in terms of the package manager (e.g.: glibc-first, binutils-stage1 etc.) but so far I am yet to come up with a clean solution.
Comment by Allan McRae (Allan) - Thursday, 20 May 2021, 06:55 GMT
As I said in my comment, deciding whether to do a full bootstrap on more minor version is not about reproducibility. It was about whether we should ensure there is only a reference into a single gcc version in these files.
Comment by Emil (xexaxo) - Thursday, 20 May 2021, 18:40 GMT
Humble request - please document (as you get the chance of course) our current rebuild order.
A while ago Allan pointed me to LFS for more details about the sequence, yet they recommend/use what Toolybird is proposing just above.
Comment by Giancarlo Razzolini (grazzolini) - Friday, 21 May 2021, 01:35 GMT Comment by Toolybird (Toolybird) - Friday, 09 July 2021, 21:20 GMT
While this bug is about the Arch toolchain bootstrap procedure, it turns out there is also an upstream GCC 11 bug affecting reproducibility.

I dug around a bit and filed an upstream bug report[1].

A fix has been proposed[2].

Until the fix is committed, the patch can be grabbed from here[3].

[1]: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101383
[2]: https://gcc.gnu.org/pipermail/gcc-patches/2021-July/574802.html
[3]:p17p7o-28o1-271o-6950-42oq6rnrs42@fhfr.qr/"> https://patchwork.ozlabs.org/project/gcc/patch/p17p7o-28o1-271o-6950-42oq6rnrs42@fhfr.qr/

sorry about the last link, flyspray has mangled it..
Comment by Toolybird (Toolybird) - Sunday, 11 July 2021, 21:47 GMT
Also on the rebuilderd status page is gcc-go. I have made some progress on this one.

Filed an upstream bug report[1] but no response so far. Folks interested in repro should take a look.

[1]: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101407
Comment by Toolybird (Toolybird) - Thursday, 15 July 2021, 19:59 GMT
Upstream have committed fixes for both bugs. Yay, GCC is reproducible again. The first will be included in upcoming 11.2. The gcc-go fix is only on mainline but is an easy cherry-pick.

Still mulling on how to fix the "bootstrap whole Arch toolchain reproducibly" issue...
Comment by Toolybird (Toolybird) - Sunday, 13 February 2022, 20:37 GMT
So it's really good to see some toolchain progress! But I have *major* concerns over the length of time of the bootstrap procedure. The whole thing is now horrendously inefficient. This is partly my fault for suggesting the additional GCC build. But throw LTO into the mix, and now even PGO (WTF??) and the whole thing is starting to get out of control. It's probably fine for Arch itself with access to powerful build machines, but the procedure needs to be attainable by average punters on consumer hardware.

After thinking about this for some time, I have some ideas on how to improve the process. But the only way I can see this working is if we have an official bootstrap script. Is this on the radar? It could maybe live in devtools?

We currently build:

linux-api-headers once
binutils twice
glibc twice
gcc (fat) twice (but actually 6 times !! due to 3-stage bootstrap)

If we borrow some ideas from cross compilation procedures, we could trim this down to:

linux-api-headers once
binutils twice
glibc once
gcc (thin) once
gcc (fat) once (but actually 3 times due to 3-stage bootstrap)

I'm not suggesting we do any actual cross compilation (although, I have experimented with this and proved that a cross compiled glibc can be byte-for-byte identical to a native compiled one). BTW, I previously mentioned `build-many-glibcs.py'. I've put some introductory usage notes up here [1] for anyone who would like to dabble. Studying the sequence and the log files produced is an excellent way to learn about the inner workings of toolchains IMHO.

The reason for the bootstrap script would be to employ an ENV VAR in the gcc PKGBUILD. For example:

(pseudo)
if ARCH_BOOTSTRAP
do thin gcc
else
do full fat gcc

I know this kind of thing is normally frowned upon in PKGBUILDS, but a special case like this might be acceptable?
Any thoughts? I'm working on a proof of concept..

[1] https://gitlab.com/-/snippets/2250210
Comment by Allan McRae (Allan) - Sunday, 13 February 2022, 23:17 GMT
As a metric, binutils with --enable-pgo-build=lto takes 4x longer to build. I gave up timing the gcc difference.

We build packages in clean chroots, so passing environmental flags is not straightforward. I was thinking of having a PKGBUILD.pass1 and PKGBUILD.pass2 in my build directly, and have a buildscript symlink it to PKGBUILD as needed. Not ideal, as you have code duplication across PKGBUILDs, but not seeing a great solution here.
Comment by Toolybird (Toolybird) - Monday, 14 February 2022, 04:57 GMT
4x longer? Ouch.

Clean chroot builds shouldn't be a problem, because the env var won't be set. i.e., conditional code will be bypassed. It's only when the Arch toolchain maintainer runs the bootstrap script that the env var will take effect. Allow me to demonstrate by stealing your build script posted in the forum :) This is just an example:

build linux-api-headers
build glibc --nocheck
build binutils --nocheck
export ARCH_BOOTSTRAP=1
build gcc --nocheck
unset ARCH_BOOTSTRAP
build glibc
build binutils
build gcc

I've already tried the PKGBUILD.pass1 approach but just couldn't stomach it.
Comment by Emil (xexaxo) - Wednesday, 16 February 2022, 12:08 GMT
Perhaps a silly question:
Do we enable LTO/PGO for anything but the final build? If so what does it bring us - both in terms of performance and build times?
Comment by Alexander Epaneshnikov (alex19EP) - Thursday, 17 February 2022, 17:10 GMT
> what does it bring us - both in terms of performance and build times?

https://gist.github.com/0849a33d8bdcb081f64274e3c6fa31f0
Comment by Toolybird (Toolybird) - Thursday, 24 March 2022, 20:58 GMT
So, I have strong suspicions GCC is no longer reproducible, probably due to PGO.

But we don't know for sure because the Arch Reproducible Status page [1] is currently horked WRT GCC.

"fatal: unable to access 'https://github.com/archlinux/svntogit-packages.git/': Could not resolve host: github.com"

(Could someone please fix the Arch rebuilderd instance to get it working again for GCC? Thanks!)

PGO was previously rejected here [2] and here [3]. What is different now? GCC is arguably *the most important* package that needs to be reproducible. If it pans out that PGO causes GCC to be unreproducible, then I vote we get rid of it. Any thoughts?

[1] https://reproducible.archlinux.org/
[2]  FS#49129 
[3]  FS#56856 
Comment by freswa (frederik) - Saturday, 26 March 2022, 19:58 GMT
Checked build with bootstrap instead of profiledbootstrap and `repro -fd` succeeded. Will investigate if we can fix this somehow...
Comment by Toolybird (Toolybird) - Sunday, 27 March 2022, 05:01 GMT
Just saw latest commit. Really appreciate the detailed commit message @freswa! I hadn't seen that link before. It lead me to another recent posting [1].

It's a tough one because we apparently sacrifice compiler performance for reproducibility. Which is the more important goal? I guess that's a decision for Arch leadership. It appears other distros value the former. Fedora don't seem to care much about repro. Debian do, but their site is a nightmare to navigate. Judging from above link, openSUSE seem to say their GCC is reproducible, but then they ship a profiled GCC in production?

The other interesting thing in that link is the bit about "deterministic filesystem readdir order" which is something I hadn't considered before. It might possibly be a factor in another recent GCC bug I reported upstream [2]. Great, another rabbit hole for me to go down! Anyway, it appears GCC devs do try to improve the reproducibility of profiled builds from time-to-time which can only be good.

BTW, with bootstrap instead of profiledbootstrap, my GCC build time (--nocheck) goes from 2h:23m down to 1h:51m (Ryzen 2700X, -j16). Without LTO it goes down to about 45 mins IIRC. I need a faster build box :)

[1] https://lists.reproducible-builds.org/pipermail/rb-general/2022-February/002478.html
[2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104832
Comment by Toolybird (Toolybird) - Tuesday, 12 April 2022, 07:22 GMT
> a proof of concept

It turns out that passing env vars through to the the build environment is fully supported [1]. I've finally started a new toolchain repo [2] where you can see this in action.

When bootstrapping a full toolchain, there is no absolutely no need in the *first passes* for LTO, PGO, debug pkgs or in the case of GCC, libgccjit.

Just omitting these simple things can speed up the build quite a lot. For example, on a beefy cloud VM with plenty of cores, a full bootstrap cycle (including test suites) of current Arch toolchain takes about 2h:52m:56s. With tweaks as per my repo, it goes down to 2h:21m:35s. That's a fair saving for very little effort. There is *heaps* more low hanging fruit to optimize this a lot further.

Pros: faster toolchain builds
Cons:
1. ((_ARCH_BOOTSTRAP)) && "do this or that"
sprinkled throughout the toolchain PKGBUILDs
2. a toolchain build script is mandatory when bootstrapping

I've taken Allan's build script and added stuff for my own purposes. If this idea ever becomes official then an Arch guru could conceivably polish it up for inclusion in devtools.

Regarding reproducibility, I'm still struggling with gccgo / libgo. There is also a crazy binutils issue [3] apparently exposed by the non-PIC libiberty.a oopsie. We could *really* do with a new binutils upload (hint, hint, hi freswa!)

[1] https://bbs.archlinux.org/viewtopic.php?pid=1474035#p1474035
[2] https://gitlab.com/Toolybird/toolchain
[3] https://sourceware.org/bugzilla/show_bug.cgi?id=29042
Comment by Emil (xexaxo) - Tuesday, 12 April 2022, 18:16 GMT
> When bootstrapping a full toolchain, there is no absolutely no need in the *first passes* for LTO, PGO,...

Precisely what I meant earlier with:

> Do we need to* enable LTO/PGO for anything but the final build?

Glad to see it shaved ~20% of the runtime. Out of curiosity - any reason why you didn't short-circuit the check functions as well?
Comment by Allan McRae (Allan) - Tuesday, 12 April 2022, 19:54 GMT
> Out of curiosity - any reason why you didn't short-circuit the check functions as well?

The script does

build linux-api-headers
build glibc --nocheck
build binutils --nocheck _ARCH_BOOTSTRAP=1
build gcc --nocheck _ARCH_BOOTSTRAP=1

Comment by Buggy McBugFace (bugbot) - Tuesday, 08 August 2023, 19:11 GMT
This is an automated comment as this bug is open for more then 2 years. Please reply if you still experience this bug otherwise this issue will be closed after 1 month.

Loading...