FS#77688 - After upgrade to systemd 253-1 system fails to start properly

Attached to Project: Arch Linux
Opened by Niels (zee99) - Wednesday, 01 March 2023, 12:13 GMT
Last edited by Christian Hesse (eworm) - Friday, 03 March 2023, 14:36 GMT
Task Type Bug Report
Category Packages: Core
Status Closed
Assigned To No-one
Architecture All
Severity High
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 2
Private No

Details

Building ARCH Linux AMI images for use on AWS EC2: https://wiki.archlinux.org/title/Arch_Linux_AMIs_for_Amazon_Web_Services

Images are build as plain (minimal) as possible using pacstrap and generally following the same process as outline here: https://wiki.archlinux.org/title/Installation_guide#Installation

The following packages are directly installed and all dependencies are pulled in automatically:

base grub mkinitcpio systemd-sysvcompat dosfstools e2fsprogs exfatprogs ntfs-3g xfsprogs man which lsof reflector rsync vi python3 audit irqbalance openssh haveged cloud-init cloud-utils aws-cli jq linux linux-headers

at the moment I do force a downgrade from 253-1 to 252.5-1 (by getting the 252 packages from the Arch Archive and installing them with pacman -U) to get the images to boot, but otherwise this is it.

A network is configured as:

/etc/systemd/network/20.ethernet

[Match]
Name = en* eth*
[Network]
DHCP = yes
[DHCP]
UseMTU = yes
UseDNS = yes
UseDomains = yes

and we enable the following services specifically:

systemctl enable systemd-timesyncd.service
systemctl enable nscd.service
systemctl enable auditd.service
systemctl enable haveged.service
systemctl enable irqbalance.service
systemctl enable cloud-init.service
systemctl enable cloud-config.service
systemctl enable cloud-final.service
systemctl enable systemd-networkd
systemctl enable systemd-resolved
systemctl enable sshd.service

These images start without issue when ussing systemd 252.5-1 but fail to start fully on 253-1.

Before updating to using 252.5 systemd we have the following targets and services:

Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Reached target Local Encrypted Volumes.
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Reached target Local Integrity Protected Volumes.
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Reached target Path Units.
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Reached target Remote File Systems.
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Reached target Slice Units.
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Reached target Swaps.
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Reached target Local Verity Protected Volumes.
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Reached target Preparation for Local File Systems.
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Reached target Local File Systems.
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Reached target System Time Set.
Feb 25 06:25:09 ip-172-31-50-96 systemd[1]: Reached target Host and Network Name Lookups.
Feb 25 06:25:09 ip-172-31-50-96 systemd[1]: Reached target System Initialization.
Feb 25 06:25:09 ip-172-31-50-96 systemd[1]: Reached target Timer Units.
Feb 25 06:25:09 ip-172-31-50-96 systemd[1]: Reached target Socket Units.
Feb 25 06:25:09 ip-172-31-50-96 systemd[1]: Reached target Basic System.
Feb 25 06:25:16 ip-172-31-50-96 systemd[1]: Reached target Preparation for Network.
Feb 25 06:25:16 ip-172-31-50-96 systemd[1]: Reached target Network.
Feb 25 06:25:20 ip-172-31-50-96 systemd[1]: Reached target Cloud-config availability.
Feb 25 06:25:20 ip-172-31-50-96 systemd[1]: Reached target Network is Online.
Feb 25 06:25:20 ip-172-31-50-96 systemd[1]: Reached target Login Prompts.
Feb 25 06:25:20 ip-172-31-50-96 systemd[1]: Reached target Multi-User System.
Feb 25 06:25:21 ip-172-31-50-96 systemd[1]: Reached target Cloud-init target.


Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Create List of Static Device Nodes...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Load Kernel Module configfs...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Load Kernel Module drm...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Load Kernel Module fuse...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Load Kernel Modules...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Generate network units from Kernel command line...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Remount Root and Kernel File Systems...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Coldplug All udev Devices...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Load/Save Random Seed...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Apply Kernel Variables...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Create System Users...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Create Static Device Nodes in /dev...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Rebuild Dynamic Linker Cache...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Journal Service...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Rule-based Manager for Device Events and Files...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Flush Journal to Persistent Storage...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Create Volatile Files and Directories...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Security Auditing Service...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Rebuild Journal Catalog...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Network Name Resolution...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Network Time Synchronization...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Update is Completed...
Feb 25 06:25:09 ip-172-31-50-96 systemd[1]: Starting Record System Boot/Shutdown in UTMP...
Feb 25 06:25:09 ip-172-31-50-96 systemd[1]: Starting Initial cloud-init job (pre-networking)...
Feb 25 06:25:09 ip-172-31-50-96 systemd[1]: Starting D-Bus System Message Bus...
Feb 25 06:25:09 ip-172-31-50-96 systemd[1]: Starting Name Service Cache Daemon...
Feb 25 06:25:09 ip-172-31-50-96 systemd[1]: Starting User Login Management...
Feb 25 06:25:16 ip-172-31-50-96 systemd[1]: Starting Network Configuration...
Feb 25 06:25:17 ip-172-31-50-96 systemd[1]: Starting Wait for Network to be Configured...
Feb 25 06:25:17 ip-172-31-50-96 systemd[1]: Starting Hostname Service...
Feb 25 06:25:19 ip-172-31-50-96 systemd[1]: Starting Initial cloud-init job (metadata service crawler)...
Feb 25 06:25:20 ip-172-31-50-96 systemd[1]: Starting Apply the settings specified in cloud-config...
Feb 25 06:25:20 ip-172-31-50-96 systemd[1]: Starting Permit User Sessions...
Feb 25 06:25:20 ip-172-31-50-96 systemd[1]: Starting Execute cloud user/final scripts...
Feb 25 06:25:44 ip-172-31-50-96 systemd[1]: Starting User Runtime Directory /run/user/1000...
Feb 25 06:25:44 ip-172-31-50-96 systemd[1]: Starting User Manager for UID 1000...

After update to 253 we get:

Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Reached target Local Encrypted Volumes.
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Reached target Local Integrity Protected Volumes.
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Reached target Path Units.
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Reached target Slice Units.
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Reached target Swaps.
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Reached target Local Verity Protected Volumes.
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Reached target Preparation for Network.
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Reached target Preparation for Local File Systems.
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Reached target Local File Systems.
Feb 25 06:28:17 ip-172-31-50-96 systemd[1]: Reached target System Time Set.
Feb 25 06:28:17 ip-172-31-50-96 systemd[1]: Reached target Network.
Feb 25 06:28:17 ip-172-31-50-96 systemd[1]: Reached target Host and Network Name Lookups.
Feb 25 06:28:17 ip-172-31-50-96 systemd[1]: Reached target System Initialization.
Feb 25 06:28:17 ip-172-31-50-96 systemd[1]: Reached target Timer Units.
Feb 25 06:28:17 ip-172-31-50-96 systemd[1]: Reached target Socket Units.
Feb 25 06:28:17 ip-172-31-50-96 systemd[1]: Reached target Basic System.
Feb 25 06:28:17 ip-172-31-50-96 systemd[1]: Reached target Multi-User System.

Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Starting Create List of Static Device Nodes...
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Starting Load Kernel Module configfs...
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Starting Load Kernel Module dm_mod...
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Starting Load Kernel Module fuse...
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Starting Load Kernel Module loop...
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Starting Load Kernel Modules...
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Starting Generate network units from Kernel command line...
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Starting Remount Root and Kernel File Systems...
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Starting Coldplug All udev Devices...
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Starting Load/Save OS Random Seed...
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Starting Apply Kernel Variables...
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Starting Create System Users...
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Starting Create Static Device Nodes in /dev...
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Starting Rebuild Dynamic Linker Cache...
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Starting Journal Service...
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Starting Rule-based Manager for Device Events and Files...
Feb 25 06:28:15 ip-172-31-50-96 systemd[1]: Starting Flush Journal to Persistent Storage...
Feb 25 06:28:15 ip-172-31-50-96 systemd[1]: Starting Network Configuration...
Feb 25 06:28:15 ip-172-31-50-96 systemd[1]: Starting Create Volatile Files and Directories...
Feb 25 06:28:16 ip-172-31-50-96 systemd[1]: Starting Rebuild Journal Catalog...
Feb 25 06:28:16 ip-172-31-50-96 systemd[1]: Starting Network Name Resolution...
Feb 25 06:28:16 ip-172-31-50-96 systemd[1]: Starting Network Time Synchronization...
Feb 25 06:28:16 ip-172-31-50-96 systemd[1]: Starting Record System Boot/Shutdown in UTMP...
Feb 25 06:28:17 ip-172-31-50-96 systemd[1]: Starting Update is Completed...
Feb 25 06:28:17 ip-172-31-50-96 systemd[1]: Starting D-Bus System Message Bus...
Feb 25 06:28:17 ip-172-31-50-96 systemd[1]: Starting Hostname Service...

Particular concerning is that we do never reach "target Login Prompts.". Thus even on the serial remote interface that I can use to access the instance and see what is wrong I dont get a login prompt making it very hard to do any digging into this issue.

In both cases the network is initialized and an IP address is assigned. But a number of service/targets are never started/reached.

Even on a freshly build AMI (where cloud-init has never run) it fails to boot (and cloud-init never runs) when running with systemd 253 but works flawlessly with 252.5-1. I point this out specifically as cloud-init will modify the network device as part of its initialization (basically that is what cloud-init is supposed to do along with seeding the allowed certificates/keys for the sshd).

Output from networkctl (before 253 upgrade as I can not get this after the upgrade).

IDX LINK TYPE OPERATIONAL SETUP
1 lo loopback carrier unmanaged
2 ens5 ether routable configured

Output from: systemctl list-unit-files (again before upgrade, if upgrade to 253 and DO NOT REBOOT, running the same command produces same output with the only difference being systemd-journald-audit.socket set to disabled):

UNIT FILE STATE PRESET
proc-sys-fs-binfmt_misc.automount static -
-.mount generated -
dev-hugepages.mount static -
dev-mqueue.mount static -
proc-sys-fs-binfmt_misc.mount disabled disabled
sys-fs-fuse-connections.mount static -
sys-kernel-config.mount static -
sys-kernel-debug.mount static -
sys-kernel-tracing.mount static -
tmp.mount static -
var-lib-machines.mount static -
systemd-ask-password-console.path static -
systemd-ask-password-wall.path static -
session-1.scope transient -
archlinux-keyring-wkd-sync.service static -
auditd.service enabled disabled
autovt@.service alias -
cloud-config.service enabled disabled
cloud-final.service enabled disabled
cloud-init-hotplugd.service static -
cloud-init-local.service disabled disabled
cloud-init.service enabled disabled
console-getty.service disabled disabled
container-getty@.service static -
dbus-org.freedesktop.hostname1.service alias -
dbus-org.freedesktop.import1.service alias -
dbus-org.freedesktop.locale1.service alias -
dbus-org.freedesktop.login1.service alias -
dbus-org.freedesktop.machine1.service alias -
dbus-org.freedesktop.network1.service alias -
dbus-org.freedesktop.portable1.service alias -
dbus-org.freedesktop.resolve1.service alias -
dbus-org.freedesktop.timedate1.service alias -
dbus-org.freedesktop.timesync1.service alias -
dbus.service static -
debug-shell.service disabled disabled
dhclient@.service disabled disabled
dm-event.service static -
e2scrub@.service static -
e2scrub_all.service static -
e2scrub_fail@.service static -
e2scrub_reap.service disabled disabled
emergency.service static -
fstrim.service static -
getty@.service enabled enabled
git-daemon@.service static -
haveged-once.service static -
haveged.service enabled disabled
initrd-cleanup.service static -
initrd-parse-etc.service static -
initrd-switch-root.service static -
initrd-udevadm-cleanup-db.service static -
ip6tables.service disabled disabled
iptables.service disabled disabled
irqbalance.service enabled disabled
kmod-static-nodes.service static -
krb5-kadmind.service disabled disabled
krb5-kdc.service disabled disabled
krb5-kpropd.service disabled disabled
krb5-kpropd@.service static -
ldconfig.service static -
man-db.service static -
mkinitcpio-generate-shutdown-ramfs.service static -
modprobe@.service static -
nscd.service enabled disabled
nvmefc-boot-connections.service disabled disabled
nvmf-autoconnect.service disabled disabled
nvmf-connect@.service static -
pam_namespace.service static -
quotaon.service static -
reflector.service disabled disabled
rescue.service static -
rfkill-block@.service disabled disabled
rfkill-unblock@.service disabled disabled
rsyncd.service disabled disabled
rsyncd@.service static -
serial-getty@.service indirect disabled
shadow.service static -
sshd.service enabled disabled
sshdgenkeys.service static -
sudo_logsrvd.service disabled disabled
system-update-cleanup.service static -
systemd-ask-password-console.service static -
systemd-ask-password-wall.service static -
systemd-backlight@.service static -
systemd-binfmt.service static -
systemd-bless-boot.service static -
systemd-boot-check-no-failures.service disabled disabled
systemd-boot-system-token.service static -
systemd-boot-update.service disabled enabled
systemd-coredump@.service static -
systemd-exit.service static -
systemd-firstboot.service static -
systemd-fsck-root.service enabled-runtime disabled
systemd-fsck@.service static -
systemd-halt.service static -
systemd-hibernate-resume@.service static -
systemd-hibernate.service static -
systemd-homed-activate.service disabled disabled
systemd-homed.service disabled enabled
systemd-hostnamed.service static -
systemd-hwdb-update.service static -
systemd-hybrid-sleep.service static -
systemd-importd.service static -
systemd-journal-catalog-update.service static -
systemd-journal-flush.service static -
systemd-journal-gatewayd.service indirect disabled
systemd-journal-remote.service indirect disabled
systemd-journal-upload.service disabled disabled
systemd-journald.service static -
systemd-journald@.service static -
systemd-kexec.service static -
systemd-localed.service static -
systemd-logind.service static -
systemd-machine-id-commit.service static -
systemd-machined.service static -
systemd-modules-load.service static -
systemd-network-generator.service enabled enabled
systemd-networkd-wait-online.service enabled disabled
systemd-networkd-wait-online@.service disabled disabled
systemd-networkd.service enabled enabled
systemd-nspawn@.service disabled disabled
systemd-oomd.service disabled disabled
systemd-pcrphase-initrd.service static -
systemd-pcrphase-sysinit.service static -
systemd-pcrphase.service static -
systemd-portabled.service static -
systemd-poweroff.service static -
systemd-pstore.service disabled enabled
systemd-quotacheck.service static -
systemd-random-seed.service static -
systemd-reboot.service static -
systemd-remount-fs.service enabled-runtime disabled
systemd-repart.service static -
systemd-resolved.service enabled enabled
systemd-rfkill.service static -
systemd-suspend-then-hibernate.service static -
systemd-suspend.service static -
systemd-sysctl.service static -
systemd-sysext.service disabled disabled
systemd-sysupdate-reboot.service indirect disabled
systemd-sysupdate.service indirect disabled
systemd-sysusers.service static -
systemd-time-wait-sync.service disabled disabled
systemd-timedated.service static -
systemd-timesyncd.service enabled enabled
systemd-tmpfiles-clean.service static -
systemd-tmpfiles-setup-dev.service static -
systemd-tmpfiles-setup.service static -
systemd-udev-settle.service static -
systemd-udev-trigger.service static -
systemd-udevd.service static -
systemd-update-done.service static -
systemd-update-utmp.service static -
systemd-user-sessions.service static -
systemd-userdbd.service indirect disabled
systemd-vconsole-setup.service static -
systemd-volatile-root.service static -
user-runtime-dir@.service static -
user@.service static -
uuidd.service indirect disabled
xfs_scrub@.service static -
xfs_scrub_all.service static -
xfs_scrub_fail@.service static -
machine.slice static -
system-systemd\x2dcryptsetup.slice static -
user.slice static -
cloud-init-hotplugd.socket disabled disabled
dbus.socket static -
dm-event.socket static -
git-daemon.socket disabled disabled
krb5-kpropd.socket disabled disabled
rsyncd.socket disabled disabled
syslog.socket static -
systemd-coredump.socket static -
systemd-journal-gatewayd.socket disabled disabled
systemd-journal-remote.socket disabled disabled
systemd-journald-audit.socket static -
systemd-journald-dev-log.socket static -
systemd-journald-varlink@.socket static -
systemd-journald.socket static -
systemd-journald@.socket static -
systemd-networkd.socket enabled disabled
systemd-oomd.socket disabled disabled
systemd-rfkill.socket static -
systemd-udevd-control.socket static -
systemd-udevd-kernel.socket static -
systemd-userdbd.socket disabled enabled
uuidd.socket disabled disabled
basic.target static -
blockdev@.target static -
bluetooth.target static -
boot-complete.target static -
cloud-config.target static -
cloud-init.target enabled-runtime disabled
cryptsetup-pre.target static -
cryptsetup.target static -
ctrl-alt-del.target alias -
default.target alias -
emergency.target static -
exit.target disabled disabled
factory-reset.target static -
final.target static -
first-boot-complete.target static -
getty-pre.target static -
getty.target static -
graphical.target static -
halt.target disabled disabled
hibernate.target static -
hybrid-sleep.target static -
initrd-fs.target static -
initrd-root-device.target static -
initrd-root-fs.target static -
initrd-switch-root.target static -
initrd-usr-fs.target static -
initrd.target static -
integritysetup-pre.target static -
integritysetup.target static -
kexec.target disabled disabled
local-fs-pre.target static -
local-fs.target static -
machines.target disabled enabled
multi-user.target indirect disabled
network-online.target static -
network-pre.target static -
network.target static -
nss-lookup.target static -
nss-user-lookup.target static -
nvmf-connect.target static -
paths.target static -
poweroff.target disabled disabled
printer.target static -
reboot.target disabled enabled
remote-cryptsetup.target disabled enabled
remote-fs-pre.target static -
remote-fs.target enabled enabled
remote-veritysetup.target disabled disabled
rescue.target static -
rpcbind.target static -
shutdown.target static -
sigpwr.target static -
sleep.target static -
slices.target static -
smartcard.target static -
sockets.target static -
sound.target static -
suspend-then-hibernate.target static -
suspend.target static -
swap.target static -
sysinit.target static -
system-update-pre.target static -
system-update.target static -
time-set.target static -
time-sync.target static -
timers.target static -
umount.target static -
usb-gadget.target static -
veritysetup-pre.target static -
veritysetup.target static -
archlinux-keyring-wkd-sync.timer disabled disabled
e2scrub_all.timer disabled disabled
fstrim.timer disabled disabled
man-db.timer disabled disabled
reflector.timer disabled disabled
shadow.timer static -
systemd-sysupdate-reboot.timer disabled disabled
systemd-sysupdate.timer disabled disabled
systemd-tmpfiles-clean.timer static -
xfs_scrub_all.timer disabled disabled









This task depends upon

Closed by  Christian Hesse (eworm)
Friday, 03 March 2023, 14:36 GMT
Reason for closing:  Fixed
Additional comments about closing:  systemd 253.1-1
Comment by Toolybird (Toolybird) - Wednesday, 01 March 2023, 21:03 GMT
It's a regression so have you bisected it to identify the offending commit? Having that information first before reporting upstream would be wise.
Comment by Niels (zee99) - Wednesday, 01 March 2023, 23:15 GMT
Upstream: https://github.com/systemd/systemd/issues/26613

There are 1811 commits between version 252 and 253 ... unless we have some idea on where this could be that is a near impossible task.
Comment by Niels (zee99) - Wednesday, 01 March 2023, 23:16 GMT
Please find debug log files of systemd from a successful boot and a 253 failed boot. Not sure what to look for though.
Comment by loqs (loqs) - Wednesday, 01 March 2023, 23:30 GMT
$ git bisect start
status: waiting for both good and bad commits
$ git bisect bad v253
status: waiting for good commit(s), bad commit known
$ git bisect good v252
Bisecting: 902 revisions left to test after this (roughly 10 steps)
[60e84f02055b76bd8916822939b686ba9b5798d5] Merge pull request #25732 from enr0n/unit-test-machine-id-initialized
Edit:
See attached source tarball. Note you need to revert 0001-Use-Arch-Linux-device-access-groups.patch before recording the git bisect result and re-apply it afterwards or build in a clean chroot.
Comment by Toolybird (Toolybird) - Thursday, 02 March 2023, 04:30 GMT
> that is a near impossible task.

???

It seems you're unaware of the power of git bisection. Please read up on the topic [1]. I had to do exactly this myself recently in  FS#77569 . It wasn't super hard, just a fair bit of effort, but worth it. @loqs has kindly given you a head start.

[1] https://wiki.archlinux.org/title/Bisecting_bugs_with_Git
Comment by Niels (zee99) - Thursday, 02 March 2023, 08:56 GMT
Thanks for the info on bisect.

As part of the setup for bisect I build systemd 253.r62851.7a67afe331-1-arch from AUR. And then before proceeding with the bisecting decided to install it. And now everything works just as well again as before.

That indicates that there was a bug somewhere that subsequently was fixed upstream with a patch.

Two questions:

1) Is the process for "reverse" bisecting the same? IMHO it would be easier finding the patch that fixed the issue in 253.r62851.7a67afe331-1-arch going back against the official Arch 253-1 ? If we find the patch that fixed the issue we know the cause. I would still like to know what caused this and how in the future I'm better prepared to find these issues ... any help and tips appreciated.

2) What is the typical merge time for Arch Linux to pull GIT builds into the official repo?
Comment by Christian Hesse (eworm) - Thursday, 02 March 2023, 09:03 GMT
Sure, you can still bisect, but the logic is a bit screwed. Bisect always wants to find a bind commit, you want to find a good one. So call a working state "bad", and a broken boot "good" to find your fix.

Usually we do not package git state into official packages, but of course I will cherry-pick important fixes. Also v253.1 is expected soon - possibly that will fix the issue already.
Comment by Christian Hesse (eworm) - Thursday, 02 March 2023, 09:06 GMT
Oh, if this confuses too much... Have a look a git-bisect(1) and search for "Alternate terms".
Comment by Niels (zee99) - Thursday, 02 March 2023, 09:31 GMT
Regarding the bisecting ... so the process is:

bisect
build
test
-if error still persists
bisect-again (would I do the bisect with the same series of commands and bisect then just "chooses" a new commit point?)
build
test

Rinse and repeat?

Seems I have some problems with building after bisecting ... get the following error during the check phase:

1088/1089 systemd:dist-check / check-version-ukify FAIL 0.20s exit status 4
>>> MALLOC_PERTURB_=56 /home/arch/systemd-built/systemd-git/src/systemd/tools/check-version.sh /home/arch/systemd-built/systemd-git/src/build/ukify 252

... stuck again.

Comment by Miguel (gypaetus) - Thursday, 02 March 2023, 11:31 GMT
To prevent this systemd bug in the Arch AMI images, one should include systemd, systemd-libs and systemd-sysvcompat under IgnorePkg in pacman.conf ?
Comment by Niels (zee99) - Thursday, 02 March 2023, 11:48 GMT
Yes - in the new images currently available here: http://arch-ami-list.drzee.net/ this has been done already and they are "frozen" on Systemd 252.5.
Comment by loqs (loqs) - Thursday, 02 March 2023, 14:36 GMT
The following change will allow the build to continue with check failures:
meson test -C build
to
meson test -C build || true
Comment by Niels (zee99) - Thursday, 02 March 2023, 21:35 GMT
After doing the bisection it would appear that

5bc244aaa90211ccd8370535274c266cdff6a1cb is the first bad commit
commit 5bc244aaa90211ccd8370535274c266cdff6a1cb
Date: Sun Sep 25 09:18:41 2022 +0200

chase-symlinks: Add chase_symlinks_at()

Is one of the problematic commits that breaks things.

While everything works fine with 252.r60801.15b4b4ed06 and fails with 252.r60802.5bc244aaa9 I also observed that everything is fine with a much later release that I went through as part of the bisect, 252.r60846.a92401d558

I find odd as it does not look like the bisect centered around that one ... maybe I need to repeat the bisect?
Comment by loqs (loqs) - Thursday, 02 March 2023, 21:54 GMT
You could try to find the fix for the issue with a reverse bisect

$ git bisect start --term-new=fixed --term-old=broken
$ git bisect broken v253
$ git bisect fixed origin/HEAD
Comment by Christian Hesse (eworm) - Thursday, 02 March 2023, 22:03 GMT
Wondering if (one of) these two fix your issue...

chase-symlinks: chase_symlinks_at() AT_FDCWD fixes
Commit: c677e13c35ff2d475199bcc9432892396485b6b8
Author: Daan De Meyer <daan.j.demeyer@gmail.com>
Date: Fri Jan 6 13:49:09 2023 +0100

chase-symlinks: Always open a dirfd to the root directory
Commit: accc26a0e543c9f9e2e5adc1ad6c5a2b9d64670d
Author: Daan De Meyer <daan.j.demeyer@gmail.com>
Date: Sun Jan 8 18:19:11 2023 +0100
Comment by Niels (zee99) - Friday, 03 March 2023, 12:15 GMT
After spending some time doing the reverse bisect (thanks loqs for pointing out the process) starting from 253.r62851.7a67afe331 (working) the guess by Christian was pretty good:

accc26a0e543c9f9e2e5adc1ad6c5a2b9d64670d is the first fixed commit
commit accc26a0e543c9f9e2e5adc1ad6c5a2b9d64670d
Date: Sun Jan 8 18:19:11 2023 +0100

chase-symlinks: Always open a dirfd to the root directory

Instead of special casing "/", let's just always open a dirfd to the
root directory.

253.r62382.c677e13c35 is NOT working and 253.r62383.accc26a0e5 is working.

So the conclusion must be that the change introduced in 252.r60802.5bc244aaa9 is what breaks things and its fixed again with 253.r62383.accc26a0e5 ...
Comment by Christian Hesse (eworm) - Friday, 03 March 2023, 12:23 GMT
Thanks for testing and verification!

The tag of v253.1 is imminent (see https://github.com/systemd/systemd-stable/pull/262) and already includes these commits cherry-picked.
Comment by Niels (zee99) - Friday, 03 March 2023, 12:46 GMT
As soon as 253.1 hits the repo I will spin up a AWS EC2 machine and test it :)

Side note: Again thanks to loqs for "teaching" me the bisect process.

It would have taken ages to do if not for using the AWS EC2 instances.

After every build of a new systemd and a failed "boot" I basically had to "trash" the EC2 instance and start with a fresh one (yes most likely I could have detached the boot volume and attached it to a working instance, nounted and chroot'ed it, then fixed it, unmounted, detached, attached it again to the original instance and booted it - but that would have taken longer and more steps).

This way I basically had a machine (with 4 CPUs and plenty of mem, could have chosen a bigger one to make the compilation even faster) to use where I compiled (and re-compiled after each bisect) and a second machine I could boot from the working AWS EC2 Arch Linux AMIs and installed the newly compiled packages on, reboot it and see if it starts or not. If it starts I could use the same machine for the "next" version again and if it failed I could just start with a fresh AMI for the "next" version.

Turn around between bisects was about 20-30 min most of the time spend on waiting on the systemd compilation.

Loading...