FS#77571 - [systemd] 253-1 changes behaviour of systemd-networkd-wait-online.service, fail to start services

Attached to Project: Arch Linux
Opened by Joakim Ekblad (aio) - Sunday, 19 February 2023, 13:27 GMT
Last edited by Toolybird (Toolybird) - Thursday, 15 June 2023, 05:51 GMT
Task Type Bug Report
Category Documentation
Status Closed
Assigned To Christian Hesse (eworm)
Architecture All
Severity High
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 2
Private No

Details

Description:

If you have services listening to a specific IP-address e.g. "ListenAddress" with sshd or any other service, they will probably fail to start at boot as of systemd-253-1. There is a change of behavior in systemd-networkd-wait-online, where you will need to add either --any or specific --interface to maintain the logical and old way wait-online worked.

Services listening on 0.0.0.0 or "any" will probably start just fine. It's only the ones with a specific listen address that fails, since network-online.target will be reached regardless if the network is up or not due to changes in systemd.

Maybe this is worthy of a "manual intervention" notification or some generic change to the systemd-neworkd-wait-online.service file?

Additional info:
* systemd-253-1
* network configured with systemd-networkd (no netctl, no NetworkManager etc.)
* Info: https://github.com/systemd/systemd/releases (see the part about systemd-networkd-wait-online)

Steps to reproduce:

Upgrade systemd, see your hardened servers fail with all its glory :D
This task depends upon

Closed by  Toolybird (Toolybird)
Thursday, 15 June 2023, 05:51 GMT
Reason for closing:  Upstream
Additional comments about closing:  See comments
Comment by Christian Hesse (eworm) - Sunday, 19 February 2023, 13:37 GMT
Do you configure your network interfaces with systemd-networkd?
Comment by Joakim Ekblad (aio) - Sunday, 19 February 2023, 15:14 GMT
Indeed, Sir.

"network configured with systemd-networkd (no netctl, no NetworkManager etc.)"
Comment by Christian Hesse (eworm) - Sunday, 19 February 2023, 16:05 GMT
Then give the output of `networkctl` please and some insight on your configuration and where services are listening.
Comment by Joakim Ekblad (aio) - Sunday, 19 February 2023, 21:29 GMT
Sorry for the late reply.

I have no un-fixed system currently. I changed my systemd-networkd-wait-online.servcice, on all 20 or so systems, to have the --interface option to the binary, as suggested in the initial report. So i can't really provide much more info without breaking one of the systems.

What I believe happens is: network-online.target becomes available directly (not waiting for any network-devices to come online) and services depending on this obviously try to start as expected when network-online.target is reached.

For example i have sshd listening on a specific IP (NIC) with "ListenAddress" set to my Ethernet cards IP, same with sockd and tinyproxy. They all listen on a specific IP. All of them failed. But services listening on any interface started. Since that address is not resolvable until the nics are up the daemons fail.

Sooner or alter systemd-networkd are done, and the services could be manually restarted successfully. This stuff just affects the dependencies between services during boot.
Comment by Toolybird (Toolybird) - Monday, 20 February 2023, 21:39 GMT
@aio, do you think this change in behavior is expected by upstream? In other words, is this something that needs reporting there? The Arch Wiki is the best place for documentation. Anyone can edit the Wiki so feel free to add something.
Comment by Joakim Ekblad (aio) - Tuesday, 21 February 2023, 18:38 GMT
The change is most def upstream (See link provided). It feels like they decided to make life simpler for systems running NetworkManager or similar. So that network-online.target just falls through for a faster boot or something.

I suppose many people will wonder why services won't start anymore (if listening on specific IP's) though.
Comment by Niels (zee99) - Friday, 24 February 2023, 23:54 GMT
This also breaks booting the Arch Linux EC2 AMIs. It looks like the cloud-init targets and other targets (audit) are never run and this the system fails to complete it configuration and boot up properly hanging after reaching the multi-user.target.

Adding options to the call to "systemd-networkd-wait-online" like -4 or -i with the particular interface seem not to have any effect and/or change.

I can see that on a system running 252-5.1 the following targets are reached (in this order):

Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Reached target Local Encrypted Volumes.
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Reached target Local Integrity Protected Volumes.
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Reached target Path Units.
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Reached target Remote File Systems.
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Reached target Slice Units.
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Reached target Swaps.
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Reached target Local Verity Protected Volumes.
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Reached target Preparation for Local File Systems.
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Reached target Local File Systems.
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Reached target System Time Set.
Feb 25 06:25:09 ip-172-31-50-96 systemd[1]: Reached target Host and Network Name Lookups.
Feb 25 06:25:09 ip-172-31-50-96 systemd[1]: Reached target System Initialization.
Feb 25 06:25:09 ip-172-31-50-96 systemd[1]: Reached target Timer Units.
Feb 25 06:25:09 ip-172-31-50-96 systemd[1]: Reached target Socket Units.
Feb 25 06:25:09 ip-172-31-50-96 systemd[1]: Reached target Basic System.
Feb 25 06:25:16 ip-172-31-50-96 systemd[1]: Reached target Preparation for Network.
Feb 25 06:25:16 ip-172-31-50-96 systemd[1]: Reached target Network.
Feb 25 06:25:20 ip-172-31-50-96 systemd[1]: Reached target Cloud-config availability.
Feb 25 06:25:20 ip-172-31-50-96 systemd[1]: Reached target Network is Online.
Feb 25 06:25:20 ip-172-31-50-96 systemd[1]: Reached target Login Prompts.
Feb 25 06:25:20 ip-172-31-50-96 systemd[1]: Reached target Multi-User System.
Feb 25 06:25:21 ip-172-31-50-96 systemd[1]: Reached target Cloud-init target.

After upgrading the system to 253 we only get these targets:

Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Reached target Local Encrypted Volumes.
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Reached target Local Integrity Protected Volumes.
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Reached target Path Units.
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Reached target Slice Units.
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Reached target Swaps.
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Reached target Local Verity Protected Volumes.
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Reached target Preparation for Network.
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Reached target Preparation for Local File Systems.
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Reached target Local File Systems.
Feb 25 06:28:17 ip-172-31-50-96 systemd[1]: Reached target System Time Set.
Feb 25 06:28:17 ip-172-31-50-96 systemd[1]: Reached target Network.
Feb 25 06:28:17 ip-172-31-50-96 systemd[1]: Reached target Host and Network Name Lookups.
Feb 25 06:28:17 ip-172-31-50-96 systemd[1]: Reached target System Initialization.
Feb 25 06:28:17 ip-172-31-50-96 systemd[1]: Reached target Timer Units.
Feb 25 06:28:17 ip-172-31-50-96 systemd[1]: Reached target Socket Units.
Feb 25 06:28:17 ip-172-31-50-96 systemd[1]: Reached target Basic System.
Feb 25 06:28:17 ip-172-31-50-96 systemd[1]: Reached target Multi-User System.

In essence these targets are never reached after the upgrade:
Remote File Systems.
Cloud-config availability.
Network is Online.
Login Prompts.
Cloud-init target.


Services started on 252.5-1

Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Create List of Static Device Nodes...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Load Kernel Module configfs...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Load Kernel Module drm...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Load Kernel Module fuse...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Load Kernel Modules...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Generate network units from Kernel command line...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Remount Root and Kernel File Systems...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Coldplug All udev Devices...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Load/Save Random Seed...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Apply Kernel Variables...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Create System Users...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Create Static Device Nodes in /dev...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Rebuild Dynamic Linker Cache...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Journal Service...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Rule-based Manager for Device Events and Files...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Flush Journal to Persistent Storage...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Create Volatile Files and Directories...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Security Auditing Service...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Rebuild Journal Catalog...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Network Name Resolution...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Network Time Synchronization...
Feb 25 06:25:08 ip-172-31-50-96 systemd[1]: Starting Update is Completed...
Feb 25 06:25:09 ip-172-31-50-96 systemd[1]: Starting Record System Boot/Shutdown in UTMP...
Feb 25 06:25:09 ip-172-31-50-96 systemd[1]: Starting Initial cloud-init job (pre-networking)...
Feb 25 06:25:09 ip-172-31-50-96 systemd[1]: Starting D-Bus System Message Bus...
Feb 25 06:25:09 ip-172-31-50-96 systemd[1]: Starting Name Service Cache Daemon...
Feb 25 06:25:09 ip-172-31-50-96 systemd[1]: Starting User Login Management...
Feb 25 06:25:16 ip-172-31-50-96 systemd[1]: Starting Network Configuration...
Feb 25 06:25:17 ip-172-31-50-96 systemd[1]: Starting Wait for Network to be Configured...
Feb 25 06:25:17 ip-172-31-50-96 systemd[1]: Starting Hostname Service...
Feb 25 06:25:19 ip-172-31-50-96 systemd[1]: Starting Initial cloud-init job (metadata service crawler)...
Feb 25 06:25:20 ip-172-31-50-96 systemd[1]: Starting Apply the settings specified in cloud-config...
Feb 25 06:25:20 ip-172-31-50-96 systemd[1]: Starting Permit User Sessions...
Feb 25 06:25:20 ip-172-31-50-96 systemd[1]: Starting Execute cloud user/final scripts...
Feb 25 06:25:44 ip-172-31-50-96 systemd[1]: Starting User Runtime Directory /run/user/1000...
Feb 25 06:25:44 ip-172-31-50-96 systemd[1]: Starting User Manager for UID 1000...

Services Started after upgrade to 253.1 (no changes where made to the systemd settings):

Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Starting Create List of Static Device Nodes...
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Starting Load Kernel Module configfs...
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Starting Load Kernel Module dm_mod...
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Starting Load Kernel Module fuse...
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Starting Load Kernel Module loop...
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Starting Load Kernel Modules...
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Starting Generate network units from Kernel command line...
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Starting Remount Root and Kernel File Systems...
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Starting Coldplug All udev Devices...
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Starting Load/Save OS Random Seed...
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Starting Apply Kernel Variables...
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Starting Create System Users...
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Starting Create Static Device Nodes in /dev...
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Starting Rebuild Dynamic Linker Cache...
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Starting Journal Service...
Feb 25 06:28:14 ip-172-31-50-96 systemd[1]: Starting Rule-based Manager for Device Events and Files...
Feb 25 06:28:15 ip-172-31-50-96 systemd[1]: Starting Flush Journal to Persistent Storage...
Feb 25 06:28:15 ip-172-31-50-96 systemd[1]: Starting Network Configuration...
Feb 25 06:28:15 ip-172-31-50-96 systemd[1]: Starting Create Volatile Files and Directories...
Feb 25 06:28:16 ip-172-31-50-96 systemd[1]: Starting Rebuild Journal Catalog...
Feb 25 06:28:16 ip-172-31-50-96 systemd[1]: Starting Network Name Resolution...
Feb 25 06:28:16 ip-172-31-50-96 systemd[1]: Starting Network Time Synchronization...
Feb 25 06:28:16 ip-172-31-50-96 systemd[1]: Starting Record System Boot/Shutdown in UTMP...
Feb 25 06:28:17 ip-172-31-50-96 systemd[1]: Starting Update is Completed...
Feb 25 06:28:17 ip-172-31-50-96 systemd[1]: Starting D-Bus System Message Bus...
Feb 25 06:28:17 ip-172-31-50-96 systemd[1]: Starting Hostname Service...

Reported this with systemd: https://github.com/systemd/systemd/issues/26613
Comment by Christian Hesse (eworm) - Wednesday, 01 March 2023, 10:47 GMT
I have not seen any issues with my own setups, and nobody provided the information I requested (output of `networkctl` and some insight on configuration). So still no idea what's going on here.

And I am not even sure the issue described by zee99 is the same as initially reported here.
Comment by Christian Hesse (eworm) - Wednesday, 01 March 2023, 10:51 GMT
BTW, behavior can be controlled from network files, see systemd.network(5) and search for `RequiredForOnline=`. Perhaps that can help for some setups, especially if services listen on addresses that are configured on a specific interface.
Comment by Niels (zee99) - Wednesday, 01 March 2023, 11:39 GMT
Let me create a separate issue for my problem -> https://bugs.archlinux.org/task/77688
Comment by Christian Hesse (eworm) - Friday, 03 March 2023, 14:40 GMT
Just preparing systemd 253.1-1... Please check if that fixes your issue.
Comment by Niels (zee99) - Friday, 03 March 2023, 16:45 GMT
I can confirm that 253.1-1 fixes the issue I reported ... if it also resolves the initial issue for this report I can not confirm. Thanks for the help!.
Comment by Joakim Ekblad (aio) - Sunday, 05 March 2023, 11:52 GMT
Sadly no improvement on systemd-networkd-wait-online.service on my end with the new version.

I removed my work-around with "--interface ens0" on one of my qemu-guests again (like it was before all this started). Still making sshd only listen on a specific (ens0) IP address. With "ListenAddress". And the problem remains (or the new behavior remains, rather). systemd-networkd-wait-online.service just falls through, not waiting for anything and the boot-up process will after a while try to start sshd.service, and sshd can not bind to the specified interface, since it does not exist yet. Resulting in "Failed to start OpenSSH Daemon", obviously.

If I after a while manually login (using VNC to the QEMU instance in this case, since the machines do not have any physical keyboard/monitor), i can se that the NIC is eventually brought up, and i can ofc manually start/restart sshd.service once they are.

So I will go back to manually editing an override with "--interface ens0" in the systemd-networkd-wait-online.service. One can wonder... Why systemd-project did not provide systemd-networkd-wait-really-online.service since this new "feature" was implemented. :D

Quoting the release notes of systemd:

"systemd-networkd-wait-online exits successfully when all interfaces
are ready or unmanaged. Previously, if neither '--any' nor
'--interface=' options were used, at least one interface had to be in
configured state. This change allows the case where systemd-networkd
is enabled, but no interfaces are configured, to be handled
gracefully. It may occur in particular when a different network
manager is also enabled and used."

My /etc/systemd/network/ens0.network has "Virtualization=true" ... And I have static IP assignment and such. So it should come up quite quickly. Which I believe is correct (i may believe wrong) if that makes any difference in the grans scheme of things.

Re
Comment by Joakim Ekblad (aio) - Sunday, 05 March 2023, 12:02 GMT
Output from networkctl:

$ networkctl
IDX LINK TYPE OPERATIONAL SETUP
1 lo loopback carrier unmanaged
2 ens3 ether routable configured

(on a system that is booted up)
Comment by Siegfried Metz (NiceGuy) - Sunday, 05 March 2023, 15:20 GMT
What if you add to the systemd-networkd-wait-online.service --timeout=0 with or without your parameters of --any / --interface=


systemd-networkd-wait-online.service unit file example of the above:

[Service]
ExecStart=/usr/lib/systemd/systemd-networkd-wait-online --timeout=0


Does this make a difference? As --timeout=0 from the manual page states:
'Fail the service if the network is not online by the time the timeout elapses. A timeout of 0 disables the timeout. Defaults to 120 seconds.'
Comment by Christian Hesse (eworm) - Monday, 06 March 2023, 09:34 GMT
Wondering if this is a timing issue with the interface not yet being available to the system.

My guess: Your system boots and the services start, including `systemd-networkd-wait-online.service` which waits for all (currently none) available network interfaces. Some services fail due to missing address on interface. Some time later the network interface (ens0? or ens3? - both are mentioned above, probably different systems) becomes available and is configured.

Any chance you verify this from log messages?

And what does your device configuration in Qemu look like? Would be interesting that know what virtual hardware is used... Perhaps selecting something like "virtio-net-pci-non-transitional" (which is non-hotplug if I understand correctly) helps?
Comment by Toolybird (Toolybird) - Wednesday, 05 April 2023, 21:59 GMT
What the status here? Any new info to share?

Possibly related [1]? (probably not..)

[1] https://gitlab.archlinux.org/archlinux/arch-boxes/-/issues/158#note_92582
Comment by Joakim Ekblad (aio) - Wednesday, 12 April 2023, 07:14 GMT
Have not had the opportunity to switch back to not explicitly wait for the specific NIC to become "online". This work-around is sufficient for me for now.

@eworm : the same thing happened on non-virtual machines too.

Has no-one else set-up their services to listen on a specific IP (NIC) and experienced the same issue?

I see that there has come quite a lot of systemd releases since I reported this, tho.
Comment by Toolybird (Toolybird) - Friday, 12 May 2023, 23:46 GMT
Ping? Any progress on getting to the bottom of this?
Comment by Toolybird (Toolybird) - Thursday, 15 June 2023, 05:50 GMT
It looks like someone finally reported this upstream [1] and a fix has been merged. We'll get it in a new release in the not too distant future.

[1] https://github.com/systemd/systemd/issues/27822

Loading...