FS#66397 - [dhcpcd] 9.0.2-1 breaks boot

Attached to Project: Arch Linux
Opened by roby (hariseldon78) - Friday, 24 April 2020, 22:48 GMT
Last edited by Giancarlo Razzolini (grazzolini) - Thursday, 11 June 2020, 16:45 GMT
Task Type Bug Report
Category Packages: Core
Status Closed
Assigned To Ronald van Haren (pressh)
Giancarlo Razzolini (grazzolini)
Architecture All
Severity Critical
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 8
Private No

Details

Description: [dhcpcd] 9.0.2-1 breaks boot

Some days ago I updated 10-20 packages and the boot was broken. These messages were visible in the console:

------------
[FAILED] Failed to start Load/Save RF Kill Switch Status.
[FAILED] Failed to start Load/Save RF Kill Switch Status.
[FAILED] Failed to start Load/Save RF Kill Switch Status.
[FAILED] Failed to start Bluetooth service.
[FAILED] Failed to start Load/Save RF Kill Switch Status.
[FAILED] Failed to start Load/Save RF Kill Switch Status.
[FAILED] Failed to start Load/Save RF Kill Switch Status.
[FAILED] Failed to mount /mnt/nas/backup-linux
[DEPEND] Dependency failed for Remote File Systems.
[FAILED] Failed to mount /mnt/nas/roby
[FAILED] Failed to start Docker Application Container Engine.
[FAILED] Failed to start CUPS Scheduler.
[FAILED] Failed to start Permit User Sessions.
[FAILED] Failed to start GNOME Display Manager.

After these messages the system didn't start the usual graphical login, neither the textual one.

Then I have downgraded all packages, using the archive mirror pointing to 2020/04/19 and it started booting again. Then i tried updating the packages one or two at a time and rebooting each time, and I narrowed it down to dhcpcd being the culprit of this failure.

But this time the behavior was slightly different: at the first reboot everything worked. I didn't install/upgrade any package, then when shutting down i saw some messages about failing unmounting home. The next reboot instead was broken, with the following messages, similar but a little different:

------------
[FAILED] Failed to start Hostname Service.
[FAILED] Failed to mount /mnt/nas/backup-linux
[DEPEND] Dependency failed for Remote File Systems.
[FAILED] Failed to mount /mnt/nas/roby
[FAILED] Failed to start Docker Application Container Engine.
[FAILED] Failed to start CUPS Scheduler.
[FAILED] Failed to start Permit User Sessions.
[FAILED] Failed to start GNOME Display Manager.

I tried another reboot and it failed the same way.

So in attachment you will find:
0: the first reboot after upgrading dhcpcd (not failing)
1: the first failing reboot after that
2: the second failing reboot
3: the normal reboot with the package downgraded (after chroot mounting from arch usb-key)

I wasn't able to connect to the support site of dhcpcd, the link from the github page doesn't work.
I attach 2 files: the booting log with the failure and the booting log without, after downgrading it again.

One thing i must say: in one of my many tries, maybe when downgrading or reinstalling or after chrooting, i saw a message from pacman reporting that the /var/lib/dhcpcd folder was permission 700 instead of 755, which make it unreadable from users. I did change it to 755 that one time, but i see that at reinstall it goes back to 700. That could very much be the root of the problem.

Maybe it's also worth saying that i don't have a wifi card on my computer.

$ pacman -Q linux systemd
linux 5.6.6.arch1-1
systemd 245.5-1

Please let me know if there is more useful information that I can provide, for the moment I'm stuck with no upgrading so the conditions will be preserved for some time.
This task depends upon

Closed by  Giancarlo Razzolini (grazzolini)
Thursday, 11 June 2020, 16:45 GMT
Reason for closing:  Fixed
Additional comments about closing:  Upstream fixed the underlying bug and the iso switched to systemd-networkd
Comment by roby (hariseldon78) - Saturday, 25 April 2020, 00:12 GMT
Just a note: I freezed the package version by putting it in the pacman.conf - IgnorePkg option, and upgraded all the remaining packages. the system boot without problems.

Please, if possible, increase the severity of the bug report because it is blocking. Sorry for not setting it myself on creation.
Comment by Roy Marples (rsmarples) - Monday, 04 May 2020, 20:03 GMT
systemd[1]: Starting dhcpcd on all interfaces...
systemd[1]: Starting dhcpcd on enp3s0...
Apr 25 00:09:21 roby-arch dhcpcd[490]: dhcpcd-9.0.2 starting
Apr 25 00:09:21 roby-arch dhcpcd[489]: dhcpcd-9.0.2 starting

Can Arch please kill the service on interface? It's causing a lot of problems.

You have more than one dhcpcd process running - this is bad.
You have an invalid dhcpcd.conf setting - this error is being reported: dhcpcd[523]: unknown option: roby-arch

Other than that, I don't see any more problems in the logs.
Comment by roby (hariseldon78) - Tuesday, 05 May 2020, 23:12 GMT
Ok, I removed the bad option in the dhcpcd.conf (it's my hostname, i don't know how it ended up there, i replaced it with 'hostname'), but the same problem is happening, so that is not the cause of the failing boot.
I'll try now disabling dhcpcd on the interface.
[...]
Ok, i did disabled the service on the interface, rebooted, upgraded, and rebooted again: the system boots correctly. The only problem i see (with both versions, after removing the service on interface) is that it fails to mount my nfs filesystem at boot; but if i mount it manually when the system has finished booting there is no problem.

In conclusion: my problem is solved, but why is it that nobody else is reporting the same problem? If it comes from having 2 instances started it should be happening to somebody else! Am i missing something?
Comment by andy brandt (smorks) - Wednesday, 06 May 2020, 14:08 GMT
i'm having the same issue, downgrading from 9.0.2 to 8.1.6 fixed the problem, but i can also try changing it from per-interface to one service. attached is my failing boot log too, if that helps.

perhaps the dhcpcd wiki page should be updated, because it currently suggests using the per-interface unit mode?
Comment by Doug Newgard (Scimmia) - Friday, 08 May 2020, 23:48 GMT
This has broken the May ISO for people, too.
Comment by Yernar Akimzhanov (ernarkazakh07) - Friday, 15 May 2020, 23:04 GMT
Having the exact same issue.

Downgraded dhcpcd to 8.1.6
Disabled all systemd-services of dhcpcd, netct, and other network managers.

Intersting things happen on reboot. No errors are shown, but I can`t physically type inside my tty. Like its not allowing me to input any character.

(BTW, I also update my kernel to the latest patch. Still nothing :( )
Comment by Hayden Parker (hkparker) - Sunday, 17 May 2020, 21:04 GMT
I believe I'm experiencing this issue as well. In my case, I'm able to boot but getty can't start and I don't have any ttys available. I believe getty is having issues due to something that's failing in systemd. Some relevant journal logs after boot:

May 09 16:33:19 bastion systemd[1]: user.slice: Failed to create cgroup /user.slice: No such file or directory
May 09 16:33:19 bastion systemd[1]: Failed to realize cgroups for queued unit user.slice, ignoring: No such file or directory
May 09 16:33:19 bastion systemd[537]: user-runtime-dir@1000.service: Failed to attach to cgroup /user.slice/user-1000.slice/user-runtime-dir@1000.service: No such file or directory
May 09 16:33:19 bastion systemd[537]: user-runtime-dir@1000.service: Failed at step CGROUP spawning /usr/lib/systemd/systemd-user-runtime-dir: No such file or directory
-- Subject: Process /usr/lib/systemd/systemd-user-runtime-dir could not be executed

[...]

May 09 16:32:35 bastion dhcpcd[360]: eth0: probing address 192.168.1.16/24
May 09 16:32:35 bastion systemd[424]: serial-getty@hvc0.service: Failed to attach to cgroup /system.slice/system-serial\x2dgetty.slice/serial-getty@hvc0.service: No such file or directory
May 09 16:32:35 bastion systemd[424]: serial-getty@hvc0.service: Failed at step CGROUP spawning /sbin/agetty: No such file or directory
-- Subject: Process /sbin/agetty could not be executed

Downgrading dhcpcd also works for me.
Comment by Hayden Parker (hkparker) - Sunday, 17 May 2020, 21:09 GMT
I just realized that I only noticed this on my host that is dual-homed on a 192.168.1/24 and a 10.0.0.0/24 network. Another VM that isn't dual homed doesn't appear to be affected. I can collect more information from both if it would be helpful.
Comment by Vasili (3point2) - Monday, 18 May 2020, 20:39 GMT
Same issue here, non-booting system after upgrading dhcpcd from 8.1.6 to 9.0.2-1. Downgrading dhcpcd from 9.0.2-1 to 8.1.7-1 resolved the issue.

If indeed this is a dhcpcd issue, can someone with enough knowledge of the boot process and systemd to explain why it has such a severe impact? I would expect something along the lines of no networking, whereas the actual impact was boot failure. The first line output by 'journalctl -p err' from a boot with dhcpcd 9.0.2-1 is

May 18 22:57:30 hal systemd[1]: syncthing@vasili.service: Failed to create cgroup /system.slice/system-syncthing.slice/syncthing@vasili.service: No such file or directory

followed by a slew of similar errors for other units. I'd very much appreciate an understanding of how the dhcpcd issue impacts systemd in this way.
Comment by Jesse Buhagiar (quaker762) - Wednesday, 20 May 2020, 15:23 GMT
Having the exact same issue with 9.0.2. Seems to be the same issue as above. Here's an image (sorry for the quality) I took of the `journalctl` boot log during an `arch-chroot` session from a live USB (https://i.imgur.com/EyAj8uA.jpg). This has been an issue for weeks now, and the second time `pacman -Syyuu` has put my machine into an unbootable state. What's going on?
Comment by Roy Marples (rsmarples) - Thursday, 21 May 2020, 00:02 GMT
It seems that Systemd logs stdout and stderr as well.
Please disable this for dhcpcd because ALL output will go via syslog.

Add this to dhcpcd service files.

[Service]
StandardOutput=null
StandardError=null

https://www.freedesktop.org/software/systemd/man/systemd.exec.html#StandardOutput=
Comment by roby (hariseldon78) - Thursday, 21 May 2020, 00:38 GMT
@quaker762 and others: if you have the same problem as me, like it seems, the workaround suggested by Roy is to `systemctl disable` the dhcpcd services bound to the specific interfaces, and only leave the main one, not bound to interfaces.
That said, i don't think that the point is to solve our personal configuration problems, but to find a solution that can fix the installation for everybody. Is there anyone with enough knowledge about arch packages that could provide an update to the package that automatically removes the problematic services on installation? @rsmarples do you think that could be possible and safe?

If it's not possible maybe the solution is to just put a line on the arch wiki, so everybody updating knows what to do. Not desirable, because the boot is broken, but at least we could announce how to solve it...
Comment by Doug Newgard (Scimmia) - Thursday, 21 May 2020, 00:54 GMT
Packages should not be disabling something. That's not the solution, either, as the ISO is broken for a lot of people and it doesn't have two instances enabled.
Comment by Roy Marples (rsmarples) - Thursday, 21 May 2020, 01:02 GMT
At this time, only systemd users are reporting this issue to me.
So far, it's only Arch users.

Has anyone tested a non privsep enabled build of dhcpcd to see if that breaks the same way?
I'm guessing it wont .... the only difference outside of dhcpcd is that the chroot requires to have directories mounted in a specific way to appease udev .... and so dhcpcd can interogate the interfaces.
Comment by Roy Marples (rsmarples) - Thursday, 21 May 2020, 15:17 GMT
https://roy.marples.name/cgit/dhcpcd.git/commit/?id=6ddd035d2d3258e4a14aa352f331997a1054e278

Instead of the StandOutput/Error redirect I said above, use this patch and add more -q to the command line arguments.
Comment by Tomasz Cebula (tomaszc) - Friday, 22 May 2020, 16:43 GMT
To the developers:
Maybe you should consider changing iso archlinux dhcpcd to dhclient.

I use archlinux in PXE and iso, and solved this problem by:

- creating dhclient.service (in the attachment),
- disabling the dhcpcd service and uninstalling the dhcpcd package,
- dhclient installation and enabling the above service,

After the image remastering, everything works properly, downloading IPs on the found interfaces.
Previously, dhcpcd had problems with devpts which made it impossible to connect via SSH or even run the screen.
https://bbs.archlinux.org/viewtopic.php?id=254999
Comment by Vasili (3point2) - Friday, 22 May 2020, 18:19 GMT
Hi Roy, thanks for taking the time to look at this issue. Is the patch you mentioned above just to produce cleaner logs to aid debugging, or is it supposed to fix the boot failure issue? I'm guessing the former?

I'd be happy to test a non-privsep build for you. Would producing that as simple as removing the --privsepuser=dhcpcd option when running ./configure?

Warm regards and thanks for your work on dhcpcd, it's stood me in good stead for the better part of 20 years now.
Comment by Roy Marples (rsmarples) - Friday, 22 May 2020, 18:34 GMT
The patch allows for cleaner logs.
I did not know that systemd logs syslog AND console output to the same place AND under the pid of the process.
This was confusing, leading to double entries leading me to believe there was a forking error.

So the patch and using -qq in the service will help diagnostics immensely.

The second thing to try is to disable privsep entirely - compile it out:
./configure --disable-privsep

The third thing to try is my latest master branch with privsep enabled again - it's almost ready for release (see below).

The fouth thing to try is the latest master branch with privsep disabled.

That's a few things to test.
Let me know how this works out.

One thing I am currently working on is moving the mounting of /dev, /proc and /sys into it's own namespace, so it's hidden from the rest of the system.
This will require an additional script runner process (so scripts can mount dependant on ssid connected to for example) and linux-3.8 I believe.
It *may* help the issue, but this is only a guess because I don't yet understand why.

For what it's worth, I only have reports of this failing in Arch Linux.
Other systemd based OS's such as Debian and Gentoo are all working fine.

Lastly thanks for the comments about supporting dhcpcd: it means a to me - a lot of my life is in there!
If you feel the need to swap out to the abandoned upstream dhclient, don't let me stop you. What I do is all about choice, you choose what you want and if it works for you then great.
Comment by Vasili (3point2) - Friday, 22 May 2020, 23:02 GMT
My pleasure, that's really good to hear :)

I've now done the following tests:

1) dhcpcd 9.0.2 compiled with --disable-privsep and -qq patch - works fine

2) dhcpcd latest master (5a93267) with privsep - systemd does not boot

3) dhcpcd latest master (5a93267) compiled with --disable-privsep - works fine

I am attaching the output of journalctl -u dhcpcd@wlan0 for each case. They are pretty much identical, apart from the fact that the non-booting case 2 lacks the lines

May 23 01:16:08 hal dhcpcd[1909]: sending signal TERM to pid 719
May 23 01:16:08 hal dhcpcd[1909]: sending signal TERM to pid 719
May 23 01:16:08 hal dhcpcd[1909]: waiting for pid 719 to exit
May 23 01:16:08 hal dhcpcd[1909]: waiting for pid 719 to exit


The output of journalctl -p err for the good cases (1 and 3) only contains a few lines of irrelevant errors:

May 23 01:15:09 hal kernel: thinkpad_acpi: acpi_evalf(BCTG, dd, ...) failed: AE_NOT_FOUND
May 23 01:15:09 hal kernel: ACPI: \_SB_.PCI0.LPC_.EC__.HKEY: BCTG: evaluate failed
May 23 01:15:09 hal kernel: thinkpad_acpi: acpi_evalf(BCTG, dd, ...) failed: AE_NOT_FOUND
May 23 01:15:09 hal kernel: ACPI: \_SB_.PCI0.LPC_.EC__.HKEY: BCTG: evaluate failed
May 23 01:16:09 hal kernel: watchdog: watchdog0: watchdog did not stop!

but for case 2) contains over 100 lines of

May 22 23:48:48 hal systemd[1]: syncthing@vasili.service: Failed to create cgroup /system.slice/system-syncthing.slice/syncthing@vasili.service: No such file or directory
May 22 23:48:48 hal systemd[1]: systemd-user-sessions.service: Failed to create cgroup /system.slice/systemd-user-sessions.service: No such file or directory
May 22 23:48:48 hal systemd[1]: unbound.service: Failed to create cgroup /system.slice/unbound.service: No such file or directory
May 22 23:48:48 hal systemd[773]: unbound.service: Failed to attach to cgroup /system.slice/unbound.service: No such file or directory
May 22 23:48:48 hal systemd[773]: unbound.service: Failed at step CGROUP spawning /usr/bin/unbound: No such file or directory
... and so on for pretty much every unit.

It does seem like the privsep functionality is tripping up systemd somehow, although it's not obvious to me how that's happening.

Please let me know if there's anything else you'd like me to test to shed a bit more light on this.

For the record I'm using systemd 245.5 and kernel 5.6.13-arch1-1
Comment by Vasili (3point2) - Saturday, 23 May 2020, 14:34 GMT
Not sure if this information will be helpful or add to the confusion, but occasionally my system does successfully boot to X with stock dhcpcd 9.0.2 - it seems to be a chance occurrence. When this happens, I'm unable to launch terminals under X - I get a message about no PTYs being available. I'm also unable to launch firefox, I get a error message about shared memory. It looks like both issues are caused by /dev/pty and /dev/shm not being mounted correctly. Closer inspection shows that /dev has been mounted *twice* instead of once as observed on boots with dhcpcd 8.1.6. Furthermore, rebooting hangs for several minutes when this happens. The logs show

May 23 17:23:15 hal dhcpcd[3178]: dhcpcd not running
May 23 17:23:15 hal dhcpcd[3179]: dhcpcd not running
May 23 17:23:15 hal dhcpcd[651]: process 651 unexpectedly terminating on signal 15
May 23 17:23:16 hal dhcpcd[656]: process 656 unexpectedly terminating on signal 15
May 23 17:23:16 hal dhcpcd[650]: arp_change: Connection refused
May 23 17:23:16 hal dhcpcd[650]: arp_close: Transport endpoint is not connected
May 23 17:23:16 hal dhcpcd[650]: script_runreason: Transport endpoint is not connected
May 23 17:23:16 hal dhcpcd[650]: main: control_stop: Transport endpoint is not connected
May 23 17:23:16 hal dhcpcd[650]: ps_dostop: Connection refused
May 23 17:23:16 hal dhcpcd[650]: ps_dostop: Transport endpoint is not connected

just before rebooting, systemd prints error messages that don't make it into the journal about not being able to unmount /var/lib/dhcpcd/dev and /var/lib/dhcpcd/proc.
Comment by Roy Marples (rsmarples) - Sunday, 24 May 2020, 05:57 GMT
Thanks for testing!

I've just pushed a fix to the dhcpcd master branch so that no special mounts are needed anymore.
Please remove the 00-linux hook if it's installed before trying - it no longer exists in my source.

Hopefully this fixes this issue.
Comment by Roy Marples (rsmarples) - Sunday, 24 May 2020, 10:38 GMT
And I've just fixed udev working with privsep as well. So hopefully this works fully now.

BTW, the chroot directory should now be an empty directory and NOT where dhcpcd stores it's files.

Roy
Comment by Vasili (3point2) - Sunday, 24 May 2020, 12:48 GMT
Hi again Roy, I'm happy to report I'm writing this from my system booted with dhcpcd's latest master at ea03eec9 (I did double check that 00-linux was gone). Thanks very much for this fix.

Best wishes,

Vasili
Comment by Roy Marples (rsmarples) - Sunday, 24 May 2020, 15:19 GMT
Awesome news! So glad it's now fixed :)

I've made a new more changes since and am entering the final round of testing before making a new release.
Comment by Vasili (3point2) - Sunday, 24 May 2020, 16:06 GMT
Excellent, thanks again.

Out of curiosity, do we know at this point what the root cause of the boot failure was?
Comment by Roy Marples (rsmarples) - Sunday, 24 May 2020, 19:54 GMT
Sadly, I don't have a sausage of a clue as to the root cause.
As far as dhcpcd is concerned, the only pratical difference I've made is removing the mount --bind of /dev, /sys, /proc and /run/udev into the chroot.

Why that broke systemd / cgroup (which one I don't know) I have no idea.
What's really interesting is that this is the recommneded way of setting up a chroot by Arch:
https://wiki.archlinux.org/index.php/Chroot#Using_chroot

So dhcpcd is not doing anything un-ordinary here.
Luckily enough, thanks to dhcpcd gaining sandbox support for Capsicum (FreeBSD) and Pledge (OpenBSD) there was nearly enough IPC present to make this a trivial change.
And because the IPC I now have is readily extensible, it wasn't too much effort to add the missing pieces.
So the actual fix wasn't too big - you can see the initial changeset here to fix it:
https://roy.marples.name/cgit/dhcpcd.git/commit/?id=709e6b1a38ef87ada75354e968a9b35eca3131aa

Quite a few patches followed to normalise it on other OS's, but as far as Linux goes that's it.

But the most curious thing is that this seems to be limited to Arch Linux, and I'm hoping an Arch dev can explain why using the recommended way of setting up a chroot proved so fragile to break the whole boot.
Comment by Roy Marples (rsmarples) - Sunday, 31 May 2020, 11:59 GMT
dhcpcd-9.1.0 has now been released which solves this issue.
Comment by Giancarlo Razzolini (grazzolini) - Monday, 01 June 2020, 09:03 GMT
It's on [testing] already. But, the iso for June switched from using dhcpcd to a systemd-networkd solution.
Comment by roby (hariseldon78) - Sunday, 07 June 2020, 23:21 GMT
This is just a note for people who have problems for having multiple dhcpcd services started: when i removed the interface bound service i started seeing that my nfs mounts where not mounting correctly at boot.

After some research i found out the cause: the unbound dhcpcd service is started with the option -b, which cause it to not wait for obtaining an address and just forks immediatly. That unlocks the network-online.target too early, so when the nfs mount is tried the network is still not ready. My solution was to edit /usr/lib/systemd/system/dhcpcd.service and replace the -b option with -w.

In my case i only have one wired interface, and i am on a desktop computer (not laptop) so 99% of the times when i boot the home network is online. When it's not i guess it will have to wait the 30 seconds timeout, and that is acceptable for me.

For people using laptops or unreliable wifi i guess the best option is to use the automount feature for nfs mounts, and leave the dhcpcd.service alone with the -b option.
Comment by Giancarlo Razzolini (grazzolini) - Monday, 08 June 2020, 10:17 GMT
@roby,

Does this also happen with the latest dhcpcd? Also, I suggest you use systemctl edit instead of editing files on /usr/lib/systemd, because those will be overwritten on updates.
Comment by roby (hariseldon78) - Monday, 08 June 2020, 21:57 GMT
@grazzolini Yes, the nfs mount error does happen also with the latest version. But i think it's by design, using the -w option by default would probably be bad for laptop users.
Maybe the situation was hidden when having the interface bound service because when launched on a single interface dhcpcd use the -w option by default.

Regarding the systemctl edit: thanks, i didn't know of its existance!
Comment by Giancarlo Razzolini (grazzolini) - Thursday, 11 June 2020, 16:45 GMT
Yes, I don't think we should switch to -w. Anyway, I'm closing this bug report because both the original dhcpcd issue is fixed upstream, as well the ISO switched to systemd-networkd by default.

Loading...