FS#66397 - [dhcpcd] 9.0.2-1 breaks boot
Attached to Project:
Arch Linux
Opened by roby (hariseldon78) - Friday, 24 April 2020, 22:48 GMT
Last edited by Giancarlo Razzolini (grazzolini) - Thursday, 11 June 2020, 16:45 GMT
Opened by roby (hariseldon78) - Friday, 24 April 2020, 22:48 GMT
Last edited by Giancarlo Razzolini (grazzolini) - Thursday, 11 June 2020, 16:45 GMT
|
Details
Description: [dhcpcd] 9.0.2-1 breaks boot
Some days ago I updated 10-20 packages and the boot was broken. These messages were visible in the console: ------------ [FAILED] Failed to start Load/Save RF Kill Switch Status. [FAILED] Failed to start Load/Save RF Kill Switch Status. [FAILED] Failed to start Load/Save RF Kill Switch Status. [FAILED] Failed to start Bluetooth service. [FAILED] Failed to start Load/Save RF Kill Switch Status. [FAILED] Failed to start Load/Save RF Kill Switch Status. [FAILED] Failed to start Load/Save RF Kill Switch Status. [FAILED] Failed to mount /mnt/nas/backup-linux [DEPEND] Dependency failed for Remote File Systems. [FAILED] Failed to mount /mnt/nas/roby [FAILED] Failed to start Docker Application Container Engine. [FAILED] Failed to start CUPS Scheduler. [FAILED] Failed to start Permit User Sessions. [FAILED] Failed to start GNOME Display Manager. After these messages the system didn't start the usual graphical login, neither the textual one. Then I have downgraded all packages, using the archive mirror pointing to 2020/04/19 and it started booting again. Then i tried updating the packages one or two at a time and rebooting each time, and I narrowed it down to dhcpcd being the culprit of this failure. But this time the behavior was slightly different: at the first reboot everything worked. I didn't install/upgrade any package, then when shutting down i saw some messages about failing unmounting home. The next reboot instead was broken, with the following messages, similar but a little different: ------------ [FAILED] Failed to start Hostname Service. [FAILED] Failed to mount /mnt/nas/backup-linux [DEPEND] Dependency failed for Remote File Systems. [FAILED] Failed to mount /mnt/nas/roby [FAILED] Failed to start Docker Application Container Engine. [FAILED] Failed to start CUPS Scheduler. [FAILED] Failed to start Permit User Sessions. [FAILED] Failed to start GNOME Display Manager. I tried another reboot and it failed the same way. So in attachment you will find: 0: the first reboot after upgrading dhcpcd (not failing) 1: the first failing reboot after that 2: the second failing reboot 3: the normal reboot with the package downgraded (after chroot mounting from arch usb-key) I wasn't able to connect to the support site of dhcpcd, the link from the github page doesn't work. I attach 2 files: the booting log with the failure and the booting log without, after downgrading it again. One thing i must say: in one of my many tries, maybe when downgrading or reinstalling or after chrooting, i saw a message from pacman reporting that the /var/lib/dhcpcd folder was permission 700 instead of 755, which make it unreadable from users. I did change it to 755 that one time, but i see that at reinstall it goes back to 700. That could very much be the root of the problem. Maybe it's also worth saying that i don't have a wifi card on my computer. $ pacman -Q linux systemd linux 5.6.6.arch1-1 systemd 245.5-1 Please let me know if there is more useful information that I can provide, for the moment I'm stuck with no upgrading so the conditions will be preserved for some time. |
This task depends upon
Closed by Giancarlo Razzolini (grazzolini)
Thursday, 11 June 2020, 16:45 GMT
Reason for closing: Fixed
Additional comments about closing: Upstream fixed the underlying bug and the iso switched to systemd-networkd
Thursday, 11 June 2020, 16:45 GMT
Reason for closing: Fixed
Additional comments about closing: Upstream fixed the underlying bug and the iso switched to systemd-networkd
Please, if possible, increase the severity of the bug report because it is blocking. Sorry for not setting it myself on creation.
systemd[1]: Starting dhcpcd on enp3s0...
Apr 25 00:09:21 roby-arch dhcpcd[490]: dhcpcd-9.0.2 starting
Apr 25 00:09:21 roby-arch dhcpcd[489]: dhcpcd-9.0.2 starting
Can Arch please kill the service on interface? It's causing a lot of problems.
You have more than one dhcpcd process running - this is bad.
You have an invalid dhcpcd.conf setting - this error is being reported: dhcpcd[523]: unknown option: roby-arch
Other than that, I don't see any more problems in the logs.
I'll try now disabling dhcpcd on the interface.
[...]
Ok, i did disabled the service on the interface, rebooted, upgraded, and rebooted again: the system boots correctly. The only problem i see (with both versions, after removing the service on interface) is that it fails to mount my nfs filesystem at boot; but if i mount it manually when the system has finished booting there is no problem.
In conclusion: my problem is solved, but why is it that nobody else is reporting the same problem? If it comes from having 2 instances started it should be happening to somebody else! Am i missing something?
perhaps the dhcpcd wiki page should be updated, because it currently suggests using the per-interface unit mode?
Downgraded dhcpcd to 8.1.6
Disabled all systemd-services of dhcpcd, netct, and other network managers.
Intersting things happen on reboot. No errors are shown, but I can`t physically type inside my tty. Like its not allowing me to input any character.
(BTW, I also update my kernel to the latest patch. Still nothing :( )
May 09 16:33:19 bastion systemd[1]: user.slice: Failed to create cgroup /user.slice: No such file or directory
May 09 16:33:19 bastion systemd[1]: Failed to realize cgroups for queued unit user.slice, ignoring: No such file or directory
May 09 16:33:19 bastion systemd[537]: user-runtime-dir@1000.service: Failed to attach to cgroup /user.slice/user-1000.slice/user-runtime-dir@1000.service: No such file or directory
May 09 16:33:19 bastion systemd[537]: user-runtime-dir@1000.service: Failed at step CGROUP spawning /usr/lib/systemd/systemd-user-runtime-dir: No such file or directory
-- Subject: Process /usr/lib/systemd/systemd-user-runtime-dir could not be executed
[...]
May 09 16:32:35 bastion dhcpcd[360]: eth0: probing address 192.168.1.16/24
May 09 16:32:35 bastion systemd[424]: serial-getty@hvc0.service: Failed to attach to cgroup /system.slice/system-serial\x2dgetty.slice/serial-getty@hvc0.service: No such file or directory
May 09 16:32:35 bastion systemd[424]: serial-getty@hvc0.service: Failed at step CGROUP spawning /sbin/agetty: No such file or directory
-- Subject: Process /sbin/agetty could not be executed
Downgrading dhcpcd also works for me.
If indeed this is a dhcpcd issue, can someone with enough knowledge of the boot process and systemd to explain why it has such a severe impact? I would expect something along the lines of no networking, whereas the actual impact was boot failure. The first line output by 'journalctl -p err' from a boot with dhcpcd 9.0.2-1 is
May 18 22:57:30 hal systemd[1]: syncthing@vasili.service: Failed to create cgroup /system.slice/system-syncthing.slice/syncthing@vasili.service: No such file or directory
followed by a slew of similar errors for other units. I'd very much appreciate an understanding of how the dhcpcd issue impacts systemd in this way.
Please disable this for dhcpcd because ALL output will go via syslog.
Add this to dhcpcd service files.
[Service]
StandardOutput=null
StandardError=null
https://www.freedesktop.org/software/systemd/man/systemd.exec.html#StandardOutput=
That said, i don't think that the point is to solve our personal configuration problems, but to find a solution that can fix the installation for everybody. Is there anyone with enough knowledge about arch packages that could provide an update to the package that automatically removes the problematic services on installation? @rsmarples do you think that could be possible and safe?
If it's not possible maybe the solution is to just put a line on the arch wiki, so everybody updating knows what to do. Not desirable, because the boot is broken, but at least we could announce how to solve it...
So far, it's only Arch users.
Has anyone tested a non privsep enabled build of dhcpcd to see if that breaks the same way?
I'm guessing it wont .... the only difference outside of dhcpcd is that the chroot requires to have directories mounted in a specific way to appease udev .... and so dhcpcd can interogate the interfaces.
Instead of the StandOutput/Error redirect I said above, use this patch and add more -q to the command line arguments.
Maybe you should consider changing iso archlinux dhcpcd to dhclient.
I use archlinux in PXE and iso, and solved this problem by:
- creating dhclient.service (in the attachment),
- disabling the dhcpcd service and uninstalling the dhcpcd package,
- dhclient installation and enabling the above service,
After the image remastering, everything works properly, downloading IPs on the found interfaces.
Previously, dhcpcd had problems with devpts which made it impossible to connect via SSH or even run the screen.
https://bbs.archlinux.org/viewtopic.php?id=254999
I'd be happy to test a non-privsep build for you. Would producing that as simple as removing the --privsepuser=dhcpcd option when running ./configure?
Warm regards and thanks for your work on dhcpcd, it's stood me in good stead for the better part of 20 years now.
I did not know that systemd logs syslog AND console output to the same place AND under the pid of the process.
This was confusing, leading to double entries leading me to believe there was a forking error.
So the patch and using -qq in the service will help diagnostics immensely.
The second thing to try is to disable privsep entirely - compile it out:
./configure --disable-privsep
The third thing to try is my latest master branch with privsep enabled again - it's almost ready for release (see below).
The fouth thing to try is the latest master branch with privsep disabled.
That's a few things to test.
Let me know how this works out.
One thing I am currently working on is moving the mounting of /dev, /proc and /sys into it's own namespace, so it's hidden from the rest of the system.
This will require an additional script runner process (so scripts can mount dependant on ssid connected to for example) and linux-3.8 I believe.
It *may* help the issue, but this is only a guess because I don't yet understand why.
For what it's worth, I only have reports of this failing in Arch Linux.
Other systemd based OS's such as Debian and Gentoo are all working fine.
Lastly thanks for the comments about supporting dhcpcd: it means a to me - a lot of my life is in there!
If you feel the need to swap out to the abandoned upstream dhclient, don't let me stop you. What I do is all about choice, you choose what you want and if it works for you then great.
I've now done the following tests:
1) dhcpcd 9.0.2 compiled with --disable-privsep and -qq patch - works fine
2) dhcpcd latest master (5a93267) with privsep - systemd does not boot
3) dhcpcd latest master (5a93267) compiled with --disable-privsep - works fine
I am attaching the output of journalctl -u dhcpcd@wlan0 for each case. They are pretty much identical, apart from the fact that the non-booting case 2 lacks the lines
May 23 01:16:08 hal dhcpcd[1909]: sending signal TERM to pid 719
May 23 01:16:08 hal dhcpcd[1909]: sending signal TERM to pid 719
May 23 01:16:08 hal dhcpcd[1909]: waiting for pid 719 to exit
May 23 01:16:08 hal dhcpcd[1909]: waiting for pid 719 to exit
The output of journalctl -p err for the good cases (1 and 3) only contains a few lines of irrelevant errors:
May 23 01:15:09 hal kernel: thinkpad_acpi: acpi_evalf(BCTG, dd, ...) failed: AE_NOT_FOUND
May 23 01:15:09 hal kernel: ACPI: \_SB_.PCI0.LPC_.EC__.HKEY: BCTG: evaluate failed
May 23 01:15:09 hal kernel: thinkpad_acpi: acpi_evalf(BCTG, dd, ...) failed: AE_NOT_FOUND
May 23 01:15:09 hal kernel: ACPI: \_SB_.PCI0.LPC_.EC__.HKEY: BCTG: evaluate failed
May 23 01:16:09 hal kernel: watchdog: watchdog0: watchdog did not stop!
but for case 2) contains over 100 lines of
May 22 23:48:48 hal systemd[1]: syncthing@vasili.service: Failed to create cgroup /system.slice/system-syncthing.slice/syncthing@vasili.service: No such file or directory
May 22 23:48:48 hal systemd[1]: systemd-user-sessions.service: Failed to create cgroup /system.slice/systemd-user-sessions.service: No such file or directory
May 22 23:48:48 hal systemd[1]: unbound.service: Failed to create cgroup /system.slice/unbound.service: No such file or directory
May 22 23:48:48 hal systemd[773]: unbound.service: Failed to attach to cgroup /system.slice/unbound.service: No such file or directory
May 22 23:48:48 hal systemd[773]: unbound.service: Failed at step CGROUP spawning /usr/bin/unbound: No such file or directory
... and so on for pretty much every unit.
It does seem like the privsep functionality is tripping up systemd somehow, although it's not obvious to me how that's happening.
Please let me know if there's anything else you'd like me to test to shed a bit more light on this.
For the record I'm using systemd 245.5 and kernel 5.6.13-arch1-1
2_5a93267-privsep.log (2.2 KiB)
3_5a93267-no-privsep.log (2.3 KiB)
May 23 17:23:15 hal dhcpcd[3178]: dhcpcd not running
May 23 17:23:15 hal dhcpcd[3179]: dhcpcd not running
May 23 17:23:15 hal dhcpcd[651]: process 651 unexpectedly terminating on signal 15
May 23 17:23:16 hal dhcpcd[656]: process 656 unexpectedly terminating on signal 15
May 23 17:23:16 hal dhcpcd[650]: arp_change: Connection refused
May 23 17:23:16 hal dhcpcd[650]: arp_close: Transport endpoint is not connected
May 23 17:23:16 hal dhcpcd[650]: script_runreason: Transport endpoint is not connected
May 23 17:23:16 hal dhcpcd[650]: main: control_stop: Transport endpoint is not connected
May 23 17:23:16 hal dhcpcd[650]: ps_dostop: Connection refused
May 23 17:23:16 hal dhcpcd[650]: ps_dostop: Transport endpoint is not connected
just before rebooting, systemd prints error messages that don't make it into the journal about not being able to unmount /var/lib/dhcpcd/dev and /var/lib/dhcpcd/proc.
I've just pushed a fix to the dhcpcd master branch so that no special mounts are needed anymore.
Please remove the 00-linux hook if it's installed before trying - it no longer exists in my source.
Hopefully this fixes this issue.
BTW, the chroot directory should now be an empty directory and NOT where dhcpcd stores it's files.
Roy
Best wishes,
Vasili
I've made a new more changes since and am entering the final round of testing before making a new release.
Out of curiosity, do we know at this point what the root cause of the boot failure was?
As far as dhcpcd is concerned, the only pratical difference I've made is removing the mount --bind of /dev, /sys, /proc and /run/udev into the chroot.
Why that broke systemd / cgroup (which one I don't know) I have no idea.
What's really interesting is that this is the recommneded way of setting up a chroot by Arch:
https://wiki.archlinux.org/index.php/Chroot#Using_chroot
So dhcpcd is not doing anything un-ordinary here.
Luckily enough, thanks to dhcpcd gaining sandbox support for Capsicum (FreeBSD) and Pledge (OpenBSD) there was nearly enough IPC present to make this a trivial change.
And because the IPC I now have is readily extensible, it wasn't too much effort to add the missing pieces.
So the actual fix wasn't too big - you can see the initial changeset here to fix it:
https://roy.marples.name/cgit/dhcpcd.git/commit/?id=709e6b1a38ef87ada75354e968a9b35eca3131aa
Quite a few patches followed to normalise it on other OS's, but as far as Linux goes that's it.
But the most curious thing is that this seems to be limited to Arch Linux, and I'm hoping an Arch dev can explain why using the recommended way of setting up a chroot proved so fragile to break the whole boot.
After some research i found out the cause: the unbound dhcpcd service is started with the option -b, which cause it to not wait for obtaining an address and just forks immediatly. That unlocks the network-online.target too early, so when the nfs mount is tried the network is still not ready. My solution was to edit /usr/lib/systemd/system/dhcpcd.service and replace the -b option with -w.
In my case i only have one wired interface, and i am on a desktop computer (not laptop) so 99% of the times when i boot the home network is online. When it's not i guess it will have to wait the 30 seconds timeout, and that is acceptable for me.
For people using laptops or unreliable wifi i guess the best option is to use the automount feature for nfs mounts, and leave the dhcpcd.service alone with the -b option.
Does this also happen with the latest dhcpcd? Also, I suggest you use systemctl edit instead of editing files on /usr/lib/systemd, because those will be overwritten on updates.
Maybe the situation was hidden when having the interface bound service because when launched on a single interface dhcpcd use the -w option by default.
Regarding the systemctl edit: thanks, i didn't know of its existance!