FS#33328 - [iproute2] ip netns doesn't respect /sys in a shared namespace
Attached to Project:
Arch Linux
Opened by A Web (aweb) - Tuesday, 08 January 2013, 06:47 GMT
Last edited by Dave Reisner (falconindy) - Friday, 09 August 2013, 01:59 GMT
Opened by A Web (aweb) - Tuesday, 08 January 2013, 06:47 GMT
Last edited by Dave Reisner (falconindy) - Friday, 09 August 2013, 01:59 GMT
|
Details
Description:
On a pure systemd system (nothing in my /etc/rc.conf), periodically systemd enters a state where I cannot use systemctl, which fails with the error "Failed to get D-Bus Connection." I've found several reports on-line of this, but mostly it turns out to be people who aren't running systemd. This is not the problem in my case. Systemctl works for a while before this happens. A workaround is to run "kill -15 1", causing systemd to re-exec itself, which then fixes the problem. Note that "kill -USR1 1", which is supposed to make systemd re-connect to D-Bus, does not fix the problem. Additional info: * package version(s) systemd 196-2 * config and/or log files etc. Steps to reproduce: The problem occurs reliably when using network namespaces. Not only does systemctl not work in a non-default network namespace (which is okay), but even trying to use systemctl as a non-root user in a non-default namespace makes systemctl no longer work for root users in the default namespace (at least until you run kill -15 1). The following transcript illustrates the issue: [code] # systemctl > /dev/null # ip netns add test # ip netns exec test su -p nobody systemctl > /dev/null /usr/bin/systemctl: /usr/bin/systemctl: cannot execute binary file # systemctl > /dev/null Failed to get D-Bus connection: No connection to service manager. # [/code] Note the last systemctl is being executed as root in the default network namespace, and it fails. (To fix your system after doing this, run [code]kill -15 1[/code].) |
This task depends upon
Closed by Dave Reisner (falconindy)
Friday, 09 August 2013, 01:59 GMT
Reason for closing: Fixed
Additional comments about closing: iproute2 3.8.0
Friday, 09 August 2013, 01:59 GMT
Reason for closing: Fixed
Additional comments about closing: iproute2 3.8.0
There's no packaging bug here to fix. You can call 'mount --make-rprivate /' at boot to counteract this, and/or mention it upstream. I suspect you'd be doing the Linux community a better service by talking to iproute2 folks about how to make ip's netns feature suck a whole lot less.
Curious, what's the use case for creating a network namespace that isn't contained within a chroot?
I agree that ip netns is particularly broken in that if you create an /etc/netns/myvpn/resolv.conf file, it bind-mounts it in both namepsaces, which makes no sense and must mean the feature has never even been tested. I guess it's also somehow breaking D-Bus, but even when I don't have an /etc/netns/myvpn directory? I thought it didn't do any bind mounts in that case. What is is mounting over to break D-Bus?
Would you mind explaining exactly what is going wrong if you understand it? Because systemd is so monolithic, I don't even know how to go about diagnosing these sorts of problems, and would appreciate a better understanding. (Also I can't be the only one to have this problem, but after searching extensively first, I think this will be the only google result actually discussing the problem, so others might appreciate too.)
In terms of the use case, network namespaces are great for VPN isolation. I can run separate namespaces for the internet and my VPN, and run one browser in each if need be. Then it is much harder for some browser or extension exploit to cross between networks--even if my intranet is full of XSS vulnerabilities. I'm not paranoid enough to need separate file system namespaces.
Note that for windows, many companies have strict policies that you cannot access the internet at all while on the VPN--it's either or at any given time. Namespaces are a big improvement on this, as you can access both networks simultaneously (using the same X server), just not in the same process.
ip does some weird (stupid? broken?) things when you call ip netns exec:
# ip netns add t
# strace -e mount,umount,unshare ip netns exec t ls
unshare(CLONE_NEWNS) = 0
umount("/sys", MNT_DETACH) = 0
mount("t", "/sys", "sysfs", 0, NULL) = 0
This is the problem. unshare(CLONE_NEWNS) creates a new *mount* namespace, and then very effectively remounts /sys. Because you're in a new mount namespace when the remount is called, it's a new copy of sysfs that's dropped on /sys. Because the real root is shared, the umount and mount from the new namespace propagates BACK INTO the toplevel namespace. This new sysfs does not have /sys/fs/cgroup/systemd mounted inside of it, and so you get the behavior you see -- systemctl fails. Asking systemd to re-exec restores this behavior, as it remounts the cgroup hierarchy in sysfs.
In the short term, you can work around this by calling 'mount --make-rprivate /sys'. This means that the new mountspace can't affect the toplevel namespace and so you're safe from ip's weirdness.
In the long term... I'm not sure. I've never liked that systemd marks the entirety of root shared. It seems to cause a lot of subtle problems like this, and there's already a hack or two in place to work around some of the other problems it causes. For example, the equiv. of "mount --make-rprivate /" is called on shutdown so that pivot_root() doesn't fail. However, I'm not sure they'll buy into the idea that systemd is at fault for this. On some level, I'm not sure I blame them, but side effects (ahem, bugs) like this are still a real annoyance.
Could you at least bring this up with iproute2 upstream so that they're aware of the problem? Maybe they have some ideas about how to make this work more nicely.
@falconindy: I will pursue upstream with iproute2. However, any reason not to mount just /sys MS_PRIVATE|MS_REC? Or to mount / MS_SLAVE|MS_REC?
Ideally, a solution would actually leave everything working properly, which would mean replicating the mount points on /sys in each namespace. It's slightly annoying that systemd chooses to mount all those file systems on top of /sys, when the semantics of sys depend on the namespace in which it was mounted, and hence drive people to unmount and remount it. (I've tried mount --make-private /sys followed by mount -o remount /sys, but only a fresh mount causes different network devices to show up in /sys/class/net with this tagged directory approach.)
Anyway, a big thank you for your all this information. The Arch team is incredibly responsive to bug reports, and it is much appreciated.
On your side? If that fulfill your needs, then go for it. systemd won't do this because MS_SHARED is wanted for visibility into containers. with MS_REC|MS_SLAVE, you won't see a container's mounts back in the toplevel namespace. Marking "/" MS_PRIVATE|MS_REC is certainly an option as well if you don't think you'll need to deal with containers on your system (note that this include systemd-nspawn). It shouldn't break anything, but I'd be interested to know if it does. I mentioned marking only /sys with MS_PRIVATE|MS_REC because it seemed to be the minimum required to work around this issue.
> It's slightly annoying that systemd chooses to mount all those file systems on top of /sys
This is how cgroup controllers are intended to work in the kernel.
I'm thinking a complete solution in iproute2 between the unshare and the mount stuff would be (in horrible pseudo-code):
make /sys MS_REC|MS_PRIVATE;
if (/etc/netns/myvpn/ exists
&& / is not MS_PRIVATE /* requires parsing /proc/self/mountinfo? */) {
if (/etc is not a mount point)
bind /etc onto itself to make it a mount point;
make /etc MS_PRIVATE;
}
http://git.kernel.org/cgit/linux/kernel/git/shemminger/iproute2.git/commit/?id=144e6ce1679a768e987230efb4afa402a5ab58ac