FS#33328 - [iproute2] ip netns doesn't respect /sys in a shared namespace

Attached to Project: Arch Linux
Opened by A Web (aweb) - Tuesday, 08 January 2013, 06:47 GMT
Last edited by Dave Reisner (falconindy) - Friday, 09 August 2013, 01:59 GMT
Task Type Bug Report
Category Packages: Core
Status Closed
Assigned To Ronald van Haren (pressh)
Dave Reisner (falconindy)
Architecture All
Severity Low
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

Description:

On a pure systemd system (nothing in my /etc/rc.conf), periodically systemd enters a state where I cannot use systemctl, which fails with the error "Failed to get D-Bus Connection."

I've found several reports on-line of this, but mostly it turns out to be people who aren't running systemd. This is not the problem in my case. Systemctl works for a while before this happens.

A workaround is to run "kill -15 1", causing systemd to re-exec itself, which then fixes the problem. Note that "kill -USR1 1", which is supposed to make systemd re-connect to D-Bus, does not fix the problem.

Additional info:
* package version(s)

systemd 196-2

* config and/or log files etc.


Steps to reproduce:

The problem occurs reliably when using network namespaces. Not only does systemctl not work in a non-default network namespace (which is okay), but even trying to use systemctl as a non-root user in a non-default namespace makes systemctl no longer work for root users in the default namespace (at least until you run kill -15 1). The following transcript illustrates the issue:

[code]
# systemctl > /dev/null
# ip netns add test
# ip netns exec test su -p nobody systemctl > /dev/null
/usr/bin/systemctl: /usr/bin/systemctl: cannot execute binary file
# systemctl > /dev/null
Failed to get D-Bus connection: No connection to service manager.
#
[/code]

Note the last systemctl is being executed as root in the default network namespace, and it fails. (To fix your system after doing this, run [code]kill -15 1[/code].)
This task depends upon

Closed by  Dave Reisner (falconindy)
Friday, 09 August 2013, 01:59 GMT
Reason for closing:  Fixed
Additional comments about closing:  iproute2 3.8.0
Comment by Dave Reisner (falconindy) - Tuesday, 08 January 2013, 13:14 GMT
ip seems to do a crap job (aka, none at all) of tearing down the namespace it creates, and therefore it leaks into the main namespace via shared submounts (as systemd effectively calls 'mount --make-rshared /' at boot).

There's no packaging bug here to fix. You can call 'mount --make-rprivate /' at boot to counteract this, and/or mention it upstream. I suspect you'd be doing the Linux community a better service by talking to iproute2 folks about how to make ip's netns feature suck a whole lot less.

Curious, what's the use case for creating a network namespace that isn't contained within a chroot?
Comment by A Web (aweb) - Tuesday, 08 January 2013, 19:27 GMT
There are so many things I wish I could change about iproute2... this isn't anywhere near the top of the list. (E.g., try using a network interface called "h," which breaks command-line parsing, or try figuring out what most of the options do or what the syntax is from the documentation.) At least the ip netns functionality is simple enough that I can write my own simple C wrapper program to replace it.

I agree that ip netns is particularly broken in that if you create an /etc/netns/myvpn/resolv.conf file, it bind-mounts it in both namepsaces, which makes no sense and must mean the feature has never even been tested. I guess it's also somehow breaking D-Bus, but even when I don't have an /etc/netns/myvpn directory? I thought it didn't do any bind mounts in that case. What is is mounting over to break D-Bus?

Would you mind explaining exactly what is going wrong if you understand it? Because systemd is so monolithic, I don't even know how to go about diagnosing these sorts of problems, and would appreciate a better understanding. (Also I can't be the only one to have this problem, but after searching extensively first, I think this will be the only google result actually discussing the problem, so others might appreciate too.)

In terms of the use case, network namespaces are great for VPN isolation. I can run separate namespaces for the internet and my VPN, and run one browser in each if need be. Then it is much harder for some browser or extension exploit to cross between networks--even if my intranet is full of XSS vulnerabilities. I'm not paranoid enough to need separate file system namespaces.

Note that for windows, many companies have strict policies that you cannot access the internet at all while on the VPN--it's either or at any given time. Namespaces are a big improvement on this, as you can access both networks simultaneously (using the same X server), just not in the same process.
Comment by Dave Reisner (falconindy) - Tuesday, 08 January 2013, 20:26 GMT
Looked into this a little and I think I understand what's going on:

ip does some weird (stupid? broken?) things when you call ip netns exec:

# ip netns add t
# strace -e mount,umount,unshare ip netns exec t ls
unshare(CLONE_NEWNS) = 0
umount("/sys", MNT_DETACH) = 0
mount("t", "/sys", "sysfs", 0, NULL) = 0

This is the problem. unshare(CLONE_NEWNS) creates a new *mount* namespace, and then very effectively remounts /sys. Because you're in a new mount namespace when the remount is called, it's a new copy of sysfs that's dropped on /sys. Because the real root is shared, the umount and mount from the new namespace propagates BACK INTO the toplevel namespace. This new sysfs does not have /sys/fs/cgroup/systemd mounted inside of it, and so you get the behavior you see -- systemctl fails. Asking systemd to re-exec restores this behavior, as it remounts the cgroup hierarchy in sysfs.

In the short term, you can work around this by calling 'mount --make-rprivate /sys'. This means that the new mountspace can't affect the toplevel namespace and so you're safe from ip's weirdness.

In the long term... I'm not sure. I've never liked that systemd marks the entirety of root shared. It seems to cause a lot of subtle problems like this, and there's already a hack or two in place to work around some of the other problems it causes. For example, the equiv. of "mount --make-rprivate /" is called on shutdown so that pivot_root() doesn't fail. However, I'm not sure they'll buy into the idea that systemd is at fault for this. On some level, I'm not sure I blame them, but side effects (ahem, bugs) like this are still a real annoyance.

Could you at least bring this up with iproute2 upstream so that they're aware of the problem? Maybe they have some ideas about how to make this work more nicely.
Comment by Dave Reisner (falconindy) - Tuesday, 08 January 2013, 20:35 GMT
If you talk to upstream iproute2, you might suggest that they call mount(NULL, "/", NULL, MS_PRIVATE|MS_REC, NULL) between the unshare and the umount/mount dance instead of making assumptions about the state of the root. In my limited testing, this seems to fix the problem.
Comment by Mantas Mikulėnas (grawity) - Tuesday, 08 January 2013, 20:53 GMT
@aweb: The bind-mount problem – "if you create an /etc/netns/myvpn/resolv.conf file, it bind-mounts it in both namepsaces" – is caused by the same thing too ("/" being shared and allowing mounts to propagate back).
Comment by A Web (aweb) - Wednesday, 09 January 2013, 07:55 GMT
Okay, I understand the problem iproute2 is creating. I was still confused as to why systemctl couldn't get a D-Bus connection without /sys, especially since strace only showed a couple of lstats of /sys. Tracing through the source, I find systemctl never even attempts to connect to systemd over D-Bus unless /sys/fs/cgroup and /sys/fs/cgroup/systemd are on different file systems! So "Failed to get D-Bus connection: No connection to service manager" is a somewhat misleading error message.

@falconindy: I will pursue upstream with iproute2. However, any reason not to mount just /sys MS_PRIVATE|MS_REC? Or to mount / MS_SLAVE|MS_REC?

Ideally, a solution would actually leave everything working properly, which would mean replicating the mount points on /sys in each namespace. It's slightly annoying that systemd chooses to mount all those file systems on top of /sys, when the semantics of sys depend on the namespace in which it was mounted, and hence drive people to unmount and remount it. (I've tried mount --make-private /sys followed by mount -o remount /sys, but only a fresh mount causes different network devices to show up in /sys/class/net with this tagged directory approach.)

Anyway, a big thank you for your all this information. The Arch team is incredibly responsive to bug reports, and it is much appreciated.
Comment by Dave Reisner (falconindy) - Wednesday, 09 January 2013, 11:09 GMT
> However, any reason not to mount just /sys MS_PRIVATE|MS_REC? Or to mount / MS_SLAVE|MS_REC?
On your side? If that fulfill your needs, then go for it. systemd won't do this because MS_SHARED is wanted for visibility into containers. with MS_REC|MS_SLAVE, you won't see a container's mounts back in the toplevel namespace. Marking "/" MS_PRIVATE|MS_REC is certainly an option as well if you don't think you'll need to deal with containers on your system (note that this include systemd-nspawn). It shouldn't break anything, but I'd be interested to know if it does. I mentioned marking only /sys with MS_PRIVATE|MS_REC because it seemed to be the minimum required to work around this issue.

> It's slightly annoying that systemd chooses to mount all those file systems on top of /sys
This is how cgroup controllers are intended to work in the kernel.
Comment by A Web (aweb) - Wednesday, 09 January 2013, 19:46 GMT
@falconindy: Sorry, I was ambiguous about the context. Your suggestion to --make-rprivate /sys is fine as a workaround, but then I was wondering about what patch to submit to iproute2. You suggested making / MS_REC|MS_PRIVATE, which would work, but why / and not just /sys?

I'm thinking a complete solution in iproute2 between the unshare and the mount stuff would be (in horrible pseudo-code):

make /sys MS_REC|MS_PRIVATE;
if (/etc/netns/myvpn/ exists
&& / is not MS_PRIVATE /* requires parsing /proc/self/mountinfo? */) {
if (/etc is not a mount point)
bind /etc onto itself to make it a mount point;
make /etc MS_PRIVATE;
}
Comment by Dave Reisner (falconindy) - Wednesday, 09 January 2013, 19:54 GMT
See attached.
Comment by Dave Reisner (falconindy) - Thursday, 10 January 2013, 14:57 GMT
Updated title to be more accurate. @aweb: please link back to any contact you make with upstream
Comment by Ronald van Haren (pressh) - Wednesday, 30 January 2013, 20:14 GMT
Any word from upstream?
Comment by A Web (aweb) - Wednesday, 30 January 2013, 22:22 GMT
Sorry, I have not talked to upstream yet, because I'm experiencing a weird bug where every other time (deterministically) I try to use a namespace it gives me an unmount error. I haven't had time to diagnose this yet, and want to submit a clean patch upstream. I understand this has been open for a long time, so if you want to mark it as closed or wontfix, that's fine. It really is an upstream bug, and not a critical one now that there is a workaround. I'll post updates here regardless as I have to fix the problem eventually.
Comment by A Web (aweb) - Thursday, 08 August 2013, 21:22 GMT
One way or another, this problem appears to be fixed now, hence I am requesting closure.
Comment by Dave Reisner (falconindy) - Friday, 09 August 2013, 01:54 GMT

Loading...