FS#36969 - [linux] 3.13 add CONFIG_USER_NS
Attached to Project:
Arch Linux
Opened by Florian Klink (flokli) - Tuesday, 17 September 2013, 20:55 GMT
Last edited by Eli Schwartz (eschwartz) - Wednesday, 13 December 2017, 15:13 GMT
Opened by Florian Klink (flokli) - Tuesday, 17 September 2013, 20:55 GMT
Last edited by Eli Schwartz (eschwartz) - Wednesday, 13 December 2017, 15:13 GMT
|
Details
Description: Add user namespaces to kernel configuration:
Support user namespaces. This allows containers, i.e. vservers, to use user namespaces to provide different user info for different servers. This is recommended to turned on when using lxc. (lxc-checkconfig complains about it) Its also needed to be able to run commands inside an lxc container while using virsh: Latest libvirt has a new command for running stuff inside a container virsh -c lxc:/// lxc-enter-namespace mycontainername -- /bin/ps -auxf This requires a fairly new kernel(3.7 or even 3.8 kernel is preferred) since it _needs all 6 namespaces present in /proc/self/ns to work properly_. (from https://www.redhat.com/archives/libvirt-users/2013-February/msg00058.html) Seems like this option got missed while closing |
This task depends upon
Closed by Eli Schwartz (eschwartz)
Wednesday, 13 December 2017, 15:13 GMT
Reason for closing: Fixed
Additional comments about closing: [core]/linux 4.14.5-1
Wednesday, 13 December 2017, 15:13 GMT
Reason for closing: Fixed
Additional comments about closing: [core]/linux 4.14.5-1
Allowing non-admin users to create namespaces is one of goals of the whole "user namespace" work. For instance, Ubuntu plans to be able to deploy unprivileged containers in 14.04 [1], [2].
[1] http://s3hh.wordpress.com/2013/02/12/user-namespaces-lxc-meeting/
[2] https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1191596
@tpowa:
Well, there are no outstanding security issues with user namespaces which I'm aware of. However, the above commit has already produced at least 2 serious vulnerabilities, so I guess people at Fedora security decided to play it safe (and I agree with them). I suggest to delay enabling user namespaces by default until at least 3.13. Is it possible to rename this bug report to say "3.13" and keep it open?
I never said that it wasn't a neat idea, I think it is awesome. The problem is that it hardly has any testing and does expose a much larger section of the kernel to user torture than was previously available before. This is a very major change to many kernel subsystems and has already enabled new attacks. I'm not saying it's something that shouldn't ever be enabled, it just needs time to bake.
CentOS 7, Debian 7.x, Ubuntu 14.04 all have USER_NS enabled in their kernels
https://github.com/sandstorm-io/sandstorm/issues/162
That said, there is a legitimate security issue as userns opens up a large new attack surface for local privilege-escalation exploits, and there have indeed been a few vulnerabilities discovered over the last few months (e.g. CVE-2014-5206/CVE-2014-5207).
Of course, for many installations -- especially single-user desktops and servers that run only trusted code* -- local privilege escalation may not be a big issue.
Debian and Ubuntu have the kernel.unprivileged_userns_clone sysctl to control access to this feature, which they default off (last I checked), but the Sandstorm installer assists the user in turning it on. Arch could do the same if you want to be cautious.
* Sandstorm is a server that runs untrusted code, but it already uses seccomp to prevent untrusted code from creating user namespaces.
That's not at all true. A local privilege escalation in the kernel is the key to escaping from the Chromium sandbox, or escalating privileges after exploiting a server / other service running as an unprivileged user.
> Debian and Ubuntu have the kernel.unprivileged_userns_clone sysctl to control access to this feature, which they default off (last I checked), but the Sandstorm installer assists the user in turning it on. Arch could do the same if you want to be cautious.
Arch doesn't add new features via patches. If you want to see this feature enabled, then land something like this upstream. Note that CONFIG_USER_NS is already enabled in the linux-grsec package because it fully removes the ability to have unprivileged user namespaces.
Sure, let's talk about Chrome, because it's actually pretty relevant.
Chrome's sandbox uses seccomp to prohibit exotic kernel features, so enabling unprivileged user namespaces has no immediate effect on Chrome's security.
Meanwhile Chrome currently relies on a setuid binary to set up its sandbox, because unshare() require privilege -- unless, of course, unprivileged user namespaces are allowed. So presumably Chrome will start using userns at some point so that it can get rid of the setuid binary which is itself a security liability. So in the long run, enabling unprivileged user namespaces is actually a security win for Chrome.
> or escalating privileges after exploiting a server / other service running as an unprivileged user.
We could have a lengthy debate about the practical usefulness of UID separation in the presence of RCEs, but I think it's beside the point here, so I withdraw the statement.
> Arch doesn't add new features via patches. If you want to see this feature enabled, then land something like this upstream.
Sorry, to be clear, I only commented here to provide what I thought might be useful information for this bug. I'm not asking for anything, nor can I volunteer for anything.
> Note that CONFIG_USER_NS is already enabled in the linux-grsec package because it fully removes the ability to have unprivileged user namespaces.
Most things that use user namespaces use them explicitly because they don't require privilege. E.g. Sandstorm (like Chrome) would rather not rely on a setuid binary for sandboxing.
That's not true. It allows calling clone without parameter checks in some of the sandboxed processes. It doesn't allow calling it in the renderer process.
> Meanwhile Chrome currently relies on a setuid binary to set up its sandbox, because unshare() require privilege -- unless, of course, unprivileged user namespaces are allowed. So presumably Chrome will start using userns at some point so that it can get rid of the setuid binary which is itself a security liability. So in the long run, enabling unprivileged user namespaces is actually a security win for Chrome.
This doesn't make much sense. A small setuid binary is way saner than a completely broken kernel feature with a vulnerability discovered every other week. Please do a quick search for user namespaces in the kernel log. AFAICT, there has never been a disclosed privesc vulnerability for the chrome-sandbox helper. There was a potentially exploitable bug[1] but in the end it didn't appear to offer a way to escalate privileges.
[1] https://code.google.com/p/chromium/issues/detail?id=76542
> Sorry, to be clear, I only commented here to provide what I thought might be useful information for this bug. I'm not asking for anything, nor can I volunteer for anything.
If a way to disable unprivileged user namespaces by default doesn't land upstream, then Arch is not going to enable the feature. The ability to opt-in to this insanity is fine, but I can't see it being enabled by default.
> Most things that use user namespaces use them explicitly because they don't require privilege. E.g. Sandstorm (like Chrome) would rather not rely on a setuid binary for sandboxing.
Exactly, it's not actually a useful feature. The reason people want it is the ability to enter containers without root, but the feature doesn't actually provide that. Enabling it makes every user with access to clone / unshare (without a parameter check) into a superuser. Thanks to the lag in getting new kernel versions into [core] there would usually be a usable user namespace exploit available. In fact, there's one *right now* and it's not even fixed in 3.17.4 in [testing] because no fix has been created.
Ouch, really? That seems like a bug in Chrome. Has someone reported it?
(Sandstorm does not allow CLONE_USERNS in clone() calls.)
> a completely broken kernel feature
I take it you have an opinion about this.
> vulnerability discovered every other week
To be fair, on the off weeks, non-userns vulnerabilities are found. Anyone relying on the lack of LPE in Linux, without using seccomp, is not in a great place, sadly.
> In fact, there's one *right now*
There's also at least one unpatched non-userns LPE *right now*, that I know of. Just saying. :)
Anyway, you've made your position clear. I guess this issue should be closed again?
It's not a bug. I'm not aware of a common seccomp sandbox that's locked down as much as it could be. It would be nice if it was done by tracing the code and identifying the minimal set of system calls and flag parameters but it doesn't work that way. For one thing, the necessary system calls / flags vary across platforms so it's not as simple as it seems. The current rules are created by fallible humans who are going to miss many opportunities to lock down specific system calls.
Originally, Chromium was just using seccomp for the renderer process but it is going to be extended to the other processes outside of that most restricted sandbox over time.
> To be fair, on the off weeks, non-userns vulnerabilities are found. Anyone relying on the lack of LPE in Linux, without using seccomp, is not in a great place, sadly.
Sure, and I'm against moving in the wrong direction. New features known to add significant attack surfaces should be opt-in at runtime. The BPF JIT compiler is a nice example of that because it's disabled by default, as it should be.
Arch *could* use the out-of-tree patch to make it opt-in, but I can't see that happening. It goes against the patching policy and I don't think the kernel maintainers are interested in deviating from it for this. Anyway, this is the frustrating side to using software as shipped by upstream.
> The reason people want it is the ability to enter containers without root, but the feature doesn't actually provide that.
Why not? I've written the tool: https://github.com/tailhook/vagga which allows just that, without any single setuid binary.
> Enabling it makes every user with access to clone / unshare (without a parameter check) into a superuser.
Well you can't unshare into super-user. You become pseudo super-user in new namespace, but that super-user can mount and that's pretty much it. You can't setuid to real root, even if you have a setuid binary for it (and that's the reason FUSE doesn't work in namespace)
The statement loses meaning when you quote it out of context... CONFIG_USER_NS=y turns all users into superusers because it doesn't work as intended. The kernel wasn't written with user namespaces in mind so there's an endless stream of privilege escalation issues via user namespaces. There are no doubt going to be a a hundred more disclosed over the next few years.
> Well you can't unshare into super-user. You become pseudo super-user in new namespace, but that super-user can mount and that's pretty much it. You can't setuid to real root, even if you have a setuid binary for it (and that's the reason FUSE doesn't work in namespace)
I'm aware of how user namespaces are intended to work. The fact is that they don't actually work that way in practice, as many parts of the kernel weren't written with them in mind.
It might be okay to enable this now?
You yourself said with 6 months you'd be okay turning it on; which is only 2 months away :)
The Apport and Abrt vulnerabilities disclosed on April 14th were only exploitable by unprivileged users due to unprivileged user namespaces. It's going to be some time before userspace is ready for the implications of the feature, let alone the kernel itself.
We're at 2 days right now and the odds that it'll make it 6 months with no vulnerabilities discovered are pretty low... CONFIG_USER_NS is pretty much CONFIG_PRIVILEGE_ESCALATION without patching away the ability for unprivileged users to use it.
As for enabling it *without* that... it has now been 2 days since the last fix:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=51dfcb076d1e1ce7006aa272cb7c4514740c7e47
http://www.openwall.com/lists/oss-security/2015/05/29/5
It's the consequence of ever-increasing complexity though.
That's the approach taken by the linux-hardened package that I maintain, although it's done via an out-of-tree sysctl controlling whether unprivileged usage is allowed. That means it can still be enabled by default for linux-hardened, just not for unprivileged users, and unprivileged use only requires toggling the sysctl.
The upstream sysctl *fully* disables it which works fine but it's a bit less useful.
The sysctl conf disabling it should not be added to systemd, etc. because it shouldn't be disabled for linux-hardened, only linux, linux-lts and linux-zen. It belongs in the kernel packages themselves, if they aren't going to carry a patch.
There's a way to dynamically turn it off by enabling it but setting user.max_user_namespaces to 0. There isn't a way to set that by default in the kernel configuration so it would need to be set at runtime as mentioned above.
Nothing has changed since Linux 4.9.
There was a proposal to support an unprivileged user namespace capability mask to allow unprivileged user namespaces without granting capabilities within it, but just like the toggle for unprivileged user namespaces that seems to be going nowhere. There isn't interest in making this feature less risky upstream.
Re bubblewrap, IMHO, it is more straightforward to use virtual machines, unless you have a really old and weak CPU...
What about giving containers to untrusted users? They could exploit many of the same vulnerabilities, so for someone hosting containers as virtual private servers, attack surface would be probably quite similar, no matter if unprivileged user namespaces are or are not allowed.
Real root in a container is real root on the host so your proposed scenario doesn't make much sense.
The only case where there's a benefit is for OS containers running an entire Linux userspace which is usually only done for development purposes. In that case, the userspace OS in the container needs a quasi-root and user namespaces provide that. However, providing it doesn't imply exposing unprivileged access. On linux-hardened, user namespaces are enabled and can be used to replace root with an isolated quasi-root. The difference is that it doesn't make namespaces, iptables, mount, etc. exposed as unprivileged attack surface.
If you want this feature so badly, you can use linux-hardened which enables CONFIG_USERNS but disables unprivileged access by default. If you really want to accept the drawbacks of unprivileged access you can enable it. Arch doesn't maintain downstream patches as a general rule and neither linux or linux-hardened have any applied. The linux-hardened project is a fork of the Linux kernel maintained outside of Arch Linux and isn't actually focused on desktop Linux.
Unless the package maintainers are willing to apply a patch making user.max_user_namespaces 0 by default in the kernel or providing a sysctl to toggle unprivileged access, there isn't an obvious approach to dealing with this.
The problem also isn't likely to change even years from now. It's still going to be a poorly designed feature with much better approaches like the one taken by bubblewrap which exposes much less attack surface. By holding off on enabling this, Arch Linux played a large part in making the alternate approach in bubblewrap into a reality. Otherwise, they could have just assumed unprivileged user namespaces were present rather than making a safer implementation.
Until something like the Chromium sandbox has a hard dependency on this, I think Arch should continue down the current path. An approach like http://www.openwall.com/lists/kernel-hardening/2017/09/29/1 may be accepted upstream in the near future which would provide a safe way to enable this. Waiting until they make the feature safe makes sense. It has so little utility compared to the danger it creates.
That's a nice thought, except for the part where Arch routinely enables everything and the kitchen sink whenever it doesn't negatively impact users.
Asking people to build their own kernels just to use USER_NS with containers is... less than ideal... unless you are proposing that on top of there being no general-purpose benefit, there is also some *negative* impact for the average user.
AFAIK no one has suggested that enabling compiled-in support but disabling it via the upstream sysctl knob, is any less secure than simply not enabling compiled-in support. And after years of argumentation on USER_NS, this can finally be done without any downstream patches (which was the original reason this bugreport had been closed as WONTIMPLEMENT).
Then don't ask them to do that! Ask them to use community/linux-hardened, which does have CONFIG_USER_NS=y (it also has the unprivileged_userns_clone sysctl knob; default to off).
This has been enabled with linux 4.14.4-2 currently in [testing], along with the same unprivileged_userns_clone sysctl knob patch as used by linux-hardened.
http://www.openwall.com/lists/kernel-hardening/2017/12/05/13
It would make sense to figure out which userns capabilities can be disabled for the Chromium userns sandbox, etc. but it's outside the scope of what I care about now. Android doesn't use user namespaces so it's not relevant to my world.
on the one hand, user namespaces allow gaining capabilities without having them previously, and this increases attack surface...
But most (all?) bugs I know about that are exposed there are not directly related to user namespaces implementation, but existed before them and were just exposed.
And, bugs are, well, bugs. something that should be fixed long ago, but either was not found or not deemed critical. Would that exposure increase their chance of being fixed, finally? Not trying to object blocking unprivileged userns with a stupid argument, just asking.
Also I have one question. When using user namespaces blocked from being created by normal non root users, normal users won't be able to just go and gain arbitrary caps, but root of course would. Someone once mentioned that the usage of userns for system containers is useful mainly for dev/testing. But what about the case of virtual private server hosting?
Virtual private server hosting companies often provide virtual machines, but, at least here in Poland, they also provide openvz containers. lxc system containers would probably be very similar to those openvz containers in the way they operate. So here we have system containers given out for money to untrusted users, and user namespaces would be useful for them. What about their security? those users would have the full set of capabilities, but users would not create them directly using an unprivileged user account of the host.
- hosts: localhost
pre_tasks:
- add_host:
name: testboot.lxd
lxd_alias: "{{ lookup('env', 'LXD_ALIAS', 'alpine/3.4/amd64' ).strip(',') }}"
roles:
- novafloss.boot-lxd
- hosts: testboot
# test all your stuff here, was it easy enough ? really ? ;)
tasks:
- shell: uname -a
I don't understand why I can't use this feature by default, it's like prohibition of majijuana because some users would abuse it and put their own security at risk.
I mean, we support docker, how is lxd any worse than docker in terms of security, seriously ? Not the same issue perhaps, but still as dangerous if not even more.
It's time to free the weed, and free CONFIG_USER_NS.
Yeah sorry about the analogy, i know it sucks it was more for the joke than anything else, would docker perhaps make a better analogy ? What about systemd ? We've had it before it was mature and there's been security issues with it too ?
James Pic (is_null): yes without a custom knob like right now, if its activated its always available for unprivileged users so the threat is always exposed without a way to opt-out. Just compare the amount of root privilege escalations, you really think it could even be used to compare? nope. last USERNS vuln: ~2 weeks ago.
USER_NS now has compiled-in support, and all it takes is making the configuration option "I want to use USER_NS". I hardly think your freedom is being trampled. OTOH I am sort of beginning to regret reopening this bug if this is the reaction once we *do* give people what they asked for.
If you want unprivileged users to be able to use it you now have that option, and by default privileged users can already make use of it. An alternative would be changing the default value of user.max_user_namespaces to 0 which would only require patching the default value but would unnecessarily block usage by privileged users. Using sysctl doesn't work properly because this needs to be per-kernel since other kernels like linux-hardened may take the privileged approach or a completely different alternative.
There is ongoing work like http://www.openwall.com/lists/kernel-hardening/2017/12/05/13 which may land in some form upstream and will make it possible to make limited, safer usage of the feature by not exposing all of the network administration features like netfilter, etc. to every unprivileged user / sandbox not disabling user namespaces.
There are always going to be vulnerabilities exposed by user namespaces. It greatly increases attack surface and that's not something that can be fixed. You're talking about this as if there's a fixed number of bugs with a count going down over time. That's not how software works at all. The code is actively developed and gaining new bugs. The attack surface added by user namespaces will increase over time as more complexity is added to areas like network administration that it exposes to unprivileged users. The endless stream of user namespace vulnerabilities is not slowing down.
Fixing bugs case by case is a very poor approach to security. Bugs need to be targeted in a systemic way to significantly improve security. For example, Arch Linux uses a bunch of compiler exploit mitigations: SSP, FORTIFY_SOURCE, PIE, RELRO, BIND_NOW. Those features only have value in making memory corruption bugs harder or in some cases even impossible (FORTIFY_SOURCE) to exploit.
SSP (-fstack-protector / -fstack-protector-strong) often reduces performance by 5% (sometimes less but sometimes even substantially more) and it can only stop exploitation of an increasingly small subset of memory corruption bugs: sequential stack corruption. SSP is imperfect and can be bypassed via other security-relevant bugs able to leak the random value. Despite those things, every major distribution enables it by default. If the performance cost was 30%, it probably wouldn't be enabled by anything but a security-oriented distribution, but it's not 30%.
Arch being willing to accept a 5% performance cost across the board but not a barely significant inconvenience for a poorly designed, extremely niche feature would be hard to justify. There are projects like flatpak / bubblewrap providing alternatives to user namespaces that are more secure. Their efforts would be wasted if everyone gave up on having decent local security and exposed it. The reason for the Chromium developers not taking that path is lack of caring about desktop Linux. They wouldn't ever consider exposing user namespaces everywhere on Android or ChromeOS... only very limited exposure for sandbox creation and no exposure to unprivileged code.