Arch Linux

Please read this before reporting a bug:
https://wiki.archlinux.org/index.php/Reporting_Bug_Guidelines

Do NOT report bugs when a package is just outdated, or it is in Unsupported. Use the 'flag out of date' link on the package page, or the Mailing List.

REPEAT: Do NOT report bugs for outdated packages!
Tasklist

FS#36969 - [linux] 3.13 add CONFIG_USER_NS

Attached to Project: Arch Linux
Opened by Florian Klink (flokli) - Tuesday, 17 September 2013, 20:55 GMT
Last edited by Eli Schwartz (eschwartz) - Wednesday, 13 December 2017, 15:13 GMT
Task Type Feature Request
Category Packages: Core
Status Closed
Assigned To Tobias Powalowski (tpowa)
Thomas Bächler (brain0)
Architecture All
Severity Low
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 66
Private No

Details

Description: Add user namespaces to kernel configuration:


Support user namespaces. This allows containers, i.e. vservers, to use user namespaces to provide different user info for different servers.


This is recommended to turned on when using lxc. (lxc-checkconfig complains about it)
Its also needed to be able to run commands inside an lxc container while using virsh:

Latest libvirt has a new command for running stuff inside a container

virsh -c lxc:/// lxc-enter-namespace mycontainername -- /bin/ps -auxf

This requires a fairly new kernel(3.7 or even 3.8 kernel is preferred)
since it _needs all 6 namespaces present in /proc/self/ns to work properly_.
(from https://www.redhat.com/archives/libvirt-users/2013-February/msg00058.html)


Seems like this option got missed while closing  FS#16715 
This task depends upon

Closed by  Eli Schwartz (eschwartz)
Wednesday, 13 December 2017, 15:13 GMT
Reason for closing:  Fixed
Additional comments about closing:  [core]/linux 4.14.5-1
Comment by Gerardo Exequiel Pozzi (djgera) - Tuesday, 17 September 2013, 23:43 GMT
This can not be done now with 3.11, because can not be enabled if XFS is present). Must wait for 3.12 (see commit d6970d4b726cea6d7a9bc4120814f95c09571fc3 [enable building user namespace with xfs] and related commits 300893b08f3bc7057a7a5f84074090ba66c8b5ca [Merge tag 'xfs-for-linus-v3.12-rc1' of git://oss.sgi.com/xfs/xfs])
Comment by Leonid Isaev (lisaev) - Tuesday, 01 October 2013, 16:39 GMT
A related Fedora bug: https://bugzilla.redhat.com/show_bug.cgi?id=917708 . Please note that there seem to be some security implications of enabling user namespaces...
Comment by Florian Klink (flokli) - Tuesday, 01 October 2013, 16:49 GMT
What about taking their approach, reverting http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=5eaf563e53294d6696e651466697eb9d491f3946 "userns: Allow unprivileged users to create user namespaces"?
Comment by William Kennington (Webhostbudd) - Sunday, 06 October 2013, 03:55 GMT
I agree with Florian, allowing non-root users to take advantage of elevating themselves to a local root seems like a huge attack surface. Preferably this would be a sysctl with a huge warning attached to it when it is switched on.
Comment by Tobias Powalowski (tpowa) - Thursday, 10 October 2013, 09:38 GMT
I have not enabled USER_NS in configs for 3.12, until security is not safe.
Comment by Leonid Isaev (lisaev) - Monday, 04 November 2013, 22:56 GMT
@Webhostbudd:
Allowing non-admin users to create namespaces is one of goals of the whole "user namespace" work. For instance, Ubuntu plans to be able to deploy unprivileged containers in 14.04 [1], [2].

[1] http://s3hh.wordpress.com/2013/02/12/user-namespaces-lxc-meeting/
[2] https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1191596

@tpowa:
Well, there are no outstanding security issues with user namespaces which I'm aware of. However, the above commit has already produced at least 2 serious vulnerabilities, so I guess people at Fedora security decided to play it safe (and I agree with them). I suggest to delay enabling user namespaces by default until at least 3.13. Is it possible to rename this bug report to say "3.13" and keep it open?
Comment by Daniel Micay (thestinger) - Thursday, 16 January 2014, 02:50 GMT
This is a very useful feature even without allowing unprivileged users to use it, so I think Arch should be enabling it and reverting the patch removing the need to be a superuser like Fedora.
Comment by William Kennington (Webhostbudd) - Wednesday, 19 February 2014, 04:42 GMT
@lisaev
I never said that it wasn't a neat idea, I think it is awesome. The problem is that it hardly has any testing and does expose a much larger section of the kernel to user torture than was previously available before. This is a very major change to many kernel subsystems and has already enabled new attacks. I'm not saying it's something that shouldn't ever be enabled, it just needs time to bake.
Comment by Daniel Micay (thestinger) - Tuesday, 22 April 2014, 20:34 GMT
It doesn't appear that user namespaces are very useful with the new restrictions... as far as I can tell they can not yet be mixed with a chroot in any way.
Comment by Damjan Georgievski (damjan) - Friday, 24 October 2014, 08:50 GMT
Revisit this issue?

CentOS 7, Debian 7.x, Ubuntu 14.04 all have USER_NS enabled in their kernels
Comment by Kenton Varda (kentonv) - Sunday, 23 November 2014, 19:31 GMT
Note that Sandstorm.io does not work on Arch because of this:

https://github.com/sandstorm-io/sandstorm/issues/162

That said, there is a legitimate security issue as userns opens up a large new attack surface for local privilege-escalation exploits, and there have indeed been a few vulnerabilities discovered over the last few months (e.g. CVE-2014-5206/CVE-2014-5207).

Of course, for many installations -- especially single-user desktops and servers that run only trusted code* -- local privilege escalation may not be a big issue.

Debian and Ubuntu have the kernel.unprivileged_userns_clone sysctl to control access to this feature, which they default off (last I checked), but the Sandstorm installer assists the user in turning it on. Arch could do the same if you want to be cautious.

* Sandstorm is a server that runs untrusted code, but it already uses seccomp to prevent untrusted code from creating user namespaces.
Comment by Daniel Micay (thestinger) - Monday, 24 November 2014, 03:55 GMT
> Of course, for many installations -- especially single-user desktops and servers that run only trusted code* -- local privilege escalation may not be a big issue.

That's not at all true. A local privilege escalation in the kernel is the key to escaping from the Chromium sandbox, or escalating privileges after exploiting a server / other service running as an unprivileged user.

> Debian and Ubuntu have the kernel.unprivileged_userns_clone sysctl to control access to this feature, which they default off (last I checked), but the Sandstorm installer assists the user in turning it on. Arch could do the same if you want to be cautious.

Arch doesn't add new features via patches. If you want to see this feature enabled, then land something like this upstream. Note that CONFIG_USER_NS is already enabled in the linux-grsec package because it fully removes the ability to have unprivileged user namespaces.
Comment by Kenton Varda (kentonv) - Monday, 24 November 2014, 05:37 GMT
> That's not at all true. A local privilege escalation in the kernel is the key to escaping from the Chromium sandbox,

Sure, let's talk about Chrome, because it's actually pretty relevant.

Chrome's sandbox uses seccomp to prohibit exotic kernel features, so enabling unprivileged user namespaces has no immediate effect on Chrome's security.

Meanwhile Chrome currently relies on a setuid binary to set up its sandbox, because unshare() require privilege -- unless, of course, unprivileged user namespaces are allowed. So presumably Chrome will start using userns at some point so that it can get rid of the setuid binary which is itself a security liability. So in the long run, enabling unprivileged user namespaces is actually a security win for Chrome.

> or escalating privileges after exploiting a server / other service running as an unprivileged user.

We could have a lengthy debate about the practical usefulness of UID separation in the presence of RCEs, but I think it's beside the point here, so I withdraw the statement.

> Arch doesn't add new features via patches. If you want to see this feature enabled, then land something like this upstream.

Sorry, to be clear, I only commented here to provide what I thought might be useful information for this bug. I'm not asking for anything, nor can I volunteer for anything.

> Note that CONFIG_USER_NS is already enabled in the linux-grsec package because it fully removes the ability to have unprivileged user namespaces.

Most things that use user namespaces use them explicitly because they don't require privilege. E.g. Sandstorm (like Chrome) would rather not rely on a setuid binary for sandboxing.
Comment by Daniel Micay (thestinger) - Monday, 24 November 2014, 06:06 GMT
> Chrome's sandbox uses seccomp to prohibit exotic kernel features, so enabling unprivileged user namespaces has no immediate effect on Chrome's security.

That's not true. It allows calling clone without parameter checks in some of the sandboxed processes. It doesn't allow calling it in the renderer process.

> Meanwhile Chrome currently relies on a setuid binary to set up its sandbox, because unshare() require privilege -- unless, of course, unprivileged user namespaces are allowed. So presumably Chrome will start using userns at some point so that it can get rid of the setuid binary which is itself a security liability. So in the long run, enabling unprivileged user namespaces is actually a security win for Chrome.

This doesn't make much sense. A small setuid binary is way saner than a completely broken kernel feature with a vulnerability discovered every other week. Please do a quick search for user namespaces in the kernel log. AFAICT, there has never been a disclosed privesc vulnerability for the chrome-sandbox helper. There was a potentially exploitable bug[1] but in the end it didn't appear to offer a way to escalate privileges.

[1] https://code.google.com/p/chromium/issues/detail?id=76542

> Sorry, to be clear, I only commented here to provide what I thought might be useful information for this bug. I'm not asking for anything, nor can I volunteer for anything.

If a way to disable unprivileged user namespaces by default doesn't land upstream, then Arch is not going to enable the feature. The ability to opt-in to this insanity is fine, but I can't see it being enabled by default.

> Most things that use user namespaces use them explicitly because they don't require privilege. E.g. Sandstorm (like Chrome) would rather not rely on a setuid binary for sandboxing.

Exactly, it's not actually a useful feature. The reason people want it is the ability to enter containers without root, but the feature doesn't actually provide that. Enabling it makes every user with access to clone / unshare (without a parameter check) into a superuser. Thanks to the lag in getting new kernel versions into [core] there would usually be a usable user namespace exploit available. In fact, there's one *right now* and it's not even fixed in 3.17.4 in [testing] because no fix has been created.
Comment by Kenton Varda (kentonv) - Monday, 24 November 2014, 06:23 GMT
> That's not true. It allows calling clone without parameter checks

Ouch, really? That seems like a bug in Chrome. Has someone reported it?

(Sandstorm does not allow CLONE_USERNS in clone() calls.)

> a completely broken kernel feature

I take it you have an opinion about this.

> vulnerability discovered every other week

To be fair, on the off weeks, non-userns vulnerabilities are found. Anyone relying on the lack of LPE in Linux, without using seccomp, is not in a great place, sadly.

> In fact, there's one *right now*

There's also at least one unpatched non-userns LPE *right now*, that I know of. Just saying. :)

Anyway, you've made your position clear. I guess this issue should be closed again?
Comment by Daniel Micay (thestinger) - Monday, 24 November 2014, 06:41 GMT
> Ouch, really? That seems like a bug in Chrome. Has someone reported it?

It's not a bug. I'm not aware of a common seccomp sandbox that's locked down as much as it could be. It would be nice if it was done by tracing the code and identifying the minimal set of system calls and flag parameters but it doesn't work that way. For one thing, the necessary system calls / flags vary across platforms so it's not as simple as it seems. The current rules are created by fallible humans who are going to miss many opportunities to lock down specific system calls.

Originally, Chromium was just using seccomp for the renderer process but it is going to be extended to the other processes outside of that most restricted sandbox over time.

> To be fair, on the off weeks, non-userns vulnerabilities are found. Anyone relying on the lack of LPE in Linux, without using seccomp, is not in a great place, sadly.

Sure, and I'm against moving in the wrong direction. New features known to add significant attack surfaces should be opt-in at runtime. The BPF JIT compiler is a nice example of that because it's disabled by default, as it should be.
Comment by Daniel Micay (thestinger) - Monday, 24 November 2014, 06:56 GMT
I would like it if this feature was available because I have use cases for it, but it definitely shouldn't be enabled by default. I don't think upstream is going to admit that the feature is so broken, so I don't see them adding a way to opt-in at runtime. The people working on it are still convinced that they can make it quite robust in the near future, despite all of the evidence to the contrary.

Arch *could* use the out-of-tree patch to make it opt-in, but I can't see that happening. It goes against the patching policy and I don't think the kernel maintainers are interested in deviating from it for this. Anyway, this is the frustrating side to using software as shipped by upstream.
Comment by Daniel Micay (thestinger) - Monday, 24 November 2014, 06:59 GMT
Another good way for upstream to handle this would be adding a capability for entering user namespaces, so sandboxes could be marked with a USERNS capability instead of setuid / CAP_SYS_ADMIN.
Comment by Paul Colomiets (pc) - Monday, 24 November 2014, 09:16 GMT
Well, actually AFAICS its enabled in Ubuntu by default. And there is no `kernel.unprivileged_userns_clone` flag in 14.04. Is it debian flag?

> The reason people want it is the ability to enter containers without root, but the feature doesn't actually provide that.

Why not? I've written the tool: https://github.com/tailhook/vagga which allows just that, without any single setuid binary.

> Enabling it makes every user with access to clone / unshare (without a parameter check) into a superuser.

Well you can't unshare into super-user. You become pseudo super-user in new namespace, but that super-user can mount and that's pretty much it. You can't setuid to real root, even if you have a setuid binary for it (and that's the reason FUSE doesn't work in namespace)
Comment by Daniel Micay (thestinger) - Monday, 24 November 2014, 09:38 GMT
> Why not? I've written the tool: https://github.com/tailhook/vagga which allows just that, without any single setuid binary.

The statement loses meaning when you quote it out of context... CONFIG_USER_NS=y turns all users into superusers because it doesn't work as intended. The kernel wasn't written with user namespaces in mind so there's an endless stream of privilege escalation issues via user namespaces. There are no doubt going to be a a hundred more disclosed over the next few years.

> Well you can't unshare into super-user. You become pseudo super-user in new namespace, but that super-user can mount and that's pretty much it. You can't setuid to real root, even if you have a setuid binary for it (and that's the reason FUSE doesn't work in namespace)

I'm aware of how user namespaces are intended to work. The fact is that they don't actually work that way in practice, as many parts of the kernel weren't written with them in mind.
Comment by Daniel Micay (thestinger) - Monday, 24 November 2014, 09:53 GMT
Why not wait until the feature is mature before enabling it? After 6 months with no disclosed USERNS local root exploits, I'll support enabling it. If you really think the feature is ready for the real world then that shouldn't be a problem, but IMO it means it wouldn't be enabled for 5-10 years.
Comment by Daurnimator (daurnimator) - Friday, 17 April 2015, 05:30 GMT Comment by Daniel Micay (thestinger) - Friday, 17 April 2015, 12:14 GMT
Pointing out that a bunch of vulnerabilities were recently closed is hardly evidence that it's safe to enable. There were *more* vulnerabilities discovered since that commit anyway...
Comment by Daurnimator (daurnimator) - Monday, 20 April 2015, 01:11 GMT
I was pointing out the most recent raft of fixes I know about. Which occurred 4 months ago.
You yourself said with 6 months you'd be okay turning it on; which is only 2 months away :)
Comment by Daniel Micay (thestinger) - Monday, 20 April 2015, 01:35 GMT
The most recent set of fixes landed yesterday (April 18th) and includes a fix for a vulnerability made public in October 2014. There were issues discovered in between the fixes from Eric Biederman anyway.

The Apport and Abrt vulnerabilities disclosed on April 14th were only exploitable by unprivileged users due to unprivileged user namespaces. It's going to be some time before userspace is ready for the implications of the feature, let alone the kernel itself.

We're at 2 days right now and the odds that it'll make it 6 months with no vulnerabilities discovered are pretty low... CONFIG_USER_NS is pretty much CONFIG_PRIVILEGE_ESCALATION without patching away the ability for unprivileged users to use it.
Comment by Goekcen (gokcen) - Sunday, 26 April 2015, 20:02 GMT
Why is so difficult to add the patch which makes it a sysctl flag? Right now, there is a kernel package in AUR, which just enables the CONFIG_USER_NS flag: https://aur.archlinux.org/packages/linux-user-ns-enabled/ . Using nice virtualization software like vagga (https://vagga.readthedocs.org/en/latest/) is a pain in Arch now.
Comment by Daniel Micay (thestinger) - Tuesday, 12 May 2015, 01:14 GMT
@gokcen: I think that would be fine, but it's at the discretion of the packager. Arch typically doesn't apply downstream patches other than backports.

As for enabling it *without* that... it has now been 2 days since the last fix:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=51dfcb076d1e1ce7006aa272cb7c4514740c7e47
Comment by Daniel Micay (thestinger) - Tuesday, 12 May 2015, 01:15 GMT
"fix" as in a serious security vulnerability closed, not just a bug
Comment by Daniel Micay (thestinger) - Friday, 29 May 2015, 16:44 GMT
This one doesn't really count, since it's not a domain-specific bug:

http://www.openwall.com/lists/oss-security/2015/05/29/5

It's the consequence of ever-increasing complexity though.
Comment by Daurnimator (daurnimator) - Tuesday, 16 June 2015, 04:30 GMT Comment by John C (tancrackers) - Monday, 29 June 2015, 20:32 GMT
Wouldn't enabling this enable Namespace Sandbox in Chromium?
Comment by Julian Brost (julian) - Monday, 07 August 2017, 03:35 GMT
Linux 4.9 added a sysctl user.max_user_namespaces which prevents the creation of user namespaces when set to 0.
Comment by John Ramsden (johnramsden) - Monday, 07 August 2017, 05:09 GMT
@julian Which would get rid of reason this was not added to Arch would it not? If the functionality that Arch didn't want to maintain on its own was added to the kernel, it should be able to be enabled in a default kernel and just turned off by default.
Comment by Julian Brost (julian) - Monday, 07 August 2017, 12:32 GMT
There's now a way to disable it. However, I'm not sure if it's possible to configure it to be disabled by default, so is it also fine to be enabled by default? It would of course be possible to ship a sysctl config to set it but that sounds suboptimal to me as it would also affect other installed kernels.
Comment by James Pic (is_null) - Monday, 07 August 2017, 12:43 GMT
Arch linux still doesn't support LXD with the default kernel. It's a bit of a pain not to have this for all people relying on this for testing rather than security.
Comment by Damjan Georgievski (damjan) - Monday, 07 August 2017, 13:23 GMT
Then perhaps, arch can supply a /usr/lib/sysctl.d/10-arch-default.conf sysctl default file that disables user_ns by default.
Comment by Daniel Micay (thestinger) - Monday, 07 August 2017, 16:43 GMT
It makes sense to enable it at compile-time with it disabled by default via the sysctl.

That's the approach taken by the linux-hardened package that I maintain, although it's done via an out-of-tree sysctl controlling whether unprivileged usage is allowed. That means it can still be enabled by default for linux-hardened, just not for unprivileged users, and unprivileged use only requires toggling the sysctl.

The upstream sysctl *fully* disables it which works fine but it's a bit less useful.
Comment by Daniel Micay (thestinger) - Monday, 07 August 2017, 16:48 GMT
i.e. you can currently use linux-hardened and do `sysctl kernel.unprivileged_userns_clone=1` to enable it for unprivileged users. I'd expect most people interested in this for sandboxing want a hardened kernel build sacrificing a bit of kernel performance for security anyway, so not having this isn't a huge problem.

The sysctl conf disabling it should not be added to systemd, etc. because it shouldn't be disabled for linux-hardened, only linux, linux-lts and linux-zen. It belongs in the kernel packages themselves, if they aren't going to carry a patch.
Comment by Michał Zegan (webczat) - Sunday, 08 October 2017, 13:08 GMT
what is the current state? I wanted to track cve's related to that but not quite sure how to do it without tracking all of them or whatever
Comment by Daniel Micay (thestinger) - Sunday, 08 October 2017, 15:42 GMT
The current state is that it's still a substantial risk without a way to enable it for only privileged usage. It exposes a bunch of additional kernel attack surface to unprivileged users by letting them enter a user namespace and gain all of the capabilities within that user namespace. They can control network administration within that namespace including iptables, mounts and a lot more. Many additional vulnerabilities are still ending up exposed due to unprivileged user namespaces.

There's a way to dynamically turn it off by enabling it but setting user.max_user_namespaces to 0. There isn't a way to set that by default in the kernel configuration so it would need to be set at runtime as mentioned above.

Nothing has changed since Linux 4.9.

There was a proposal to support an unprivileged user namespace capability mask to allow unprivileged user namespaces without granting capabilities within it, but just like the toggle for unprivileged user namespaces that seems to be going nowhere. There isn't interest in making this feature less risky upstream.
Comment by Michał Zegan (webczat) - Sunday, 08 October 2017, 16:22 GMT
well from what i know enabling unprivileged user namespaces was actually intended behavior, in theory you should be able to control, inside of an user namespace, only things governed by other namespaces that are created along with it, but bugs in that... well
Comment by Daniel Micay (thestinger) - Sunday, 08 October 2017, 16:35 GMT
Yes, it's intended that it gives access to huge amount of additional kernel attack surface that was never expected to be exposed like this. For something that's meant to be a security feature, it certainly doesn't work very well. The costs are way higher than any benefits. It's only truly useful to an OS container, not app containers. It's a hack to get a quasi-root within the container. Everything else is better served by https://github.com/projectatomic/bubblewrap.
Comment by Daniel Micay (thestinger) - Sunday, 08 October 2017, 16:37 GMT
Not really sure what the point is of rehashing this over and over again. The whole discussion is already here.
Comment by Michał Zegan (webczat) - Sunday, 08 October 2017, 17:00 GMT
my intention was to ask for any changes in current status, from people who (probably) track it with more success than I do. that discussion is mostly a side effect.
Comment by Leonid Isaev (lisaev) - Sunday, 08 October 2017, 20:36 GMT
You need to keep in mind that Arch is a general purpose distro, and it is unclear what are the benefits of user namespaces for people who don't use containers. If you understand that you need userns, just build your own kernel.

Re bubblewrap, IMHO, it is more straightforward to use virtual machines, unless you have a really old and weak CPU...
Comment by Michał Zegan (webczat) - Sunday, 08 October 2017, 20:40 GMT
Well, most other namespaces are used by non container software sometimes, see systemd. User namespaces probably also are although not sure. And all namespaces are currently enabled by default, userns is the only exception. So not sure this argument holds, security concerns make more sense to me.
Comment by Michał Zegan (webczat) - Sunday, 08 October 2017, 20:42 GMT
to point out one significant example, pid namespaces are probably? not useful outside of containers, but they are compiled in by default
Comment by Lucas Werkmeister (lucaswerkmeister) - Sunday, 08 October 2017, 20:46 GMT
Yes, systemd can also utilize user namespaces, with the PrivateUsers= setting (hides all other users/groups from a process; see systemd.exec(5)). Currently, `systemd-run -qt -p PrivateUsers=yes -p User=daemon echo hi` fails.
Comment by Leonid Isaev (lisaev) - Sunday, 08 October 2017, 20:50 GMT
Yes, I meant security concerns. Enabling all namespaces except userns makes sense because with them -ARCH kernels are usable without sacrificing security. Userns is different because it does not add much usability (at least from my perspective and I routinely run about ~10 containers). And btw, systemd uses namespaces also for containers...
Comment by Michał Zegan (webczat) - Sunday, 08 October 2017, 21:02 GMT
probably the biggest improvement, at least in theory, is that in a container using userns root processes don't actually run as real root. So breaking out of such a container would give them only normal user privileges outside. But not sure what nature do the most cves relating to user namespaces have, so that theory may be too optimistic.
Comment by Levente Polyak (anthraxx) - Sunday, 08 October 2017, 21:09 GMT
Like most of them allow code execution as ring 0 which actually leads to escape of the container/sandbox being root -> privilege escalation. This was quite verbosely explained by Daniel at the beginning/middle of this thread so can we please avoid going round and round yet again. thanks
Comment by Michał Zegan (webczat) - Sunday, 08 October 2017, 23:06 GMT
Just asking for opinion/info, not trying to argue.
What about giving containers to untrusted users? They could exploit many of the same vulnerabilities, so for someone hosting containers as virtual private servers, attack surface would be probably quite similar, no matter if unprivileged user namespaces are or are not allowed.
Comment by Daniel Micay (thestinger) - Sunday, 08 October 2017, 23:52 GMT
Normal application / server containers don't benefit from user namespaces since they don't need root and are in fact negatively impacted by the attack surface. Exposing unprivileged user namespaces reduces application container security along with the security of the system as a whole.

Real root in a container is real root on the host so your proposed scenario doesn't make much sense.

The only case where there's a benefit is for OS containers running an entire Linux userspace which is usually only done for development purposes. In that case, the userspace OS in the container needs a quasi-root and user namespaces provide that. However, providing it doesn't imply exposing unprivileged access. On linux-hardened, user namespaces are enabled and can be used to replace root with an isolated quasi-root. The difference is that it doesn't make namespaces, iptables, mount, etc. exposed as unprivileged attack surface.

If you want this feature so badly, you can use linux-hardened which enables CONFIG_USERNS but disables unprivileged access by default. If you really want to accept the drawbacks of unprivileged access you can enable it. Arch doesn't maintain downstream patches as a general rule and neither linux or linux-hardened have any applied. The linux-hardened project is a fork of the Linux kernel maintained outside of Arch Linux and isn't actually focused on desktop Linux.
Comment by Michał Zegan (webczat) - Monday, 09 October 2017, 00:02 GMT
many hosting companies I know here used openvz instead of a virtual machine to run virtual private servers for users, in an os container. So someone could use things like lxc now for the same purpose.
Comment by Daniel Micay (thestinger) - Monday, 09 October 2017, 00:04 GMT
Enabling this for the linux package would be fine if it was disabled by default via the user.max_user_namespaces sysctl but there isn't a clean way to have a kernel-specific sysctl configuration. The linux-hardened package takes a different approach where it makes sense for it to be enabled by default since unprivileged users can't use it in the default configuration.

Unless the package maintainers are willing to apply a patch making user.max_user_namespaces 0 by default in the kernel or providing a sysctl to toggle unprivileged access, there isn't an obvious approach to dealing with this.

The problem also isn't likely to change even years from now. It's still going to be a poorly designed feature with much better approaches like the one taken by bubblewrap which exposes much less attack surface. By holding off on enabling this, Arch Linux played a large part in making the alternate approach in bubblewrap into a reality. Otherwise, they could have just assumed unprivileged user namespaces were present rather than making a safer implementation.

Until something like the Chromium sandbox has a hard dependency on this, I think Arch should continue down the current path. An approach like http://www.openwall.com/lists/kernel-hardening/2017/09/29/1 may be accepted upstream in the near future which would provide a safe way to enable this. Waiting until they make the feature safe makes sense. It has so little utility compared to the danger it creates.
Comment by Eli Schwartz (eschwartz) - Monday, 09 October 2017, 21:22 GMT
> You need to keep in mind that Arch is a general purpose distro, and it is unclear what are the benefits of user namespaces for people who don't use containers. If you understand that you need userns, just build your own kernel.

That's a nice thought, except for the part where Arch routinely enables everything and the kitchen sink whenever it doesn't negatively impact users.

Asking people to build their own kernels just to use USER_NS with containers is... less than ideal... unless you are proposing that on top of there being no general-purpose benefit, there is also some *negative* impact for the average user.

AFAIK no one has suggested that enabling compiled-in support but disabling it via the upstream sysctl knob, is any less secure than simply not enabling compiled-in support. And after years of argumentation on USER_NS, this can finally be done without any downstream patches (which was the original reason this bugreport had been closed as WONTIMPLEMENT).
Comment by Daniel Micay (thestinger) - Tuesday, 10 October 2017, 03:44 GMT
We don't have a way to ship kernel-specific sysctl configuration so it would need to be done with a tiny patch changing the default value of allowed user namespaces to 0 since there isn't an existing configuration option.
Comment by Luke Shumaker (lukeshu) - Thursday, 12 October 2017, 01:52 GMT
> Asking people to build their own kernels just to use USER_NS with containers is... less than ideal...

Then don't ask them to do that! Ask them to use community/linux-hardened, which does have CONFIG_USER_NS=y (it also has the unprivileged_userns_clone sysctl knob; default to off).
Comment by Eli Schwartz (eschwartz) - Friday, 08 December 2017, 00:55 GMT
https://git.archlinux.org/svntogit/packages.git/commit/?h=packages/linux&id=e42e6ffc6243370215eb33690b3c68f96f181cdb

This has been enabled with linux 4.14.4-2 currently in [testing], along with the same unprivileged_userns_clone sysctl knob patch as used by linux-hardened.
Comment by Daniel Micay (thestinger) - Friday, 08 December 2017, 00:58 GMT
There's a proposed new approach on kernel-hardening too but I don't really expect it to land either, and it remains to be seen how much of the added attack surface it can disable while using user namespaces for something useful:

http://www.openwall.com/lists/kernel-hardening/2017/12/05/13

It would make sense to figure out which userns capabilities can be disabled for the Chromium userns sandbox, etc. but it's outside the scope of what I care about now. Android doesn't use user namespaces so it's not relevant to my world.
Comment by Michał Zegan (webczat) - Friday, 08 December 2017, 13:41 GMT
hey, honestly:
on the one hand, user namespaces allow gaining capabilities without having them previously, and this increases attack surface...
But most (all?) bugs I know about that are exposed there are not directly related to user namespaces implementation, but existed before them and were just exposed.
And, bugs are, well, bugs. something that should be fixed long ago, but either was not found or not deemed critical. Would that exposure increase their chance of being fixed, finally? Not trying to object blocking unprivileged userns with a stupid argument, just asking.
Also I have one question. When using user namespaces blocked from being created by normal non root users, normal users won't be able to just go and gain arbitrary caps, but root of course would. Someone once mentioned that the usage of userns for system containers is useful mainly for dev/testing. But what about the case of virtual private server hosting?
Virtual private server hosting companies often provide virtual machines, but, at least here in Poland, they also provide openvz containers. lxc system containers would probably be very similar to those openvz containers in the way they operate. So here we have system containers given out for money to untrusted users, and user namespaces would be useful for them. What about their security? those users would have the full set of capabilities, but users would not create them directly using an unprivileged user account of the host.
Comment by James Pic (is_null) - Friday, 08 December 2017, 13:47 GMT
I don't care about security, I love LXD instead of virtual machines for deployment playbooks testing locally and continuous integration, because it's blazing fast, and I have a beautiful ansible role: novafloss.boot-lxd, which allows testing roles as simple as:


- hosts: localhost

pre_tasks:
- add_host:
name: testboot.lxd
lxd_alias: "{{ lookup('env', 'LXD_ALIAS', 'alpine/3.4/amd64' ).strip(',') }}"

roles:
- novafloss.boot-lxd

- hosts: testboot
# test all your stuff here, was it easy enough ? really ? ;)
tasks:
- shell: uname -a

I don't understand why I can't use this feature by default, it's like prohibition of majijuana because some users would abuse it and put their own security at risk.

I mean, we support docker, how is lxd any worse than docker in terms of security, seriously ? Not the same issue perhaps, but still as dangerous if not even more.

It's time to free the weed, and free CONFIG_USER_NS.
Comment by Levente Polyak (anthraxx) - Friday, 08 December 2017, 13:52 GMT
James Pic (is_null): simply because if you force enable it by default without a way to deactivate it, then its like enforcing everyone to smoke weed instead of air, no matter if they wish to or not so your analogy doesn't sum up.
Comment by James Pic (is_null) - Friday, 08 December 2017, 13:54 GMT
Do you mean that if this option is activated, people will be forced to use it ?

Yeah sorry about the analogy, i know it sucks it was more for the joke than anything else, would docker perhaps make a better analogy ? What about systemd ? We've had it before it was mature and there's been security issues with it too ?
Comment by Remi Gacogne (rgacogne) - Friday, 08 December 2017, 13:57 GMT
So, we are going to support the feature you wanted and you are complaining because you will need to enable it? Seriously?
Comment by Levente Polyak (anthraxx) - Friday, 08 December 2017, 14:01 GMT
Michał Zegan (webczat): you are missing the point, if you are inside the userns you have the cap, its running the same kernel, if there is such a bug you get root access on the host no matter if you had a user account on the host or not. The difference is that those areas originally could only be reached as root, so you need root privileges to exploit that... not sure what you expect to additionally gain from doing so. however enabling it gives unprivileged users the access to those areas/code, which were originally never designed to be exposed in the first place. A endless rabbit hole of vulns. rhel has deactivated it by default and grsec has disabled it on purpose.

James Pic (is_null): yes without a custom knob like right now, if its activated its always available for unprivileged users so the threat is always exposed without a way to opt-out. Just compare the amount of root privilege escalations, you really think it could even be used to compare? nope. last USERNS vuln: ~2 weeks ago.
Comment by Eli Schwartz (eschwartz) - Friday, 08 December 2017, 14:01 GMT
We didn't compile docker into the kernel and autostart it on boot so every process ever could do whatever they wanted whenever they wanted, so again a terrible analogy.

USER_NS now has compiled-in support, and all it takes is making the configuration option "I want to use USER_NS". I hardly think your freedom is being trampled. OTOH I am sort of beginning to regret reopening this bug if this is the reaction once we *do* give people what they asked for.
Comment by Michał Zegan (webczat) - Friday, 08 December 2017, 14:14 GMT
I see this point. I just note that those things should be fixed no matter if they are so much exposed or not, because most or all of it, or at least all i know, is not misdesign of a feature, but just bugs. Actually not sure how userns would need to be designed to be actually safe... btw where/how do you track appearing cve's related to user namespaces? I would like to do this but don't know where.
Comment by James Pic (is_null) - Friday, 08 December 2017, 14:25 GMT
This is my own reaction and not "people"'s reaction and only represents my own point of view. I'm sorry I spoke up then I should know I don't have the skills to discuss this (i thought it was ok to make mistakes and learn things from people who know better by discussion). I love Arch with or without this option and of course I can always use my own kernel. Please go on I will not disrupt this thread again. Thanks for your answers, best regards, with love.
Comment by Michał Zegan (webczat) - Friday, 08 December 2017, 14:36 GMT
James Pic (is_null): you would be able to enable unprivileged userns if you would need them. Not sure if lxd uses them or creates userns as root, in the second case this does not even affect you
Comment by Daniel Micay (thestinger) - Friday, 08 December 2017, 16:18 GMT
is_null: Consider not saying something like "I don't care about security" if you expect to be taken seriously. You don't care about security, and that's why your opinion doesn't matter. Arch Linux has to balance the needs of different people and exposing everyone to a huge amount of extra attack surface with an endless stream of serious vulnerabilities wouldn't be doing that.

If you want unprivileged users to be able to use it you now have that option, and by default privileged users can already make use of it. An alternative would be changing the default value of user.max_user_namespaces to 0 which would only require patching the default value but would unnecessarily block usage by privileged users. Using sysctl doesn't work properly because this needs to be per-kernel since other kernels like linux-hardened may take the privileged approach or a completely different alternative.

There is ongoing work like http://www.openwall.com/lists/kernel-hardening/2017/12/05/13 which may land in some form upstream and will make it possible to make limited, safer usage of the feature by not exposing all of the network administration features like netfilter, etc. to every unprivileged user / sandbox not disabling user namespaces.
Comment by Daniel Micay (thestinger) - Friday, 08 December 2017, 16:47 GMT
> And, bugs are, well, bugs. something that should be fixed long ago, but either was not found or not deemed critical. Would that exposure increase their chance of being fixed, finally?

There are always going to be vulnerabilities exposed by user namespaces. It greatly increases attack surface and that's not something that can be fixed. You're talking about this as if there's a fixed number of bugs with a count going down over time. That's not how software works at all. The code is actively developed and gaining new bugs. The attack surface added by user namespaces will increase over time as more complexity is added to areas like network administration that it exposes to unprivileged users. The endless stream of user namespace vulnerabilities is not slowing down.

Fixing bugs case by case is a very poor approach to security. Bugs need to be targeted in a systemic way to significantly improve security. For example, Arch Linux uses a bunch of compiler exploit mitigations: SSP, FORTIFY_SOURCE, PIE, RELRO, BIND_NOW. Those features only have value in making memory corruption bugs harder or in some cases even impossible (FORTIFY_SOURCE) to exploit.

SSP (-fstack-protector / -fstack-protector-strong) often reduces performance by 5% (sometimes less but sometimes even substantially more) and it can only stop exploitation of an increasingly small subset of memory corruption bugs: sequential stack corruption. SSP is imperfect and can be bypassed via other security-relevant bugs able to leak the random value. Despite those things, every major distribution enables it by default. If the performance cost was 30%, it probably wouldn't be enabled by anything but a security-oriented distribution, but it's not 30%.

Arch being willing to accept a 5% performance cost across the board but not a barely significant inconvenience for a poorly designed, extremely niche feature would be hard to justify. There are projects like flatpak / bubblewrap providing alternatives to user namespaces that are more secure. Their efforts would be wasted if everyone gave up on having decent local security and exposed it. The reason for the Chromium developers not taking that path is lack of caring about desktop Linux. They wouldn't ever consider exposing user namespaces everywhere on Android or ChromeOS... only very limited exposure for sandbox creation and no exposure to unprivileged code.
Comment by Leonid Isaev (lisaev) - Friday, 08 December 2017, 20:57 GMT
Maybe I'm missing smth, but unprivileged_userns_clone seems uninitialized in the patch. Is it 0 because that's how uninitialized variables are in C? (For comparison,in a similar patch, Ubuntu sets it to 1)
Comment by Leonid Isaev (lisaev) - Friday, 08 December 2017, 20:59 GMT
Also, was enable AUDIT necessary for this feature? And BTW, nobody mentioned CRIU also got enabled. IMHO, the latter is more useful than user_ns.
Comment by Daniel Micay (thestinger) - Friday, 08 December 2017, 21:17 GMT
Per the C standard all globals are zero initialized so it's well-defined to rely on that and it's guarantee to be what happens in practice. It's more common than not to rely on that in the Linux kernel although there isn't a standard code style.
Comment by Daniel Micay (thestinger) - Friday, 08 December 2017, 21:23 GMT
AUDIT isn't related, I think heftig wanted to make container stuff available thus enabling CRIU and AUDIT too. AUDIT is very monolithic when really they should probably split it up into more options upstream along with providing a default disable option.
Comment by loqs (loqs) - Friday, 08 December 2017, 21:55 GMT
AUDIT has been disabled again linux 4.14.4-3 would be nice to know the reasoning behind some of the other changes in 4.14.4-2 https://bbs.archlinux.org/viewtopic.php?id=232510

Loading...