presentation held at SUSE Linux Expert Forum December 2014
Linux container history and Linux namespaces
examples include:
* Move a VPN connection to its own namespace(p 25)
* User namespaces demo(p 28)
see collection of useful articles and advanced container usecases pp 29
3. 3
Container examples
‒ Non-Linux:
‒ Solaris Containers(Zones), FreeBSD jails, WPAR(AIX)
‒ Linux:
‒ Vserver, OpenVZ, and FreeVPS
‒ Out of tree
‒ Process containers:
‒ OpenAFS's PAGs process authentication group membership
‒ Inheritance through fork()
‒ Cached token used for access control
‒ http://docs.openafs.org/AdminGuide/ch02s10.html
‒ Process containers: http://lwn.net/Articles/236529/
‒ Plan9:
‒ Everything as a filesystem(naming, access, protection methods)
‒ per-process namespaces
4. 4
Linux containers - a conceptual
artifice
‒ Namespaces
‒ Isolation, virtualization
‒ clone() and unshare()
‒ Resource containers
‒ manage the use of resources outside the operating system
‒ disk, network, memory and processor
‒ cgroups
‒ Capability bounding sets
‒ divide the privileges traditionally associated with superuser into distinct units
‒ limit the privilege available to containers, CAP_SYS_ADMIN
‒ Checkpoint/restart
‒ Requires former
6. 6
Looking forward..
‒ 16 Aug 2006 Andrew Morton
‒ “Generally, I am not very comfortable merging any
namespace/containerization/resource management patches into
mainline until we have some sort of high- level agreed-to roadmap
which will take us to an agreed-to-at-a-high-level destination.
‒ Now, I _am_ OK with merging useless infrastructure as long as all
the prime stakeholders are OK with it. ..
‒ That would not be a useful patchset on its own because nothing
_uses_ it..
‒ We don't normally merge useless patches, but this is a special
case.
‒ So, (policy making on the fly), let's start merging the well-tested,
well-isolated, low-overhead generally-agreed-to features into
mainline.”
7. 7
Multiple Instances of the Global Linux
Namespaces(2006)
Eric W. Biederman, Linux Networx
‒ By adding additional namespaces .. we can, at a trivial cost,
extend the UNIX concept and make novel uses of Linux
possible
‒ Multiple instances of a namespace simply means that you can
have two things with the same name.
‒ Implementation: allow an application with capability full control
over a namespace and still not be able to escape
‒ https://www.kernel.org/doc/ols/2006/ols2006v1-pages-101-112.pdf
9. 9
Coordinated Efforts 2007
Companies And Individuals Involved
‒ Arista Networks(Arastra): Eric Biedermann - all, initial approach
‒ SGI: Paul Jackson - original cpusets, now part of cgroups
‒ Linux-VServer: Herbert Poetzl - namespaces, containers
‒ Openvz: Pavel Emelyanov, Kir Kolyshkin
‒ Google: Paul Menage - task containers, cgroups
‒ Zap project: Oren Ladaan - C/R
‒ IBM: Serge E. Hallyn, Dave Hanson, Cedric Le Goater, Daniel Lezcano -
ns, C/R, Balbir Singh, Srivatsa Vaddagiri - task containers
‒ Others: NEC, XtreemOS, kerlabs, Bull, HP, planetlab
‒ Source: container mailing list - containers development plans (Aug 8 2007)
10. 10
Coordinated Efforts
‒ post anything container-related to containers mailinglist, before
any attempts to send it upstream - containers@lists.osdl.org
‒ make sure what is in -mm fits openvz, VServer and other
products
‒ make sure initial framework also fits requirements of basic
resource management system
12. 12
Namespaces
• Namespaces - lightweight process virtualization
• Isolation: Enable a process (or several processes) to
have different views of the system than other
processes
• Currently 6 namespaces:
‒ mnt, pid, net, ipc, uts, user
‒ 4 more planned..(2006)
‒ security namespace
‒ security keys namespace
‒ device namespace
‒ time namespace
13. 13
Mount namespace
‒ Mount namespace first type by Al Viro, 2002
‒ Kernel 2.4.19
‒ CLONE_NEWNS
‒ 6 CLONE_NEW * flags were added (include/linux/sched.h)
‒ These flags (or a combination of them) can be used in clone()
or unshare() syscalls to create a namespace
15. 15
Namespace: Systemcalls
‒ 3 system calls are used
‒ clone()
‒ Creates new process and a new namespace, attach process to ns
‒ unshare()
‒ new namespace, attach current process to it
‒ reverses sharing that was done using clone(2) system call(2005)
‒ setns(int fd, int nstype)
‒ join an existing namespace
16. 16
• no parameter of a namespace name
• 6 entries (inodes) added under /proc/<pid>/ns
‒ Kernel 3.8
• Nsproxy
• Kernel config items:
‒ CONFIG_UTS_NS
‒ CONFIG_IPC_NS
‒ CONFIG_USER_NS
‒ CONFIG_PID_NS
‒ CONFIG_NET_NS
17. 17
Namespace: User space additions
‒ nsenter(util-linux >= 2.23)
‒ wrapper around setns
‒ allows running a new process in context of existing process
‒ iproute
‒ ip netns
‒ add, del, exec
‒ util-linux
‒ unshare
‒ All 6 namespaces
18. 18
UTS namespace
‒ Uts - Unix timesharing
‒ new_utsname struct:
‒ sysname, nodename, release, version, machine, domainname
‒ CLONE_NEWUTS
‒ Since 2.6.19
‒ Initial usecase: vserver/openvz - clone a new uts namespace
for each new virtual server
‒ http://lwn.net/Articles/179345/
‒ Demo: unshare -u /bin/bash
19. 19
IPC namespace
‒ same principle as uts
‒ process will have independent namespace for System V
message queues, semaphore sets and shared memory
segments
‒ CONFIG_IPC_NS, CONFIG_SYSVIPC
‒ CLONE_NEWIPC flag:
‒ since 2.4.19
20. 20
Network namespace
‒ A network namespace is logically another copy of the network
stack, with its own routes, firewall rules, and network devices
‒ a network device belongs to exactly one network namespace
‒ a socket belongs to exactly one network namespace
‒ a new network namespace only includes the loopback device
‒ communication between namespaces using veth or unix
sockets
21. 21
Network namespace: Usecases
‒ Turn off network inside namespace:
‒ ensure that processes running there will be unable to make connections
outside of namespace
‒ i.e.:spam, botnets
‒ Restricted namespace:
‒ Even processes that handle network traffic (a web server worker process or
web browser rendering process for example) can be placed into a restricted
namespace
‒ Namespace without network devices
‒ make impossible for child or worker processes to make additional network
connections
‒ http://lwn.net/Articles/580893/
22. 22
Network namespace
‒ man ip-netns
‒ ip netns add <net_ns>
‒ creates /var/run/netns/tns0
‒ ip netns exec NAME cmd ... - Run cmd in the named network namespace
‒ /etc/netns/<net_ns>/resolv.conf overrides /etc/resolv.conf
‒
‒ Communicate between net ns by
‒ creating a pair of network devices (veth) and move one to another network
namespace
24. 24
Network namespace example
Move a VPN connection to its own namespace
‒ ip netns add tns0
‒ mkdir /etc/netns/tns0
‒ openconnect -s /etc/vpnc/vpnc-script <your-vpn-network>
‒ ip link set dev tun0 netns tns0
‒ #example: VPN_IP_ADDRESS=`ip a|grep 149|sed -e 's/..*149/149/' -e 's#/32.*##'`
‒ ip netns exec tns0 ip addr add $VPN_IP_ADDRESS dev tun0
‒ ip netns exec tns0 ip link set tun0 up
‒ ip netns exec tns0 ip link set lo up
‒ #test: ip netns exec tns0 ping $VPN_IP_ADDRESS
‒ #ip netns exec tns0 ip route restore </tmp/ip-route-save-vpn
‒ ip route|sed -e 's/ [scope|proto].*//' -e 's/^/ip route add /g' >/tmp/ip-route-add
‒ chmod 755 /tmp/ip-route-add
‒ ip netns exec tns0 /tmp/ip-route-add
‒ #test: ip netns exec tns0 ip route
‒ echo nameserver <your_VPN_specific_nameserver> >/etc/netns/tns0/resolv.conf
‒ ip netns exec tns0 cat /etc/resolv.conf
‒ ip netns exec tns0 wget <IP_ADDRESS_only_available_via_VPN>
25. 25
User namespace
‒ only namespace which can be created without CAP_SYS_ADMIN capability
‒ A process will have distinct set of UIDs, GIDs and capabilities
‒ User namespaces allow per-namespace mappings of user and group IDs.
‒ users and groups may have privileges for certain operations inside the
container without having those privileges outside the container
‒ Capabilities
‒ have root privileges for operations inside the container only
‒ map user IDs on the host system to corresponding user IDs in the
namespace
‒ Since 3.8 complete
‒ aving a full set of caps in your local user namespace is safe
‒ user namespace root users can create network namespaces
30. 30
cgroup only container
‒ One of the cgroup only container uses we see@Parallels (so no separate
filesystem and no net namespaces) is pure apache load balancer type
shared hosting. In this scenario, base apache is effectively brought up in
the host environment, but then spawned instances are resource limited
using cgroups according to what the customer has paid.
‒ Obviously all apache instances are sharing /var and /run from the host
(mostly for logging and pid storage and static pages). The reason some
hosters do this is that it allows much higher density simple web serving
(either static pages from quota limited chroots or dynamic pages limited by
database space constraints) because each "instance" shares so much from
the host. The service is obviously much more basic than giving each
customer a container running apache, but it's much easier for the hoster to
administer and it serves the customer just as well for a large cross section
of use cases and for those it doesn't serve, the hoster uall has separate
container hosting (for a higher price, of course).
‒ systemd-devel ml: Sun, 25 Aug 13, 19:16 CEST James Bottomley
31. 31
PaaS SaaS Container
‒ I gave you one example: a really simplistic one. A more sophisticated
example is a PaaS or SaaS container where you bring the OS up in the host
but spawn a particular application into its own container (this is essentially
similar to what Docker does). Often in this case, you do add separate
mount and network namespaces to make the application isolated and
migrateable with its own IP address. The reason you share init and most of
the OS from the host is for elasticity and density, which are fast becoming a
holy grail type quest of cloud orchestration systems: if you don't have to
bring up the OS from init and you can just start the application from a C/R
image (orders of magnitude smaller than a full system image) and slap on
the necessary namespaces as you clone it, you have something that comes
online in miliseconds which is a feat no hypervisor based virtualisation can
match.
‒ systemd-devel ml, Sun, 25 Aug 13, 20:16 CEST James Bottomley
32. 32
tidbits
‒ mboxgrep namespace systemd-devel201*
‒ It sounds like you're setting up your containers wrongly. If a container can
reboot the system it means that host root capabilities have leaked into the
container, which is a big security no-no. The upstream way of avoiding this
is USER_NS (because root in the container is now not root in the host).
The OpenVZ kernel uses a different mechanism to solve the problem, but
we think USER_NS is the better way to go on this.
‒ For launching new services in a container simply sending a message to the
init process is probably what you want. I think those messages already
traverse unix domain sockets so it insn't too shabby.
‒
33. 33
tidbits
‒ mboxgrep namespace systemd-devel201*
‒ Feb 2014
‒ > FYI I have succesfully run Fedora 19 with systemd inside a container
‒ > with libvirt LXC, however, I did *not* enable user namespaces. Every
‒ > time I try user namespaces I find some other bug in either the kernel
‒ > or libvirt, so I wouldn't be surprised if yet more breakage has
‒ > occurred in user namepsaces :-(
‒ Those bugs should now be fixed, if you don't enable the option, how are we
supposed to know what is left to be done? :)
34. 34
tidbits
‒ https://lkml.org/lkml/2013/4/25/596
‒ > Final question, is it by design that uid 0 within a namespace in not
‒ > allowed to write to
‒ > /proc/*/oom_score_adj?
‒
‒ Essentially. It is by design that uid 0 within a namespace be mapped to some
other uid outside the namespace, and that the permissions on writes should use
the permission needed outside of the user namespace.
‒ Which means there are all kinds of things only uid 0 can write to, that you can't
touch in a user namespace. Some of those things the policy may need to be
reconsidered. A lot of those things the default policy is good. Regardless we are
now defaulting to not letting root in a container do risky things which is a good
thing.
‒ Eric
35. 35
Capabilities
‒ http://man7.org/linux/man-pages/man7/user_namespaces.7.html
‒ The child process created by clone(2) with the CLONE_NEWUSER flag starts out
with a complete set of capabilities in the new user namespace. Likewise, a
process that creates a new user namespace using unshare(2) or joins an existing
user namespace using setns(2) gains a full set of capabilities in that namespace.
On the other hand, that process has no capabilities in the parent (in the case of
clone(2)) or previous (in the case of unshare(2) and setns(2)) user namespace,
even if the new namespace is created or joined by the root user (i.e., a process
with user ID 0 in the root namespace).
‒ Note that a call to execve(2) will cause a process's capabilities to be recalculated
in the usual way (see capabilities(7)), so that usually, unless it has a user ID of 0
within the namespace or the executable file has a nonempty inheritable
capabilities mask, it will lose all capabilities.
‒ Having a capability inside a user namespace permits a process to perform
operations (that require privilege) only on resources governed by that namespace.
36. 36
Socketat - network namespaces
‒ http://lwn.net/Articles/407615/
‒ The use case are applications are the handful of networking applications that find that it
makes sense to listen to sockets from multiple network namespaces at once. Say a
home machine that has a vpn into your office network and the vpn into the office network
runs in a different network namespace so you don't have to worry about address conflicts
between the two networks, the chance of accidentally bridging between them, and so you
can use different dns resolvers for the different networks.
‒ In that scenario it would be nice if I could run some services on both networks. Starting
two+ copies of the daemons just so the can have live in all of the networks is ok, but in the
fullness of time I expect that there will be daemons that want to optimize things and have
sockets in all of the network namespaces you are connected to.
‒ In a multiple network namespace aware application when it goes to open a socket it will
want to specify which network namespace the socket is in. If it is a general listener it will
probably listening to events in /proc/mounts waiting for extra namespaces to be mounted
under a standard location say: /var/run/netns/<netnsname>/ns.
‒ Once the application receives the event for a new network namespace showing up it can will
want to create a new socket listening for connections in the new network namespace.
‒ In that scenario none of those network namespaces are foreign, but one network
namespace will be the default and the rest will be non-default network namespaces.
37. 37
socketat
‒ http://lists.openvz.org/pipermail/devel/2010-October/025720.html
‒ [Devel] Re: [PATCH 8/8] net: Implement socketat.
‒ Just to clarify this point. You enter the namespace, create the socket and go back
to the initial namespace (or create a new one). Further operations can be made
against this fd because it is the network namespace stored in the sock struct
which is used, not the current process network namespace which is used at the
socket creation only.
‒ We can actually already do that by unsharing and then create a socket. This
socket will pin the namespace and can be used as a control socket for the
namespace (assuming the socket domain will be ok for all the operations).
‒ .. if I assume you want to create a process controlling 1024 netns, let's try to
identificate what happen with setns and with socketat :
‒ With setns:
‒ * open /proc/self/ns/net (1)
‒ * unshare the netns
‒ * open /proc/self/ns/net (2)
‒ * setns (1)
‒ * create a virtual network device
‒ * move the virtual device to (2) (using the set netns by fd)
38. 38
socketat
‒ http://lists.openvz.org/pipermail/devel/2010-October/025736.html
‒ > The app control point is in namespace0. I still want to be able to
‒ > "boot" namespaces first and maybe a few seconds later do a socketat()...
‒ > and create devices, tcp sockets etc. I suspect create_ns(namespace-name)
‒ > would involve:
‒ > * open /proc/self/ns/net (namespace-name)
‒ > * unshare the netns
‒ > Is this correct?
‒
‒ Almost.
‒ create should be:
‒ * verify namespace-name is not already in use
‒ * mkdir -p /var/run/netns/<namespace-name>
‒ * unshare the netns
‒ * mount --bind /proc/self/ns/net /var/run/netns/<namespace-name>
40. 40
References – old
‒ Paul B. Menage. Adding Generic Process Containers to the Linux Kernel. Proceedings
of the Ottawa Linux Symposium, 2007.
‒ http://www.kernel.org/doc/ols/2007/ols2007v2-pages-45-58.pdf
‒ Linux-CR: Transparent Application Checkpoint-Restart in Linux
‒ http://www1.cs.columbia.edu/~orenl/papers/ols2010-linuxcr.pdf
‒ Making applications mobile using containers
‒ http://lxc.sourceforge.net/doc/ols2006/lxc-ols2006-slides.pdf
‒ Virtual Servers and Checkpoint/Restart in Mainstream Linux
‒ describes the general namespace support in Linux and its usage
‒ Transparent Checkpoint-Restart of Multiple Processes on Commodity Operating
Systems -Oren Laadan
‒ Source: Operating System Virtualization: Practice and Experience Oren
Ladaan(systor2010_osvirt.pdf)
41. 41
References
‒
‒ http://lwn.net/Articles/531114/#series_index
‒ Namespaces in operation, 6 part series by Michael Kerrisk
‒ https://github.com/bigbighd604/C-Notes
‒ demo codes git from namespace series
‒ www.haifux.org/lectures/299/netLec7.pdf (Rami Rosen, 2013)
‒ https://www.kernel.org/doc/ols/2006/ols2006v1-pages-101-112.pdf (Biederman)
‒ http://books.google.de/books?id=RpsQAwAAQBAJ&pg=PA424&lpg=PA423&ots=
rAqP4sxMXn&focus=viewport&dq=Rami+Rosen+network+namespaces&hl=de
‒ Linux Kernel Networking(Rami Rosen)
‒ http://www.makelinux.net/kernel_map/
‒ http://en.wikipedia.org/wiki/Operating_system-level_virtualization
‒ /usr/src/linux/Documentation/unshare.txt
‒ How to find namespaces in a Linux system
‒ http://www.opencloudblog.com/?p=251
44. Unpublished Work of SUSE LLC. All Rights Reserved.
This work is an unpublished work and contains confidential, proprietary and trade secret information of SUSE LLC.
Access to this work is restricted to SUSE employees who have a need to know to perform tasks within the scope of
their assignments. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated,
abridged, condensed, expanded, collected, or adapted without the prior written consent of SUSE.
Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability.
General Disclaimer
This document is not to be construed as a promise by any participating company to develop, deliver, or market a
product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making
purchasing decisions. SUSE makes no representations or warranties with respect to the contents of this document,
and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The
development, release, and timing of features or functionality described for SUSE products remains at the sole
discretion of SUSE. Further, SUSE reserves the right to revise this document and to make changes to its content, at
any time, without obligation to notify any person or entity of such revisions or changes. All SUSE marks referenced in
this presentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries. All
third-party trademarks are the property of their respective owners.