Docker and other container runtimes are gathering momentum and becoming the new industry standard for server applications. Linux namespaces, commonly used to run Docker apps, come with a large surface of attack which is difficult to reduce. Intel’s Clear Containers use KVM to run containers as VMs to provide additional isolation. It is possible to provide VM-like isolation for containers without sacrificing performance.
This talk focuses on the benefits of using Xen to provide an execution environment for Docker apps. The presentation starts by listing the requirements of this environment. It explains why monitoring container syscalls is important and what its security benefits are. The talk introduces a new paravirtualized protocol to virtualize IP sockets and provides the design and implementation details. The presentation clarifies the impact of the new protocol from a security perspective. The discussion concludes by comparing performance figures with the traditional PV network frontend and backend drivers in Linux, explaining the reasons for any performance gaps.
4. Security Recommendations
(from NCC White Paper)
• From “Understanding and Hardening Linux Containers” by NCC Group:
• Run unprivileged containers (user namespaces, root capability, dropping)
• Apply a Mandatory Access Control system, such as SELinux
• Build a custom kernel binary with as few modules as possible
• Apply sysctl hardening
• Apply disk and storage limits
• Control device access and limit resource usage with cgroups
• Drop any capabilities which are not required for the application within the container
• Use custom mount options to increase defense in depth
• Apply GRSecurity and PAX patches to Linux
• Reduce Linux attack surface with Seccomp-bpf
• Isolate containers based on trust and exposure
• Logging, auditing and monitoring is important for container deployment
• Use hardware virtualization along application trust zones
12. System Call Virtualization
• Introduce proxy kernel
• Same as root kernel
• Allows memory pages re-use
• Single kernel to manage
• Subset of syscallsdelivered to
machine kernel
• Socket, file , time
• Majority of system calls restricted
within syscall proxy
Syscall Kernel Proxy
Root KernelRoot
Ring 0
Ring 3 Container Container
SyscallVirtualization
Unprotected
Proxied/Translated
Hypercall
Syscall Kernel Proxy
16. Example: Network Access with Namespaces
• Container namespace created at
the host as before
• Container process is launched
inside a protected VM
• Through System Call virtualization
system calls applied to namespace
context
• Container gets IP address of
network namespace
• Transparent to Docker and other
container systems
Syscall Kernel Proxy
Root
Ring 0
Ring 3 Container Container
SyscallVirtualization
Syscall Kernel Proxy
Connect
Connect
192.168.2.1
Bridge 192.168.2.1
18. First Implementation
• Design document
• http://marc.info/?l=xen-devel&m=147033114613017
• Code
• First, simple implementation on Xen
• 1 Command ring
• Per socket:
• data ring
• event ring
• Variable ring data sizes configurable per socket
• Supported functions (socket, connect, release, bind, listen, accept, poll)
• git://git.kernel.org/pub/scm/linux/kernel/git/sstabellini/xen.git pvcalls-5
29. First Implementation
• Design document
• http://marc.info/?l=xen-devel&m=147033114613017
• Code
• First, simple implementation on Xen
• 1 Command ring
• Per socket:
• data ring
• event ring
• Variable ring data sizes configurable per socket
• Supported functions (socket, connect, release, bind, listen, accept, poll)
• git://git.kernel.org/pub/scm/linux/kernel/git/sstabellini/xen.git pvcalls-5