The document describes Linux containerization and virtualization technologies including containers, control groups (cgroups), namespaces, and backups. It discusses:
1) How cgroups isolate and limit system resources for containers through mechanisms like cpuset, cpuacct, cpu, memory, blkio, and freezer.
2) How namespaces isolate processes by ID, mounting, networking, IPC, and other resources to separate environments for containers.
3) The new backup system which uses thin provisioning and snapshotting to efficiently backup container environments to backup servers and restore individual accounts or full servers as needed.
6. What do we have?
● cpuset - whole cores and cpu mapping
● cpuacct - cpu cycle accounting
● cpu - less then core granularity
● memory - limits and accounting
● blkio - limits and accounting
● net_cls - network classification
● net_prio - network priority
● Freezer + checkpoint/restore - migration
7. General structure
● tasks
– attach a task(thread) and show
list of threads
● cgroup.procs
– show list of processes
# mount -t cgroup none /cgroups
# mount -t cgroup -o cpuset cpuset /cg/cpuset
8. How to use them?
● Create cgroup
# mkdir /cgroup/GRP
● Prepare minimum limits
# echo 0-2 > /cgroup/GRP/cpuset.cpus
# echo 0-1 > /cgroup/GRP/cpuset.mems
● Add a process to a cgroup:
# echo PID > /cgroup/GRP/tasks
● Verify that a process is in the cgroup
# grep PID /cgroup/GRP/tasks
9. cpuset
● Physical CPU & Memory limits
– cpuset.cpus - list of allowed CPUs
– cpuset.mems - list of allowed memory slots
– cpuset.cpu_exclusive - 0/1 are the CPUs
exclusive to this group
– cpuset.mem_exclusive - 0/1 are the memory
slots exclusive to this group
Documentation/cgroups/cpusets.txt
10. CPU accounting
● cpu usage combined for all cpus (in
nanoseconds)
● cpu usage per-cpu (in nanoseconds)
● per cpu and user/system(in USER_HZ)
● Documentation/cgroups/cpuacct.txt
11. CPU
● CPU scheduler limits CONFIG_CGROUP_SCHED
– cpu.shares
– cpu.cfs_quota_us: in microseconds
– cpu.cfs_period_us: in microseconds (default 100ms)
– cpu.stat: exports throttling statistics
nr_throttled: Number of times the group has been
throttled/limited.
throttled_time: The total time duration (in
nanoseconds) for which entities of the group have
been throttled.
● Documentation/scheduler/sched-bwc.txt
12. CPU 3
CPU 2
CPU 0
CPU examples
CPU 1
q - quata
p - period
q: 500
p: 500
q: 1000
p: 500
q: 1500
p: 500
q: 2000
p: 500
# echo 250000 > cpu.cfs_quota_us
# echo 500000 > cpu.cfs_period_us
q: 250
p: 500
13. memory
Only Memory
● memory.usage_in_bytes
– show current res_counter usage for memory
● memory.limit_in_bytes
– set/show limit of memory usage
● memory.failcnt
– show the number of memory usage hits limits
Memory + Swap
● memory.memsw.usage_in_bytes
● memory.memsw.limit_in_bytes
● memory.memsw.failcnt
14. memory
Kernel Memory limits
● memory.kmem.limit_in_bytes
– set/show hard limit for kernel memory
● memory.kmem.usage_in_bytes
– show current kernel memory allocation
● memory.kmem.failcnt
– show the number of kernel memory usage hits
limits
19. blkio
// 10241024
|- lxc/|- lxc/ 900900
| |- c120| |- c120 450450
| |- c121| |- c121 450450
| |- c122| |- c122 450450
| |- c123| |- c123 450450
So each container can get only 50% of the totalSo each container can get only 50% of the total
I/O of the LXC cgroupI/O of the LXC cgroup
20. Network
● Adding network class to each cgroup so you
can later limit it with tc
– Documentation/cgroups/net_cls.txt
● Prioritizing network traffic on interface
– Documentation/cgroups/net_prio.txt
21. Freezer + CRIU
● freezer.state
– ТHAWED
– FREEZING
– FROZEN
● freezer.self_freezing
– 0 (thawed)/ 1 (frozen)
● freezer.parent_freezing
– 0 if partent is frozen
● CRIU - Checkpoint and Restore
In Userspace
28. User namespace
User authentication and mapping files:
● /etc/passwd
● /etc/group
● /etc/shadow
- What if we want to create a username called
pesho, but such user already exists?
- What if we want to create user joan with UID
1005, but there is already user pesho with UID
1005?
31. Network namespace
- IP
- IPv6
- Routing
- TCP
- UDP
- SCTP
- DCCP
- RDS
● Having а separate
loopback device for a process
● Or simply test the MySQL
server on the same IP
● Completely different routing
for a process
32. Mount namespace
the most complex one...
having only one / is a problem...
- at around 22000 mounts everything on your
machine starts to lag... no matter how many
cores or ram you have :(
- having a different /proc/mounts per process
would be nice and very interesting to
implement... :)
33. PID namespace
Migration of processes between machines (CRIU)
It allows you to have a two or more processes
running with the same PID.
PID - is the PID on the host machine
NSPID - is the PID that the process sees
PID NSPID
1421 5420 ssh-agent
1730 5420 xchat
1756 5420 firefox
39. Avatar Design
Avatar MasterAvatar Master
Host ServerHost Server Backup ServerBackup Server
Start backups
Each backup server
has a limit of maximum
simultaneous jobs.
- max jobs
- max backups
- max restores
40. Avatar Design
Avatar MasterAvatar Master
Host ServerHost Server Backup ServerBackup Server
Report status
each backup reports a lot of things:
- thinpool data usage
- mounted df output
- LV df output
- archive_size
- broken dbs
- remote_addr
- user IP
- exit_code
- caller_pid
- interface_type
- archive_size
- last_progress
46. Full server restore
Avatar MasterAvatar Master
Host ServerHost Server Backup ServerBackup Server
Report status
account 1
ns1 & ns2 restore here
account 3