One cluster to serve them all

One Cluster to Serve
Them All
How to run a multi-tenant K8s cluster for 1000+ users in
research and education at a University
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 1

First: 2 Things

1. Wide Range of Experience

2. Resources at Universities

University Requirements
• Flexible compute resources for Research & Teaching purposes
• Students: Try technologies, host small services etc.
• Research projects: Host project websites, services and run
large workloads in the cloud
• Must be simple to use but allow for complex setups!
• Large variety in technologies!
• 1000+ students
• AWS, Azure, GKE etc. not an option due to administrative
restrictions

Multi-Tenancy @ HAW
• Lab & Research projects each buy their own resources:
• Setup consumes too much time  Project elapsed before anything runs
• Large vendor variety  very hard to maintain
• Objectives:
• Consolidate heterogeneous compute resources
• Datacenter De-Fragmentation  Due to scarcity of power, cooling and space
• Goals:
• Democratize Compute Resources
• Increase Research & Development ramp-up speed and efficiency
• Improve Resource Utilization
• Simplify usage

Worked well, but...
• VMWare at scale is too expensive
• Resources became scarce as people demanded larger VM
instances
• Also: lack of flexibility
• VMs are never returned
• VMs never get patched Users need to maintain Operating
Systems (hint: they won’t)
• Problems with security rules: Either too hard or too weak 
Users unsatisfied

Containers to the Rescue!
• Lightweight
• Fast
• Flexible
• Resource Efficient
… you know it
• But:
Requires Orchestration
• Enter Kubernetes

Multiple Clusters?
• Requires the skill to run K8s
• Even if setup is automated:
• Still leaves configuration of cluster to the users
• Does not help in error cases
• Does not help with special setups
• Essentially same provisioning problem as with VMs
• aka: Who gets how many ressources and when?

Other reasons for single-cluster
• “(…) Not needing to deploy and monitor multiple clusters (i.e.
build all the tooling we did to run GKE at Google)” – David
Oppenheimer
• “(…) with the increasing emergence of "secure container"
technologies, this tendency will only increase, primarily driven
by resource cost considerations” – Quinton Hoole
• Source: https://goo.gl/ypCtzg

To the multi-tenant cluster we go!

Initial Cluster Setup
• Kubernetes the Hard Way (https://github.com/kelseyhightower/kubernetes-the-hard-way)
 ICC the Hard Way (https://github.com/christianhuening/kubernetes-the-haw-hamburg-way)
• 3 Master Nodes
• VM, 1 Core, 4 GB
• 3 Worker Nodes
• Bare Metal, 8 Core, 128GB
• 5 Node etcd cluster
• VM, 1 Core, 8 GB, HDD Storage
• Canal + Flannel (Calico) as Overlay Network Solution

We need AAA
• AuthN
• Who?
• AuthZ
• What?
• Admission
• How much?

AuthN
• Login via HAW Accounts through LDAP
• „Let‘s key in LDAP settings into the K8s LDAP module“
• ...oh... wait...
• Auth Token Webhook in API-Servers
• kubernetes-ldap service forked & extended from Apprenda/Kismatic
• Code: https://github.com/christianhuening/kubernetes-ldap
• API-Server Config:
--authentication-token-webhook-config-file=/etc/kubernetes/ssl/ldap-webhook-config.yaml
--authentication-token-webhook-cache-ttl=30m0s
--runtime-config=authentication.k8s.io/v1beta1=true

AuthN
• Kubernetes-ldap service hosts two endpoints:
• /ldapAuth: Listens for login requests and returns JWT token, exposed
via Ingress
• /authenticate: Endpoint for API-Server to validate incoming tokens
kubeloginkubectl K8s-api K8s-ldap HAW-IDM
/ldapAuth
200, JWT token
Write kube/config
Any API call /authenticate
OK / NOK
proceed
LDAP bind
Bind ok

AuthN
• Users use kubelogin to authenticate
• Creates/Updates ~/.kube/config file
• Set default namespace
• Activate Context
• Stored token is valid for 12 hours

LDAP Webhook Config

AuthZ
• Source of Truth required
• Majority of project and course work at HAW is done via Gitlab
• We built a Gitlab Integrator Service which:
• maps Groups, Projects and Personal Repos to Namespaces
• maps User roles from Gitlab to RoleBindings
• also applies PodSecurityPolicies & Docker Registry Secrets
• supports Webhook feature and full-sync every 3 hours
• allows for namespaces to be excluded from synchronization
• kube-system “cleaned up” , whooops
• sets up K8s Integration in Gitlab (i.e. for Continuous Delivery)
• can run inside of cluster or externally

AuthZ - Custom Roles and Bindings
• Special permissions are
granted through a ConfigMap
• Integrator ensures these are
present in the cluster
• Code: https://github.com/k8s-
tamias/gitlab-k8s-integrator

AuthZ
• Service at a point where it does too many things
• Reengineering:
• Tenant Operator/Controller
• Adapters for sources of truth like Gitlab, Github, LDAP, etc…
• Discussion at https://goo.gl/CQFvd8
• And in Multi-Tenancy Workgroup:
• Mailing List: https://goo.gl/fZ8g6B
• Slack: https://kubernetes.slack.com/messages/wg-multitenancy
• Come in, join the fun ☺ !

Admission
• Currently: Free4All Model
• We’ll add Quotas soon
• Ideas:
• Resource Leasing between tenants
• Node Ownership & Limited Node Control
• Ongoing discussion: https://goo.gl/vs5A3q

Current Architecture

Ready? Not Yet! More Hardware…
• 6 Storage Nodes:
• 2x10 Core Xeon & 96 GB Ram
• 8 x 4TB HDD @ 7200rpm
• 1 x 2TB NVMe SSD
• 8 x Compute Nodes:
• 1 TB HDD for Images
• 1x GPU Node:
• 4 x Tesla V100
• 1 2TB NVMe SSD
• 32 x 10Gbit Port Cisco Nexus Switch
• Boot via iPXE & ContainerLinux

Compute Node

Storage Node

GPU Node

Storage – CEPH & rook.io
• rook.io on ContainerLinux:
• Runs CEPH cluster as Pods in Kubernetes
• Same benefits for your storage cluster as you have for your apps
• Requires persistent storage for ceph-mon storage to be
shutdown/restart-safe
 Mount /var/lib/rook to extra hard-drive
• BTW: No need for multiple pools due to single, large cluster! ☺

Also: Logging
• No OpenSource Logging solution capable of multi-tenancy out-
of-the-box
• Opt 1: Deploy a Graylog+ES to every namespace  2-4 GB mem
• Opt 2: Provide Helm chart for people who want it  won‘t be used
• Option 3: Graylog can do it through Streams and Rules in
combination with User permissions
• However problems and slow 
• Gets setup via gitlab-integrator
• As I said: it‘s doing too many things…

Even More:
• SSL Certificate auto-provisioning via kube-lego
• Discontinued: Need to migrate to cert-manager!
• Monitoring via Prometheus-Operator
• No multi-tenancy yet, suggestions?
• GPGPU pods via https://github.com/NVIDIA/k8s-device-plugin
• And special PSPs in Namespaces via Gitlab-Integrator
• Dynamic Nodes from PC-Pools
• Add up to 1.2 TB memory and 600 cores
• Utilizes the csrapproval-controller (since K8s 1.7)

Summary
• Everything worked fine!
• Without actual users…
• Go-Live in September 2017 (Winter-Semester)
• ~150 concurrent users
• 2 very heavy users (master theses)
• Sort of brought down the cluster several times ☺
• Several problems showed up:

Problems – Control Plane
API Server Metrics:

Problems - Control Plane
• API Servers were running out of capacity:
• Increased memory to 32GB
• Increased Cores to 4
• Increased API Server count to 6
• However: Problems persisted
• kubectl commands timed out
• Deployments didn’t start
• Nodes failed due to API-Servers not responding
• etcd?

Problems – etcd I
• Obviously etcd ran out of memory
• Disable Swap!
• Increase mem to 16 GB per Node

Problems – etcd II
• Switch to pure
SSD storage
recommended!
Dez 12 09:13:23 icc-etcd-1 etcd[3875]: failed to send out heartbeat on time (exceeded the 100ms timeout for 228.310831ms)
Dez 12 09:13:23 icc-etcd-1 etcd[3875]: server is likely overloaded
Dez 12 09:13:23 icc-etcd-1 etcd[3875]: failed to send out heartbeat on time (exceeded the 100ms timeout for 228.294797ms)
Dez 12 09:13:23 icc-etcd-1 etcd[3875]: server is likely overloaded

Problems – etcd III - ‘large-scale’
• 36 Nodes
+ 75 dynamic Nodes
• 2147 Namespaces
• 908 - 2500 Pods
• 10538 RoleBindings
• High Pod churn
https://coreos.com/etcd/docs/latest/op-guide/hardware.html

Problems – etcd III
• We hit etcds default storage limit of 2GB
• etcd only accepted READ and DELETE requests
• Increase the size via --quota-backend-bytes flag
• Max is 8GB
• Effectively caused downtime for 1 day; services remained up
• Recovery took about 7 hours at full utilization

Other Performance Impacts
• kube-state-metric‘s pod_nanny required higher settings
(extra_mem = 150Mi) per Node due to higher pod churn

Lessons Learned
• Large-Scale is not necessarily bound to #nodes
• etcd really is your Pet and you want to make it feel
good:
• Multi-Tenancy possible but complex
• Requires especially good monitoring, logging &
auditing
• Students are very curious and use
the new technologies

What’s next?
• Node Security & Container Isolation
• Network Policies
• Resource Management via Self-Service (tamias.io)
• Priorities / kube-arbitrator
• Improve usage of owned, but idle resources
• PodTolerationRestriction Controller
• IPv6 & multi-network setup (IoT research et al.)

Thanks for listening!

One cluster to serve them all

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a One cluster to serve them all

Similar a One cluster to serve them all (20)

Último

Último (20)

One cluster to serve them all