SlideShare una empresa de Scribd logo
1 de 41
Descargar para leer sin conexión
One Cluster to Serve
Them All
How to run a multi-tenant K8s cluster for 1000+ users in
research and education at a University
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 1
First: 2 Things
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 2
1. Wide Range of Experience
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 3
2. Resources at Universities
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 4
University Requirements
• Flexible compute resources for Research & Teaching purposes
• Students: Try technologies, host small services etc.
• Research projects: Host project websites, services and run
large workloads in the cloud
• Must be simple to use but allow for complex setups!
• Large variety in technologies!
• 1000+ students
• AWS, Azure, GKE etc. not an option due to administrative
restrictions
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 5
Multi-Tenancy @ HAW
• Lab & Research projects each buy their own resources:
• Setup consumes too much time  Project elapsed before anything runs
• Large vendor variety  very hard to maintain
• Objectives:
• Consolidate heterogeneous compute resources
• Datacenter De-Fragmentation  Due to scarcity of power, cooling and space
• Goals:
• Democratize Compute Resources
• Increase Research & Development ramp-up speed and efficiency
• Improve Resource Utilization
• Simplify usage
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 6
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 7
Worked well, but...
• VMWare at scale is too expensive
• Resources became scarce as people demanded larger VM
instances
• Also: lack of flexibility
• VMs are never returned
• VMs never get patched Users need to maintain Operating
Systems (hint: they won’t)
• Problems with security rules: Either too hard or too weak 
Users unsatisfied
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 8
Containers to the Rescue!
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 9
• Lightweight
• Fast
• Flexible
• Resource Efficient
… you know it
• But:
Requires Orchestration
• Enter Kubernetes
Multiple Clusters?
• Requires the skill to run K8s
• Even if setup is automated:
• Still leaves configuration of cluster to the users
• Does not help in error cases
• Does not help with special setups
• Essentially same provisioning problem as with VMs
• aka: Who gets how many ressources and when?
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 10
Other reasons for single-cluster
• “(…) Not needing to deploy and monitor multiple clusters (i.e.
build all the tooling we did to run GKE at Google)” – David
Oppenheimer
• “(…) with the increasing emergence of "secure container"
technologies, this tendency will only increase, primarily driven
by resource cost considerations” – Quinton Hoole
• Source: https://goo.gl/ypCtzg
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 11
To the multi-tenant cluster we go!
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 12
Initial Cluster Setup
• Kubernetes the Hard Way (https://github.com/kelseyhightower/kubernetes-the-hard-way)
 ICC the Hard Way (https://github.com/christianhuening/kubernetes-the-haw-hamburg-way)
• 3 Master Nodes
• VM, 1 Core, 4 GB
• 3 Worker Nodes
• Bare Metal, 8 Core, 128GB
• 5 Node etcd cluster
• VM, 1 Core, 8 GB, HDD Storage
• Canal + Flannel (Calico) as Overlay Network Solution
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 13
We need AAA
• AuthN
• Who?
• AuthZ
• What?
• Admission
• How much?
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 14
AuthN
• Login via HAW Accounts through LDAP
• „Let‘s key in LDAP settings into the K8s LDAP module“
• ...oh... wait...
• Auth Token Webhook in API-Servers
• kubernetes-ldap service forked & extended from Apprenda/Kismatic
• Code: https://github.com/christianhuening/kubernetes-ldap
• API-Server Config:
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 15
--authentication-token-webhook-config-file=/etc/kubernetes/ssl/ldap-webhook-config.yaml
--authentication-token-webhook-cache-ttl=30m0s
--runtime-config=authentication.k8s.io/v1beta1=true
AuthN
• Kubernetes-ldap service hosts two endpoints:
• /ldapAuth: Listens for login requests and returns JWT token, exposed
via Ingress
• /authenticate: Endpoint for API-Server to validate incoming tokens
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 16
kubeloginkubectl K8s-api K8s-ldap HAW-IDM
/ldapAuth
200, JWT token
Write kube/config
Any API call /authenticate
OK / NOK
proceed
LDAP bind
Bind ok
AuthN
• Users use kubelogin to authenticate
• Creates/Updates ~/.kube/config file
• Set default namespace
• Activate Context
• Stored token is valid for 12 hours
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 17
LDAP Webhook Config
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 18
AuthZ
• Source of Truth required
• Majority of project and course work at HAW is done via Gitlab
• We built a Gitlab Integrator Service which:
• maps Groups, Projects and Personal Repos to Namespaces
• maps User roles from Gitlab to RoleBindings
• also applies PodSecurityPolicies & Docker Registry Secrets
• supports Webhook feature and full-sync every 3 hours
• allows for namespaces to be excluded from synchronization
• kube-system “cleaned up” , whooops
• sets up K8s Integration in Gitlab (i.e. for Continuous Delivery)
• can run inside of cluster or externally
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 19
AuthZ - Custom Roles and Bindings
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 20
• Special permissions are
granted through a ConfigMap
• Integrator ensures these are
present in the cluster
• Code: https://github.com/k8s-
tamias/gitlab-k8s-integrator
AuthZ
• Service at a point where it does too many things
• Reengineering:
• Tenant Operator/Controller
• Adapters for sources of truth like Gitlab, Github, LDAP, etc…
• Discussion at https://goo.gl/CQFvd8
• And in Multi-Tenancy Workgroup:
• Mailing List: https://goo.gl/fZ8g6B
• Slack: https://kubernetes.slack.com/messages/wg-multitenancy
• Come in, join the fun ☺ !
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 21
Admission
• Currently: Free4All Model
• We’ll add Quotas soon
• Ideas:
• Resource Leasing between tenants
• Node Ownership & Limited Node Control
• Ongoing discussion: https://goo.gl/vs5A3q
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 22
Current Architecture
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 23
Ready? Not Yet! More Hardware…
• 6 Storage Nodes:
• 2x10 Core Xeon & 96 GB Ram
• 8 x 4TB HDD @ 7200rpm
• 1 x 2TB NVMe SSD
• 8 x Compute Nodes:
• 2x10 Core Xeon & 192 GB Ram
• 1 TB HDD for Images
• 1x GPU Node:
• 2x10 Core Xeon & 768 GB Ram
• 4 x Tesla V100
• 1 2TB NVMe SSD
• 32 x 10Gbit Port Cisco Nexus Switch
• Boot via iPXE & ContainerLinux
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 24
Compute Node
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 25
Storage Node
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 26
GPU Node
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 27
Storage – CEPH & rook.io
• rook.io on ContainerLinux:
• Runs CEPH cluster as Pods in Kubernetes
• Same benefits for your storage cluster as you have for your apps
• Requires persistent storage for ceph-mon storage to be
shutdown/restart-safe
 Mount /var/lib/rook to extra hard-drive
• BTW: No need for multiple pools due to single, large cluster! ☺
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 28
Also: Logging
• No OpenSource Logging solution capable of multi-tenancy out-
of-the-box
• Opt 1: Deploy a Graylog+ES to every namespace  2-4 GB mem
• Opt 2: Provide Helm chart for people who want it  won‘t be used
• Option 3: Graylog can do it through Streams and Rules in
combination with User permissions
• However problems and slow 
• Gets setup via gitlab-integrator
• As I said: it‘s doing too many things…
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 29
Even More:
• SSL Certificate auto-provisioning via kube-lego
• Discontinued: Need to migrate to cert-manager!
• Monitoring via Prometheus-Operator
• No multi-tenancy yet, suggestions?
• GPGPU pods via https://github.com/NVIDIA/k8s-device-plugin
• And special PSPs in Namespaces via Gitlab-Integrator
• Dynamic Nodes from PC-Pools
• Add up to 1.2 TB memory and 600 cores
• Utilizes the csrapproval-controller (since K8s 1.7)
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 30
Summary
• Everything worked fine!
• Without actual users…
• Go-Live in September 2017 (Winter-Semester)
• ~150 concurrent users
• 2 very heavy users (master theses)
• Sort of brought down the cluster several times ☺
• Several problems showed up:
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 31
Problems – Control Plane
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 32
API Server Metrics:
Problems - Control Plane
• API Servers were running out of capacity:
• Increased memory to 32GB
• Increased Cores to 4
• Increased API Server count to 6
• However: Problems persisted
• kubectl commands timed out
• Deployments didn’t start
• Nodes failed due to API-Servers not responding
• etcd?
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 33
Problems – etcd I
• Obviously etcd ran out of memory
• Disable Swap!
• Increase mem to 16 GB per Node
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 34
Problems – etcd II
• Switch to pure
SSD storage
recommended!
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 35
Dez 12 09:13:23 icc-etcd-1 etcd[3875]: failed to send out heartbeat on time (exceeded the 100ms timeout for 228.310831ms)
Dez 12 09:13:23 icc-etcd-1 etcd[3875]: server is likely overloaded
Dez 12 09:13:23 icc-etcd-1 etcd[3875]: failed to send out heartbeat on time (exceeded the 100ms timeout for 228.294797ms)
Dez 12 09:13:23 icc-etcd-1 etcd[3875]: server is likely overloaded
Problems – etcd III - ‘large-scale’
• 36 Nodes
+ 75 dynamic Nodes
• 2147 Namespaces
• 908 - 2500 Pods
• 10538 RoleBindings
• High Pod churn
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 36
https://coreos.com/etcd/docs/latest/op-guide/hardware.html
Problems – etcd III
• We hit etcds default storage limit of 2GB
• etcd only accepted READ and DELETE requests
• Increase the size via --quota-backend-bytes flag
• Max is 8GB
• Effectively caused downtime for 1 day; services remained up
• Recovery took about 7 hours at full utilization
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 37
Other Performance Impacts
• kube-state-metric‘s pod_nanny required higher settings
(extra_mem = 150Mi) per Node due to higher pod churn
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 38
Lessons Learned
• Large-Scale is not necessarily bound to #nodes
• etcd really is your Pet and you want to make it feel
good:
• Multi-Tenancy possible but complex
• Requires especially good monitoring, logging &
auditing
• Students are very curious and use
the new technologies
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 39
What’s next?
• Node Security & Container Isolation
• Network Policies
• Resource Management via Self-Service (tamias.io)
• Priorities / kube-arbitrator
• Improve usage of owned, but idle resources
• PodTolerationRestriction Controller
• IPv6 & multi-network setup (IoT research et al.)
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 40
Thanks for listening!
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 41

Más contenido relacionado

La actualidad más candente

Kubernetes Frankfurt
Kubernetes FrankfurtKubernetes Frankfurt
Kubernetes Frankfurtloodse
 
KubeCon 2019 - Scaling your cluster (both ways)
KubeCon 2019 - Scaling your cluster (both ways)KubeCon 2019 - Scaling your cluster (both ways)
KubeCon 2019 - Scaling your cluster (both ways)Patrick Chanezon
 
Introduction to Kubernetes and Google Container Engine (GKE)
Introduction to Kubernetes and Google Container Engine (GKE)Introduction to Kubernetes and Google Container Engine (GKE)
Introduction to Kubernetes and Google Container Engine (GKE)Opsta
 
Kubernetes - A Short Ride Throught the project and its ecosystem
Kubernetes - A Short Ride Throught the project and its ecosystemKubernetes - A Short Ride Throught the project and its ecosystem
Kubernetes - A Short Ride Throught the project and its ecosystemMaciej Kwiek
 
Webinar: Using Litmus Chaos Engineering and AI for auto incident detection
Webinar: Using Litmus Chaos Engineering and AI for auto incident detectionWebinar: Using Litmus Chaos Engineering and AI for auto incident detection
Webinar: Using Litmus Chaos Engineering and AI for auto incident detectionMayaData Inc
 
Save 60% of Kubernetes storage costs on AWS & others with OpenEBS
Save 60% of Kubernetes storage costs on AWS & others with OpenEBSSave 60% of Kubernetes storage costs on AWS & others with OpenEBS
Save 60% of Kubernetes storage costs on AWS & others with OpenEBSMayaData Inc
 
PuppetConf 2017: Kubernetes in the Cloud w/ Puppet + Google Container Engine-...
PuppetConf 2017: Kubernetes in the Cloud w/ Puppet + Google Container Engine-...PuppetConf 2017: Kubernetes in the Cloud w/ Puppet + Google Container Engine-...
PuppetConf 2017: Kubernetes in the Cloud w/ Puppet + Google Container Engine-...Puppet
 
GlueCon kubernetes & container engine
GlueCon kubernetes & container engineGlueCon kubernetes & container engine
GlueCon kubernetes & container enginebrendandburns
 
Cloud spanner architecture and use cases
Cloud spanner architecture and use casesCloud spanner architecture and use cases
Cloud spanner architecture and use casesGDG Cloud Bengaluru
 
Kubernates : An Small introduction for Beginners by Rajiv Vishwkarma
Kubernates : An Small introduction for Beginners by Rajiv VishwkarmaKubernates : An Small introduction for Beginners by Rajiv Vishwkarma
Kubernates : An Small introduction for Beginners by Rajiv VishwkarmaRajiv Vishwkarma
 
Resilient microservices with Kubernetes - Mete Atamel - Codemotion Rome 2017
Resilient microservices with Kubernetes - Mete Atamel - Codemotion Rome 2017Resilient microservices with Kubernetes - Mete Atamel - Codemotion Rome 2017
Resilient microservices with Kubernetes - Mete Atamel - Codemotion Rome 2017Codemotion
 
Deploying containerized applications with Kubeapps
Deploying containerized applications with KubeappsDeploying containerized applications with Kubeapps
Deploying containerized applications with KubeappsJanakiram MSV
 
Getting started with Azure Container Service (AKS)
Getting started with Azure Container Service (AKS)Getting started with Azure Container Service (AKS)
Getting started with Azure Container Service (AKS)Janakiram MSV
 
Running and Managing Kubernetes on OpenStack
Running and Managing Kubernetes on OpenStackRunning and Managing Kubernetes on OpenStack
Running and Managing Kubernetes on OpenStackVictor Palma
 
How we manage thousands of clusters with minimal effort
How we manage thousands of clusters with minimal effortHow we manage thousands of clusters with minimal effort
How we manage thousands of clusters with minimal effortLibbySchulze
 
Kubernetes intro public - kubernetes user group 4-21-2015
Kubernetes intro   public - kubernetes user group 4-21-2015Kubernetes intro   public - kubernetes user group 4-21-2015
Kubernetes intro public - kubernetes user group 4-21-2015reallavalamp
 
An overview of the Kubernetes architecture
An overview of the Kubernetes architectureAn overview of the Kubernetes architecture
An overview of the Kubernetes architectureIgor Sfiligoi
 
Why kubernetes matters
Why kubernetes mattersWhy kubernetes matters
Why kubernetes mattersPlatform9
 
A Million ways of Deploying a Kubernetes Cluster
A Million ways of Deploying a Kubernetes ClusterA Million ways of Deploying a Kubernetes Cluster
A Million ways of Deploying a Kubernetes ClusterJimmy Lu
 
Kubernetes Helm: Why It Matters
Kubernetes Helm: Why It MattersKubernetes Helm: Why It Matters
Kubernetes Helm: Why It MattersPlatform9
 

La actualidad más candente (20)

Kubernetes Frankfurt
Kubernetes FrankfurtKubernetes Frankfurt
Kubernetes Frankfurt
 
KubeCon 2019 - Scaling your cluster (both ways)
KubeCon 2019 - Scaling your cluster (both ways)KubeCon 2019 - Scaling your cluster (both ways)
KubeCon 2019 - Scaling your cluster (both ways)
 
Introduction to Kubernetes and Google Container Engine (GKE)
Introduction to Kubernetes and Google Container Engine (GKE)Introduction to Kubernetes and Google Container Engine (GKE)
Introduction to Kubernetes and Google Container Engine (GKE)
 
Kubernetes - A Short Ride Throught the project and its ecosystem
Kubernetes - A Short Ride Throught the project and its ecosystemKubernetes - A Short Ride Throught the project and its ecosystem
Kubernetes - A Short Ride Throught the project and its ecosystem
 
Webinar: Using Litmus Chaos Engineering and AI for auto incident detection
Webinar: Using Litmus Chaos Engineering and AI for auto incident detectionWebinar: Using Litmus Chaos Engineering and AI for auto incident detection
Webinar: Using Litmus Chaos Engineering and AI for auto incident detection
 
Save 60% of Kubernetes storage costs on AWS & others with OpenEBS
Save 60% of Kubernetes storage costs on AWS & others with OpenEBSSave 60% of Kubernetes storage costs on AWS & others with OpenEBS
Save 60% of Kubernetes storage costs on AWS & others with OpenEBS
 
PuppetConf 2017: Kubernetes in the Cloud w/ Puppet + Google Container Engine-...
PuppetConf 2017: Kubernetes in the Cloud w/ Puppet + Google Container Engine-...PuppetConf 2017: Kubernetes in the Cloud w/ Puppet + Google Container Engine-...
PuppetConf 2017: Kubernetes in the Cloud w/ Puppet + Google Container Engine-...
 
GlueCon kubernetes & container engine
GlueCon kubernetes & container engineGlueCon kubernetes & container engine
GlueCon kubernetes & container engine
 
Cloud spanner architecture and use cases
Cloud spanner architecture and use casesCloud spanner architecture and use cases
Cloud spanner architecture and use cases
 
Kubernates : An Small introduction for Beginners by Rajiv Vishwkarma
Kubernates : An Small introduction for Beginners by Rajiv VishwkarmaKubernates : An Small introduction for Beginners by Rajiv Vishwkarma
Kubernates : An Small introduction for Beginners by Rajiv Vishwkarma
 
Resilient microservices with Kubernetes - Mete Atamel - Codemotion Rome 2017
Resilient microservices with Kubernetes - Mete Atamel - Codemotion Rome 2017Resilient microservices with Kubernetes - Mete Atamel - Codemotion Rome 2017
Resilient microservices with Kubernetes - Mete Atamel - Codemotion Rome 2017
 
Deploying containerized applications with Kubeapps
Deploying containerized applications with KubeappsDeploying containerized applications with Kubeapps
Deploying containerized applications with Kubeapps
 
Getting started with Azure Container Service (AKS)
Getting started with Azure Container Service (AKS)Getting started with Azure Container Service (AKS)
Getting started with Azure Container Service (AKS)
 
Running and Managing Kubernetes on OpenStack
Running and Managing Kubernetes on OpenStackRunning and Managing Kubernetes on OpenStack
Running and Managing Kubernetes on OpenStack
 
How we manage thousands of clusters with minimal effort
How we manage thousands of clusters with minimal effortHow we manage thousands of clusters with minimal effort
How we manage thousands of clusters with minimal effort
 
Kubernetes intro public - kubernetes user group 4-21-2015
Kubernetes intro   public - kubernetes user group 4-21-2015Kubernetes intro   public - kubernetes user group 4-21-2015
Kubernetes intro public - kubernetes user group 4-21-2015
 
An overview of the Kubernetes architecture
An overview of the Kubernetes architectureAn overview of the Kubernetes architecture
An overview of the Kubernetes architecture
 
Why kubernetes matters
Why kubernetes mattersWhy kubernetes matters
Why kubernetes matters
 
A Million ways of Deploying a Kubernetes Cluster
A Million ways of Deploying a Kubernetes ClusterA Million ways of Deploying a Kubernetes Cluster
A Million ways of Deploying a Kubernetes Cluster
 
Kubernetes Helm: Why It Matters
Kubernetes Helm: Why It MattersKubernetes Helm: Why It Matters
Kubernetes Helm: Why It Matters
 

Similar a One cluster to serve them all

Kubernetes for .NET developers
Kubernetes for .NET developersKubernetes for .NET developers
Kubernetes for .NET developersShahid Iqbal
 
Effiziente CI/CD-Pipelines – mit den richtigen Tools klappt das
Effiziente CI/CD-Pipelines – mit den richtigen Tools klappt dasEffiziente CI/CD-Pipelines – mit den richtigen Tools klappt das
Effiziente CI/CD-Pipelines – mit den richtigen Tools klappt dasNico Meisenzahl
 
Kubernetes ClusterAPI
Kubernetes ClusterAPIKubernetes ClusterAPI
Kubernetes ClusterAPIloodse
 
K8s ClusterAPI - Managing Kubernetes Cluster
K8s ClusterAPI - Managing Kubernetes ClusterK8s ClusterAPI - Managing Kubernetes Cluster
K8s ClusterAPI - Managing Kubernetes ClusterQAware GmbH
 
DevSecOps in a cloudnative world
DevSecOps in a cloudnative worldDevSecOps in a cloudnative world
DevSecOps in a cloudnative worldKarthik Gaekwad
 
DevNexus 2015: Kubernetes & Container Engine
DevNexus 2015: Kubernetes & Container EngineDevNexus 2015: Kubernetes & Container Engine
DevNexus 2015: Kubernetes & Container EngineKit Merker
 
Couchbase Connect 2016
Couchbase Connect 2016Couchbase Connect 2016
Couchbase Connect 2016Michael Kehoe
 
Google Tech Talk with Dr. Eric Brewer in Korea Apr.27.2015
Google Tech Talk with Dr. Eric Brewer in Korea Apr.27.2015Google Tech Talk with Dr. Eric Brewer in Korea Apr.27.2015
Google Tech Talk with Dr. Eric Brewer in Korea Apr.27.2015Chris Jang
 
DevOpsCon London: How containerized Pipelines can boost your CI/CD
DevOpsCon London: How containerized Pipelines can boost your CI/CDDevOpsCon London: How containerized Pipelines can boost your CI/CD
DevOpsCon London: How containerized Pipelines can boost your CI/CDNico Meisenzahl
 
DevOps Gathering - How Containerized Pipelines Can Boost Your CI/CD
DevOps Gathering - How Containerized Pipelines Can Boost Your CI/CDDevOps Gathering - How Containerized Pipelines Can Boost Your CI/CD
DevOps Gathering - How Containerized Pipelines Can Boost Your CI/CDNico Meisenzahl
 
Going Serverless with Kubeless In Google Container Engine (GKE)
Going Serverless with Kubeless In Google Container Engine (GKE)Going Serverless with Kubeless In Google Container Engine (GKE)
Going Serverless with Kubeless In Google Container Engine (GKE)Bitnami
 
xlwings for Google Sheets
xlwings for Google Sheetsxlwings for Google Sheets
xlwings for Google Sheetsxlwings
 
10 tips for Cloud Native Security
10 tips for Cloud Native Security10 tips for Cloud Native Security
10 tips for Cloud Native SecurityKarthik Gaekwad
 
Kubernetes 1.16 and rancher 2.3 enhancements
Kubernetes 1.16 and rancher 2.3 enhancementsKubernetes 1.16 and rancher 2.3 enhancements
Kubernetes 1.16 and rancher 2.3 enhancementsSaiyam Pathak
 
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TCloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TShapeBlue
 
ekb.py: KISS REST API
ekb.py: KISS REST APIekb.py: KISS REST API
ekb.py: KISS REST APIYury Yurevich
 
ZaloPay Merchant Platform on K8S on-premise
ZaloPay Merchant Platform on K8S on-premiseZaloPay Merchant Platform on K8S on-premise
ZaloPay Merchant Platform on K8S on-premiseChau Thanh
 
Elevate Your Builds: Next-Gen CI/CD with Azure Container Apps and KEDA
Elevate Your Builds: Next-Gen CI/CD with Azure Container Apps and KEDAElevate Your Builds: Next-Gen CI/CD with Azure Container Apps and KEDA
Elevate Your Builds: Next-Gen CI/CD with Azure Container Apps and KEDAPhilip Welz
 

Similar a One cluster to serve them all (20)

Kubernetes for .NET developers
Kubernetes for .NET developersKubernetes for .NET developers
Kubernetes for .NET developers
 
Kubernetes kubecon-roundup
Kubernetes kubecon-roundupKubernetes kubecon-roundup
Kubernetes kubecon-roundup
 
Effiziente CI/CD-Pipelines – mit den richtigen Tools klappt das
Effiziente CI/CD-Pipelines – mit den richtigen Tools klappt dasEffiziente CI/CD-Pipelines – mit den richtigen Tools klappt das
Effiziente CI/CD-Pipelines – mit den richtigen Tools klappt das
 
Kubernetes ClusterAPI
Kubernetes ClusterAPIKubernetes ClusterAPI
Kubernetes ClusterAPI
 
K8s ClusterAPI - Managing Kubernetes Cluster
K8s ClusterAPI - Managing Kubernetes ClusterK8s ClusterAPI - Managing Kubernetes Cluster
K8s ClusterAPI - Managing Kubernetes Cluster
 
Kubernetes Security
Kubernetes SecurityKubernetes Security
Kubernetes Security
 
DevSecOps in a cloudnative world
DevSecOps in a cloudnative worldDevSecOps in a cloudnative world
DevSecOps in a cloudnative world
 
DevNexus 2015: Kubernetes & Container Engine
DevNexus 2015: Kubernetes & Container EngineDevNexus 2015: Kubernetes & Container Engine
DevNexus 2015: Kubernetes & Container Engine
 
Couchbase Connect 2016
Couchbase Connect 2016Couchbase Connect 2016
Couchbase Connect 2016
 
Google Tech Talk with Dr. Eric Brewer in Korea Apr.27.2015
Google Tech Talk with Dr. Eric Brewer in Korea Apr.27.2015Google Tech Talk with Dr. Eric Brewer in Korea Apr.27.2015
Google Tech Talk with Dr. Eric Brewer in Korea Apr.27.2015
 
DevOpsCon London: How containerized Pipelines can boost your CI/CD
DevOpsCon London: How containerized Pipelines can boost your CI/CDDevOpsCon London: How containerized Pipelines can boost your CI/CD
DevOpsCon London: How containerized Pipelines can boost your CI/CD
 
DevOps Gathering - How Containerized Pipelines Can Boost Your CI/CD
DevOps Gathering - How Containerized Pipelines Can Boost Your CI/CDDevOps Gathering - How Containerized Pipelines Can Boost Your CI/CD
DevOps Gathering - How Containerized Pipelines Can Boost Your CI/CD
 
Going Serverless with Kubeless In Google Container Engine (GKE)
Going Serverless with Kubeless In Google Container Engine (GKE)Going Serverless with Kubeless In Google Container Engine (GKE)
Going Serverless with Kubeless In Google Container Engine (GKE)
 
xlwings for Google Sheets
xlwings for Google Sheetsxlwings for Google Sheets
xlwings for Google Sheets
 
10 tips for Cloud Native Security
10 tips for Cloud Native Security10 tips for Cloud Native Security
10 tips for Cloud Native Security
 
Kubernetes 1.16 and rancher 2.3 enhancements
Kubernetes 1.16 and rancher 2.3 enhancementsKubernetes 1.16 and rancher 2.3 enhancements
Kubernetes 1.16 and rancher 2.3 enhancements
 
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TCloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
 
ekb.py: KISS REST API
ekb.py: KISS REST APIekb.py: KISS REST API
ekb.py: KISS REST API
 
ZaloPay Merchant Platform on K8S on-premise
ZaloPay Merchant Platform on K8S on-premiseZaloPay Merchant Platform on K8S on-premise
ZaloPay Merchant Platform on K8S on-premise
 
Elevate Your Builds: Next-Gen CI/CD with Azure Container Apps and KEDA
Elevate Your Builds: Next-Gen CI/CD with Azure Container Apps and KEDAElevate Your Builds: Next-Gen CI/CD with Azure Container Apps and KEDA
Elevate Your Builds: Next-Gen CI/CD with Azure Container Apps and KEDA
 

Último

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 

Último (20)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

One cluster to serve them all

  • 1. One Cluster to Serve Them All How to run a multi-tenant K8s cluster for 1000+ users in research and education at a University 06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 1
  • 2. First: 2 Things 06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 2
  • 3. 1. Wide Range of Experience 06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 3
  • 4. 2. Resources at Universities 06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 4
  • 5. University Requirements • Flexible compute resources for Research & Teaching purposes • Students: Try technologies, host small services etc. • Research projects: Host project websites, services and run large workloads in the cloud • Must be simple to use but allow for complex setups! • Large variety in technologies! • 1000+ students • AWS, Azure, GKE etc. not an option due to administrative restrictions 06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 5
  • 6. Multi-Tenancy @ HAW • Lab & Research projects each buy their own resources: • Setup consumes too much time  Project elapsed before anything runs • Large vendor variety  very hard to maintain • Objectives: • Consolidate heterogeneous compute resources • Datacenter De-Fragmentation  Due to scarcity of power, cooling and space • Goals: • Democratize Compute Resources • Increase Research & Development ramp-up speed and efficiency • Improve Resource Utilization • Simplify usage 06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 6
  • 8. Worked well, but... • VMWare at scale is too expensive • Resources became scarce as people demanded larger VM instances • Also: lack of flexibility • VMs are never returned • VMs never get patched Users need to maintain Operating Systems (hint: they won’t) • Problems with security rules: Either too hard or too weak  Users unsatisfied 06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 8
  • 9. Containers to the Rescue! 06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 9 • Lightweight • Fast • Flexible • Resource Efficient … you know it • But: Requires Orchestration • Enter Kubernetes
  • 10. Multiple Clusters? • Requires the skill to run K8s • Even if setup is automated: • Still leaves configuration of cluster to the users • Does not help in error cases • Does not help with special setups • Essentially same provisioning problem as with VMs • aka: Who gets how many ressources and when? 06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 10
  • 11. Other reasons for single-cluster • “(…) Not needing to deploy and monitor multiple clusters (i.e. build all the tooling we did to run GKE at Google)” – David Oppenheimer • “(…) with the increasing emergence of "secure container" technologies, this tendency will only increase, primarily driven by resource cost considerations” – Quinton Hoole • Source: https://goo.gl/ypCtzg 06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 11
  • 12. To the multi-tenant cluster we go! 06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 12
  • 13. Initial Cluster Setup • Kubernetes the Hard Way (https://github.com/kelseyhightower/kubernetes-the-hard-way)  ICC the Hard Way (https://github.com/christianhuening/kubernetes-the-haw-hamburg-way) • 3 Master Nodes • VM, 1 Core, 4 GB • 3 Worker Nodes • Bare Metal, 8 Core, 128GB • 5 Node etcd cluster • VM, 1 Core, 8 GB, HDD Storage • Canal + Flannel (Calico) as Overlay Network Solution 06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 13
  • 14. We need AAA • AuthN • Who? • AuthZ • What? • Admission • How much? 06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 14
  • 15. AuthN • Login via HAW Accounts through LDAP • „Let‘s key in LDAP settings into the K8s LDAP module“ • ...oh... wait... • Auth Token Webhook in API-Servers • kubernetes-ldap service forked & extended from Apprenda/Kismatic • Code: https://github.com/christianhuening/kubernetes-ldap • API-Server Config: 06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 15 --authentication-token-webhook-config-file=/etc/kubernetes/ssl/ldap-webhook-config.yaml --authentication-token-webhook-cache-ttl=30m0s --runtime-config=authentication.k8s.io/v1beta1=true
  • 16. AuthN • Kubernetes-ldap service hosts two endpoints: • /ldapAuth: Listens for login requests and returns JWT token, exposed via Ingress • /authenticate: Endpoint for API-Server to validate incoming tokens 06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 16 kubeloginkubectl K8s-api K8s-ldap HAW-IDM /ldapAuth 200, JWT token Write kube/config Any API call /authenticate OK / NOK proceed LDAP bind Bind ok
  • 17. AuthN • Users use kubelogin to authenticate • Creates/Updates ~/.kube/config file • Set default namespace • Activate Context • Stored token is valid for 12 hours 06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 17
  • 18. LDAP Webhook Config 06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 18
  • 19. AuthZ • Source of Truth required • Majority of project and course work at HAW is done via Gitlab • We built a Gitlab Integrator Service which: • maps Groups, Projects and Personal Repos to Namespaces • maps User roles from Gitlab to RoleBindings • also applies PodSecurityPolicies & Docker Registry Secrets • supports Webhook feature and full-sync every 3 hours • allows for namespaces to be excluded from synchronization • kube-system “cleaned up” , whooops • sets up K8s Integration in Gitlab (i.e. for Continuous Delivery) • can run inside of cluster or externally 06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 19
  • 20. AuthZ - Custom Roles and Bindings 06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 20 • Special permissions are granted through a ConfigMap • Integrator ensures these are present in the cluster • Code: https://github.com/k8s- tamias/gitlab-k8s-integrator
  • 21. AuthZ • Service at a point where it does too many things • Reengineering: • Tenant Operator/Controller • Adapters for sources of truth like Gitlab, Github, LDAP, etc… • Discussion at https://goo.gl/CQFvd8 • And in Multi-Tenancy Workgroup: • Mailing List: https://goo.gl/fZ8g6B • Slack: https://kubernetes.slack.com/messages/wg-multitenancy • Come in, join the fun ☺ ! 06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 21
  • 22. Admission • Currently: Free4All Model • We’ll add Quotas soon • Ideas: • Resource Leasing between tenants • Node Ownership & Limited Node Control • Ongoing discussion: https://goo.gl/vs5A3q 06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 22
  • 24. Ready? Not Yet! More Hardware… • 6 Storage Nodes: • 2x10 Core Xeon & 96 GB Ram • 8 x 4TB HDD @ 7200rpm • 1 x 2TB NVMe SSD • 8 x Compute Nodes: • 2x10 Core Xeon & 192 GB Ram • 1 TB HDD for Images • 1x GPU Node: • 2x10 Core Xeon & 768 GB Ram • 4 x Tesla V100 • 1 2TB NVMe SSD • 32 x 10Gbit Port Cisco Nexus Switch • Boot via iPXE & ContainerLinux 06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 24
  • 28. Storage – CEPH & rook.io • rook.io on ContainerLinux: • Runs CEPH cluster as Pods in Kubernetes • Same benefits for your storage cluster as you have for your apps • Requires persistent storage for ceph-mon storage to be shutdown/restart-safe  Mount /var/lib/rook to extra hard-drive • BTW: No need for multiple pools due to single, large cluster! ☺ 06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 28
  • 29. Also: Logging • No OpenSource Logging solution capable of multi-tenancy out- of-the-box • Opt 1: Deploy a Graylog+ES to every namespace  2-4 GB mem • Opt 2: Provide Helm chart for people who want it  won‘t be used • Option 3: Graylog can do it through Streams and Rules in combination with User permissions • However problems and slow  • Gets setup via gitlab-integrator • As I said: it‘s doing too many things… 06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 29
  • 30. Even More: • SSL Certificate auto-provisioning via kube-lego • Discontinued: Need to migrate to cert-manager! • Monitoring via Prometheus-Operator • No multi-tenancy yet, suggestions? • GPGPU pods via https://github.com/NVIDIA/k8s-device-plugin • And special PSPs in Namespaces via Gitlab-Integrator • Dynamic Nodes from PC-Pools • Add up to 1.2 TB memory and 600 cores • Utilizes the csrapproval-controller (since K8s 1.7) 06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 30
  • 31. Summary • Everything worked fine! • Without actual users… • Go-Live in September 2017 (Winter-Semester) • ~150 concurrent users • 2 very heavy users (master theses) • Sort of brought down the cluster several times ☺ • Several problems showed up: 06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 31
  • 32. Problems – Control Plane 06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 32 API Server Metrics:
  • 33. Problems - Control Plane • API Servers were running out of capacity: • Increased memory to 32GB • Increased Cores to 4 • Increased API Server count to 6 • However: Problems persisted • kubectl commands timed out • Deployments didn’t start • Nodes failed due to API-Servers not responding • etcd? 06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 33
  • 34. Problems – etcd I • Obviously etcd ran out of memory • Disable Swap! • Increase mem to 16 GB per Node 06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 34
  • 35. Problems – etcd II • Switch to pure SSD storage recommended! 06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 35 Dez 12 09:13:23 icc-etcd-1 etcd[3875]: failed to send out heartbeat on time (exceeded the 100ms timeout for 228.310831ms) Dez 12 09:13:23 icc-etcd-1 etcd[3875]: server is likely overloaded Dez 12 09:13:23 icc-etcd-1 etcd[3875]: failed to send out heartbeat on time (exceeded the 100ms timeout for 228.294797ms) Dez 12 09:13:23 icc-etcd-1 etcd[3875]: server is likely overloaded
  • 36. Problems – etcd III - ‘large-scale’ • 36 Nodes + 75 dynamic Nodes • 2147 Namespaces • 908 - 2500 Pods • 10538 RoleBindings • High Pod churn 06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 36 https://coreos.com/etcd/docs/latest/op-guide/hardware.html
  • 37. Problems – etcd III • We hit etcds default storage limit of 2GB • etcd only accepted READ and DELETE requests • Increase the size via --quota-backend-bytes flag • Max is 8GB • Effectively caused downtime for 1 day; services remained up • Recovery took about 7 hours at full utilization 06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 37
  • 38. Other Performance Impacts • kube-state-metric‘s pod_nanny required higher settings (extra_mem = 150Mi) per Node due to higher pod churn 06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 38
  • 39. Lessons Learned • Large-Scale is not necessarily bound to #nodes • etcd really is your Pet and you want to make it feel good: • Multi-Tenancy possible but complex • Requires especially good monitoring, logging & auditing • Students are very curious and use the new technologies 06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 39
  • 40. What’s next? • Node Security & Container Isolation • Network Policies • Resource Management via Self-Service (tamias.io) • Priorities / kube-arbitrator • Improve usage of owned, but idle resources • PodTolerationRestriction Controller • IPv6 & multi-network setup (IoT research et al.) 06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 40
  • 41. Thanks for listening! 06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 41