Kubernetes in Production: Lessons Learnt

KUBERNETES IN
PRODUCTION:
LESSONS LEARNT
1

Introduction
● Kubernetes in production for 6+ months, handling 2K requests/second
● 100+ micro-services and 200+ components like Databases, cache stores and
queues
● 1800+ pods
● New environment setup in weeks through automation
● Cost savings through optimum utilization of resources
2

Cluster Creation
● Cluster pod address range (cluster-ipv4-cidr)
○ Size
○ IP conflict between clusters in different Google Cloud Platform (GCP) projects
● Cluster type
○ Zonal
○ Regional
● Add storage class for SSD
3

Namespaces != Environments
4
Pod
Staging Cluster
ns
Pod Pod Pod
Cluster
staging ns
Pod PodPod
production ns
Pod PodPod
Production Cluster
ns
Pod Pod Pod

Team as Namespace
5
Pod Pod
Cluster
platform ns
Pod PodPod
promotions ns
Pod PodPod

Global vs Namespace-scoped Tiller
7

Caveat:
ClusterRoleBinding cannot be
created using these tillers
Global vs Namespace-scoped Tiller
8

Rolling Update & Readiness probe
10
Service
V2
V1 V1
Service
V2
V1 V1 Service
V2
V1
V2
Service
V2
V2
Deploy one instance of new version
Attach to load balancer Delete one instance of old version
Deploy one instance of new version
Delete another instance of old version
Service
V2
V1 V1
Crash loop in new version
V2
Unhealthy
V2
Healthy
maxSurge: 1
maxUnavailable: 1
minReadySeconds: 3

Database on containers
● High Availability is important in container world
○ Pods are not durable
● Use persistent volumes
● Statefulset - What & Why?
○ Ordered creation, deletion and scaling
○ Stable Identifier for pods
○ Each pod will have dedicated persistent volume
11

Database on containers
12
K8s Cluster
Pod Pod Pod
Statefulset
MasterSlave 1 Slave 2
● Statefulset alone is not enough
for achieving High Availability
● Postgres cluster => Stolon
● Use pod anti-affinity to reduce
impact of a node failure

Isolate Stateful & Stateless Apps
● Why?
○ Separation of concerns
○ Different resource consumption pattern for stateful and stateless apps
○ Apps undergo frequent updates while components does not
● Separate Node pool
● Separate Cluster
○ Consul and kube-consul-register for service discovery
13

Inter Cluster - Service Discovery
14

Resource Requests & Limits
Requests:
When Containers have resource requests specified, the K8s scheduler can make better decisions about
which nodes to place Pods on.
Limits:
When Containers have their limits specified, contention for resources on a node can be handled in a
better way by the K8s scheduler.
15

● How we approached?
○ Start with default requests and limits which is unlimited
○ Learn the patterns over time and introduce appropriate requests and limits
● Advantages:
○ Measure the full utilization requirement of each application separately
● Disadvantages:
○ Unbalanced pod scheduling and this led to resource crunch
○ Auto scaling of nodes in GKE doesn’t work
Resource Requests & Limits
16

Monitoring in K8s
● Why it is important in container world?
● Tools:
○ Prometheus in K8s - Prometheus operator
○ Grafana
● Metrics exporters as separate pods:
○ Independent from the actual component
● Metrics exporters as sidecar of the component pod
○ Needs restart of actual component in case of an update
17

Monitoring in K8s
18
● Dashboards
○ Node metrics
○ Node Pod metrics
○ Ingress controller
○ K8s API latency
○ K8s persistent volumes

Alerting in K8s
● Pods - crash loops, readiness
● Nodes - Restart, Kubelet process restart, Docker daemon restart
● Sudden CPU and Memory, Disk Utilization spikes of Pods and Nodes
○ Indicates anomaly
○ If resource consumption of a node goes beyond configured eviction policy then pods are
evicted based on priority.
19

Monitoring & Alerting Setup
20
K8s Cluster
Pod Pod
Pod PodAlertManager Prometheus
Monitoring & Alerting
Node Pool
Default node pool
GrafanaSlack

Kubernetes API Gotchas
● Downtime during K8s master upgrades in GKE
○ Applications dependent on Kubernetes API are affected
○ Maintenance Window (Beta) - GKE allows to configure a 4 hour time frame window
● Reduce application runtime dependency on K8s API
21

GKE Limitations
● Only 16 disks can be attached per node
● Only 8 SSD disks can be attached per node
● Max of 50 internal load balancer is allowed per project in GKE
● Pod IP range decides the number of nodes
● No control over K8s master nodes
22

Development practices that help containerization
● Config - Store config in the environment
● Logs - Treat logs as event streams
○ Centralized logging - Stackdriver / ELK
● Processes - Execute app as one or more stateless processes
● Concurrency - Scale out via process model
● The Twelve-Factor App - https://12factor.net
23

25
THANK YOU
Arunvel Sriram
&
Prabhu Jayakumar

Kubernetes in Production: Lessons Learnt

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Kubernetes in Production: Lessons Learnt

Similar a Kubernetes in Production: Lessons Learnt (20)

Último

Último (20)

Kubernetes in Production: Lessons Learnt