Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Kubernetes in Production: Lessons Learnt

61 visualizaciones

Publicado el

What does it feel like to run a high-traffic large scale application on Kubernetes, with 100+ Microservices and 1600+ Pods, handling 2K requests/second in production? Experience these developers’ journey through the Do’s, the Don'ts, the pains, the pleasures, and the Gotcha!’s to production.

Publicado en: Tecnología
  • Sé el primero en comentar

Kubernetes in Production: Lessons Learnt

  1. 1. KUBERNETES IN PRODUCTION: LESSONS LEARNT 1
  2. 2. Introduction ● Kubernetes in production for 6+ months, handling 2K requests/second ● 100+ micro-services and 200+ components like Databases, cache stores and queues ● 1800+ pods ● New environment setup in weeks through automation ● Cost savings through optimum utilization of resources 2
  3. 3. Cluster Creation ● Cluster pod address range (cluster-ipv4-cidr) ○ Size ○ IP conflict between clusters in different Google Cloud Platform (GCP) projects ● Cluster type ○ Zonal ○ Regional ● Add storage class for SSD 3
  4. 4. Namespaces != Environments 4 Pod Staging Cluster ns Pod Pod Pod Cluster staging ns Pod PodPod production ns Pod PodPod Production Cluster ns Pod Pod Pod
  5. 5. Team as Namespace 5 Pod Pod Cluster platform ns Pod PodPod promotions ns Pod PodPod
  6. 6. Helm & Tiller 6
  7. 7. Global vs Namespace-scoped Tiller 7
  8. 8. Caveat: ClusterRoleBinding cannot be created using these tillers Global vs Namespace-scoped Tiller 8
  9. 9. CI/CD for Helm 9
  10. 10. Rolling Update & Readiness probe 10 Service V2 V1 V1 Service V2 V1 V1 Service V2 V1 V2 Service V2 V2 Deploy one instance of new version Attach to load balancer Delete one instance of old version Deploy one instance of new version Delete another instance of old version Service V2 V1 V1 Crash loop in new version V2 Unhealthy V2 Healthy maxSurge: 1 maxUnavailable: 1 minReadySeconds: 3
  11. 11. Database on containers ● High Availability is important in container world ○ Pods are not durable ● Use persistent volumes ● Statefulset - What & Why? ○ Ordered creation, deletion and scaling ○ Stable Identifier for pods ○ Each pod will have dedicated persistent volume 11
  12. 12. Database on containers 12 K8s Cluster Pod Pod Pod Statefulset MasterSlave 1 Slave 2 ● Statefulset alone is not enough for achieving High Availability ● Postgres cluster => Stolon ● Use pod anti-affinity to reduce impact of a node failure
  13. 13. Isolate Stateful & Stateless Apps ● Why? ○ Separation of concerns ○ Different resource consumption pattern for stateful and stateless apps ○ Apps undergo frequent updates while components does not ● Separate Node pool ● Separate Cluster ○ Consul and kube-consul-register for service discovery 13
  14. 14. Inter Cluster - Service Discovery 14
  15. 15. Resource Requests & Limits Requests: When Containers have resource requests specified, the K8s scheduler can make better decisions about which nodes to place Pods on. Limits: When Containers have their limits specified, contention for resources on a node can be handled in a better way by the K8s scheduler. 15
  16. 16. ● How we approached? ○ Start with default requests and limits which is unlimited ○ Learn the patterns over time and introduce appropriate requests and limits ● Advantages: ○ Measure the full utilization requirement of each application separately ● Disadvantages: ○ Unbalanced pod scheduling and this led to resource crunch ○ Auto scaling of nodes in GKE doesn’t work Resource Requests & Limits 16
  17. 17. Monitoring in K8s ● Why it is important in container world? ● Tools: ○ Prometheus in K8s - Prometheus operator ○ Grafana ● Metrics exporters as separate pods: ○ Independent from the actual component ● Metrics exporters as sidecar of the component pod ○ Needs restart of actual component in case of an update 17
  18. 18. Monitoring in K8s 18 ● Dashboards ○ Node metrics ○ Node Pod metrics ○ Ingress controller ○ K8s API latency ○ K8s persistent volumes
  19. 19. Alerting in K8s ● Pods - crash loops, readiness ● Nodes - Restart, Kubelet process restart, Docker daemon restart ● Sudden CPU and Memory, Disk Utilization spikes of Pods and Nodes ○ Indicates anomaly ○ If resource consumption of a node goes beyond configured eviction policy then pods are evicted based on priority. 19
  20. 20. Monitoring & Alerting Setup 20 K8s Cluster Pod Pod Pod PodAlertManager Prometheus Monitoring & Alerting Node Pool Default node pool GrafanaSlack
  21. 21. Kubernetes API Gotchas ● Downtime during K8s master upgrades in GKE ○ Applications dependent on Kubernetes API are affected ○ Maintenance Window (Beta) - GKE allows to configure a 4 hour time frame window ● Reduce application runtime dependency on K8s API 21
  22. 22. GKE Limitations ● Only 16 disks can be attached per node ● Only 8 SSD disks can be attached per node ● Max of 50 internal load balancer is allowed per project in GKE ● Pod IP range decides the number of nodes ● No control over K8s master nodes 22
  23. 23. Development practices that help containerization ● Config - Store config in the environment ● Logs - Treat logs as event streams ○ Centralized logging - Stackdriver / ELK ● Processes - Execute app as one or more stateless processes ● Concurrency - Scale out via process model ● The Twelve-Factor App - https://12factor.net 23
  24. 24. QUESTIONS ? 24
  25. 25. 25 THANK YOU Arunvel Sriram & Prabhu Jayakumar

×