SlideShare una empresa de Scribd logo
1 de 47
Descargar para leer sin conexión
you should watch
Jorge Salamero - @bencerillo
15 Kubernetes
failure points
Jorge Salamero
Tech Marketing aka container gamer @ Sysdig
github.com/bencer
@bencerillo
OSS fan
Monitoring, containers, IoT/home-automation, cars
About me
Monitoring & Security Platform for Containers
Monitoring 15 Kubernetes failure points
- Apps
- Hosts
- Orchestration
- Containers
- Yourself
https://sysdig.com/blog/monitoring-kubernetes-with-sysdig-cloud/
https://sysdig.com/blog/alerting-kubernetes/
The holy service metrics
- KPI / biz metrics / synthetic
monitoring / user metrics
- Google SRE book:
“The Four Golden Signals”
Latency+Traffic+Errors+Saturation
USE method
- Utilization
(how busy we are, close to 100% bottleneck)
- Saturation
(amount of work waiting on the queue)
- Errors
RED method
- Request Rate
- Request Errors
- Request Duration
The holy service metrics
- Code instrumentation (statsd, JMX
or Prometheus metrics):
var httpDurationsHistogram := prometheus.NewHistogramVec(prometheus.HistogramOpts{
Name: "http_durations_histogram_seconds",
Help: "Seconds spent serving HTTP requests.",
Buckets: prometheus.DefBuckets,
}, []string{"method", "route", "status_code"})
prometheus.MustRegister(httpDurationsHistogram)
- or Sysdig autodiscovery ;-)
1. connections per second
net.request.count
2. response time
net.response.time
3. errors
net.request.error.count
Prometheus + Grafana UI
Kubernetes orchestration
Kubernetes hierarchy
Services vs hosts+containers
Kubernetes metadata: labels
Pod
app: shopping
tier: api
Pod
app: shopping
tier: db
Pod
app: social
tier: api
role: search
Pod
app: social
tier: api
role: search
Leverage metadata (by service)
Leverage metadata (by pod)
Health vs state monitoring
- Health:
- CPU, memory, disk
- connections, response time,
errors
Health vs state monitoring
- State (orchestration):
- Are containers up and
running properly?
Health vs state monitoring
- kube-state-metrics
https://github.com/kubernetes/kube-state-metrics
https://sysdig.com/blog/introducing-kube-state-metrics/
calculate new metrics based on
the state of Kubernetes
resources
Container scheduling
- Need to deploy a container:
- given the requirements,
where can we run it?
and let’s ignore affinity, taints and tolerations:
https://sysdig.com/blog/kubernetes-scheduler/
- capacity planning
4. node availability
Based on the host or the kubelet component status:
kube_node_status_condition{condition="Ready",status="true"} == 0
count(kube_node_status_condition{condition="Ready",status="true"} == 0) > 1 and
(count(kube_node_status_condition{condition="Ready",status="true"} == 0) /
count(kube_node_status_condition{condition="Ready",status="true"})) > 0.2
count(up{job="kubelet"} == 0) / count(up{job="kubelet"}) * 100 > 3
kube_node_status_condition: kube_node_status_ready,
kube_node_status_out_of_disk, kube_node_status_memory_pressure,
kube_node_status_disk_pressure, and kube_node_status_network_unavailable
Sysdig alert UI
Container resource requirements
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
https://github.com/kubernetes-incubator/cluster-capacity
5. CPU resources
6. memory resources
kube_node_status_capacity_pods
kube_node_status_allocatable_pods
kube_node_status_capacity_cpu_cores
kube_node_status_capacity_memory_bytes
kube_node_status_allocatable_cpu_cores
kube_node_status_allocatable_memory_bytes
capacity - used (by OS and kube services) = allocatable
Container disk requirements
here things get more complicated...
- ephemeral disk usage
- persistent volumes claims
7. disk resources
predict_linear(node_filesystem_free[30m], 3600 * 2) < 0
kube_node_status_condition: kube_node_status_out_of_disk
but within containers this is still WIP, at least Kubernetes 1.8:
container_fs_* doesn’t work with PV
https://github.com/kubernetes/kubernetes/pull/59170
https://github.com/kubernetes/kubernetes/pull/51553
https://kubernetes.io/docs/concepts/cluster-administration/controller-metrics/
Container orchestration
- ReplicationController
- ReplicaSet
- Deployment
- DaemonSet
- StatefulSet
Kubernetes deployments
Is Kubernetes doing what is
supposed to to?
Orchestration needs monitoring too.
8. running instances
9. desired instances
((kube_deployment_status_replicas_updated != kube_deployment_spec_replicas)
or
(kube_deployment_status_replicas_available != kube_deployment_spec_replicas))
10. deployment updates glitches
kube_deployment_status_observed_generation !=
kube_deployment_metadata_generation
kube_deployment_spec_paused
kube_deployment_spec_strategy_rollingupdate_max_unavailable
Container livecycle state
https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/
Liveness probes
To know when to restart a container:
livenessProbe:
httpGet:
path: /healthz
port: 8080
httpHeaders:
- name: X-Custom-Header
value: Awesome
initialDelaySeconds: 3
periodSeconds: 3
Ready-ness probes
To know when a container is ready to start accepting traffic:
readinessProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 5
periodSeconds: 5
11. pod status
kube_pod_status_phase: Pending|Running|Succeeded|Failed|Unknown
kube_pod_status_ready
kube_pod_status_scheduled
kube_pod_container_status_waiting
kube_pod_container_status_running
kube_pod_container_status_terminated
kube_pod_container_status_ready
12. pod restarts
You can look at this as a metric or as an event:
ALERT PodRestartingTooMuch
IF rate(k8s_pod_status_restartCount[1m]) > 1/(5*60)
FOR 1h
LABELS { severity="warning" }
ANNOTATIONS {
summary = "Pod {{$labels.namespace}}/{{$label.name}} restarting too
much.",
description = "Pod {{$labels.namespace}}/{{$label.name}} restarting too
much.",
}
CrashLoopBackOff event
https://sysdig.com/blog/debug-kubernetes-crashloopbackoff/
Sysdig Inspect
https://github.com/draios/sysdig-inspect
Kubernetes internals
- APIserver
- KubeDNS / Istio
- container registry
- any other piece of Kubernetes
https://sysdig.com/blog/monitor-etcd/
13. APIserver
rate(apiserver_request_count{code=~"^(?:5..)$"}[5m]) /
rate(apiserver_request_count[5m])* 100 > 5
apiserver_latency_seconds:quantile{quantile="0.99",subresource!="log",verb!
~"^(?:WATCH|WATCHLIST|PROXY|CONNECT)$"}> 4
Or just do Golden signals on APIserver endpoint too :-)
14. KubeDNS / Istio
histogram_quantile(0.95,
sum(rate(kubedns_probe_kubedns_latency_ms_bucket[1m])) BY (le,
kubernetes_pod_name)) > 1000
All export native metrics in Prometheus format, just scrape them!
https://sysdig.com/blog/monitor-istio/
What are we deploying?
- CI/CD and commits
- Manual deploys
You need to validate what you
tell Kubernetes too!
15. monitor your commands
kubeval: validates YAML and JSON config files
https://github.com/garethr/kubeval
kube-diff: show differences between running state and version controlled configuration
https://github.com/weaveworks/kubediff
Configuration reconciliation discussion:
https://github.com/kubernetes/kubernetes/issues/1702
Although this is getting automated too:
https://sysdig.com/blog/kubernetes-scaler/
Recap
1. connections per second
2. response time
3. errors
4. node availability
5. CPU resources
6. memory resources
7. disk and external resources
Recap (2)
8. running instances
9. desired instances
10. deployment updates glitches
Recap (3)
11. pod status
12. pod restarts
13. APIserver health
14. KubeDNS / Istio health
15. monitor your commands
Grazie!
Jorge Salamero - @bencerillo
https://sysdig.com/blog/

Más contenido relacionado

La actualidad más candente

MariaDB 10.11 key features overview for DBAs
MariaDB 10.11 key features overview for DBAsMariaDB 10.11 key features overview for DBAs
MariaDB 10.11 key features overview for DBAsFederico Razzoli
 
NoSQL @ CodeMash 2010
NoSQL @ CodeMash 2010NoSQL @ CodeMash 2010
NoSQL @ CodeMash 2010Ben Scofield
 
Naver속도의, 속도에 의한, 속도를 위한 몽고DB (네이버 컨텐츠검색과 몽고DB) [Naver]
Naver속도의, 속도에 의한, 속도를 위한 몽고DB (네이버 컨텐츠검색과 몽고DB) [Naver]Naver속도의, 속도에 의한, 속도를 위한 몽고DB (네이버 컨텐츠검색과 몽고DB) [Naver]
Naver속도의, 속도에 의한, 속도를 위한 몽고DB (네이버 컨텐츠검색과 몽고DB) [Naver]MongoDB
 
Parallel Replication in MySQL and MariaDB
Parallel Replication in MySQL and MariaDBParallel Replication in MySQL and MariaDB
Parallel Replication in MySQL and MariaDBMydbops
 
MySQL Audit using Percona audit plugin and ELK
MySQL Audit using Percona audit plugin and ELKMySQL Audit using Percona audit plugin and ELK
MySQL Audit using Percona audit plugin and ELKYoungHeon (Roy) Kim
 
Rate Limiting with NGINX and NGINX Plus
Rate Limiting with NGINX and NGINX PlusRate Limiting with NGINX and NGINX Plus
Rate Limiting with NGINX and NGINX PlusNGINX, Inc.
 
Upgrade to MySQL 8.0!
Upgrade to MySQL 8.0!Upgrade to MySQL 8.0!
Upgrade to MySQL 8.0!Ted Wennmark
 
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...Spark Summit
 
Hacking WebApps for fun and profit : how to approach a target?
Hacking WebApps for fun and profit : how to approach a target?Hacking WebApps for fun and profit : how to approach a target?
Hacking WebApps for fun and profit : how to approach a target?Yassine Aboukir
 
How Booking.com avoids and deals with replication lag
How Booking.com avoids and deals with replication lagHow Booking.com avoids and deals with replication lag
How Booking.com avoids and deals with replication lagJean-François Gagné
 
MySQL Performance for DevOps
MySQL Performance for DevOpsMySQL Performance for DevOps
MySQL Performance for DevOpsSveta Smirnova
 
SpecterOps Webinar Week - Kerberoasting Revisisted
SpecterOps Webinar Week - Kerberoasting RevisistedSpecterOps Webinar Week - Kerberoasting Revisisted
SpecterOps Webinar Week - Kerberoasting RevisistedWill Schroeder
 
모두가 성장하는 스터디 만들기
모두가 성장하는 스터디 만들기모두가 성장하는 스터디 만들기
모두가 성장하는 스터디 만들기BYUNGHOKIM10
 
[D2]pinpoint 개발기
[D2]pinpoint 개발기[D2]pinpoint 개발기
[D2]pinpoint 개발기NAVER D2
 
DRP (Stretch Cluster) for HDP - Future of Data : Paris
DRP (Stretch Cluster) for HDP - Future of Data : Paris DRP (Stretch Cluster) for HDP - Future of Data : Paris
DRP (Stretch Cluster) for HDP - Future of Data : Paris Mohamed Mehdi Ben Aissa
 
RedisConf17 - Redis Graph
RedisConf17 - Redis GraphRedisConf17 - Redis Graph
RedisConf17 - Redis GraphRedis Labs
 
AI 연구자를 위한 클린코드 - GDG DevFest Seoul 2019
AI 연구자를 위한 클린코드 - GDG DevFest Seoul 2019AI 연구자를 위한 클린코드 - GDG DevFest Seoul 2019
AI 연구자를 위한 클린코드 - GDG DevFest Seoul 2019Kenneth Ceyer
 

La actualidad más candente (20)

MariaDB 10.11 key features overview for DBAs
MariaDB 10.11 key features overview for DBAsMariaDB 10.11 key features overview for DBAs
MariaDB 10.11 key features overview for DBAs
 
NoSQL @ CodeMash 2010
NoSQL @ CodeMash 2010NoSQL @ CodeMash 2010
NoSQL @ CodeMash 2010
 
Naver속도의, 속도에 의한, 속도를 위한 몽고DB (네이버 컨텐츠검색과 몽고DB) [Naver]
Naver속도의, 속도에 의한, 속도를 위한 몽고DB (네이버 컨텐츠검색과 몽고DB) [Naver]Naver속도의, 속도에 의한, 속도를 위한 몽고DB (네이버 컨텐츠검색과 몽고DB) [Naver]
Naver속도의, 속도에 의한, 속도를 위한 몽고DB (네이버 컨텐츠검색과 몽고DB) [Naver]
 
Parallel Replication in MySQL and MariaDB
Parallel Replication in MySQL and MariaDBParallel Replication in MySQL and MariaDB
Parallel Replication in MySQL and MariaDB
 
Confianza cero
Confianza ceroConfianza cero
Confianza cero
 
MySQL Audit using Percona audit plugin and ELK
MySQL Audit using Percona audit plugin and ELKMySQL Audit using Percona audit plugin and ELK
MySQL Audit using Percona audit plugin and ELK
 
Rate Limiting with NGINX and NGINX Plus
Rate Limiting with NGINX and NGINX PlusRate Limiting with NGINX and NGINX Plus
Rate Limiting with NGINX and NGINX Plus
 
Redis
RedisRedis
Redis
 
Upgrade to MySQL 8.0!
Upgrade to MySQL 8.0!Upgrade to MySQL 8.0!
Upgrade to MySQL 8.0!
 
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
 
Graph database
Graph databaseGraph database
Graph database
 
Hacking WebApps for fun and profit : how to approach a target?
Hacking WebApps for fun and profit : how to approach a target?Hacking WebApps for fun and profit : how to approach a target?
Hacking WebApps for fun and profit : how to approach a target?
 
How Booking.com avoids and deals with replication lag
How Booking.com avoids and deals with replication lagHow Booking.com avoids and deals with replication lag
How Booking.com avoids and deals with replication lag
 
MySQL Performance for DevOps
MySQL Performance for DevOpsMySQL Performance for DevOps
MySQL Performance for DevOps
 
SpecterOps Webinar Week - Kerberoasting Revisisted
SpecterOps Webinar Week - Kerberoasting RevisistedSpecterOps Webinar Week - Kerberoasting Revisisted
SpecterOps Webinar Week - Kerberoasting Revisisted
 
모두가 성장하는 스터디 만들기
모두가 성장하는 스터디 만들기모두가 성장하는 스터디 만들기
모두가 성장하는 스터디 만들기
 
[D2]pinpoint 개발기
[D2]pinpoint 개발기[D2]pinpoint 개발기
[D2]pinpoint 개발기
 
DRP (Stretch Cluster) for HDP - Future of Data : Paris
DRP (Stretch Cluster) for HDP - Future of Data : Paris DRP (Stretch Cluster) for HDP - Future of Data : Paris
DRP (Stretch Cluster) for HDP - Future of Data : Paris
 
RedisConf17 - Redis Graph
RedisConf17 - Redis GraphRedisConf17 - Redis Graph
RedisConf17 - Redis Graph
 
AI 연구자를 위한 클린코드 - GDG DevFest Seoul 2019
AI 연구자를 위한 클린코드 - GDG DevFest Seoul 2019AI 연구자를 위한 클린코드 - GDG DevFest Seoul 2019
AI 연구자를 위한 클린코드 - GDG DevFest Seoul 2019
 

Similar a 15 kubernetes failure points you should watch

DevOpsDaysRiga 2018: Andrew Martin - Continuous Kubernetes Security
DevOpsDaysRiga 2018: Andrew Martin - Continuous Kubernetes Security DevOpsDaysRiga 2018: Andrew Martin - Continuous Kubernetes Security
DevOpsDaysRiga 2018: Andrew Martin - Continuous Kubernetes Security DevOpsDays Riga
 
Monitoring akka cluster on kubernetes
Monitoring akka cluster on kubernetesMonitoring akka cluster on kubernetes
Monitoring akka cluster on kubernetesSeva Dolgopolov
 
Kubernetes - Sailing a Sea of Containers
Kubernetes - Sailing a Sea of ContainersKubernetes - Sailing a Sea of Containers
Kubernetes - Sailing a Sea of ContainersKel Cecil
 
Control Plane: Continuous Kubernetes Security (DevSecOps - London Gathering, ...
Control Plane: Continuous Kubernetes Security (DevSecOps - London Gathering, ...Control Plane: Continuous Kubernetes Security (DevSecOps - London Gathering, ...
Control Plane: Continuous Kubernetes Security (DevSecOps - London Gathering, ...Michael Man
 
Hands-On Introduction to Kubernetes at LISA17
Hands-On Introduction to Kubernetes at LISA17Hands-On Introduction to Kubernetes at LISA17
Hands-On Introduction to Kubernetes at LISA17Ryan Jarvinen
 
Build Your Own CaaS (Container as a Service)
Build Your Own CaaS (Container as a Service)Build Your Own CaaS (Container as a Service)
Build Your Own CaaS (Container as a Service)HungWei Chiu
 
Keynote #Tech - Google : aperçu de la gestion des services distribués chez Go...
Keynote #Tech - Google : aperçu de la gestion des services distribués chez Go...Keynote #Tech - Google : aperçu de la gestion des services distribués chez Go...
Keynote #Tech - Google : aperçu de la gestion des services distribués chez Go...Paris Open Source Summit
 
Kubernetes extensibility: crd & operators
Kubernetes extensibility: crd & operators Kubernetes extensibility: crd & operators
Kubernetes extensibility: crd & operators Giacomo Tirabassi
 
Kubernetes extensibility: CRDs & Operators
Kubernetes extensibility: CRDs & OperatorsKubernetes extensibility: CRDs & Operators
Kubernetes extensibility: CRDs & OperatorsSIGHUP
 
Using kubernetes to lose your fear of using containers
Using kubernetes to lose your fear of using containersUsing kubernetes to lose your fear of using containers
Using kubernetes to lose your fear of using containersjosfuecas
 
Azure kubernetes service (aks) part 3
Azure kubernetes service (aks)   part 3Azure kubernetes service (aks)   part 3
Azure kubernetes service (aks) part 3Nilesh Gule
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusTobias Schmidt
 
A GitOps model for High Availability and Disaster Recovery on EKS
A GitOps model for High Availability and Disaster Recovery on EKSA GitOps model for High Availability and Disaster Recovery on EKS
A GitOps model for High Availability and Disaster Recovery on EKSWeaveworks
 
using Mithril.js + postgREST to build and consume API's
using Mithril.js + postgREST to build and consume API'susing Mithril.js + postgREST to build and consume API's
using Mithril.js + postgREST to build and consume API'sAntônio Roberto Silva
 
使用 Prometheus 監控 Kubernetes Cluster
使用 Prometheus 監控 Kubernetes Cluster 使用 Prometheus 監控 Kubernetes Cluster
使用 Prometheus 監控 Kubernetes Cluster inwin stack
 
How Honestbee Does CI/CD on Kubernetes - Vincent DeSmet
How Honestbee Does CI/CD on Kubernetes - Vincent DeSmetHow Honestbee Does CI/CD on Kubernetes - Vincent DeSmet
How Honestbee Does CI/CD on Kubernetes - Vincent DeSmetDevOpsDaysJKT
 
Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...
Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...
Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...Codemotion
 
Cluster management with Kubernetes
Cluster management with KubernetesCluster management with Kubernetes
Cluster management with KubernetesSatnam Singh
 
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdfServerless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdfStavros Kontopoulos
 
Kubernetes x PaaS – コンテナアプリケーションのNoOpsへの挑戦
Kubernetes x PaaS – コンテナアプリケーションのNoOpsへの挑戦Kubernetes x PaaS – コンテナアプリケーションのNoOpsへの挑戦
Kubernetes x PaaS – コンテナアプリケーションのNoOpsへの挑戦Yoichi Kawasaki
 

Similar a 15 kubernetes failure points you should watch (20)

DevOpsDaysRiga 2018: Andrew Martin - Continuous Kubernetes Security
DevOpsDaysRiga 2018: Andrew Martin - Continuous Kubernetes Security DevOpsDaysRiga 2018: Andrew Martin - Continuous Kubernetes Security
DevOpsDaysRiga 2018: Andrew Martin - Continuous Kubernetes Security
 
Monitoring akka cluster on kubernetes
Monitoring akka cluster on kubernetesMonitoring akka cluster on kubernetes
Monitoring akka cluster on kubernetes
 
Kubernetes - Sailing a Sea of Containers
Kubernetes - Sailing a Sea of ContainersKubernetes - Sailing a Sea of Containers
Kubernetes - Sailing a Sea of Containers
 
Control Plane: Continuous Kubernetes Security (DevSecOps - London Gathering, ...
Control Plane: Continuous Kubernetes Security (DevSecOps - London Gathering, ...Control Plane: Continuous Kubernetes Security (DevSecOps - London Gathering, ...
Control Plane: Continuous Kubernetes Security (DevSecOps - London Gathering, ...
 
Hands-On Introduction to Kubernetes at LISA17
Hands-On Introduction to Kubernetes at LISA17Hands-On Introduction to Kubernetes at LISA17
Hands-On Introduction to Kubernetes at LISA17
 
Build Your Own CaaS (Container as a Service)
Build Your Own CaaS (Container as a Service)Build Your Own CaaS (Container as a Service)
Build Your Own CaaS (Container as a Service)
 
Keynote #Tech - Google : aperçu de la gestion des services distribués chez Go...
Keynote #Tech - Google : aperçu de la gestion des services distribués chez Go...Keynote #Tech - Google : aperçu de la gestion des services distribués chez Go...
Keynote #Tech - Google : aperçu de la gestion des services distribués chez Go...
 
Kubernetes extensibility: crd & operators
Kubernetes extensibility: crd & operators Kubernetes extensibility: crd & operators
Kubernetes extensibility: crd & operators
 
Kubernetes extensibility: CRDs & Operators
Kubernetes extensibility: CRDs & OperatorsKubernetes extensibility: CRDs & Operators
Kubernetes extensibility: CRDs & Operators
 
Using kubernetes to lose your fear of using containers
Using kubernetes to lose your fear of using containersUsing kubernetes to lose your fear of using containers
Using kubernetes to lose your fear of using containers
 
Azure kubernetes service (aks) part 3
Azure kubernetes service (aks)   part 3Azure kubernetes service (aks)   part 3
Azure kubernetes service (aks) part 3
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with Prometheus
 
A GitOps model for High Availability and Disaster Recovery on EKS
A GitOps model for High Availability and Disaster Recovery on EKSA GitOps model for High Availability and Disaster Recovery on EKS
A GitOps model for High Availability and Disaster Recovery on EKS
 
using Mithril.js + postgREST to build and consume API's
using Mithril.js + postgREST to build and consume API'susing Mithril.js + postgREST to build and consume API's
using Mithril.js + postgREST to build and consume API's
 
使用 Prometheus 監控 Kubernetes Cluster
使用 Prometheus 監控 Kubernetes Cluster 使用 Prometheus 監控 Kubernetes Cluster
使用 Prometheus 監控 Kubernetes Cluster
 
How Honestbee Does CI/CD on Kubernetes - Vincent DeSmet
How Honestbee Does CI/CD on Kubernetes - Vincent DeSmetHow Honestbee Does CI/CD on Kubernetes - Vincent DeSmet
How Honestbee Does CI/CD on Kubernetes - Vincent DeSmet
 
Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...
Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...
Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...
 
Cluster management with Kubernetes
Cluster management with KubernetesCluster management with Kubernetes
Cluster management with Kubernetes
 
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdfServerless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf
 
Kubernetes x PaaS – コンテナアプリケーションのNoOpsへの挑戦
Kubernetes x PaaS – コンテナアプリケーションのNoOpsへの挑戦Kubernetes x PaaS – コンテナアプリケーションのNoOpsへの挑戦
Kubernetes x PaaS – コンテナアプリケーションのNoOpsへの挑戦
 

Más de Sysdig

Wordpress y Docker, de desarrollo a produccion
Wordpress y Docker, de desarrollo a produccionWordpress y Docker, de desarrollo a produccion
Wordpress y Docker, de desarrollo a produccionSysdig
 
What Prometheus means for monitoring vendors
What Prometheus means for monitoring vendorsWhat Prometheus means for monitoring vendors
What Prometheus means for monitoring vendorsSysdig
 
Docker Runtime Security
Docker Runtime SecurityDocker Runtime Security
Docker Runtime SecuritySysdig
 
CI / CD / CS - Continuous Security in Kubernetes
CI / CD / CS - Continuous Security in KubernetesCI / CD / CS - Continuous Security in Kubernetes
CI / CD / CS - Continuous Security in KubernetesSysdig
 
Continuous Security
Continuous SecurityContinuous Security
Continuous SecuritySysdig
 
The top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitorThe top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitorSysdig
 
The top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitorThe top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitorSysdig
 
Behavioural activity monitoring on CoreOS with Sysdig Falco
Behavioural activity monitoring on CoreOS with Sysdig FalcoBehavioural activity monitoring on CoreOS with Sysdig Falco
Behavioural activity monitoring on CoreOS with Sysdig FalcoSysdig
 
How to Monitor Microservices
How to Monitor MicroservicesHow to Monitor Microservices
How to Monitor MicroservicesSysdig
 
WTF my container just spawned a shell!
WTF my container just spawned a shell!WTF my container just spawned a shell!
WTF my container just spawned a shell!Sysdig
 
Trace everything, when APM meets SysAdmins
Trace everything, when APM meets SysAdminsTrace everything, when APM meets SysAdmins
Trace everything, when APM meets SysAdminsSysdig
 
You're monitoring Kubernetes Wrong
You're monitoring Kubernetes WrongYou're monitoring Kubernetes Wrong
You're monitoring Kubernetes WrongSysdig
 
The Dark Art of Container Monitoring - Spanish
The Dark Art of Container Monitoring - SpanishThe Dark Art of Container Monitoring - Spanish
The Dark Art of Container Monitoring - SpanishSysdig
 
Lions, Tigers and Deers: What building zoos can teach us about securing micro...
Lions, Tigers and Deers: What building zoos can teach us about securing micro...Lions, Tigers and Deers: What building zoos can teach us about securing micro...
Lions, Tigers and Deers: What building zoos can teach us about securing micro...Sysdig
 
Building Trustworthy Containers
Building Trustworthy ContainersBuilding Trustworthy Containers
Building Trustworthy ContainersSysdig
 
A brief history of system calls
A brief history of system callsA brief history of system calls
A brief history of system callsSysdig
 
Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing ToolsSysdig
 
Extending Sysdig with Chisel
Extending Sysdig with ChiselExtending Sysdig with Chisel
Extending Sysdig with ChiselSysdig
 
Intro to sysdig in 15 minutes
Intro to sysdig in 15 minutesIntro to sysdig in 15 minutes
Intro to sysdig in 15 minutesSysdig
 
Troubleshooting Kubernetes
Troubleshooting KubernetesTroubleshooting Kubernetes
Troubleshooting KubernetesSysdig
 

Más de Sysdig (20)

Wordpress y Docker, de desarrollo a produccion
Wordpress y Docker, de desarrollo a produccionWordpress y Docker, de desarrollo a produccion
Wordpress y Docker, de desarrollo a produccion
 
What Prometheus means for monitoring vendors
What Prometheus means for monitoring vendorsWhat Prometheus means for monitoring vendors
What Prometheus means for monitoring vendors
 
Docker Runtime Security
Docker Runtime SecurityDocker Runtime Security
Docker Runtime Security
 
CI / CD / CS - Continuous Security in Kubernetes
CI / CD / CS - Continuous Security in KubernetesCI / CD / CS - Continuous Security in Kubernetes
CI / CD / CS - Continuous Security in Kubernetes
 
Continuous Security
Continuous SecurityContinuous Security
Continuous Security
 
The top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitorThe top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitor
 
The top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitorThe top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitor
 
Behavioural activity monitoring on CoreOS with Sysdig Falco
Behavioural activity monitoring on CoreOS with Sysdig FalcoBehavioural activity monitoring on CoreOS with Sysdig Falco
Behavioural activity monitoring on CoreOS with Sysdig Falco
 
How to Monitor Microservices
How to Monitor MicroservicesHow to Monitor Microservices
How to Monitor Microservices
 
WTF my container just spawned a shell!
WTF my container just spawned a shell!WTF my container just spawned a shell!
WTF my container just spawned a shell!
 
Trace everything, when APM meets SysAdmins
Trace everything, when APM meets SysAdminsTrace everything, when APM meets SysAdmins
Trace everything, when APM meets SysAdmins
 
You're monitoring Kubernetes Wrong
You're monitoring Kubernetes WrongYou're monitoring Kubernetes Wrong
You're monitoring Kubernetes Wrong
 
The Dark Art of Container Monitoring - Spanish
The Dark Art of Container Monitoring - SpanishThe Dark Art of Container Monitoring - Spanish
The Dark Art of Container Monitoring - Spanish
 
Lions, Tigers and Deers: What building zoos can teach us about securing micro...
Lions, Tigers and Deers: What building zoos can teach us about securing micro...Lions, Tigers and Deers: What building zoos can teach us about securing micro...
Lions, Tigers and Deers: What building zoos can teach us about securing micro...
 
Building Trustworthy Containers
Building Trustworthy ContainersBuilding Trustworthy Containers
Building Trustworthy Containers
 
A brief history of system calls
A brief history of system callsA brief history of system calls
A brief history of system calls
 
Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing Tools
 
Extending Sysdig with Chisel
Extending Sysdig with ChiselExtending Sysdig with Chisel
Extending Sysdig with Chisel
 
Intro to sysdig in 15 minutes
Intro to sysdig in 15 minutesIntro to sysdig in 15 minutes
Intro to sysdig in 15 minutes
 
Troubleshooting Kubernetes
Troubleshooting KubernetesTroubleshooting Kubernetes
Troubleshooting Kubernetes
 

Último

Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066
 
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitterShivangiSharma879191
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .Satyam Kumar
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721
 
computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction managementMariconPadriquez1
 

Último (20)

Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
 
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction management
 

15 kubernetes failure points you should watch

  • 1. you should watch Jorge Salamero - @bencerillo 15 Kubernetes failure points
  • 2. Jorge Salamero Tech Marketing aka container gamer @ Sysdig github.com/bencer @bencerillo OSS fan Monitoring, containers, IoT/home-automation, cars About me
  • 3. Monitoring & Security Platform for Containers
  • 4. Monitoring 15 Kubernetes failure points - Apps - Hosts - Orchestration - Containers - Yourself https://sysdig.com/blog/monitoring-kubernetes-with-sysdig-cloud/ https://sysdig.com/blog/alerting-kubernetes/
  • 5. The holy service metrics - KPI / biz metrics / synthetic monitoring / user metrics - Google SRE book: “The Four Golden Signals” Latency+Traffic+Errors+Saturation
  • 6. USE method - Utilization (how busy we are, close to 100% bottleneck) - Saturation (amount of work waiting on the queue) - Errors
  • 7. RED method - Request Rate - Request Errors - Request Duration
  • 8. The holy service metrics - Code instrumentation (statsd, JMX or Prometheus metrics): var httpDurationsHistogram := prometheus.NewHistogramVec(prometheus.HistogramOpts{ Name: "http_durations_histogram_seconds", Help: "Seconds spent serving HTTP requests.", Buckets: prometheus.DefBuckets, }, []string{"method", "route", "status_code"}) prometheus.MustRegister(httpDurationsHistogram) - or Sysdig autodiscovery ;-)
  • 9. 1. connections per second net.request.count 2. response time net.response.time 3. errors net.request.error.count
  • 14. Kubernetes metadata: labels Pod app: shopping tier: api Pod app: shopping tier: db Pod app: social tier: api role: search Pod app: social tier: api role: search
  • 17. Health vs state monitoring - Health: - CPU, memory, disk - connections, response time, errors
  • 18. Health vs state monitoring - State (orchestration): - Are containers up and running properly?
  • 19. Health vs state monitoring - kube-state-metrics https://github.com/kubernetes/kube-state-metrics https://sysdig.com/blog/introducing-kube-state-metrics/ calculate new metrics based on the state of Kubernetes resources
  • 20. Container scheduling - Need to deploy a container: - given the requirements, where can we run it? and let’s ignore affinity, taints and tolerations: https://sysdig.com/blog/kubernetes-scheduler/ - capacity planning
  • 21. 4. node availability Based on the host or the kubelet component status: kube_node_status_condition{condition="Ready",status="true"} == 0 count(kube_node_status_condition{condition="Ready",status="true"} == 0) > 1 and (count(kube_node_status_condition{condition="Ready",status="true"} == 0) / count(kube_node_status_condition{condition="Ready",status="true"})) > 0.2 count(up{job="kubelet"} == 0) / count(up{job="kubelet"}) * 100 > 3 kube_node_status_condition: kube_node_status_ready, kube_node_status_out_of_disk, kube_node_status_memory_pressure, kube_node_status_disk_pressure, and kube_node_status_network_unavailable
  • 23. Container resource requirements resources: requests: memory: "256Mi" cpu: "250m" limits: memory: "512Mi" cpu: "500m" https://github.com/kubernetes-incubator/cluster-capacity
  • 24. 5. CPU resources 6. memory resources kube_node_status_capacity_pods kube_node_status_allocatable_pods kube_node_status_capacity_cpu_cores kube_node_status_capacity_memory_bytes kube_node_status_allocatable_cpu_cores kube_node_status_allocatable_memory_bytes capacity - used (by OS and kube services) = allocatable
  • 25. Container disk requirements here things get more complicated... - ephemeral disk usage - persistent volumes claims
  • 26. 7. disk resources predict_linear(node_filesystem_free[30m], 3600 * 2) < 0 kube_node_status_condition: kube_node_status_out_of_disk but within containers this is still WIP, at least Kubernetes 1.8: container_fs_* doesn’t work with PV https://github.com/kubernetes/kubernetes/pull/59170 https://github.com/kubernetes/kubernetes/pull/51553 https://kubernetes.io/docs/concepts/cluster-administration/controller-metrics/
  • 27. Container orchestration - ReplicationController - ReplicaSet - Deployment - DaemonSet - StatefulSet
  • 28. Kubernetes deployments Is Kubernetes doing what is supposed to to? Orchestration needs monitoring too.
  • 30. 9. desired instances ((kube_deployment_status_replicas_updated != kube_deployment_spec_replicas) or (kube_deployment_status_replicas_available != kube_deployment_spec_replicas))
  • 31. 10. deployment updates glitches kube_deployment_status_observed_generation != kube_deployment_metadata_generation kube_deployment_spec_paused kube_deployment_spec_strategy_rollingupdate_max_unavailable
  • 33. Liveness probes To know when to restart a container: livenessProbe: httpGet: path: /healthz port: 8080 httpHeaders: - name: X-Custom-Header value: Awesome initialDelaySeconds: 3 periodSeconds: 3
  • 34. Ready-ness probes To know when a container is ready to start accepting traffic: readinessProbe: exec: command: - cat - /tmp/healthy initialDelaySeconds: 5 periodSeconds: 5
  • 35. 11. pod status kube_pod_status_phase: Pending|Running|Succeeded|Failed|Unknown kube_pod_status_ready kube_pod_status_scheduled kube_pod_container_status_waiting kube_pod_container_status_running kube_pod_container_status_terminated kube_pod_container_status_ready
  • 36. 12. pod restarts You can look at this as a metric or as an event: ALERT PodRestartingTooMuch IF rate(k8s_pod_status_restartCount[1m]) > 1/(5*60) FOR 1h LABELS { severity="warning" } ANNOTATIONS { summary = "Pod {{$labels.namespace}}/{{$label.name}} restarting too much.", description = "Pod {{$labels.namespace}}/{{$label.name}} restarting too much.", }
  • 39. Kubernetes internals - APIserver - KubeDNS / Istio - container registry - any other piece of Kubernetes https://sysdig.com/blog/monitor-etcd/
  • 40. 13. APIserver rate(apiserver_request_count{code=~"^(?:5..)$"}[5m]) / rate(apiserver_request_count[5m])* 100 > 5 apiserver_latency_seconds:quantile{quantile="0.99",subresource!="log",verb! ~"^(?:WATCH|WATCHLIST|PROXY|CONNECT)$"}> 4 Or just do Golden signals on APIserver endpoint too :-)
  • 41. 14. KubeDNS / Istio histogram_quantile(0.95, sum(rate(kubedns_probe_kubedns_latency_ms_bucket[1m])) BY (le, kubernetes_pod_name)) > 1000 All export native metrics in Prometheus format, just scrape them! https://sysdig.com/blog/monitor-istio/
  • 42. What are we deploying? - CI/CD and commits - Manual deploys You need to validate what you tell Kubernetes too!
  • 43. 15. monitor your commands kubeval: validates YAML and JSON config files https://github.com/garethr/kubeval kube-diff: show differences between running state and version controlled configuration https://github.com/weaveworks/kubediff Configuration reconciliation discussion: https://github.com/kubernetes/kubernetes/issues/1702 Although this is getting automated too: https://sysdig.com/blog/kubernetes-scaler/
  • 44. Recap 1. connections per second 2. response time 3. errors 4. node availability 5. CPU resources 6. memory resources 7. disk and external resources
  • 45. Recap (2) 8. running instances 9. desired instances 10. deployment updates glitches
  • 46. Recap (3) 11. pod status 12. pod restarts 13. APIserver health 14. KubeDNS / Istio health 15. monitor your commands
  • 47. Grazie! Jorge Salamero - @bencerillo https://sysdig.com/blog/