6. Why you need Kubernetes and what it can do
Self-healing
Kubernetes restarts containers that fail, replaces containers, kills
containers that don’t respond to your user-defined health check, and
doesn’t advertise them to clients until they are ready to serve.
What is Kubernetes? – kubernetes.io
8. What Kubernetes is not
Does not provide nor adopt any comprehensive machine
configuration, maintenance, management, or self-healing systems.
What is Kubernetes? – kubernetes.io
14. Chaos Engineering
「ランダムに障害を注入して何かを発見する」手法?
Random strategies waste resources testing “uninteresting” faults,
while programmer-guided approaches are only as good as the
intuition of a programmer and only scale with human effort.
Automating Failure Testing Research at Internet Scale. Alvaro, P et al.
ACM Conference Proceedings 2016 pp: 17-28
15. 1. Start by defining ‘steady state’ as some measurable output of a system that indicates
normal behavior.
2. Hypothesize that this steady state will continue in both the control group and the
experimental group.
3. Introduce variables that reflect real world events like servers that crash, hard drives that
malfunction, network connections that are severed, etc.
4. Try to disprove the hypothesis by looking for a difference in steady state between the
control group and the experimental group.
CHAOS IN PRACTICE - PRINCIPLES OF CHAOS
26. STPA Control Structure
Controller
Controller
Controlled Process
Control
Algorithm
Process
Model
Control
Algorithm
Process
Model
Control Action Feedback
Control Action Feedback
権
限
高
低
Control
Algorithm
Process
Model
コントローラの意思決定プロセス
(プログラムのロジックなど)
コントローラが信じていること
(被コントロールプロセスの状態など)
33. きっと複雑なので サブシステムに分割しておく
Control Plane Subsystem App/Data Plane Subsystem
Developer/Admin
CA
API Server
Controllers (Control Plane)
Controlled
Process
CA
Controllers (App/Data Plane)
LW CA FCA LW
F
34. App/Data Plane Subsystem Developer/
Admin
API Server
CA1 F1
kube-proxykubelet
Container Runtime
Node Features / Configs
Pods (Incl. System Pods. e. g.
DaemonSet, Operator)
CA2 F2 CA3 F3
CA9 F9
CA10 F10
CA6 LW6 CA7 F7CA5 LW5
CA12 LW12
CA8 F8
CA4 F4
Infra. API
CA11 F11
46. CAST(Causal Analysis using System Theory)
STPAは事前、CASTは事後
STAMP
(System-Theoretic Accident Model and
Processes)
STPA
(System-Theoretic
Process Analysis)
CAST
(Causal Analysis
using System Theory)
基礎理論
方法論
構造からハザードシナリオを分析
(事前)
事故から原因を分析
(事後)
50. Story # タイトル
1DNS issues in Kubernetes. Public postmortem #1
2CPU limits and aggressive throttling in Kubernetes
3When GKE ran out of IP addresses
4Sailing with the Istio through the shallow water
5Kubernetes made my latency 10x higher
6A Kubernetes crime story
7New K8s workers unable to join cluster
8How a simple admission webhook lead to a cluster outage
9Post Mortem: Kubernetes Node OOM
10Kubernetes' dirty endpoint secret and Ingress
11How a Production Outage Was Caused Using Kubernetes Pod Priorities
12Moving to Kubernetes: the Bad and the Ugly
13Kubernetes Failure Stories, or: How to Crash Your Cluster
14Build Errors of Continuous Delivery Platform
1510 ways to shoot yourself in the foot with kubernetes
16Keynote How Spotify Accidentally Deleted All its Kube Clusters with No User Impact
17Oh Sh*t! The Config Changed!
18Misunderstanding the behaviour of one templating line — and the pain it caused our k8s clusters
19How to kill the Algolia dashboard during Black Friday
20Outage post-mortem
21The shipwreck of GKE Cluster Upgrade
22Breaking Kubernetes: How We Broke and Fixed our K8s Cluster
23Maximize learnings from a Kubernetes cluster failure
24Kubernetes Load Balancer Configuration – Beware when draining nodes
25On Infrastructure at Scale: A Cascading Failure of Distributed System
26A Perfect DNS Storm
27Kubernetes and the Menace ELB, the tale of an outage
28Moving the entire stack to k8s within a year – lessons learned
29Anatomy of a Production Kubernetes Outage
30“Break and Recover” Kubernetes Cluster
31101 Ways to Crash Your Cluster
32Search and Reporting Outage
33SaleMove US System Issue
56. Infrastructure as Data
Zhang Lei, Staff Engineer of Alibaba Cloud, CNCF Ambassador, Co-
chair of CNCF SIG App Delivery, and maintainer of Kubernetes.
Fundamentals of Declarative Application Management in Kubernetes
57. KubernetesのAPI Serverを核にした Infrastructure as Data
Resource Custom
Resource
Validation/Mutation
Trigger
Desired State
Current State
Validating
Webhook
Mutating
Webhook
Controller
App/Data Plane
Agent
Developer/
Admin
API Server + etcd
Pods