5. “Open-source platform for automating
deployment, scaling, and operations of
application containers around clusters of
hosts, providing container-centric
infrastructure”
- Kubernetes Documentation
18. etcd is designed for “large scale distributed
systems… that never tolerate split brain
behavior and are willing to sacrifice
availability” to achieve it
- etcd Documentation
37. What happens if a follower
doesn’t get a heartbeat?
Election time!
38. In the game of Raft leadership elections, you win or you lose.
39. 1. Write goes to leader
2. Leader appends command to log
3. Tells other servers via RPC to append it
to their logs (followers will say no if
they’re behind)
4. Once majority append, leader commits
5. Leader tells nodes in subsequent
messages of the last committed entry
6. Nodes commit
55. What We Expect
1. We create deployment
2. Deployment creates a replica set
3. Replica set creates three pods
4. Our scheduler schedules those three
pods
5. Kubelet will run scheduled pods
83. Things We’ve Done
● Look at Kubernetes components
● Shown how it handles distributed
state
● Dove into how we reconcile state and
schedule pods
● Trace a deployment through the
system
These are pretty cool things, and most talks about Kubernetes concentrates on these things.
I’d like to do something a little different.
Let’s all be in agreement that Kubernetes can do it, and they’re cool. I
want to peek behind the curtain.
https://jvns.ca/blog/2016/10/10/what-even-is-a-container/
Cgroups - specify limits for how much memory and CPU a process can use
Namespace - stop process from interfering with other processes
Containerized app examples - nginx deployment, microservice, Redis instance
1 or more containers that share a unique IPguaranteed to have all containers running on the same host
can share disk volumes (ex. App that logs to file, sidecar to forward logs)
Declarative way of managing pods
What containers we want to run and how many instances
Use replica sets under the hood - make sure # running in cluster is = # desired
Etcd - distributed k/v store that serves as Kubernetes database
API server - controls access to etcd - interacted with via REST calls
Controller manager - background threads that handle routine cluster tasks such as making sure a deployment has enough pods deployed and triggers the scheduling of new ones if needed
scheduler - schedules unscheduled pods, whenever there’s an unscheduled pod, the scheduler determines where the appropriate home for it is.
These components collectively make up the control plane and run on the master hosts in the cluster.
Kubelet - watches for pods that have been assigned to node and runs them, constantly polls API server + local config,
cAdvisor - metrics collector
Kube proxy - handles some networking stuff
https://www.comp.nus.edu.sg/~gilbert/pubs/BrewersConjecture-SigAct.pdf
Consistency - always reading the most recently written value
Availability - every request to a non-failing node receives a response
Partition-tolerance - system can handle the dropping of messages between nodes
Etcd is consistent/partition-tolerant
If in a 5 node cluster, 3 nodes take a dive, the other 2 stop responding to requests until the quroum is restored
If your goal is to have a system for running containerized applications, you need something that plays nice with clustered systems
CAP theorem reference - etcd is going to sacrifice availability to preserve consistency
Clarification -> Etcd unavailability doesn’t take down k8s, just the ability to mutate
Typical CRUD operations through rest calls or command line tool - etcdctl
How do the nodes agree on what a value is for a given key?
If not done smartly, things can go awry pretty quickly
How does etcd agree on the value for a given key while still upholding its consistency guarantee?
Method of achieving distributed consensus - having multiple distributed servers agree on what the value is for a given key in a fault tolerant manner
Trivial way to solve - the value is always 3
Typical CRUD operations through rest calls or command line tool - etcdctl
How do the nodes agree on what a value is for a given key?
If not done smartly, things can go awry pretty quickly
How does etcd agree on the value for a given key while still upholding its consistency guarantee?
Imagine a system that does NOT implement the Raft algorithm
Three nodes, A, B, C that each store a single value
When a client wants to read a value it contacts any node and gets its value
When a client wants to write a value, it contacts any node, and that node tells all the other nodes of the new value
Now contemplate what happens if three clients update the value in three different ways at the same time.
At time t=0, 3 clients write 3 different values to our 3 nodes simultaneously
At time t=5, after all updates have been applied, what’s the value, assuming no other writes have occurred?
No idea!
In fact, it’s possible for each node to have a different value depending on what order messages were received in
All of the previous messages have been lost forever
After the network recovers, the value at C is still going to be M until someone updates the value.
Seemingly simple systems can fail in all sorts of Fun ways when exposed to concurrent operations
In order for consensus to be achived, we’re going to require greater coordination between our nodes
This is where Raft comes into play
Replicated log - series of commands executed in order by a state machine
We want each log to have the same commands in the same order so that each machine has the same state
Accepts entries from clients, replicates them on other nodes, says when it’s safe to actually apply them
If leader fails, a new one is elected
Followers just hang out, respond to requests from leaders
Leader handles all the requests and does all of the fun coordination
Candidate is the state used to elect a new leader
Part of dealing with ordered statements is dealing with time
Raft uses an incrementing integer timestamp called a term, which is tied to an election
Terms last until there’s a new leader
Serves as a logical clock and aid in detecting obsolete info (like stale leaders)
Leaders send heartbeat messages to the follower nodes, saying “Hey I’m still alive”
Follower increments term and declares itself as a candidate
Sends an RPC request out for votes to all of the other servers
Servers vote for the first candidate who asks for their votes, assuming the candidate is at least as up to date as the voter.
If the voter has entries that the candidate doesn’t, then the voter isn’t going to vote for that candidate.
If the candidate gets a majority of the votes, it wins!
It sends out a new set of heartbeat messages to notify others of its victory, and life goes on
Simultaneous write scenario solved by requirement that all writes go to leader
Leader receives each write, and each write is a new entry with a new index -- can’t occur simultaneously
Network partition scenario - followers reject requests if they’re behind and will eventually be caught up
also, requests go to the leader and our dead node can’t become leader because it’s behind - effect is minimized
Elections require a majority of the nodes to agree to elect a leader
All writes require a majority of the nodes to replicate the transaction to be successful
If we lose more than half the nodes, then we’re unable to do anything
Kubernetes actually suggests you don’t interact with replica sets directly, we instead use the deployment controller to manage them
It knows which pods to manage through labels, we define labels on the pods we create and we tell the replica set which labels it should look for so that it knows which pods to manage
Provides a declarative way of managing pods and uses that to roll out your desired changes in a friendly way
Handles rolling deploys, rolling back your application, and autoscaling it out as well
What it’s looking for specifically are pods without a node name
Constantly querying the API server (and by extension, etcd) looking for those unscheduled pods
HostName - compares the node’s name to the node name specified in the pod spec (if any), excludes every node that doesn’t match, lets us schedule directly to that node, assuming it passes other predicates
MatchNodeSelector - variant of HostName, Kubernetes let you provide label selectors to associate resources using labels, this predicate checks that the node selector supplied in the pod spec is valid for a given node
Scheduler also checks the resources of each node
PodFitsHostPort - whether or not the host port (a hard-coded port specified in the pod spec) is available for that node, if not available -> filtered out
PodFitsResources - another resource-aware check, pods can request a given amount of CPU and memory; this predicate checks whether the node is capable of satisfying that request
CheckNodeMemoryPressure and CheckNodeDiskPressure won’t schedule onto nodes whose memory usage or disk usage is too high
Also volume checks - make sure that we’re not using more than allowed (for cloud providers) or there’s no conflicts on volume claims
Ties are broken by randomly picking a winner
Most obvious strategy is put our pod on the least used note - LeastRequestedPriority - calculates how much CPU + memory (equally weighted and added together) would be left after scheduling pod to that node - helps with balancing resource usage across cluster
BalancedResourceAllocation - attempts to prevent nodes from being largely weighted towards CPU or memory usage - try to avoid nodes with 95% CPU usage and 5% memory usage
Checks to see how the pod affects the balance of resource usage, favoring nodes who would have CPU utilization closer to memory utilization after scheduling
Last one I want to look at is SelectorSpreadPriority
Lose a great chunk of the benefits of having multiple copies of an application if they’re all sitting on the same node.
If the node goes down, you’ll lose every copy of the app that’s running on the node (insert metaphor about eggs and baskets)
This function minimizes the amount of pods from the same service or replica set on the same node, causing our currently-being-scheduled pods to favor nodes without its managed siblings
Specify we want three instances
We’re going to label it with the tag ‘app:hello-world’
Going to launch the hello-world container, when we hit port 80 it’ll return a hello world!
We use kubectl to actually get our app running in kubernetes
Kubectl is how we interact with the cluster - can view resources as well as modfiy them
Gives a wonderful view of what’s going on with our cluster
We can check the status of our pods, our deployments, and importantly for us right now, use it to construct a timeline of events
Though they’ve been formatted to fit the screen
A service defines a logical set of pods and a policy by which to access them - load balancer, port exposed on the host, external IP
Defines a service that “selects” pods with the app tag with a value of hello-world
Exposes the service as a port on the node
Can’t do a load balancer because that requires a cloud provider, otherwise we could create an Amazon ELB or whatever google cloud’s equiavalent
We’ll also create this via kubectl