Is advanced scheduling in Kubernetes achievable? Yes, however, how do you properly accommodate every real-life scenario that a Kubernetes user might encounter? How do you leverage advanced scheduling techniques to shape and describe each scenario in easy-to-use rules and configurations?
Oleg Chunikhin addressed those questions and demonstrated techniques for implementing advanced scheduling. For example, using spot instances and cost-effective resources on AWS, coupled with the ability to deliver a minimum set of functionalities that cover the majority of needs – without configuration complexity. You’ll get a run-down of the pitfalls and things to keep in mind for this route.
3. What to Look For
• Kubernetes overview
• Scheduling algorithm
• Scheduling controls
• Advanced scheduling techniques
• Examples, use cases, and recommendations
7. Kubernetes | Nodes and Pods
Node2
Pod A-2
10.0.1.5
Cnt1
Cnt2
Node 1
Pod A-1
10.0.0.3
Cnt1
Cnt2
Pod B-1
10.0.0.8
Cnt3
8. Node 1
Kubernetes | Container Orchestration
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
Pod A
Pod B
K8S
Controller(s)
User
Node 1
Pod A
Pod B Node 2
Pod C
9. Node 1
Kubernetes | Container Orchestration
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
It all starts empty
10. Node 1
Kubernetes | Container Orchestration
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Kubelet registers node
object in master
14. Node 1
Kubernetes | Container Orchestration
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Node 1
Node 2
…identifies the best
node to run them on…
Pod A
Pod B
Pod C
15. Node 1
Kubernetes | Container Orchestration
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Node 1
Node 2
…and marks the
pods as scheduled
on corresponding
nodes.
Pod A
Pod B
Pod C
16. Node 1
Kubernetes | Container Orchestration
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Node 1
Node 2
Kubelet notices pods
scheduled to its nodes…
Pod A
Pod B
Pod C
17. Node 1
Kubernetes | Container Orchestration
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Node 1
Node 2
…and starts pods’
containers.
Pod A
Pod B
Pod C
Pod A
Pod B
18. Node 1
Kubernetes | Container Orchestration
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Node 1
Node 2
Scheduler finds the
best node to run pods.
HOW?
Pod A
Pod B
Pod C
Pod A
Pod B
19. Kubernetes | Scheduling Algorithm
For each pod that needs scheduling:
1. Filter nodes
2. Calculate nodes priorities
3. Schedule pod if possible
20. Kubernetes | Scheduling Algorithm
Volume filters
• Do pod requested volumes’ zones
fit the node’s zone?
• Can the node attach to the
volumes?
• Are there mounted volumes
conflicts?
• Are there additional volume
topology constraints?
Volume filters
Resource filters
Topology filters
Prioritization
21. Kubernetes | Scheduling Algorithm
Resource filters
• Does pod requested resources
(CPU, RAM GPU, etc) fit the node’s
available resources?
• Can pod requested ports be
opened on the node?
• Is there no memory or disk
pressure on the node?
Volume filters
Resource filters
Topology filters
Prioritization
22. Kubernetes | Scheduling Algorithm
Topology filters
• Is the pod requested to run on this
node?
• Are there inter-pod affinity
constraints?
• Does the node match the pod’s
node selector?
• Can the pod tolerate the node’s
taints?
Volume filters
Resource filters
Topology filters
Prioritization
23. Kubernetes | Scheduling Algorithm
Prioritize with weights for
• Pod replicas distribution
• Least (or most) node utilization
• Balanced resource usage
• Inter-pod affinity priority
• Node affinity priority
• Taint toleration priority
Volume filters
Resource filters
Topology filters
Prioritization
24. Scheduling Controlling Pods Destination
• Specify resource requirements
• Be aware of volumes
• Use node constraints
• Use affinity and anti-affinity
• Scheduler configuration
• Custom / multiple schedulers
25. Scheduling Controlled | Resources
• CPU, RAM, other (GPU)
• Requests and limits
• Reserved resources
kind: Node
status:
allocatable:
cpu: "4"
memory: 8070796Ki
pods: "110"
capacity:
cpu: "4"
memory: 8Gi
pods: "110"
kind: Pod
spec:
containers:
- name: main
resources:
requests:
cpu: 100m
memory: 1Gi
26. Scheduling Controlled | Volumes
• Request volumes in the right zones
• Make sure the node can attach
enough volumes
• Avoid volume location conflicts
• Use volume topology constraints
(alpha in 1.7)
Node 1
Pod A
Node 2 Volume 2
Pod B
Unschedulable
Zone A
Pod C
Requested
Volume
Zone B
27. Scheduling Controlled | Volumes
• Request volumes in the right zones
• Make sure the node can attach
enough volumes
• Avoid volume location conflicts
• Use volume topology constraints
(alpha in 1.7)
Node 1
Pod A
Volume 2Pod B
Pod C Requested
Volume
Volume 1
28. Scheduling Controlled | Volumes
• Request volumes in the right zones
• Make sure node can attach enough
volumes
• Avoid volume location conflicts
• Use volume topology constraints
(alpha in 1.7)
Node 1
Volume 1Pod A
Node 2
Volume 2Pod B
Pod C
29. Scheduling Controlled | Volumes
• Request volumes in the right zones
• Make sure node can attach enough
volumes
• Avoid volume location conflicts
• Use volume topology constraints
(alpha in 1.7)
annotations:
"volume.alpha.kubernetes.io/node-affinity": '{
"requiredDuringSchedulingIgnoredDuringExecution": {
"nodeSelectorTerms": [{
"matchExpressions": [{
"key": "kubernetes.io/hostname",
"operator": "In",
"values": ["docker03"]
}]
}]
}}'
30. Scheduling Controlled | Constraints
• Host constraints
• Labels and node selectors
• Taints and tolerations
Node 1Pod A
kind: Pod
spec:
nodeName: node1
kind: Node
metadata:
name: node1
31. Scheduling Controlled | Node Constraints
• Host constraints
• Labels and node selectors
• Taints and tolerations
Node 1
Pod A Node 2
Node 3
label: tier: backend
kind: Node
metadata:
labels:
tier: backend
kind: Pod
spec:
nodeSelector:
tier: backend
32. Scheduling Controlled | Node Constraints
• Host constraints
• Labels and node selectors
• Taints and tolerations
kind: Pod
spec:
tolerations:
- key: error
value: disk
operator: Equal
effect: NoExecute
tolerationSeconds: 60
kind: Node
spec:
taints:
- effect: NoSchedule
key: error
value: disk
timeAdded: null
Pod B
Node 1
tainted
Pod A
tolerate
33. Scheduling Controlled | Taints
Taints communicate
node conditions
• Key – condition category
• Value – specific condition
• Operator – value wildcard
• Equal
• Exists
• Effect
• NoSchedule – filter at scheduling time
• PreferNoSchedule – prioritize at scheduling time
• NoExecute – filter at scheduling time, evict if executing
• TolerationSeconds – time to tolerate “NoExecute” taint
kind: Pod
spec:
tolerations:
- key: <taint key>
value: <taint value>
operator: <match operator>
effect: <taint effect>
tolerationSeconds: 60
40. Scheduling Controlled | Affinity Example
affinity:
topologyKey: tier
labelSelector:
matchLabels:
group: a
Node 1
tier: a
Pod B
group: a
Node 3
tier: b
tier: a
Node 4
tier: b
tier: b
Pod B
group: a
Node 1
tier: a
48. Scheduling Controlled | Custom Scheduler
Naive implementation
• In an infinite loop:
• Get list of Nodes: /api/v1/nodes
• Get list of Pods: /api/v1/pods
• Select Pods with
status.phase == Pending and
spec.schedulerName == our-name
• For each pod:
• Calculate target Node
• Create a new Binding object: POST /api/v1/bindings
apiVersion: v1
kind: Binding
Metadata:
namespace: default
name: pod1
target:
apiVersion: v1
kind: Node
name: node1
49. Scheduling Controlled | Custom Scheduler
Better implementation
• Watch Pods: /api/v1/pods
• On each Pod event:
• Process if the Pod with
status.phase == Pending and
spec.schedulerName == our-name
• Get list of Nodes: /api/v1/nodes
• Calculate target Node
• Create a new Binding object: POST /api/v1/bindings
apiVersion: v1
kind: Binding
Metadata:
namespace: default
name: pod1
target:
apiVersion: v1
kind: Node
name: node1
50. Scheduling Controlled | Custom Scheduler
Even better implementation
• Watch Nodes: /api/v1/nodes
• On each Node event:
• Update Node cache
• Watch Pods: /api/v1/pods
• On each Pod event:
• Process if the Pod with
status.phase == Pending and
spec.schedulerName == our-name
• Calculate target Node
• Create a new Binding object: POST /api/v1/bindings
apiVersion: v1
kind: Binding
Metadata:
namespace: default
name: pod1
target:
apiVersion: v1
kind: Node
name: node1
51. Custom Scheduler | Standard Filters
• Minimal set of filters
• kube-scheduler
• Extend
• Re-implement
GitHub kubernetes/kubernetes
plugin/pkg/scheduler/scheduler.go
plugin/pkg/scheduler/algorithm/predicates/predicates.go
52. Use Case | Distributed Pods
apiVersion: v1
kind: Pod
metadata:
name: db-replica-3
labels:
component: db
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: kubernetes.io/hostname
labelSelector:
matchExpressions:
- key: component
operator: In
values: [ "db" ]
Node 2
db-replica-2
Node 1
Node 3
db-replica-1
db-replica-3
53. Use Case | Co-located Pods
apiVersion: v1
kind: Pod
metadata:
name: app-replica-1
labels:
component: web
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: kubernetes.io/hostname
labelSelector:
matchExpressions:
- key: component
operator: In
values: [ "db" ]
Node 2
db-replica-2
Node 1
Node 3
db-replica-1
app-replica-1
54. Use Case | Reliable Service on Spot Nodes
• “fixed” node group
Expensive, more reliable, fixed number
Tagged with label nodeGroup: fixed
• “spot” node group
Inexpensive, unreliable, auto-scaled
Tagged with label nodeGroup: spot
• Scheduling rules:
• At least two pods on “fixed” nodes
• All other pods favor “spot” nodes
• Custom scheduler
55. Scheduling | Dos and Don’ts
DO
• Use resource-based scheduling instead of
node-based
• Specify resource requests
• Keep requests == limits
• Especially for non-elastic resources
• Memory is non-elastic!
• Safeguard against missing resource specs
• Namespace default limits
• Admission controllers
• Plan architecture of localized volumes (EBS,
local)
• Use inter-pod affinity/anti-affinity if possible
DON’T
• ... assign pod to nodes directly
• ... use pods with no resource requests
• ... use resource requests rather node
• ... use node-affinity or node assignment if
possible
56. Scheduling | Key Takeaways
• Scheduling filters and priorities
• Resource requests and availability
• Inter-pod affinity/anti-affinity
• Volumes localization (AZ)
• Node labels and selectors
• Node affinity/anti-affinity
• Node taints and tolerations
• Scheduler(s) tweaking and customization
Thank you for coming to see my presentation
Oleg Chunikhin
CTO at Kublr
Chief Software Architect at EastBanc Technologies
Kublr we develop an enterprise Kubernetes management platform
We see that quite often rich and powerful scheduling controls Kubernetes provides are underutilized, and essentially manual scheduling is used
We prepared this scheduling overview presentation to explain how cloud native applications can be made better by utilizing full power of k8s scheduling.
I will spend a few minutes reintroducing docker and kubernetes architecture concepts…
before we dig into kubernetes scheduling.
Talking about scheduling, I’ll try to explain
capabilities, …
controls available to cluster users and administrators, …
and extension points
We’ll also look at a couple of examples and…
Some recommendations
Kubernetes can schedule other types of containers, e.g. rkt
Docker containers can be managed through other orchestration technologies, such as
Mesos
Docker Swarm
Hashicorp Nomad
Docker-Kubernetes is still arguably the most common combination and we will be talking specifically about it today.
The architecture and concepts are shared with other
Distribution
Configuration
Isolation
Image repository may be public or private
Signed images are supported
Overlay network is not required
Different Linux process isolation technologies – namespaces, security groups и т.д.
Master:
API
Metadata database
Can run in HA mode1, 3, or 5 instances)
Nodes
K8s agents, docker, system containers, and application containers
After initialization and setup nodes are fully controlled by the master
Registering nodes in the wizard
Appointment of pods on the nodes
The address allocation is submitted (from the pool of addresses of the overlay network allocated to the node at registration)
Joint launch of containers in the pod
Sharing the address space of a dataport and data volumes with containers
The overall life cycle of the pod and its container
The life cycle of the pod is very simple - moving and changing is not allowed, you must be re-created
Master API maintains the general picture – vision of desired and current known state
Master relies on other components – controllers, kubelet – to update current known state
User modifies to-be state and reads current state
Controllers “clarify” to-be state
Kubelet perform actions to achieve to-be state, and reports current state
Scheduler is just one of the controllers, responsible for assigning unassigned pods to specific nodes
First there was nothing
Master API maintains the general picture
User modifies to-be state and reads current state
Controllers “clarify” to-be state
Kubelet perform actions to achieve to-be state, and reports current state
Scheduler is just one of the controllers, responsible for assigning unassigned pods to specific nodes
Master API maintains the general picture
User modifies to-be state and reads current state
Controllers “clarify” to-be state
Kubelet perform actions to achieve to-be state, and reports current state
Scheduler is just one of the controllers, responsible for assigning unassigned pods to specific nodes
Master API maintains the general picture
User modifies to-be state and reads current state
Controllers “clarify” to-be state
Kubelet perform actions to achieve to-be state, and reports current state
Scheduler is just one of the controllers, responsible for assigning unassigned pods to specific nodes
Master API maintains the general picture
User modifies to-be state and reads current state
Controllers “clarify” to-be state
Kubelet perform actions to achieve to-be state, and reports current state
Scheduler is just one of the controllers, responsible for assigning unassigned pods to specific nodes
Master API maintains the general picture
User modifies to-be state and reads current state
Controllers “clarify” to-be state
Kubelet perform actions to achieve to-be state, and reports current state
Scheduler is just one of the controllers, responsible for assigning unassigned pods to specific nodes
Master API maintains the general picture
User modifies to-be state and reads current state
Controllers “clarify” to-be state
Kubelet perform actions to achieve to-be state, and reports current state
Scheduler is just one of the controllers, responsible for assigning unassigned pods to specific nodes
Master API maintains the general picture
User modifies to-be state and reads current state
Controllers “clarify” to-be state
Kubelet perform actions to achieve to-be state, and reports current state
Scheduler is just one of the controllers, responsible for assigning unassigned pods to specific nodes
Master API maintains the general picture
User modifies to-be state and reads current state
Controllers “clarify” to-be state
Kubelet perform actions to achieve to-be state, and reports current state
Scheduler is just one of the controllers, responsible for assigning unassigned pods to specific nodes
Master API maintains the general picture
User modifies to-be state and reads current state
Controllers “clarify” to-be state
Kubelet perform actions to achieve to-be state, and reports current state
Scheduler is just one of the controllers, responsible for assigning unassigned pods to specific nodes
Pod requests new volumes, can they be created in a zone where the can be attached to the node?
If requested volumes already exist, can they be attached to the node?
If the volumes are already attached/mounted, can they be mounted to this node?
Any other user-specified constraints?
This most often happens in AWS, where
EBS can only be attached to instances in the same AZ where EBS is located
This pod should be co-located (affinity) or not co-located (anti-affinity)
with the pods matching the labelSelector in the specified namespaces,
where co-located is defined as running on a node whose value of the label with key topologyKey matches that of any node on which any of the selected pods is running.
Empty topologyKey:
For PreferredDuringScheduling pod anti-affinity, empty topologyKey is interpreted as "all topologies" ("all topologies" here means all the topologyKeys indicated by scheduler command-line argument --failure-domains);
For affinity and for RequiredDuringScheduling pod anti-affinity, empty topologyKey is not allowed.
This pod should be co-located (affinity) or not co-located (anti-affinity)
with the pods matching the labelSelector in the specified namespaces,
where co-located is defined as running on a node whose value of the label with key topologyKey matches that of any node on which any of the selected pods is running.
Empty topologyKey:
For PreferredDuringScheduling pod anti-affinity, empty topologyKey is interpreted as "all topologies" ("all topologies" here means all the topologyKeys indicated by scheduler command-line argument --failure-domains);
For affinity and for RequiredDuringScheduling pod anti-affinity, empty topologyKey is not allowed.