Implement Advanced Scheduling Techniques in Kubernetes

Implement Advanced Scheduling Techniques in Kubernetes
Oleg Chunikhin | CTO, Kublr | February 2018

Introduction
• Oleg Chunikhin
• CTO @ Kublr
• Chief Software Architect @ EastBanc Technologies
• Kublr
• Enterprise Kubernetes cluster manager
• Application delivery platform

What to Look For
• Kubernetes overview
• Scheduling algorithm
• Scheduling controls
• Advanced scheduling techniques
• Examples, use cases, and recommendations

Kubernetes | Technology Stack
Kubernetes
• Orchestration
• Network
• Configuration
• Service discovery
• Ingress
• Persistence
• …
Docker
• Distribution
• Configuration
• Isolation

Docker | Architecture
Docker image
repository
Instance
Images
App data
Docker CLI
Overlay
network
Docker daemon
Application containersApplication containers

Kubernetes | Architecture
Master Node
K8s master
components:
etcd, scheduler, api,
controller
K8s
metadata
Docker
kubelet
App
data
K8s node components:
overlay network,
discovery, connectivity
Infrastructure and
application containers
Infrastructure and
application containers
Overlay
network

Kubernetes | Nodes and Pods
Node2
Pod A-2
10.0.1.5
Cnt1
Cnt2
Node 1
Pod A-1
10.0.0.3
Cnt1
Cnt2
Pod B-1
10.0.0.8
Cnt3

Node 1
Kubernetes | Container Orchestration
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
Pod A
Pod B
K8S
Controller(s)
User
Node 1
Pod A
Pod B Node 2
Pod C

Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
It all starts empty

Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Kubelet registers node
object in master

Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Node 1
Node 2

Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Node 1
Node 2
User creates (unscheduled) Pod
object(s) in Master

Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Node 1
Node 2
Scheduler notices
unscheduled Pods ...

Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Node 1
Node 2
…identifies the best
node to run them on…
Pod A
Pod B
Pod C

Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Node 1
Node 2
…and marks the
pods as scheduled
on corresponding
nodes.
Pod A
Pod B
Pod C

Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Node 1
Node 2
Kubelet notices pods
scheduled to its nodes…
Pod A
Pod B
Pod C

Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Node 1
Node 2
…and starts pods’
containers.
Pod A
Pod B
Pod C
Pod A
Pod B

Node 1
Docker
Kubelet
K8S Master API
K8S
Scheduler(s)
K8S
Controller(s)
User
Node 1
Node 2
Scheduler finds the
best node to run pods.
HOW?
Pod A
Pod B
Pod C
Pod A
Pod B

Kubernetes | Scheduling Algorithm
For each pod that needs scheduling:
1. Filter nodes
2. Calculate nodes priorities
3. Schedule pod if possible

Volume filters
• Do pod requested volumes’ zones
fit the node’s zone?
• Can the node attach to the
volumes?
• Are there mounted volumes
conflicts?
• Are there additional volume
topology constraints?
Volume filters
Resource filters
Topology filters
Prioritization

Resource filters
• Does pod requested resources
(CPU, RAM GPU, etc) fit the node’s
available resources?
• Can pod requested ports be
opened on the node?
• Is there no memory or disk
pressure on the node?
Volume filters
Resource filters
Topology filters
Prioritization

Topology filters
• Is the pod requested to run on this
node?
• Are there inter-pod affinity
constraints?
• Does the node match the pod’s
node selector?
• Can the pod tolerate the node’s
taints?
Volume filters
Resource filters
Topology filters
Prioritization

Prioritize with weights for
• Pod replicas distribution
• Least (or most) node utilization
• Balanced resource usage
• Inter-pod affinity priority
• Node affinity priority
• Taint toleration priority
Volume filters
Resource filters
Topology filters
Prioritization

Scheduling Controlling Pods Destination
• Specify resource requirements
• Be aware of volumes
• Use node constraints
• Use affinity and anti-affinity
• Scheduler configuration
• Custom / multiple schedulers

Scheduling Controlled | Resources
• CPU, RAM, other (GPU)
• Requests and limits
• Reserved resources
kind: Node
status:
allocatable:
cpu: "4"
memory: 8070796Ki
pods: "110"
capacity:
cpu: "4"
memory: 8Gi
pods: "110"
kind: Pod
spec:
containers:
- name: main
resources:
requests:
cpu: 100m
memory: 1Gi

Scheduling Controlled | Volumes
• Request volumes in the right zones
• Make sure the node can attach
enough volumes
• Avoid volume location conflicts
• Use volume topology constraints
(alpha in 1.7)
Node 1
Pod A
Node 2 Volume 2
Pod B
Unschedulable
Zone A
Pod C
Requested
Volume
Zone B

• Make sure the node can attach
enough volumes
(alpha in 1.7)
Node 1
Pod A
Volume 2Pod B
Pod C Requested
Volume
Volume 1

• Make sure node can attach enough
volumes
(alpha in 1.7)
Node 1
Volume 1Pod A
Node 2
Volume 2Pod B
Pod C

• Make sure node can attach enough
volumes
(alpha in 1.7)
annotations:
"volume.alpha.kubernetes.io/node-affinity": '{
"requiredDuringSchedulingIgnoredDuringExecution": {
"nodeSelectorTerms": [{
"matchExpressions": [{
"key": "kubernetes.io/hostname",
"operator": "In",
"values": ["docker03"]
}]
}]
}}'

Scheduling Controlled | Constraints
• Host constraints
• Labels and node selectors
• Taints and tolerations
Node 1Pod A
kind: Pod
spec:
nodeName: node1
kind: Node
metadata:
name: node1

Scheduling Controlled | Node Constraints
Node 1
Pod A Node 2
Node 3
label: tier: backend
kind: Node
metadata:
labels:
tier: backend
kind: Pod
spec:
nodeSelector:
tier: backend

Scheduling Controlled | Node Constraints
kind: Pod
spec:
tolerations:
- key: error
value: disk
operator: Equal
effect: NoExecute
tolerationSeconds: 60
kind: Node
spec:
taints:
- effect: NoSchedule
key: error
value: disk
timeAdded: null
Pod B
Node 1
tainted
Pod A
tolerate

Scheduling Controlled | Taints
Taints communicate
node conditions
• Key – condition category
• Value – specific condition
• Operator – value wildcard
• Equal
• Exists
• Effect
• NoSchedule – filter at scheduling time
• PreferNoSchedule – prioritize at scheduling time
• NoExecute – filter at scheduling time, evict if executing
• TolerationSeconds – time to tolerate “NoExecute” taint
kind: Pod
spec:
tolerations:
- key: <taint key>
value: <taint value>
operator: <match operator>
effect: <taint effect>
tolerationSeconds: 60

Scheduling Controlled | Affinity
• Node affinity
• Inter-pod affinity
• Inter-pod anti-affinity
kind: Pod
spec:
affinity:
nodeAffinity: { ... }
podAffinity: { ... }
podAntiAffinity: { ... }

Scheduling Controlled | Node Affinity
Scope
• Preferred during scheduling, ignored during execution
• Required during scheduling, ignored during execution
kind: Pod
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 10
preference: { <node selector term> }
- ...
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- { <node selector term> }
- ... v

Scheduling Controlled | Inter-pod Affinity
Scope
kind: Pod
spec:
affinity:
podAffinity:
- weight: 10
podAffinityTerm: { <pod affinity term> }
- ...
- { <pod affinity term> }
- ...

Scheduling Controlled | Inter-pod Anti-affinity
Scope
kind: Pod
spec:
affinity:
podAntiAffinity:
- weight: 10
podAffinityTerm: { <pod affinity term> }
- ...
- { <pod affinity term> }
- ...

Scheduling Controlled | Pod Affinity Terms
• topologyKey – nodes’ label key defining co-location
• labelSelector and namespaces – select group of pods
<pod affinity term>:
topologyKey: <topology label key>
namespaces: [ <namespace>, ... ]
labelSelector:
matchLabels:
<label key>: <label value>
...
matchExpressions:
- key: <label key>
operator: In | NotIn | Exists | DoesNotExist
values: [ <value 1>, ... ]
...

Scheduling Controlled | Affinity Example
affinity:
topologyKey: tier
labelSelector:
matchLabels:
group: a
Node 1
tier: a
Pod B
group: a
Node 3
tier: b
tier: a
Node 4
tier: b
tier: b
Pod B
group: a
Node 1
tier: a

Scheduling Controlled | Scheduler Configuration
• Algorithm provider
• Policy configuration file / ConfigMap
• Extender

Default Scheduler | Algorithm Provider
kube-scheduler
--scheduler-name=default-scheduler
--algorithm-provider=DefaultProvider
--algorithm-provider=ClusterAutoscalerProvider

Default Scheduler | Custom Policy Config
kube-scheduler
--config=<file>
--policy-config-file=<file>
--use-legacy-policy-config=<true|false>
--policy-configmap=<config map name>
--policy-configmap-namespace=<config map ns>

Default Scheduler | Custom Policy Config
{
"kind" : "Policy",
"apiVersion" : "v1",
"predicates" : [
{"name" : "PodFitsHostPorts"},
...
{"name" : "HostName"}
],
"priorities" : [
{"name" : "LeastRequestedPriority", "weight" : 1},
...
{"name" : "EqualPriority", "weight" : 1}
],
"hardPodAffinitySymmetricWeight" : 10,
"alwaysCheckAllPredicates" : false
}

Default Scheduler | Scheduler Extender
{
"kind" : "Policy",
"apiVersion" : "v1",
"predicates" : [...],
"priorities" : [...],
"extenders" : [{
"urlPrefix": "http://127.0.0.1:12346/scheduler",
"filterVerb": "filter",
"bindVerb": "bind",
"prioritizeVerb": "prioritize",
"weight": 5,
"enableHttps": false,
"nodeCacheCapable": false
}],
"hardPodAffinitySymmetricWeight" : 10,
"alwaysCheckAllPredicates" : false
}

Default Scheduler | Scheduler Extender
func fiter(pod, nodes) api.NodeList
func prioritize(pod, nodes) HostPriorityList
func bind(pod, node)

Scheduling Controlled | Multiple Schedulers
kind: Pod
Metadata:
name: pod2
spec:
schedulerName: my-scheduler
kind: Pod
Metadata:
name: pod1
spec:
...

Scheduling Controlled | Custom Scheduler
Naive implementation
• In an infinite loop:
• Get list of Nodes: /api/v1/nodes
• Get list of Pods: /api/v1/pods
• Select Pods with
status.phase == Pending and
spec.schedulerName == our-name
• For each pod:
• Calculate target Node
• Create a new Binding object: POST /api/v1/bindings
apiVersion: v1
kind: Binding
Metadata:
namespace: default
name: pod1
target:
apiVersion: v1
kind: Node
name: node1

Better implementation
• Watch Pods: /api/v1/pods
• On each Pod event:
• Process if the Pod with
• Get list of Nodes: /api/v1/nodes
apiVersion: v1
kind: Binding
Metadata:
namespace: default
name: pod1
target:
apiVersion: v1
kind: Node
name: node1

Even better implementation
• Watch Nodes: /api/v1/nodes
• On each Node event:
• Update Node cache
• Watch Pods: /api/v1/pods
• On each Pod event:
• Process if the Pod with
apiVersion: v1
kind: Binding
Metadata:
namespace: default
name: pod1
target:
apiVersion: v1
kind: Node
name: node1

Custom Scheduler | Standard Filters
• Minimal set of filters
• kube-scheduler
• Extend
• Re-implement
GitHub kubernetes/kubernetes
plugin/pkg/scheduler/scheduler.go
plugin/pkg/scheduler/algorithm/predicates/predicates.go

Use Case | Distributed Pods
apiVersion: v1
kind: Pod
metadata:
name: db-replica-3
labels:
component: db
spec:
affinity:
podAntiAffinity:
- topologyKey: kubernetes.io/hostname
labelSelector:
matchExpressions:
- key: component
operator: In
values: [ "db" ]
Node 2
db-replica-2
Node 1
Node 3
db-replica-1
db-replica-3

Use Case | Co-located Pods
apiVersion: v1
kind: Pod
metadata:
name: app-replica-1
labels:
component: web
spec:
affinity:
podAffinity:
- topologyKey: kubernetes.io/hostname
labelSelector:
matchExpressions:
- key: component
operator: In
values: [ "db" ]
Node 2
db-replica-2
Node 1
Node 3
db-replica-1
app-replica-1

Use Case | Reliable Service on Spot Nodes
• “fixed” node group
Expensive, more reliable, fixed number
Tagged with label nodeGroup: fixed
• “spot” node group
Inexpensive, unreliable, auto-scaled
Tagged with label nodeGroup: spot
• Scheduling rules:
• At least two pods on “fixed” nodes
• All other pods favor “spot” nodes
• Custom scheduler

Scheduling | Dos and Don’ts
DO
• Use resource-based scheduling instead of
node-based
• Specify resource requests
• Keep requests == limits
• Especially for non-elastic resources
• Memory is non-elastic!
• Safeguard against missing resource specs
• Namespace default limits
• Admission controllers
• Plan architecture of localized volumes (EBS,
local)
• Use inter-pod affinity/anti-affinity if possible
DON’T
• ... assign pod to nodes directly
• ... use pods with no resource requests
• ... use resource requests rather node
• ... use node-affinity or node assignment if
possible

Scheduling | Key Takeaways
• Scheduling filters and priorities
• Resource requests and availability
• Inter-pod affinity/anti-affinity
• Volumes localization (AZ)
• Node labels and selectors
• Node affinity/anti-affinity
• Node taints and tolerations
• Scheduler(s) tweaking and customization

Oleg Chunikhin
Chief Technology Officer
oleg@kublr.com
kublr.com
Thank you!

Implement Advanced Scheduling Techniques in Kubernetes

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Implement Advanced Scheduling Techniques in Kubernetes

Similar a Implement Advanced Scheduling Techniques in Kubernetes (20)

Más de Kublr

Más de Kublr (18)

Último

Último (20)

Implement Advanced Scheduling Techniques in Kubernetes

Notas del editor