SlideShare una empresa de Scribd logo
1 de 41
Descargar para leer sin conexión
Process Scheduler and Balancer
in Kernel
Haifeng Li
2014-3-3
Process Scheduler
Outline
• Introduction of scheduler
• Scheduler History
– Round-Robin Scheduler
– O(N)
– O(1)

• Completely Fair Scheduler
• Real Time Scheduler

3
Introduction of Scheduler
• Scheduler
– Determining which process run when there are
multiple runnable processes.

• Linux Scheduler history
Linux Version

Scheduler

Previous 2.4

Round Robin Scheduler

Version 2.4

0(N)

V2.5.17~2.6.23

0(1)

V2.6.23~Now

Completely Fair Scheduler

4
Round Robin Scheduler
struct task_struct {
long counter;
long priority;
…
}

• Algorithm
– Init: p->counter = current->counter >> 1
– At each tick: current->counter -– When current->counter ==0, system picks the highest
counter thread to run.
– When all threads’ counter is 0, reset the counter:
p->counter = (p->counter >> 1) + p->priority
5
O(N) Scheduler
• Algorithm

struct task_struct{
…
long counter;
long nice;
…
}

– All runnable task lists in a global list.
– Time slice is related to priority & CONFIG_HZ.
– When pick next task, choose the most weight
task&&(p->counter!=0):weight = p->counter + (20-nice).
– After all task used up time slice, recalculate the
counter.
#if HZ < 200
#define TICK_SCALE(x) ((x) >> 2)
#elif HZ < 400
#define TICK_SCALE(x) ((x) >> 1)
…
#endif
#define NICE_TO_TICKS(nice) (TICK_SCALE(20-(nice))+1)

6
O(1) Scheduler (1)
• This scheduler use tow priority arrays per
processor to keep track of ready tasks of the
processor

7
O(1) Scheduler (2)
• Time Slice

#define SCALE_PRIO(x, prio) 
max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO/2), MIN_TIMESLICE)

static unsigned int task_timeslice(task_t *p)
{
if (p->static_prio < NICE_TO_PRIO(0))
return SCALE_PRIO(DEF_TIMESLICE*4, p->static_prio);
else
return SCALE_PRIO(DEF_TIMESLICE, p->static_prio);
}

• Dynamitic Priority
• max(100,min(static_priority-bonus+5,139))
8
O(1) Scheduler (3)
• Bonus is from sleep time.

• MAX_SLEEP_AVG is 1000ms; MAX_BONUS is 10

9
CFS Scheduler: Concept(1)
• "Ideal multi-tasking CPU" is a (non-existent :-))
CPU which can run each task at precise equal
speed and equal share.[1]

[1].Documentation/scheduler/sched-design-CFS.txt

10
CFS Scheduler: Concept(2)
• The actual things like this, obviously not fair:

• So, the concept of “virtual runtime” is
introduced.
 Picture is from: Completely Fair Scheduler, Linux journal, Issue #184, August 2009

11
CFS Scheduler: Virtual Runtime (1)
• The virtual runtime of a task specifies when its
next time slice would start execution on the ideal
multi-tasking CPU.[1]
• CFS tries to maintain an equal virtual runtime for
each task in a CPU’s run_queue at all time.
– Reason: tasks would execute simultaneously and no
task would ever get "out of balance" from the "ideal"
share of CPU time.[1]

• CFS always tries to run the task with the smallest
virtual runtime value.
[1].Documentation/scheduler/sched-design-CFS.txt

12
CFS scheduler: Virtual Runtime (2)
• One period time for all tasks
(1)

• Time slice for a task on real Processor
(2)

• Virtual Runtime
(3)

According to (2) and (3), get:
(4)
13
A demo: understanding virtual runtime
• Thread 1: weight 2 /Thread 2: weight 5
• Period Clock: P=10ms(HZ:100)
Clock Sequence

Virtual Runtime 1

Virtual Runtime 2

0

0

0

1

½ *P

0

2

½ *P

1/5 * P

3

½ *P

2/5 * P

4

½ *P

3/5 * P

5

1*P

3/5 * P

6

1*P

4/5 * P

7

1*P

1*P

…

…

…
14
CFS Scheduler: Priority & Weight
static const int prio_to_weight[40] = {
/* -20 */
88761,
71755,
56483,
/* -15 */
29154,
23254,
18705,
/* -10 */
9548,
7620,
6100,
/* -5 */
3121,
2501,
1991,
/*
0 */
1024,
820,
655,
/*
5 */
335,
272,
215,
/* 10 */
110,
87,
70,
/* 15 */
36,
29,
23,
};

46273,
14949,
4904,
1586,
526,
172,
56,
18,

36291,
11916,
3906,
1277,
423,
137,
45,
15,

100000

90000
80000

Weight

70000
60000
50000
40000

30000
20000
10000
0
-20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Nice Value
15
CFS scheduler: implementation (1)
• CFS uses a virtual runtime-ordered red-black
tree to build a "timeline" of future task
execution.

16
CFS scheduler: implementation (2)

 More: http://www.ibm.com/developerworks/library/l-completely-fair-scheduler/

17
Real Time Scheduler
• The real-time scheduler has to ensure systemwide strict real-time priority scheduling (SWSRPS)
• Only the N highest-priority tasks be running at
any given point in time, where N is the number of
CPUs.
• Frequently task balancing can introduce cache
thrashing and contention for global data (such as
runqueue locks) and can degrade throughput.
• Tow policies
– SCHED_RR
– SCHED_FIFO
18
Key Structures
struct cpupri_vec {
atomic_t
count;
cpumask_var_t mask;
};
struct cpupri {
struct cpupri_vec pri_to_cpu[CPUPRI_NR_PRIORITIES];
int
cpu_to_pri[NR_CPUS];
};

19
Overview of RT scheduler Algorithm
• The scheduler has to address several scenarios:
– Where to place a task optimally on wakeup (that is,
pre-balance).
– What to do with a lower-priority task when it wakes
up but is on a runqueue running a task of higher
priority.
– What to do with a low-priority task when a higherpriority task on the same runqueue wakes up and
preempts it.
– What to do when a task lowers its priority and thereby
causes a previously lower-priority task to have the
higher priority.
 More: http://www.linuxjournal.com/magazine/real-time-linux-kernel-scheduler

20
A Demo of RT scheduler Algorithm

21
Scheduler decision
Start with top scheduler class

Runnable task
available?

N

Pick Next scheduler class

Y
Pick Next Task of Scheduler class

22
CFS Load Balancer
Outline
• Objective
• How to balance among cores
– Hierarchy & Key Data Structures
• Scenarios of balance

24
Objective
1. Prevent processors from being idle while others
processors still have tasks waiting to execute[1]
2. Keep the difference in numbers of ready tasks
on all processors as small as possible[1]
Addition: Try to save power while the load is light.[2]

[1] Chun-Yu Lai, Performance Evaluation of Linux Kernel Load Balancing Mechanisms , 2006
[2] Suresh Siddha, Chip Multi Processing aware Linux Kernel Scheduler , 2006 Linux Symposium
25
Hierarchy
• Scheduling Domain: Each scheduling domain
spans a number of CPUs.
• Scheduling Group: Each scheduling domain
must have one or more CPU groups which are
organized as a circular one way linked list.
• Balancing within a scheduling domain occurs
between groups.

More information: http://lwn.net/Articles/80911/

26
A Demo of Hierarchy

27
Key members of sched_domain
struct sched_domain {
/* These fields must be setup */
struct sched_domain *parent; /* top domain must be null terminated */
struct sched_domain *child;
/* bottom domain must be null terminated */
struct sched_group *groups;
/* the balancing groups of the domain */
…
unsigned int busy_factor;
/* less balancing by factor if busy */
unsigned int imbalance_pct;
/* No balance until over watermark */
…
int flags;
/* See SD_* */
…
unsigned long last_balance;
/* init to jiffies. units in jiffies */
unsigned int balance_interval;
/* initialise to 1. units in ms. */
unsigned int span_weight;
unsigned long span[0];
};
/* * sched-domains (multiprocessor balancing) declarations: */
#ifdef CONFIG_SMP
#define SD_LOAD_BALANCE
0x0001
/* Do load balancing on this domain. */
#define SD_BALANCE_NEWIDLE 0x0002
/* Balance when about to become idle */
#define SD_BALANCE_EXEC
0x0004
/* Balance on exec */
#define SD_BALANCE_FORK
0x0008
/* Balance on fork, clone */
#define SD_BALANCE_WAKE
0x0010 /* Balance on wakeup */
#define SD_WAKE_AFFINE
0x0020
/* Wake task to waking CPU */
#define SD_SHARE_CPUPOWER
0x0080
/* Domain members share cpu power */
28
Key members of sched_group
struct sched_group {
struct sched_group *next;
/* Must be a circular list */
…
unsigned int group_weight;
struct sched_group_power *sgp;
…
unsigned long cpumask[0];
};
struct sched_group_power {
…
unsigned int power;
…
};

29
An example of sched_domain
#define SD_CPU_INIT (struct sched_domain) {
.busy_factor
= 64,
.imbalance_pct
= 125,
.flags
= 1*SD_LOAD_BALANCE
| 1*SD_BALANCE_NEWIDLE
| 1*SD_BALANCE_EXEC
| 1*SD_BALANCE_FORK
| 0*SD_BALANCE_WAKE
| 1*SD_WAKE_AFFINE
,
.last_balance
= jiffies,
.balance_interval
= 1,
}














30
CFS Load Balancing: How to
• load_balance is used to offload tasks in the
busiest runqueue of the busiest group (most
runnable tasks):
– inactive(likely to be cache cold)
– high priority

• load_balance skips tasks that are:
– Currently running on a CPU
– Not allowed to run on the current CPU(as indicated by
the cpus_allowed bitmask in the task_struct)
– Still be cache warm on its currently CPU
31
How busiest is the busiest group?
• In current level domain, the biggest group
average load is the busiest group.
– If current processor is idle, the busiest group
should meet that number of running threads is
bigger than the core numbers of that group.
– Else

• If the busiest group is found, this domain is
unbalanced.
32
Restore balance
• How much load to actually move to equalize the
imbalance:
(1)
(2)
(3)

• Offload min(imbalance_x) from the busiest
runqueue in the busiest group to restore balance
• Busiest runqueue is the maximum load weight in
the busiest group
33
Load Balancing: idle balancing
• Idle balancing
– In schedule(), if this CPU is about to become idle.
Attempts to pull one task from busiest CPUs.
for_each_domain(this_cpu, sd) {
if (!(sd->flags & SD_LOAD_BALANCE))
continue;
pulled_task = load_balance(this_cpu, this_rq,
sd, CPU_NEWLY_IDLE, &balance);

if (pulled_task)
break;
}

34
Load Balancing: Periodic balancing
• In timer tick, if current time is after rq->next_balance, trigger
SCHED_SOFTIRQ.
• Current processor starts from the lowest-level
scheduling domain and searches the domain hierarchy
to decide whether the rebalancing is need.
interval = sd->balance_interval;
– Current time > sd->last_balance+interval
if (idle != CPU_IDLE)
– Current domain is unbalanced
interval *= sd->busy_factor;
– If needed rebalancing, pull tasks from busiest runqueue to
current runqueue.

• After one round of periodic balancing, rq->next_balance is
updated to current time + highest-level interval.
35
Other Methods to keep Balance
• Exec balancing
– Where to put a new task

SD_BALANCE_EXEC
SD_BALANCE_FORK
SD_BALANCE_WAKE

• Fork balancing
– Where to put a new spawned thread

• Wake balancing
– Where to put the wakee thread

• ILB balancing

36
Exec balancing
• Search the idlest group from the highest level
scheduling domain to lowest level domain.
– Idlest group is the minimum avg_load
– Meet

• Search the idlest cpu from idlest group.
– Idlest cpu is the minimum avg_load in idlest group

• Pack this task to a work and add this work to
&per_cpu(cpu_stopper, cpu) list.
• Wake up the stoper->thread which running on idlest
CPU
37
Fork Balancing
• In do_fork, select the idlest cpu and insert this
thread to the runqueue of the idlest cpu.

38
Wake Balancing
• If this_cpu_load+wakee_weight <= prev_cpu_load, the target cpu is close
to X;else close to Y.
• From the last level cache domain, choose the
idle cpu. If no idle cpu, choose X or Y.
Waker is currently running on CPU X
Wakee was last time running on CPU Y

39
Idle Load Balance(1)
• When one of the busy CPUs notice that there
may be an idle rebalancing needed, they will
kick the idle load balancer, which then does
idle load balancing for all the idle CPUs.
– Now >= nohz.next_balance
– Number of running tasks >2
– NOHZ.nr_cpus is not empty.

40
Idle Load Balance(2)
• Routine
– Find an ilber and send IPI_RESCHEDULE ipi to it
– After ilber wake up from ipi
• Do idle balance for itself
• Help other idle processors to do load balance.
• If pull tasks for other processor, send IPI_RESCHEDULE to it.

– Update nohz.next_next_balance to ilber’s
next_balance

41

Más contenido relacionado

La actualidad más candente

Linux scheduling and input and output
Linux scheduling and input and outputLinux scheduling and input and output
Linux scheduling and input and outputSanidhya Chugh
 
Linux Internals - Kernel/Core
Linux Internals - Kernel/CoreLinux Internals - Kernel/Core
Linux Internals - Kernel/CoreShay Cohen
 
Linux Interrupts
Linux InterruptsLinux Interrupts
Linux InterruptsKernel TLV
 
eBPF Trace from Kernel to Userspace
eBPF Trace from Kernel to UserspaceeBPF Trace from Kernel to Userspace
eBPF Trace from Kernel to UserspaceSUSE Labs Taipei
 
DPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingDPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingMichelle Holley
 
Scheduling in Android
Scheduling in AndroidScheduling in Android
Scheduling in AndroidOpersys inc.
 
Linux Kernel Module - For NLKB
Linux Kernel Module - For NLKBLinux Kernel Module - For NLKB
Linux Kernel Module - For NLKBshimosawa
 
Jagan Teki - U-boot from scratch
Jagan Teki - U-boot from scratchJagan Teki - U-boot from scratch
Jagan Teki - U-boot from scratchlinuxlab_conf
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Andriy Berestovskyy
 
Introduction to Linux Kernel by Quontra Solutions
Introduction to Linux Kernel by Quontra SolutionsIntroduction to Linux Kernel by Quontra Solutions
Introduction to Linux Kernel by Quontra SolutionsQUONTRASOLUTIONS
 
Intel DPDK Step by Step instructions
Intel DPDK Step by Step instructionsIntel DPDK Step by Step instructions
Intel DPDK Step by Step instructionsHisaki Ohara
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)Brendan Gregg
 
LAS16-105: Walkthrough of the EAS kernel adaptation to the Android Common Kernel
LAS16-105: Walkthrough of the EAS kernel adaptation to the Android Common KernelLAS16-105: Walkthrough of the EAS kernel adaptation to the Android Common Kernel
LAS16-105: Walkthrough of the EAS kernel adaptation to the Android Common KernelLinaro
 
WALT vs PELT : Redux - SFO17-307
WALT vs PELT : Redux  - SFO17-307WALT vs PELT : Redux  - SFO17-307
WALT vs PELT : Redux - SFO17-307Linaro
 
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedBrendan Gregg
 
ACPI Debugging from Linux Kernel
ACPI Debugging from Linux KernelACPI Debugging from Linux Kernel
ACPI Debugging from Linux KernelSUSE Labs Taipei
 
Linux Preempt-RT Internals
Linux Preempt-RT InternalsLinux Preempt-RT Internals
Linux Preempt-RT Internals哲豪 康哲豪
 

La actualidad más candente (20)

Linux scheduling and input and output
Linux scheduling and input and outputLinux scheduling and input and output
Linux scheduling and input and output
 
Linux Internals - Kernel/Core
Linux Internals - Kernel/CoreLinux Internals - Kernel/Core
Linux Internals - Kernel/Core
 
Making Linux do Hard Real-time
Making Linux do Hard Real-timeMaking Linux do Hard Real-time
Making Linux do Hard Real-time
 
Linux Interrupts
Linux InterruptsLinux Interrupts
Linux Interrupts
 
eBPF Trace from Kernel to Userspace
eBPF Trace from Kernel to UserspaceeBPF Trace from Kernel to Userspace
eBPF Trace from Kernel to Userspace
 
DPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingDPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet Processing
 
Scheduling in Android
Scheduling in AndroidScheduling in Android
Scheduling in Android
 
Linux Kernel Module - For NLKB
Linux Kernel Module - For NLKBLinux Kernel Module - For NLKB
Linux Kernel Module - For NLKB
 
Jagan Teki - U-boot from scratch
Jagan Teki - U-boot from scratchJagan Teki - U-boot from scratch
Jagan Teki - U-boot from scratch
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)
 
Introduction to Linux Kernel by Quontra Solutions
Introduction to Linux Kernel by Quontra SolutionsIntroduction to Linux Kernel by Quontra Solutions
Introduction to Linux Kernel by Quontra Solutions
 
Intel DPDK Step by Step instructions
Intel DPDK Step by Step instructionsIntel DPDK Step by Step instructions
Intel DPDK Step by Step instructions
 
Introduction to Modern U-Boot
Introduction to Modern U-BootIntroduction to Modern U-Boot
Introduction to Modern U-Boot
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)
 
LAS16-105: Walkthrough of the EAS kernel adaptation to the Android Common Kernel
LAS16-105: Walkthrough of the EAS kernel adaptation to the Android Common KernelLAS16-105: Walkthrough of the EAS kernel adaptation to the Android Common Kernel
LAS16-105: Walkthrough of the EAS kernel adaptation to the Android Common Kernel
 
WALT vs PELT : Redux - SFO17-307
WALT vs PELT : Redux  - SFO17-307WALT vs PELT : Redux  - SFO17-307
WALT vs PELT : Redux - SFO17-307
 
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting Started
 
ACPI Debugging from Linux Kernel
ACPI Debugging from Linux KernelACPI Debugging from Linux Kernel
ACPI Debugging from Linux Kernel
 
Linux Device Tree
Linux Device TreeLinux Device Tree
Linux Device Tree
 
Linux Preempt-RT Internals
Linux Preempt-RT InternalsLinux Preempt-RT Internals
Linux Preempt-RT Internals
 

Similar a Process Scheduler and Balancer in Linux Kernel

Introduction to ARM big.LITTLE technology
Introduction to ARM big.LITTLE technologyIntroduction to ARM big.LITTLE technology
Introduction to ARM big.LITTLE technology義洋 顏
 
Scheduling in Linux and Web Servers
Scheduling in Linux and Web ServersScheduling in Linux and Web Servers
Scheduling in Linux and Web ServersDavid Evans
 
Embedded_ PPT_4-5 unit_Dr Monika-edited.pptx
Embedded_ PPT_4-5 unit_Dr Monika-edited.pptxEmbedded_ PPT_4-5 unit_Dr Monika-edited.pptx
Embedded_ PPT_4-5 unit_Dr Monika-edited.pptxProfMonikaJain
 
RTOS Material hfffffffffffffffffffffffffffffffffffff
RTOS Material hfffffffffffffffffffffffffffffffffffffRTOS Material hfffffffffffffffffffffffffffffffffffff
RTOS Material hfffffffffffffffffffffffffffffffffffffadugnanegero
 
OS Process and Thread Concepts
OS Process and Thread ConceptsOS Process and Thread Concepts
OS Process and Thread Conceptssgpraju
 
Real Time Kernels
Real Time KernelsReal Time Kernels
Real Time KernelsArnav Soni
 
Process scheduling
Process schedulingProcess scheduling
Process schedulingHao-Ran Liu
 
load-balancing-method-for-embedded-rt-system-20120711-0940
load-balancing-method-for-embedded-rt-system-20120711-0940load-balancing-method-for-embedded-rt-system-20120711-0940
load-balancing-method-for-embedded-rt-system-20120711-0940Samsung Electronics
 
Scheduling in Android
Scheduling in AndroidScheduling in Android
Scheduling in AndroidOpersys inc.
 
XPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSE
XPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSEXPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSE
XPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSEThe Linux Foundation
 
Process scheduling &amp; time
Process scheduling &amp; timeProcess scheduling &amp; time
Process scheduling &amp; timeYojana Nanaware
 
POLITEKNIK MALAYSIA
POLITEKNIK MALAYSIAPOLITEKNIK MALAYSIA
POLITEKNIK MALAYSIAAiman Hud
 
Fast boot
Fast bootFast boot
Fast bootSZ Lin
 
Linux kernel development ch4
Linux kernel development   ch4Linux kernel development   ch4
Linux kernel development ch4huangachou
 
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephRongze Zhu
 

Similar a Process Scheduler and Balancer in Linux Kernel (20)

Introduction to ARM big.LITTLE technology
Introduction to ARM big.LITTLE technologyIntroduction to ARM big.LITTLE technology
Introduction to ARM big.LITTLE technology
 
Scheduling in Linux and Web Servers
Scheduling in Linux and Web ServersScheduling in Linux and Web Servers
Scheduling in Linux and Web Servers
 
Os2
Os2Os2
Os2
 
Embedded_ PPT_4-5 unit_Dr Monika-edited.pptx
Embedded_ PPT_4-5 unit_Dr Monika-edited.pptxEmbedded_ PPT_4-5 unit_Dr Monika-edited.pptx
Embedded_ PPT_4-5 unit_Dr Monika-edited.pptx
 
RTOS Material hfffffffffffffffffffffffffffffffffffff
RTOS Material hfffffffffffffffffffffffffffffffffffffRTOS Material hfffffffffffffffffffffffffffffffffffff
RTOS Material hfffffffffffffffffffffffffffffffffffff
 
OS Process and Thread Concepts
OS Process and Thread ConceptsOS Process and Thread Concepts
OS Process and Thread Concepts
 
Real Time Kernels
Real Time KernelsReal Time Kernels
Real Time Kernels
 
Mastering Real-time Linux
Mastering Real-time LinuxMastering Real-time Linux
Mastering Real-time Linux
 
Process scheduling
Process schedulingProcess scheduling
Process scheduling
 
load-balancing-method-for-embedded-rt-system-20120711-0940
load-balancing-method-for-embedded-rt-system-20120711-0940load-balancing-method-for-embedded-rt-system-20120711-0940
load-balancing-method-for-embedded-rt-system-20120711-0940
 
Scheduling in Android
Scheduling in AndroidScheduling in Android
Scheduling in Android
 
Section05 scheduling
Section05 schedulingSection05 scheduling
Section05 scheduling
 
Ch6 cpu scheduling
Ch6   cpu schedulingCh6   cpu scheduling
Ch6 cpu scheduling
 
XPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSE
XPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSEXPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSE
XPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSE
 
Process scheduling &amp; time
Process scheduling &amp; timeProcess scheduling &amp; time
Process scheduling &amp; time
 
POLITEKNIK MALAYSIA
POLITEKNIK MALAYSIAPOLITEKNIK MALAYSIA
POLITEKNIK MALAYSIA
 
General Purpose GPU Computing
General Purpose GPU ComputingGeneral Purpose GPU Computing
General Purpose GPU Computing
 
Fast boot
Fast bootFast boot
Fast boot
 
Linux kernel development ch4
Linux kernel development   ch4Linux kernel development   ch4
Linux kernel development ch4
 
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
 

Último

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 

Último (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 

Process Scheduler and Balancer in Linux Kernel

  • 1. Process Scheduler and Balancer in Kernel Haifeng Li 2014-3-3
  • 3. Outline • Introduction of scheduler • Scheduler History – Round-Robin Scheduler – O(N) – O(1) • Completely Fair Scheduler • Real Time Scheduler 3
  • 4. Introduction of Scheduler • Scheduler – Determining which process run when there are multiple runnable processes. • Linux Scheduler history Linux Version Scheduler Previous 2.4 Round Robin Scheduler Version 2.4 0(N) V2.5.17~2.6.23 0(1) V2.6.23~Now Completely Fair Scheduler 4
  • 5. Round Robin Scheduler struct task_struct { long counter; long priority; … } • Algorithm – Init: p->counter = current->counter >> 1 – At each tick: current->counter -– When current->counter ==0, system picks the highest counter thread to run. – When all threads’ counter is 0, reset the counter: p->counter = (p->counter >> 1) + p->priority 5
  • 6. O(N) Scheduler • Algorithm struct task_struct{ … long counter; long nice; … } – All runnable task lists in a global list. – Time slice is related to priority & CONFIG_HZ. – When pick next task, choose the most weight task&&(p->counter!=0):weight = p->counter + (20-nice). – After all task used up time slice, recalculate the counter. #if HZ < 200 #define TICK_SCALE(x) ((x) >> 2) #elif HZ < 400 #define TICK_SCALE(x) ((x) >> 1) … #endif #define NICE_TO_TICKS(nice) (TICK_SCALE(20-(nice))+1) 6
  • 7. O(1) Scheduler (1) • This scheduler use tow priority arrays per processor to keep track of ready tasks of the processor 7
  • 8. O(1) Scheduler (2) • Time Slice #define SCALE_PRIO(x, prio) max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO/2), MIN_TIMESLICE) static unsigned int task_timeslice(task_t *p) { if (p->static_prio < NICE_TO_PRIO(0)) return SCALE_PRIO(DEF_TIMESLICE*4, p->static_prio); else return SCALE_PRIO(DEF_TIMESLICE, p->static_prio); } • Dynamitic Priority • max(100,min(static_priority-bonus+5,139)) 8
  • 9. O(1) Scheduler (3) • Bonus is from sleep time. • MAX_SLEEP_AVG is 1000ms; MAX_BONUS is 10 9
  • 10. CFS Scheduler: Concept(1) • "Ideal multi-tasking CPU" is a (non-existent :-)) CPU which can run each task at precise equal speed and equal share.[1] [1].Documentation/scheduler/sched-design-CFS.txt 10
  • 11. CFS Scheduler: Concept(2) • The actual things like this, obviously not fair: • So, the concept of “virtual runtime” is introduced.  Picture is from: Completely Fair Scheduler, Linux journal, Issue #184, August 2009 11
  • 12. CFS Scheduler: Virtual Runtime (1) • The virtual runtime of a task specifies when its next time slice would start execution on the ideal multi-tasking CPU.[1] • CFS tries to maintain an equal virtual runtime for each task in a CPU’s run_queue at all time. – Reason: tasks would execute simultaneously and no task would ever get "out of balance" from the "ideal" share of CPU time.[1] • CFS always tries to run the task with the smallest virtual runtime value. [1].Documentation/scheduler/sched-design-CFS.txt 12
  • 13. CFS scheduler: Virtual Runtime (2) • One period time for all tasks (1) • Time slice for a task on real Processor (2) • Virtual Runtime (3) According to (2) and (3), get: (4) 13
  • 14. A demo: understanding virtual runtime • Thread 1: weight 2 /Thread 2: weight 5 • Period Clock: P=10ms(HZ:100) Clock Sequence Virtual Runtime 1 Virtual Runtime 2 0 0 0 1 ½ *P 0 2 ½ *P 1/5 * P 3 ½ *P 2/5 * P 4 ½ *P 3/5 * P 5 1*P 3/5 * P 6 1*P 4/5 * P 7 1*P 1*P … … … 14
  • 15. CFS Scheduler: Priority & Weight static const int prio_to_weight[40] = { /* -20 */ 88761, 71755, 56483, /* -15 */ 29154, 23254, 18705, /* -10 */ 9548, 7620, 6100, /* -5 */ 3121, 2501, 1991, /* 0 */ 1024, 820, 655, /* 5 */ 335, 272, 215, /* 10 */ 110, 87, 70, /* 15 */ 36, 29, 23, }; 46273, 14949, 4904, 1586, 526, 172, 56, 18, 36291, 11916, 3906, 1277, 423, 137, 45, 15, 100000 90000 80000 Weight 70000 60000 50000 40000 30000 20000 10000 0 -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Nice Value 15
  • 16. CFS scheduler: implementation (1) • CFS uses a virtual runtime-ordered red-black tree to build a "timeline" of future task execution. 16
  • 17. CFS scheduler: implementation (2)  More: http://www.ibm.com/developerworks/library/l-completely-fair-scheduler/ 17
  • 18. Real Time Scheduler • The real-time scheduler has to ensure systemwide strict real-time priority scheduling (SWSRPS) • Only the N highest-priority tasks be running at any given point in time, where N is the number of CPUs. • Frequently task balancing can introduce cache thrashing and contention for global data (such as runqueue locks) and can degrade throughput. • Tow policies – SCHED_RR – SCHED_FIFO 18
  • 19. Key Structures struct cpupri_vec { atomic_t count; cpumask_var_t mask; }; struct cpupri { struct cpupri_vec pri_to_cpu[CPUPRI_NR_PRIORITIES]; int cpu_to_pri[NR_CPUS]; }; 19
  • 20. Overview of RT scheduler Algorithm • The scheduler has to address several scenarios: – Where to place a task optimally on wakeup (that is, pre-balance). – What to do with a lower-priority task when it wakes up but is on a runqueue running a task of higher priority. – What to do with a low-priority task when a higherpriority task on the same runqueue wakes up and preempts it. – What to do when a task lowers its priority and thereby causes a previously lower-priority task to have the higher priority.  More: http://www.linuxjournal.com/magazine/real-time-linux-kernel-scheduler 20
  • 21. A Demo of RT scheduler Algorithm 21
  • 22. Scheduler decision Start with top scheduler class Runnable task available? N Pick Next scheduler class Y Pick Next Task of Scheduler class 22
  • 24. Outline • Objective • How to balance among cores – Hierarchy & Key Data Structures • Scenarios of balance 24
  • 25. Objective 1. Prevent processors from being idle while others processors still have tasks waiting to execute[1] 2. Keep the difference in numbers of ready tasks on all processors as small as possible[1] Addition: Try to save power while the load is light.[2] [1] Chun-Yu Lai, Performance Evaluation of Linux Kernel Load Balancing Mechanisms , 2006 [2] Suresh Siddha, Chip Multi Processing aware Linux Kernel Scheduler , 2006 Linux Symposium 25
  • 26. Hierarchy • Scheduling Domain: Each scheduling domain spans a number of CPUs. • Scheduling Group: Each scheduling domain must have one or more CPU groups which are organized as a circular one way linked list. • Balancing within a scheduling domain occurs between groups. More information: http://lwn.net/Articles/80911/ 26
  • 27. A Demo of Hierarchy 27
  • 28. Key members of sched_domain struct sched_domain { /* These fields must be setup */ struct sched_domain *parent; /* top domain must be null terminated */ struct sched_domain *child; /* bottom domain must be null terminated */ struct sched_group *groups; /* the balancing groups of the domain */ … unsigned int busy_factor; /* less balancing by factor if busy */ unsigned int imbalance_pct; /* No balance until over watermark */ … int flags; /* See SD_* */ … unsigned long last_balance; /* init to jiffies. units in jiffies */ unsigned int balance_interval; /* initialise to 1. units in ms. */ unsigned int span_weight; unsigned long span[0]; }; /* * sched-domains (multiprocessor balancing) declarations: */ #ifdef CONFIG_SMP #define SD_LOAD_BALANCE 0x0001 /* Do load balancing on this domain. */ #define SD_BALANCE_NEWIDLE 0x0002 /* Balance when about to become idle */ #define SD_BALANCE_EXEC 0x0004 /* Balance on exec */ #define SD_BALANCE_FORK 0x0008 /* Balance on fork, clone */ #define SD_BALANCE_WAKE 0x0010 /* Balance on wakeup */ #define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */ #define SD_SHARE_CPUPOWER 0x0080 /* Domain members share cpu power */ 28
  • 29. Key members of sched_group struct sched_group { struct sched_group *next; /* Must be a circular list */ … unsigned int group_weight; struct sched_group_power *sgp; … unsigned long cpumask[0]; }; struct sched_group_power { … unsigned int power; … }; 29
  • 30. An example of sched_domain #define SD_CPU_INIT (struct sched_domain) { .busy_factor = 64, .imbalance_pct = 125, .flags = 1*SD_LOAD_BALANCE | 1*SD_BALANCE_NEWIDLE | 1*SD_BALANCE_EXEC | 1*SD_BALANCE_FORK | 0*SD_BALANCE_WAKE | 1*SD_WAKE_AFFINE , .last_balance = jiffies, .balance_interval = 1, } 30
  • 31. CFS Load Balancing: How to • load_balance is used to offload tasks in the busiest runqueue of the busiest group (most runnable tasks): – inactive(likely to be cache cold) – high priority • load_balance skips tasks that are: – Currently running on a CPU – Not allowed to run on the current CPU(as indicated by the cpus_allowed bitmask in the task_struct) – Still be cache warm on its currently CPU 31
  • 32. How busiest is the busiest group? • In current level domain, the biggest group average load is the busiest group. – If current processor is idle, the busiest group should meet that number of running threads is bigger than the core numbers of that group. – Else • If the busiest group is found, this domain is unbalanced. 32
  • 33. Restore balance • How much load to actually move to equalize the imbalance: (1) (2) (3) • Offload min(imbalance_x) from the busiest runqueue in the busiest group to restore balance • Busiest runqueue is the maximum load weight in the busiest group 33
  • 34. Load Balancing: idle balancing • Idle balancing – In schedule(), if this CPU is about to become idle. Attempts to pull one task from busiest CPUs. for_each_domain(this_cpu, sd) { if (!(sd->flags & SD_LOAD_BALANCE)) continue; pulled_task = load_balance(this_cpu, this_rq, sd, CPU_NEWLY_IDLE, &balance); if (pulled_task) break; } 34
  • 35. Load Balancing: Periodic balancing • In timer tick, if current time is after rq->next_balance, trigger SCHED_SOFTIRQ. • Current processor starts from the lowest-level scheduling domain and searches the domain hierarchy to decide whether the rebalancing is need. interval = sd->balance_interval; – Current time > sd->last_balance+interval if (idle != CPU_IDLE) – Current domain is unbalanced interval *= sd->busy_factor; – If needed rebalancing, pull tasks from busiest runqueue to current runqueue. • After one round of periodic balancing, rq->next_balance is updated to current time + highest-level interval. 35
  • 36. Other Methods to keep Balance • Exec balancing – Where to put a new task SD_BALANCE_EXEC SD_BALANCE_FORK SD_BALANCE_WAKE • Fork balancing – Where to put a new spawned thread • Wake balancing – Where to put the wakee thread • ILB balancing 36
  • 37. Exec balancing • Search the idlest group from the highest level scheduling domain to lowest level domain. – Idlest group is the minimum avg_load – Meet • Search the idlest cpu from idlest group. – Idlest cpu is the minimum avg_load in idlest group • Pack this task to a work and add this work to &per_cpu(cpu_stopper, cpu) list. • Wake up the stoper->thread which running on idlest CPU 37
  • 38. Fork Balancing • In do_fork, select the idlest cpu and insert this thread to the runqueue of the idlest cpu. 38
  • 39. Wake Balancing • If this_cpu_load+wakee_weight <= prev_cpu_load, the target cpu is close to X;else close to Y. • From the last level cache domain, choose the idle cpu. If no idle cpu, choose X or Y. Waker is currently running on CPU X Wakee was last time running on CPU Y 39
  • 40. Idle Load Balance(1) • When one of the busy CPUs notice that there may be an idle rebalancing needed, they will kick the idle load balancer, which then does idle load balancing for all the idle CPUs. – Now >= nohz.next_balance – Number of running tasks >2 – NOHZ.nr_cpus is not empty. 40
  • 41. Idle Load Balance(2) • Routine – Find an ilber and send IPI_RESCHEDULE ipi to it – After ilber wake up from ipi • Do idle balance for itself • Help other idle processors to do load balance. • If pull tasks for other processor, send IPI_RESCHEDULE to it. – Update nohz.next_next_balance to ilber’s next_balance 41