Integrating CPU idle, frequency scaling with Linux scheduler
1. Wed 5 March, 11:15am, Daniel Lezcano, Mike Turquette
LCA14-306: CPUidle & CPUfreq
integration with scheduler
2. Introduction
● Power aware discussion
● Patchset « Small task packing »
− Some informations shared between cpuidle and the
scheduler
− https://lwn.net/Articles/520857/
● « Line on the sand » by Ingo Molnar
− Integrate first cpuidle and cpufreq with the scheduler
− http://lwn.net/Articles/552885/
4. Idle time measurement
● From the scheduler :
− The duration of the idle task is running
− Includes the interrupt processing time
● From CPUidle :
− The duration between interrupts
● CPUIdle code happens with local interrupts disabled
● T(idle task) = Σ T(CPUidle) + Σ T(irqs)
6. Idle time measurement unification
● What is the impact of returning to the
scheduler each time an interrupt occurred ?
− Scheduler will choose the idle task again if nothing
to do
− Mainloop code simplified
− Idle time measured nearly the same for the
scheduler and cpuidle
− Probably a negative impact on performance to fix
7. Load balance
● Taking the decision to balance a task when
going to idle
■ Use of avg_idle
● Does not use how long the cpu will sleep
■ The idle state should be selected before
■ CPUIdle should give the state the cpu will be
● Balance a task to the idlest cpu
■ Does not use the cpu's exit latency
■ CPUidle should give back the state the cpu is
8. CPUidle main function
● Reduce the distance between the scheduler
and the cpuidle framework
− Move the idle task to kernel/sched
− Move the cpuidle_idle function in the idle task code
− Integrate the idle mainloop and cpuidle_idle_call
● Allows to access the scheduler's private
structure definition
9. Menu governor split
● The events could be classified in three
categories :
1. Predictable → timers
2. Repetitive → IOs
3. Random → key stroke, incoming packet
● Category 2 could be integrated into the
scheduler
10. IO latency tracking
● IO are repetitive within a reasonable interval to
assume it as predictable enough
11. IO latency tracking
● Measurement from the scheduler
− io_schedule
− io_schedule_timeout
● Count per task the io latency
− Task migration moves IO history unlike current
governor
− Latency constraint for the task
12. Combine informations
● Move predictable event framework in the
scheduler
● Informations combined between the scheduler
and menu governor will be more accurate
− Idle balance decision based on the idle state a cpu
is or about to enter
− Load tracking from task for idle state exit latency
− CPU computation power and topology
− DVFS strategies for exit idle state boost
13. Scheduler + CPUidle
● The scheduler should have all the informations
to tell CPUidle :
− How long it will sleep
− What is the latency constraint
● The CPUidle should use the information
provided by the scheduler :
− Select an idle state
− Use the backend driver idle callback
− No more heuristics
14. Status
● A lot of cleanups around the idle mainloop
● CPUidle main function inside the idle mainloop
− Code distance reduced, sharing the structures
scheduler/cpuidle
− Communication between sub-systems made easier
15. Work in progress
● First iteration of IO latency tracking
implemented
− Validation in progress
● Simple governor for CPUIdle
− Select a state
● Idle time unification experimentation
16. CPUfreq + scheduler
The title is misleading … CPUfreq may completely
disappear in the future.
17. CPUfreq + scheduler
The title is misleading … CPUfreq may completely
disappear in the future.
Goal is to initiate CPU dynamic voltage & frequency
scaling (DVFS) from the Linux scheduler
18. CPUfreq + scheduler
The title is misleading … CPUfreq may completely
disappear in the future.
Goal is to initiate CPU dynamic voltage & frequency
scaling (DVFS) from the Linux scheduler
Nobody knows what this will look like, so please ask
questions and raise suggestions
19. • Polling workqueue
• E.g. ondemand
• Based on idle time / busyness
• No relation to decisions taken by the scheduler
• Task may be run at any time
• No relation to idle task
• In fact, task will not wake-up during idle
CPUfreq today
20. • Replace polling loop with event driven action
• Scheduler already takes action which affects available
compute capacity
• Load balance
• Migrating tasks to and from CPUs of different compute capacity
• DVFS transitions are a natural fit
Event driven behavior
21. • Method to initiate CPU DVFS transitions from the
scheduler
• Identify call sites to initiate those transitions
• Enqueue/dequeue task
• Load balance
• Idle entry/exit
• Aggressively schedule deadline tasks
• Maybe others
• Define interface between the scheduler & the DVFS
thingy
• Currently a power driver in Morten’s RFC
• Remove CPUfreq governor layer from the power driver completely?
Lots of work ahead
22. • Experiment with policy
• When and where to evaluate if frequency should be changed
• What metrics are important to the algorithm?
• DVFS versus race-to-idle
• Integrate with power model
• Benchmark performance & power
• Performance regressions
• Does it save power?
• Make it work with non-CPUfreq things like PSCI and
ACPI for changing CPU P-state
Lots of work ahead, part 2
23. • https://lkml.org/lkml/2013/10/11/547
• Replaces polling loop in CPUfreq governor with
scheduler event-driven action
• CPUfreq machine drivers are re-used initially
• CPUfreq governor becomes a shim layer to the power
driver
Morten’s power aware scheduling RFC
24. • DVFS task is itself scheduled on a workqueue
• Might not be run for some time after the scheduler determines that a
DVFS transition should happen
• Kworker threads are filtered out
• Prevents infinite reentrancy into the scheduler
• CPU capacity is not changed when enqueuing and dequeuing these
tasks
Nitty gritty details
25. include/linux/sched/power.h
struct power_driver {
/*
* Power driver calls may happen from scheduler context with irq
* disabled and rq locks held. This must be taken into account in
* the power driver.
*/
/* cpu already at max capacity? */
int (*at_max_capacity) (int cpu);
/* Increase cpu capacity hint */
int (*go_faster) (int cpu, int hint);
/* Decrease cpu capacity hint */
int (*go_slower) (int cpu, int hint);
/* Best cpu to wake up */
int (*best_wake_cpu) (void);
/* Scheduler call-back without rq lock held and with irq enabled */
void (*late_callback) (int cpu);
};
26. • https://github.com/mturquette/linux/commits/sched-cpufreq
• Replaced workqueue method with per-CPU kthread
• This allows removal of the kworker filter
• Please commence bikeshedding over the name of this kthread
• Use SCHED_FIFO policy for the task
• Will be run before the normal work (right?)
• These patches were just validated yesterday
• Bugs
• Holes in logic
• Misunderstandings
• Voided warranties
Incremental changes on top
27. • Gather more opinions on the power driver interface
• Is go_faster/go_slower the right way?
• Spoiler alert: Probably not.
• When else might we want to evaluate CPU frequency?
• Idle entry/exit as mentioned by Daniel
• Cluster-level considerations
• Sched domains
• Not just per-core
• Four Cortex-A9’s with single CPU clock
• Coordinate with the power model work
What’s next?
29. More about Linaro Connect: http://connect.linaro.org
More about Linaro: http://www.linaro.org/about/
More about Linaro engineering: http://www.linaro.org/engineering/
Linaro members: www.linaro.org/members