LCE12: big.LITTLE TC2 update

1
Update on big.LITTLE on TC2
Morten Rasmussen
Technology Researcher

2
Agenda
 big.LITTLE Software solutions overview
 ARM's Test Chip 2 overview
 Benchmarking Methodology and Use Cases
 IKS status update
 big.LITTLE MP status update

3
big.LITTLE overview
 Performance and power efficiency in one system:
Cortex-A15 vs Cortex-A7
Performance
Cortex-A7 vs Cortex-A15
Energy Efficiency
Dhrystone 1.9x 3.5x
FDCT 2.3x 3.8x
IMDCT 3.0x 3.0x
MemCopy L1 1.9x 2.3x
MemCopy L2 1.9x 3.4x

4
IKS solution – Basics
 In-Kernel Switcher (IKS):
 Targeted first generation big.LITTLE products.
Cortex-A7
Cortex-A15
Kernel
scheduler IKS
Task 1
Task 2
Logical CPU ?

5
MP solution
Cortex-A7
Cortex-A15
Kernel
scheduler
Task 1
Task 2
?

6
ARM’s Test Chip 2 (TC#2): An Overview
 A Versatile Express core tile
publically available:
 Capabilities
 2 x A15 (r2p1) @ up to 1.2 Ghz
 3 x A7 (r0p1) @ up to 1Ghz
 CCI/DMC/GIC/ADB (r0p0)
 DMA (PL330)
 2GB external DDR2 memory
@ 400Mhz
 64k internal SRAM
 Coresight debug (including JTAG
and ITM trace but no STM)
 No GPU
 cpufreq support: Independent for
each cluster with limited voltage
scaling
 cpuidle support: Cluster power
gating
TC2

7
Benchmarking Methodology
Results
Performance
Power
Configurable:
- CCI
- ftrace
- streamline
CSV config:
- Use case
- Scheduling model
- Numbers of cores to use
- Scaling governors
 Automated system for
running user workloads
on target device
Choose workload
Choose CPU mode:
Cortex-A7, Cortex-A15, Migration
(cluster or CPU), or MP
Choose active cores in each
cluster
TC2: 1-2 big, 1-3 LITTLE
Choose DVFS governor:
Interactive, performance,
powersave, ondemand
Extensible – parameterisation

8
IKS solution
 Targeted first generation big.LITTLE products.
Cortex-A7
Cortex-A15
Kernel
scheduler IKS
Task 1
Task 2
Logical CPU ?

CONFIDENTIAL9
IKS: CPU Migration
 big.LITTLE extends DVFS
 DVFS algorithm monitors load on each
CPU
 When load is low it can be handled on a
LITTLE processor
 When load is high the context is
transferred to a big processor
 The unused processor can be powered
down
 When all processors in a cluster are
inactive the cluster and its L2 cache can
be powered down

CONFIDENTIAL10
IKS: CPU Migration
 big.LITTLE extends DVFS
 DVFS algorithm monitors load on each
CPU
 When load is low it can be handled on a
LITTLE processor
 When load is high the context is
transferred to a big processor
 The unused processor can be powered
down
 When all processors in a cluster are
inactive the cluster and its L2 cache can
be powered down

11
IKS: OPP mapping to A7 / A15 on TC2
 Virtual Frequency maps OPPs to big or LITTLE cores
Virtual
OPP
Physical OPP
A7
Physical OPP
A15 Voltage
A7
350000 350000 V1
400000 400000 V1
... X X V1
800000 800000 V1
900000 900000 V2
1000000 1000000 V3
A15
1200000 600000 V1
1400000 700000 V1
... X 2X V1
2000000 1000000 V1
2200000 1100000 V2
2400000 1200000 V3

12
IKS: Results for Audio on TC2
 Power compared to executing the use case on A15
 IKS does not use A15s during Audio run
70% saving
TC2:
A15 up to 1.2 GHz
A7 up to 1 GHz
Better results expected on
representative silicon.

13
IKS: Results for BBench + Audio on TC2
 Performance is measured as from page loading times of
BBench
 Results normalised to power and performance consumed on
same use case run on A15 only
BBench page + Audio
TC2:
A15 up to 1.2 GHz
A7 up to 1 GHz

15
IKS: Interactive governor on TC2
if (cpu_load >= go_hispeed_load){
...
new_freq = max_freq * cpu_load / 100;
...
}
else {
...
new_freq = hispeed_freq*cpu_load/100;
...
}
 For A15 on TC2 with a go_highspeed at 85% (default) this algorithm
only uses overdrive section of A15
 Approach is to introduce a second point of inflection:highspeed2

17
IKS: Results: Bbench + Audio
 Power improves with no performance cost
BBench page + Audio
TC2:
A15 up to 1.2 GHz
A7 up to 1 GHz

18
MP solution
Cortex-A7
Cortex-A15
Kernel
scheduler
Task 1
Task 2
?

19
MP solution – more details
 Scheduler modifications:
 Treat big and LITTLE cpus as
separate scheduling domains.
 Use PJT's load-tracking patches
to track individual task load.
 Migrate tasks between the big and
the LITTLE domains based on
task load.
 Patch set available through Linaro.
L
BB
L
Load balance Load balance
Load-based task migration
Task load
Task state
Executing Sleep
Load decay

20
MP: Experimental Implementation
 Scheduler modifications:
 Apply PJTs’ load-tracking patch set.
 Set up big and little sched_domains with no
load-balancing between them.
 select_task_rq_fair() checks task load
history to select appropriate target CPU for
tasks waking up.
 Add forced migration mechanism to push of
the currently running task to big core similar
to the existing active load balancing
mechanism.
 Periodically check
(run_rebalance_domains()) current task on
little runqueues for tasks that need to be
forced to migrate to a big core.
L
BB
L
load_balance load_balance
select_task_rq_fair()/
Forced migration

21
MP: ARM TC2: Audio
 Workload: Audio (mp3 playback)
 Performance/Energy target:
 A7 energy
 Status:
 Audio related task do not use A15s, but
the power consumption is still
significantly more than A7 alone.
 MP not as power efficient as IKS yet
 Todo:
 Target spurious wake-ups on A15. All
the extra power comes from the A15's
which shouldn't be used at all. Energy
A7 30.79%
MP 39.86%
0
10
20
30
40
50
60
70
80
90
100 Audio
A15
A7 2CPU
IKS
MP
Energy
TC2:
A15 up to 1.2 GHz
A7 up to 1 GHz

22
MP: Audio workload analysis
 Where is the extra energy spent
with MP?
 Need a look at why A15's consume
power when they are not necessary.
A7 MP
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Audio energy breakdown
A15 cluster
A7 cluster
Energy
hrtimer functions cpu0 cpu1 cpu2 cpu3 cpu4
hrtimer_wakeup 2 2 1212 417 190
tick_sched_timer 404 58 483 507 779
WQ functions cpu0 cpu1 cpu2 cpu3 cpu4
vmstat_update 30 2 27 25 28
cache_reap 15 2 14 13 14
phy_state_machine 31 0 0 0 0
Enter idle cpu0 cpu1 cpu2 cpu3 cpu4
0 6 2 2379 260 423
1 801 807 8316 9373 9652
TC2:
A15 up to 1.2 GHz
A7 up to 1 GHz

23
Scale invariant load
 Load accumulation rate does not scale with available
compute capacity (frequency, big/LITTLE cpu)
 Currently, there is no link between cpufreq and the scheduler
 Tasks may be migrated away from a cpu at low frequency by the
scheduler before cpufreq has increased the frequency to match the
cpu load.
 Scaling the tracked load accumulation to match the current
frequency mitigates this issue.
 Tasks cannot accumulate enough load at low frequency to trigger
migration and must wait for cpufreq to react first.
Freq = x Freq = 2x

24
Scale invariant load
76782.1 76782.2 76782.3 76782.4 76782.5 76782.6
0
200
400
600
800
1000
76332.95 76333.05 76333.15 76333.25 76333.35 76333.45
0
200
400
600
800
1000
Original Frequency invariant

25
Load accumulation rate
 For some workloads tracked load saturates too fast and leads
to unnecessary task migrations.
 Extending the tracked load history reduces tracked load
variations due to sudden changes in the load characteristics.
 Increasing the y factor in the load expression decreases the
load accumulation and decay rates.
load=
u0+u1⋅y+u2⋅y
2
+…+un⋅y
n
1024+y+ y
2
+…+ y
n
+1
1 21 41 61 81 101
6
11
16 26
31
36 46
51
56 66
71
76 86
91
96 106
111
116
121
126
131
136
141
146
151
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
y=0.9785
Time [ms]
y<1,0⩽u<1024

26
 Increasing y leads to a more conservative tracked load
 Should lead to less up/down migrations
 Increases up/down migrations delay for tasks that needs to be
migrated.
1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97
4 10 16 22 28 34 40 46 52 58 64 70 76 82 88 94 100
103
106
109
112
115
118
121
124
127
130
133
136
139
142
145
148
151
154
157
160
163
166
169
172
175
178
181
184
187
190
193
196
199
Task
y=0.9785
y=0.9844
y=0.9922
Time [ms]
Trackedload

27
MP – Top Issues
 Spurious wakeups
 A15s are woken up by scheduler ticks (mainly)
 Workqueues
 Timers
 RCU
 cpu wakeup prioritisation
 Pick the cheapest target cpu
 Global balancing
 Spread load to A7s when A15s are overloaded
 Pack vs. spread
 Cluster aware cpufreq governors

LCE12: big.LITTLE TC2 update

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (15)

Similar a LCE12: big.LITTLE TC2 update

Similar a LCE12: big.LITTLE TC2 update (20)

Más de Linaro

Más de Linaro (20)

Último

Último (20)

LCE12: big.LITTLE TC2 update