This presentation is about system-wide energy optimization for multiple-DVS components and real-time tasks. It describes a realistic energy model for embedded systems, and present a method to optimize the energy consumption. It is presented at ECRTS 2010.
System-wide Energy Optimization for Multiple DVS Components and Real-time Tasks
1. System-wide Energy Optimization for Multiple DVS Components and Real-time Tasks HeechulYun, Po-Liang Wu, AnshuArya, TarekAbdelzaher, Cheolgi Kim, and LuiSha
2. DVS in Real-time Systems The Goal To minimize energy consumption by adjusting freq. and voltage but still meet the deadline Most consider CPU only Assume execution time depends on CPU freq. But memory and bus are also important Affect execution time (e.g., memory intensive app will be slowed if memory or bus is slow.) Consume considerable energy (similar order of energy compared to CPU) Are DVS capable in many recent embedded processors 2
3. Motivation 3 Memxfer5b : memory benchmark program Half of CPU clock Exec. time increased only 3% Energy saved 30%
4. Motivation 4 Dhrystone: CPU benchmark program Half of Mem clock Exec time increased only 0.05% Energy saved 10%
5. Contents Motivation Energy Model Considers CPU, BUS and Memory and task characteristics Evaluation (Model validation) Energy Optimization of Real-time Tasks Static multi-DVS problem and solution Evaluation Conclusion 5
6. Task Model 6 computation memory fetch (cache stall) power power Computation Memory fetch time time Task = Computation + Memory fetch
7. Task Model (2) power Lower CPU freq M power C time C M time power C : computation M : off-chip memory fetch (cache-stall cycles) C Lower MEM freq M time 7
8. Task Model (3) Execution time of a task C : CPU cycles of a given task M : memory cycles of a given task fc : CPU clock frequency fm : Memory clock frequency 8
9. Power Model Power of a component (i.e., CPU) k : capacitance constant f : frequency of the component V : supplying voltage R : leakage power 9 Different k for different modes: kactive - active mode capacitance kstandby- standby mode capacitance
10. Energy Model 10 power Pure Computation Memory Fetch (Cache stall) idle time P (Period) e(exec. time) Total system energy is
11. Pure Computation Block 11 power CPU active Memory Fetch (Cache stall) Bus, memstandby idle System static time e P kca : capacitance constant for activecpu kbs : capacitance constant for standby bus kms : capacitance constant for standby memory R : system wide static power consumption
12. Memory Fetch Block 12 power Pure Computation CPU standby Bus, memactive idle System static time e P kcs : capacitance constant for standbycpu kba : capacitance constant for active bus kma : capacitance constant for active memory
13. Idle Block 13 power Pure Computation Memory Fetch (Cache stall) CPU, bus, mem idle System static time e P I : idle mode power consumption. e: execution time (C/fc + M/fm )
14. Energy Model Summary 14 power Ecpu Emem Eidle pure exec block MEM fetch block idle block CPU active CPU standby Memory Fetch Dynamic power Bus, memactive Bus, memstandby CPU, bus, mem idle idle System static time e P System wide energy model Considers CPU, bus, and memory power consumption Considers active, standby and idle modes Other components are assumed to be static (included in R)
15. Energy Equation 15 CPU block Memory block Idle block System-wide energy consumption of a task during period P
16. Power supply ARM926 PSRAM (256KB) 8K-I 8K-D System bus STMP3650 SoC External peripherals (flash, LCD, External DRAM, …) BOARD 16 Evaluation Platform Multi-meter
17. Evaluation Platform (2) ARM9 based SoC CPU : up to 200Mhz, BUS : up to 100Mhz CPU and BUS are synchronous (BUS = CPU/N) Memory (PSRAM) freq is equal to system bus frequency (fb=fm) CPU, BUS, and memory all share the common voltage Vdd : 1.504V ~ 1.804V (0.32V step) Energy equation V : shared voltage for CPU, bus, and memory : active bus and memory constant : standby bus and memory constant 17
18. Validation Methodology 4 synthetic programs with different cache stall ratio (0%, 10%, 25%, 55%) 8 clock configurations (fc, fm) for each program Performed nonlinear least square analysis for total 32 data points against the energy equation 18
19. Energy Model Fitting 19 Coefficient of determination(R2) is 99.97% (100% is a perfect fit)
20. Energy Equation for Our Platform 20 Obtained coefficients in the energy equation
21. Contents Motivation Energy Model Considers CPU, BUS and Memory and task characteristics Evaluation Energy Optimization of Real-time Tasks Static Multi-DVS Problem and optimal solution Evaluation Conclusion 21
22. Static Multi-DVS Problem Given a set of periodic real-time tasks (T1, …,Tn), where each task invocation requires up to Ci CPU cycles and up to Mi memory cycles at worst. Find the energy optimal static frequencies for multiple DVS capable components (CPU, bus, and memory) 22
23. Problem Formulation Minimize Subjects to where 23 H : hyper period ei : execution time of task i Ecomp,i: computation block energy of task i Emem,i: cache stall block energy of task i Eidle: idle block energy
24. Optimal Solution Intuitive procedure Find an unconstrained minimal over fc and fm (fb= fm) Check boundary conditions due to system specific constraints. (e.g., minimum and maximum clock range) Details are in the paper 24
25. Energy Plot 25 Blue : less energy Red : more energy fm(MHz) Deadlineboundary fc(MHz) Task set : CH = 140*106, MH = 30*106 ,H = 3s
26. Evaluation Compare the following schemes: MAX CPU and memory are all set to maximum. CPU-only static DVS Memory frequency is set to maximum Baseline static multi-DVS CPU and memory frequencies change proportionally Optimal static multi-DVS Proposed scheme Optimal dynamic multi-DVS Can change frequencies at each task schedule Brute force search among all the possible combination Simulation setup Use energy equation obtained from measurements on our real hardware platform 26
27. Energy vs Utilization 27 Normalized average power consumption utilization Task set cache stall ratio (MH/(CH+MH) ):0.3
28. Energy vs Cache Stall Ratio 28 Normalized average power consumption Cache stall ratio Task set utilization ratio(eH/H):0.5
29. Effect of Diversity of Cache Stall Ratio 29 Normalized energy consumption diversity homogeneous diverse Task set cache stall ratio = 0.45, Task set utilization ratio(eH/H):0.5
30. Conclusion Energy model Considers multiple DVS capable components and task characteristic Validated on a real hardware platform Static multi-DVS problem Assigns energy optimal static frequencies of multiple DVS components for periodic real-time tasks Optimal solution (static multi-DVS scheme) shows better energy saving compared to CPU-only DVS 30
33. CPU-only DVS 33 Valid range (~200Mhz) Energy (mJ) fc (Mhz) Not effective in allowed range (*) based on energy equation for out h/w platform. Memory clock was set to max
34. Power Distribution 34 Cache stall ratio = 55% (cpu,bus)=(80,80Mhz) Cache stall ratio = 10% (cpu,bus)=(80,80Mhz) (*) based on energy equation for our h/w platform E = Ecpu + Emem + Estatic
35. Active and Idle 35 mJ mJ fc (Mhz) fm (Mhz) (*) actual measurement result
Notas del editor
DVS is extensively studied in real-time system. The goal is to minimize energy consumption and still meet the deadline of tasks. Most DVS schemes only considerCPU and assume execution time depends on CPU freq. In reality, there are other similarly important components such as memory and bus that affects execution time and energy consumption. Moreover, unlike desktop processors, in many embedded processors, we can control bus and memory clocks as well as cpu clock.
Here is one real example. We ran a memory intensive benchmark on our hardware platform and measured the power consumption of the entire system. When we lower the CPU clock to half (click), the execution time does not change much (click), only a 3% increase, instead of double the execution time. This is because most of the time, the task fetch data from memory which is independent from CPU clock speed. So, by reducing the CPU clock, we save 30% on energy consumption with only 3% of time increase.
In a second example, we ran a CPU benchmark program called dhrystone. Of course, if we change CPU clock to half, the execution time will double. However, if we change memory clock to half, its execution time does not change at all, because this program does not fetch data from memory much, but we save 10% on energy consumption. The amount of save energy is smaller than the previous experiment, but still significant. These experiments shows that CPU only DVS is not enough and motivate to us develop a realistic model that can explain these behaviors.
Mathematically, the execution time of a task is defined in this eq. Here C is cpu cycles, M is memory cycles, fc is cpu clock, and fm is memory clock. Then the execution time e is C over fc plus M over fm.
And this equation is standard power equation. Here k is capacitance constant for the component and f is frequency, V is voltage, and R is leakage power of the component. that is Power consumption W equal to kfV square plus RHere, it is important to understand that if operation mode is different then k is different. For example, when CPU is busy, its power consumption is much bigger compared to when cpu is not doing anything even though its operating frequency is the same. It is because of low power h/w design such as clock gating technique reduce the number of active part of the component so that it lower its aggeregated capacitance constance.
Now, let’s consider a task energy consumption for a given period P. Task execution can be divided into two blocks: pure computation and cache stall which means time to fetch data from off chip memory to fill the cache line of the processor core. While memory fetch operations are scattered throughout the entire execution, we aggregate them into this single block. This is true for in-order processor in which processor must wait until data to be fetched when there is any single cache miss. Therefore there’s no overlap between CPU execution and memory fetch. This is not true for out-of-order processors because there is overlapping period due to out-of-order executions, but it is relatively small compared to long memory fetch time. When a task complete, cpu, bus and memory are all in idle state. Then, the energy consumption can be expressed as the sum of energy consumption of these three blocks.
Now, let’s look at the first block – pure computation block. As I said, we have three major component to consider, cpu, memory, and bus that each component is a single term in this equation.The cpu … The bus and memory It is important to note that each component is in different mode of operation. At this block, the CPU is actively executing instructions without any delay. But bus and memory are not doing anything. When a component is in different mode of operation, the capacitance constant can be quite different even if its operating frequency remain the same because of various power saving techniques, such as clock gating, used in recent hardware design. Therefore we use different capacitance constant for different for active and standby mode. Therefore, in this equation, kca is capacitance constant for active cpu, kbs is standby bus, and kms is also standby memory. R represents the sum of all static power components in the system. The execution time of this portion of the task is C over fc. So far, we described this block.
The next block is cache stall block. The difference, compared to pure computation block, is mode of operation of each component. In this block, CPU is do noting but wait data are being fetched from memory. Therefore, in this block, CPU is idle but bus and memory is active. Again we use different capacitance constant for each component at each mode of operation. Here, Kcs is standby cpu, kbs is active bus, kma is active memory. And the execution time of this portion is M over fm.
Final block is idle block. Since the task is finished, all components – cpu, bus, memory – are in idle mode. We assume there is special mode in the system that save power more aggressively, which can be found in many recent embedded processors. Therefore, we used a separate term I instead to represent the power consumption in that special mode of the system. The execution time is period minus the execution time of the task.
And this is the equation we derived which describes energy consumption of entire system running a periodic task.
In this hardware platform, we used 4 synthetic programs with different cache stall ratio (0-55%) and for each task, we used 8 different clock configurations and measured the energy consumption. After measuring total 32 data we performed non-linear-…