In the past decade, high-performance cluster computing platforms have been widely used to solve challenging and rigorous engineering tasks in industry and scientific applications. Due to extremely high energy cost,reducing energy consumption has become a major
concern in designing economical and environmentally friendly cluster computing
infrastructures for many high-performance applications. The primary focus of this talk is to illustrate how to improve energy efficiency of clusters and storage systems without significantly degrading performance. In this talk, we will first describe a general architecture
for building energy-efficient cluster computing platforms. Then, we will outline several energyefficient scheduling algorithms designed for high-performance clusters and large-scale storage systems. The experimental results using both synthetic and real world applications
show that energy dissipation in clusters can be reduced with a marginal degradation of system performance.
Time Series Foundation Models - current state and future directions
Energy efficient resource management for high-performance clusters
1. Energy Efficient Scheduling for High-Performance Clusters ZiliangZong, Texas State University Adam Manzanares, Los Alamos National Lab Xiao Qin, Auburn University
2. Where is Auburn University? Ph.D.’04, U. of Nebraska-Lincoln 04-07, New Mexico Tech 07-now, Auburn University
7. Investigators ZiliangZong, Ph.D. Assistant Professor, Texas State University Adam Manzanares, Ph.D. Candidate Los Alamos National Lab Xiao Qin, Ph.D. Associate Professor Auburn University 2011/6/22 7
18. Motivational Example 8 T1 T3 T2 T4 1 23 33 39 0 8 6 5 2 3 T1 T3 T4 10 15 23 26 32 0 8 6 2 2 4 T2 4 24 14 6 T3 T4 T1 T1 23 29 20 0 8 0 8 2 T2 18 Linear Schedule Time: 39s No Duplication Schedule (NDS) Time: 32s Task Duplication Schedule (TDS) Time: 29s An Example of duplication 2011/6/22 18
19. Motivational Example (cont.) (8,48) (6,6) (5,5) T1 T3 T2 T4 1 23 33 39 0 8 (15,90) (10,60) 2 3 T1 T3 T4 (4,4) (2,2) 23 26 32 0 8 6 2 T2 (6,36) 4 24 14 T3 T4 T1 T1 23 29 20 0 8 0 8 2 T2 18 Linear Schedule Time:39s Energy: 234J No Duplication Schedule (MCP) Time: 32s Energy: 242J Task Duplication Schedule (TDS) Time: 29s Energy: 284J An Example of duplication CPU_Energy=6W Network_Energy=1W 2011/6/22 19
20. Motivational Example (cont.) (8,48) (6,6) (5,5) 1 (15,90) (10,60) 2 3 T1 T3 T4 (4,4) (2,2) 23 26 32 0 8 6 2 T2 (6,36) 4 24 14 T3 T4 T1 T1 23 29 20 0 8 0 8 2 T2 18 The energy cost of duplicating T1: CPU side: 48J Network side: -6J Total: 42J The performance benefit of duplicating T1: 6s Energy-performance tradeoff: 42/6 = 7 EAD Time: 32s Energy: 242J PEBD Time: 29s Energy: 284J If Threshold = 10 Duplicate T1? EAD: NO PEBD: Yes 2011/6/22 20
21. Basic Steps of Energy-Aware Scheduling Algorithm Implementation: Step 1: DAG Generation Task Description: Task Set {T1, T2, …, T9, T10 } T1 is the entry task; T10 is the exit task; T2, T3 and T4 can not start until T1 finished; T5 and T6 can not start until T2 finished; T7 can not start until both T3 and T4 finished; T8 can not start until both T5 and T6 finished; T9 can not start until both T6 and T7 finished; T10 can not start until both T8 and T9 finished; 2011/6/22 21
22. Basic Steps of Energy-Aware Scheduling Algorithm Implementation: Total Execution time from current task to the exit task Earliest Start Time Earliest Completion Time Latest Allowable Start Time Latest Allowable Completion Time Favorite Predecessor Step 2: Parameters Calculation 2011/6/22 22
25. The EAD and PEBD Algorithms Generate the DAG of given task sets Calculate energy increase and time decrease Calculate energy increase Find all the critical paths in DAG Ratio= energy increase/ time decrease more_energy<=Threshold? Generate scheduling queue based on the level (ascending) No Yes select the task (has not been scheduled yet) with the lowest level as starting task No Ratio<=Threshold? Duplicate this task and select the next task in the same critical path Yes meet entry task Duplicate this task and select the next task in the same critical path No allocate it to the same processor with the tasks in the same critical path Yes No For each task which is in the same critical path with starting task, check if it is already scheduled Save time if duplicate this task? Yes PEBD EAD 2011/6/22 25
29. Impact of CPU Power Dissipation Impact of CPU Types: 19.4% 3.7% Energy consumption for different processors (Gaussian, CCR=0.4) Energy consumption for different processors (FFT, CCR=0.4) 2011/6/22 29
30. Impact of Interconnect Power Dissipation Impact of Interconnection Types: 5% 3.1% 16.7% 13.3% Energy consumption (Robot Control, Myrinet) Energy consumption (Robot Control, Infiniband) 2011/6/22 30
31. Parallelism Degrees Impact of Application Parallelism: 6.9% 5.4% 17% 15.8% Energy consumption of Sparse Matrix (Myrinet) Energy consumption of Robert Control(Myrinet) 2011/6/22 31
High performance computing platforms have been widely deployed for intensive data processing and data storage. The impact of high performance computing platforms could be found in almost every domain: financial services, scientific computing, bioinformatics, computational chemistry, and weather forecast.
High performance computing platforms have been widely deployed for intensive data processing and data storage. The impact of high performance computing platforms could be found in almost every domain: financial services, scientific computing, bioinformatics, computational chemistry, and weather forecast.
High performance computing platforms have been widely deployed for intensive data processing and data storage. The impact of high performance computing platforms could be found in almost every domain: financial services, scientific computing, bioinformatics, computational chemistry, and weather forecast.
High performance computing platforms have been widely deployed for intensive data processing and data storage. The impact of high performance computing platforms could be found in almost every domain: financial services, scientific computing, bioinformatics, computational chemistry, and weather forecast.
High performance computing platforms have been widely deployed for intensive data processing and data storage. The impact of high performance computing platforms could be found in almost every domain: financial services, scientific computing, bioinformatics, computational chemistry, and weather forecast.
High performance computing platforms have been widely deployed for intensive data processing and data storage. The impact of high performance computing platforms could be found in almost every domain: financial services, scientific computing, bioinformatics, computational chemistry, and weather forecast.
This slide shows a typical high-performance computing platform, which was built by Google in the Oregon state. There is no doubt that they have significantly changed our lives and we all benefit from the great services provided by these super computing platforms . However, these giant machines consume a huge amount of energy.
This figure comes from the report of Environmental Protection Agency submitted to the congress last year. Based on their report, the total power usage of servers and data centers in United States is 61.4 billion kwh in 2006. This is more than doubled the energy usage for the same purpose in 2000. If we look at the trend, from 2000 to 2006, the energy consumed by servers and data centers rapidly increased from 28.2 billion kwh all the way up to 61.4 billion kwh.
Even worse, the EPA predicts that the power usage of servers and data centers will be doubled again within 5 years if the historical trends are followed. Even we follow the current efficiency trends, the power usage will exceed 100 billion kwh in 2011. This is a huge amount of energy.
However, most pervious research primarily focus on the performance, security and reliability issues of high-performance computing platforms. The energy consumption issue was ignored. Now the energy problem has become so serious and I believe it is time for us to highlight the energy efficiency research of high-performance computing platforms.
In our architecture, we have four layers: application layer, middleware layer, resource layer, network layer. In each layer, we can incorporate energy-aware techniques. For example, in the application layer, we can reduce the unnecessary access to hardware when writing the code. In the middleware layer, we can schedule parallel tasks in more energy-efficient ways. In the resource and network layers, we can do energy-aware resource management.
This slide shows some typical hardware in the resource and network layers like CPU, main board, storage disk, network adapter, switch and router.
One thing I would like to emphasize here is that any energy-oriented research should not scarify other important characters like performance, reliability or security. Although there must be some tradeoff once we introduce energy-aware techniques, we do not want to see significant degradation in other characters. In other words, we would like to make our research compatible with existing techniques. For my research, I mainly focus on the tradeoff between performance and energy.
Before we talk about the algorithms, let’s see the cluster systems first. In a cluster, we have the master node and slave nodes. The master node is responsible to schedule tasks and allocate them to slave nodes for parallel execution. All slave nodes are connected by high speed interconnections and they communicate with each other through message passing.
The parallel tasks running on clusters are represented using Directed Acyclic Graph , or DAG for short. Usually, a dag has one entry task and one or multiple exit tasks. Dag shows the task number and the execution time of each task. It also shows the dependence and communication time among tasks. Explain a little bit…
Weakness 1: Do not consider energy conservation in memoryWeakness 2: Energy can’t be conserved even then network interconnects are idleIn order to improve performance, we use duplication strategy. This slide shows why duplication can improve performance. Here we have 4 tasks represented by the DAG in the left side. If we use linear scheduling, all four tasks will be allocated in 1 CPU and the execution time will be 39s. However, we noticed that we can schedule task 2 to the 2nd CPU so that we do not need to wait the completion of task 3. In that way, the total time will shortened to 32s. We also noticed that 6s are wasted in the 2nd CPU because task 2 has to wait the message from task 1. If duplicate task 1 in the 2nd CPU, we can further shorted the schedule length to 29s. Obviously, the duplication could improve performance.
However, if we calculate the energy, we will find that duplication may consume more power. For example, if we set the energy consumption for CPU and network 6w and 1w, the total energy consumption of duplication will be 42J more than NDS and 50J more than linear schedule. That is mainly because task 1 are executed twice. Here I would like to mention that I will use NDS(MCP) to represent no duplication schedule and use TDS to represent task duplication schedule. You will see a lot of them in the simulation results.
So we have to consider the tradeoff between performance and power consumption. We propose two algorithms to consider the tradeoff. One is called energy-aware duplication or EAD for short. The other one is called performance-energy balanced duplication or PEBD for short. In EAD, we only calculate the energy cost for duplicating a task. For example, if we duplicate T1, we will pay the 48J energy cost in the CPU side because we have to execute T1 twice . At the same time, we can save 6J energy in the network side because we do not need send message from T1 to T2. So the total cost will be 42J. In PEBD, we also calculate the performance benefit. If we duplicate T1, we can shorten schedule length 6s in maxim. So the ration between energy and performance will be 7. If we set duplication threshold to be 10, EAD will not duplicate while PEBD will duplicate.
Now let’s look at how to implement the algorithms using a concrete example. Step1, we will generate the DAD based on the task description, which should be provided by users.
Next, we are going to calculate the important parameters based on the equations 14-19 shown in Chapter4. The level means…
Once we have these parameters, we can obtain the original task list by sorting the level in an ascending order. We will start from the first unscheduled task in the list, which is 10, and follow the favorite predecessor to the entry task. All tasks on this path will form a critical path. Here the first critical path will be 10->9->7->3->1; Then, these tasks will be marked as scheduled. In the next iteration, the algorithm will pick up the next unscheduled task as the start task and form the second critical path. Then, the third one and the fourth one. The algorithm will not terminated until all tasks have been scheduled.
The algorithms also have to make the duplication decision. Explain…
This diagram summarize the steps we just talked about. I will just skip it.
Now we are going to discuss the simulation results. We implement our own simulator using C language under Linux system. The CPU power consumption parameters come from the xbitlabs. We simulate 4 different CPUs, 3 of them are AMD and one is Intel.
This slide shows the structure of two small task set. The left one is Fast Fourier Transform and the right one is Gaussian Elimination.
The slide shows the DAG structure of two real-world applications. The left one is Robot Control and the right one is Sparse Matrix Solver.
This slide shows the impact of CPU types. Recall that I simulate 4 different CPUs, which are represented in 4 different colors. We found that the CPU with blue color can save more energy compares with other 3 CPUs. For example, we can save 19.4% energy using blue CPU while we only can save 3.7% for the purple CPU. The indication behind is that these 4 CPUs have different gaps between CPU_busy and CPU_idle. This table summarize the difference. The gap for the blue CPU is 89w but the gap for the purple CPU is only 18w. So our observation is…
This slide shows the impact of interconnections. The left one is the simulation results for Myrinet and the right one is the simulation results for the Infiniband. We can save 16.7% and 13.3% energy when CCR is 0.1 and 0.5 respectively using Myrinet. However, the number drops down to 5% and 3.1% for Infiniband. We found that the only difference between these two simulation sets are the network power consumption rate. The Myrinet is 33.6w and the Infiniband is 65w. So our observation is that…
We also observe the impact of application parallelism. The left figure shows the experimental results for Robot Control and the right one shows the results for Sparse. We noticed that we can save 17% and 15.8% energy for robot but only save 6.9% and 5.4% energy for sparse when CCR is the same. That is because the parallelism of robot is less than sparse. So our observation is…
This slide shows our observation to the impact of CCR. Read...
This group of simulation results show the impact to performance. The left one is for Gaussian and the right one is for Sparse. This table summarize that the overall performance degradation of EAD and PEBD is 5.7% and 2.2% compared with TDS for Gaussian. For Sparse, the number is 2.92% and 2.02%. Our observation is …
For example, we designed a mapping matrix to represent the execution time of tasks in different processors. As you can see, for the same task T1, the execution time are 6.7, 3.9, 2.0 respectively. If a task could not be executed in a processor, we will put a infinite sign.
We compared our HEADUS algorithm with other 4 algorithms and found that HEADUS can obtain the best overall energy savings in all of the 4 different environments.
We also observed that HEADUS can same more energy under environment 2 and 4.