[February 2017 - Ph.D. Final Dissertation] Enabling Power-awareness For Multi-tenant Systems

Enabling Power-Awareness
For Multi-Tenant Systems
Candidate: Matteo FERRONI
Advisor: Marco D. Santambrogio
Tutor: Donatella Sciuto
Ph.D. Cycle: XXIX
Ph.D. in Information Technology: Final Dissertation
Politecnico di Milano — February, 17th 2017

POWER CONSUMPTION
Credits: https://citizentv.co.ke/blogs/electricity-supply-is-fundamental-human-right-102340/

The battery of your smartphone does not last a day.
Credits: http://www.mobileworld.it/2016/01/07/smartphone-ricarica-camminata-62171/

A data center needs to deal with power grid limits.
Credits: https://resources.workable.com/systems-engineer-job-description

5
Context definition
Common features
(1) hardware heterogeneity
(2) software multi-tenancy
(3) input variability
Key facts:
• Energy budgets and power caps constrain the performance of the system
• The actual power consumption is affected by a pletora of different actors
(0) A bird's eye view

Problem definition and proposed approach
6
• Problems definition
A. How much power is a system going to consume, given certain working conditions?
B. How to control a system to consume less power, still satisfying its requirements?
• Assumption: the system will behave as it did in the past
• High-level approach:
1. Observe the behavior of the system during its real working conditions
2. Build accurate models to describe and predict it
3. Use them to refine decisions and meet goals efficiently
Idea: learn from experience

7
Pragmatic methodology
Data-driven power-awareness through a holistic approach
We start from raw data
(power measurements,
load traces, system
stats, etc.)
We are not interested in the
physical components of the
system: it is a black box
We help users and
systems to learn and
predict their power needs
This should be done in automation throughout the whole lifetime of the system

Outline
1. A first case study: power models for Android devices
2. Generalization: Model and Analysis of Resource Consumption (MARC)
3. Virtual guests monitoring: towards power-awareness for Xen
4. Modeling power consumption in multi-tenant virtualized systems
5. Maximizing performance under a power cap: a hybrid approach
6. Moving forward: containerization, challenges and opportunities
7. Conclusion and future work
8
CONTROLMODEL

9
• We need to observe and model the phenomenon
The need for a model
EnergyBudget(%)
Power
Model
Energy
Behavior
Time-To-Live (s)
Now! Time
9
(1) A first case study: power models for Android devices

10
Model as-a-Service
• Requirements:
• No monitoring and modeling overheads on the system itself
• adapt to different systems/users, as well as to changes over time
• Proposed solution: Model-as-a-Service
a. send raw traces to a remote server
b. compute power models
c. send back predictions and models
parameters
a
b
c
Power
constrained
system

11
Pragmatic approach
• Modeling approach: “divide et impera”
We experienced a piecewise linear behavior and tried to attribute
this to domain-specific features
Working regime
A
Working regime
B
Working regime
C
= actions on
controllable variables
Exogenous input
(uncontrollable)

12
Prediction performance w.r.t. SoA approaches
• Baseline
• Android L and Battery Historian (early 2015)
• Makes use of power models to estimate TTLs
• Performance reported for different models
• SM - one model for the user behavior for the whole day
• HM - one model for the user behavior for every hour of the day
• DM - subset of HM, merging similar hours of the day
• I% - Improvements w.r.t. Android L (AL)
average error values are reported ± standard deviations

MODEL
Outline
13
CONTROL

Signal
Models
Markov
Models
ARX 
Models
PHASE2A
14
A general methodology: the MARC approach
PHASE3
Integration
PHASE1
Data
Conditioning
Traced Battery Level
Battery Discharge
BatteryLevel
25%
30%
35%
40%
45%
50%
55%
60%
65%
70%
75%
80%
85%
90%
95%
100%
Time
32.000s 34.000s 36.000s 38.000s 40.000s
Traced Power
Energy Consumption
Linear Approximation
Sudden slope change
Energy
2kJ
4kJ
6kJ
8kJ
10kJ
12kJ
14kJ
16kJ
18kJ
20kJ
22kJ
24kJ
26kJ
28kJ
30kJ
32kJ
Power
2W
4W
6W
8W
10W
12W
14W
16W
18W
20W
22W
24W
26W
28W
30W
Time
0s 200s 400s 600s 800s 1000s 1200s
Traced Power
Energy Consumption
Linear Approximation - IDLE
Linear Approximation - I/O
Linear Approximation - MEM
Linear Approximation - CPU
Energy
2kJ
4kJ
6kJ
8kJ
10kJ
12kJ
14kJ
16kJ
18kJ
20kJ
22kJ
24kJ
26kJ
28kJ
30kJ
32kJ
Power
2W
4W
6W
8W
10W
12W
14W
16W
18W
20W
22W
24W
26W
28W
30W
Time
0s 200s 400s 600s 800s 1000s 1200s
PHASE2B
PHASE2C
(2) Generalization: Model and Analysis of Resource Consumption (MARC)
• MARC (Model and Analysis of Resource Consumption) is a REST platform
that is able to build resource consumption models in an “as-a-service” fashion

15
A model for each configuration
1
PHASE2B
PHASE2A
Autoregressive Models with Exogenous Variables
3
PHASE2C
Traced Power
Energy Consumption
Linear Approximation - MEM
Linear Approximation - CPU
Energy
2kJ
4kJ
6kJ
8kJ
10kJ
12kJ
14kJ
16kJ
18kJ
20kJ
22kJ
24kJ
26kJ
28kJ
30kJ
32kJ
Power
2W
4W
6W
8W
10W
12W
14W
16W
18W
20W
22W
24W
26W
28W
30W
Time
0s 200s 400s 600s 800s 1000s 1200s
FOR EACH
WORKING REGIME
A model  
is computed to
characterize the
process

16
Predicting configuration switches
1
PHASE2B
PHASE2A
Hidden Markov Models
3
PHASE2C
BY OBSERVING
PERIODICITY
A predictive
conﬁguration
switching model is
computed
Traced Power
Energy Consumption
Sudden slope change
Energy
2kJ
4kJ
6kJ
8kJ
10kJ
12kJ
14kJ
16kJ
18kJ
20kJ
22kJ
24kJ
26kJ
28kJ
30kJ
32kJ
Power
2W
4W
6W
8W
10W
12W
14W
16W
18W
20W
22W
24W
26W
28W
30W
Time
0s 200s 400s 600s 800s 1000s 1200s

Tackling the residual non-linearity
17
PHASE2B
PHASE2A
3
PHASE2C
WITHIN EACH
WORKING REGIME
The residual non-
linearity is
addressed by
exploiting time
series analyses
Signal Models and Time Series Analysis
Traced Battery Level
Battery Discharge
BatteryLevel
25%
30%
35%
40%
45%
50%
55%
60%
65%
70%
75%
80%
85%
90%
95%
100%
Time
32.000s 34.000s 36.000s 38.000s 40.000s
1

MODELCONTROL
Outline
18

19
Use case: Power consumption models for Xen domains
✅
❓Dom 0
Kernel
HW
XEN
CPU MEMORYIO
Drivers
Dom 1
Guest OS
Paravirtualized
Application
Dom 2
Guest OS
Paravirtualized
Application
DomU
Guest OS
Paravirtualized
Application
CONFIG SCHEDULER MMU TIMERS INTERRUPTS
PV frontBack
Toolstack
THE XEN
HYPERVISOR
Type 1 Hypervisor currently employed  
in many production environments
• Question: “how much is a virtual tenant consuming?”
(3) Virtual guests monitoring: towards power-awareness for Xen

20
Use case: Power consumption models for Xen domains
✅
❓Dom 0
Kernel
HW
XEN
CPU MEMORYIO
Drivers
Dom 1
Guest OS
Paravirtualized
Application
Dom 2
Guest OS
Paravirtualized
Application
DomU
Guest OS
Paravirtualized
Application
CONFIG SCHEDULER MMU TIMERS INTERRUPTS
PV frontBack
Toolstack
ASSUMPTION
“The power consumption of a system
depends on what the hardware is doing”
• Proposed solution: model virtual tenants power consumption exploiting
hardware events traces, collected and attributed to each one of them

Tracing the Domains’ behavior
21
XeMPowerCLI
A1
1
B1
A2
2
B2
A1
1
B1
A3
3
B3
A2
2
1
A1
Core 0 Core N
Time
B2
…
…
…
context
switch
context
switch
context
switch
context
switch
XeMPowerDaemon
B2
B2
B1
B1
B3
B2
B2
B1
B1
B3
Xen Kernel Dom0
Hardware events per core,
energy per socket
…
XEMPOWER
Collect and account hardware events
to virtual tenants in two steps:
1. In the Xen scheduler (kernel-level)
• At every context switch, trace the
interesting hardware events
• e.g., INST_RET,
UNHALTED_CLOCK_CYCLES,
LLC_REF, LLC_MISS
2. In Domain 0 (privileged tenant)
• Periodically acquire the events
traces and aggregate them on a
domain basis

Outline
22
MODELCONTROL

23
Power models: State-of-Art approaches
Workload classes:
(a) idle
(b) weak I/O intensive
(c) memory intensive
(d) CPU intensive
(e) strong I/O intensive
Use a single power model, built on different hardware events:
A. INST_RET, UNHALTED_CLOCK_CYCLES, LLC_REF, LLC_MISS
B. INST_RET, UNHALTED_CLOCK_CYCLES, LLC_REF
C. UNHALTED_CLOCK_CYCLES, LLC_REF  
Configuration
Model A Model B Model C
RMSE
Relative
error
RMSE
Relative
error
RMSE
Relative
error
(a) ± 17.63 W 35.56% ± 16.44 W 32% ± 17.68 W 35%
(b) ± 4.7 W 9.4% ± 5.86 W 11.7% ± 7.17 W 14%
(c) ± 19.11 W 38% ± 34.54 W 70% ± 18.7 W 37%
(d) ± 0.44 W 0.08% ± 0.6W W 1.2% ± 0.42 W 0.08%
(e) ± 2.98 W 5.9% ± 38.57 W 77% ± 3.29 W 6.5%
average ± 8.97 W 17.79% ± 19.20 W 38.38% ± 9.45 W 18.52%
Table 6.9: The modelling errors (Root MSE and mean relative error) obtained with state of the art
Workload
classes
The best average model is the
worst on a single configuration
No model is better than the others
consistently w.r.t. all the configurations
(4) Modeling power consumption in multi-tenant virtualized systems

Power modeling flow
24(4) Modeling power consumption in multi-tenant virtualized systems
Models
exploitation

25
• Goals of the experiments:
A. assess the precision of the modeling methodology
B. explore model portability on different hardware platforms
C. evaluate colocation of different tenants
• Benchmarks
– Apache Spark (SVM and PageRank)
– Redis (Memory-intensive)
– MySQL and Cassandra (IO-intensive)
– FFmpeg (CPU-intensive)
Experimental evaluation
• Experimental setup
– A. WRK: Intel Core i7 @ 3.40GHz
8GB DDR3 RAM
– B. SRV1: Intel Xeon @ 2.80GHz
16GB DDR3 RAM
– C. SRV2: two Intel Xeon @ 2.3GHz
128GB RAM DDR4

• RMSE around 1W on average,
under 2W in almost all the cases;
• only three results present a
worse behavior (still under 5W)
• Relative error around 2% on
average, under 4% in almost all
the cases
• only three results present a
worse behavior (still under 10%)
Results generally outperform the
works in literature [1,2,3], even in
the worst cases
[1] Anton Beloglazov, Rajkumar Buyya, Young Choon Lee, Albert Zomaya, et al. A taxon- omy and survey of energy-efﬁcient data centers and cloud computing systems. Advances in computers,
82(2):47–111, 2011
[2] W Lloyd Bircher and Lizy K John. Complete system power estimation: A trickle-down approach based on performance events. In Performance Analysis of Systems & Software, 2007. ISPASS
2007. IEEE International Symposium on, pages 158–168. IEEE, 2007
[3] Hailong Yang, Qi Zhao, Zhongzhi Luan, and Depei Qian. imeter: An integrated vm power model based on performance proﬁling. Future Generation Computer Systems, 36:267–286, 2014.

Outline
27
MODELCONTROL

Problem definition
28(5) Maximizing performance under a power cap: a hybrid approach
• Two points of view:
A. minimize power consumption given a minimum performance requirement
B. maximize performance given a limit on the maximum power consumption
• Requirements:
– work in a virtualized environment
– avoid instrumentation of the guest workloads
• Steps towards the goal:
1. identify a performance metric for all the hosted tenants
2. define a resource allocation policy to deal with the requirements
3. extend the hypervisor to provide the right knobs

(5) Maximizing performance under a power cap: a hybrid approach
Power capping approaches
29
SOFTWARE APPROACH
✓ efﬁciency
✖ timeliness
MODEL BASED 
MONITORING [3]
THREAD 
MIGRATION [2]
RESOURCE
MANAGMENT DVFS [4] RAPL [1]
CPU
QUOTA
HARDWARE APPROACH
✖ efﬁciency
✓ timeliness
[1] H. David, E. Gorbatov, U. R. Hanebutte, R. Khanna, and C. Le. Rapl: Memory power estimation and capping. In International Symposium on Low Power Electronics and Design (ISPLED), 2010.
[2] R. Cochran, C. Hankendi, A. K. Coskun, and S. Reda. Pack & cap: adaptive dvfs and thread packing under power caps. In International Symposium on Microarchitecture (MICRO), 2011.
[3]M. Ferroni, A. Cazzola, D. Matteo, A. A. Nacci, D. Sciuto, and M. D. Santambrogio. Mpower: gain back your android battery life! In Proceedings of the 2013 ACM conference on Pervasive and
ubiquitous computing adjunct publication, pages 171–174. ACM, 2013.
[4] T. Horvath, T. Abdelzaher, K. Skadron, and X. Liu. Dynamic voltage scaling in multitier web servers with end-to-end delay control. In Computers, IEEE Transactions. IEEE, 2007.

30
[5] H. Zhang and H. Hoffmann. Maximizing performance under a power cap: A comparison of hardware, software, and hybrid techniques. In International Conference on Architectural Support for
Programming Languages and Operating Systems (ASPLOS), 2016.
HYBRID APPROACH [5]
✓ efficiency
✓ timeliness
Power capping approaches
SOFTWARE APPROACH
✓ efficiency
✖ timeliness
MODEL BASED 
MONITORING [3]
THREAD 
MIGRATION [2] RESOURCE
MANAGMENT
DVFS [4]
RAPL [1]
CPU
QUOTA
HARDWARE APPROACH
✖ efficiency
✓ timeliness

Systemdesign

• The workloads run in paravirtualized domains
Systemdesign

• XeMPUPiL spans over all the layers
Systemdesign

• Instruction Retired (IR) metric gathered and accounted to each domain,
thanks to XeMPower
• The aggregation is done over a time window of 1 second
Systemdesign

• Observation of both hardware events (i.e., IR) and power
consumption (whole CPU socket)
Systemdesign

36
– given a workload with M virtual resources
and an assignment of N physical resources,
to each pCPUi we assign:
Systemdesign

• Hybrid actuation:
– enforce power cap via RAPL
– define a CPU pool for the workload and pin workload’s vCPUs over pCPUs
Systemdesign

38
Systemdesign

39
Systemdesign

40
A. how do different workloads perform under a power cap?
B. can we achieve higher efficiency w.r.t. RAPL power cap?
• Benchmarks
– Embarrassingly Parallel (EP)
– IOzone
– cachebench
– Bi-Triagonal solver (BT)
• Three power caps explored: 40W, 30W and 20W
• Results are normalized w.r.t. the performance obtained with no power caps
– 2.8-GHz quad-core Intel Xeon
– 32GB of RAM
– Xen hypervisor version 4.4

41
0
0.2
0.4
0.6
0.8
1.0
NO RAPL
RAPL 40
RAPL 30
RAPL 20
NormalizedPerformance
0
0.2
0.4
0.6
0.8
1.0
EP cachebench IOzone BT
• Preliminary evaluation: how do they perform under a power cap?

42
0
0.2
0.4
0.6
0.8
1.0
NO RAPL
RAPL 40
RAPL 30
RAPL 20
0
0.2
0.4
0.6
0.8
1.0
• For CPU-bound benchmarks (i.e., EP and BT), the difference are significant

43
0
0.2
0.4
0.6
0.8
1.0
NO RAPL
RAPL 40
RAPL 30
RAPL 20
0
0.2
0.4
0.6
0.8
1.0
• With IO- and/or memory-bound workloads, the performance degradation is
less significant between different power caps

44
0
0.5
1.0
PUPiL 40
RAPL 40
Normalizedperformance
0
0.5
1.0
0
0.5
1.0
PUPiL 30
RAPL 30
0
0.5
1.0
0
0.5
1.0
PUPiL 20
RAPL 20
0
0.5
1.0
• Performance of the
workloads with
XeMPUPiL, for different
power caps:
– higher performance
than RAPL, in general
– not always true on a
pure CPU-bound
benchmark (i.e., EP)

45
0
0.5
1.0
PUPiL 40
RAPL 40
0
0.5
1.0
0
0.5
1.0
PUPiL 30
RAPL 30
0
0.5
1.0
0
0.5
1.0
PUPiL 20
RAPL 20
0
0.5
1.0
workloads with
power caps:
pure CPU-bound

46
0
0.5
1.0
PUPiL 40
RAPL 40
0
0.5
1.0
0
0.5
1.0
PUPiL 30
RAPL 30
0
0.5
1.0
0
0.5
1.0
PUPiL 20
RAPL 20
0
0.5
1.0
workloads with
power caps:
pure CPU-bound

Outline
47
MODELCONTROL

Containerization: opportunities and challenges
48(6) Moving forward: containerization, challenges and opportunities
A different road to multi-tenancy
• Group the application and all its dependencies in a single container
• The host operating system sees a container as a group of processes
Proposed solution
• A power-aware orchestrator for Docker containers
Manage resources to meet the power consumption goal
• A policy-based system
Guarantee performance of the containers while staying under the power cap

DockerCap: system design
49
Observe
Queue Act
Queue
Act Component
CGroup
Observe Component
Docker
Containers
RAPL
Decide Component
Policy1 Policy2 Policy3
(6) Moving forward: containerization, challenges and opportunities

50
Power samples from
Intel RAPL
Resource allocation of the
containers from Docker and
cgroups
Observe
Queue Act
Queue
Act Component
CGroup
Observe Component
Docker
Containers
RAPL
Decide Component

51
Resource Partitioning
Observe
Queue Act
Queue
Act Component
CGroup
Observe Component
Docker
Containers
RAPL
Decide Component
Resource Control

52
Observe
Queue Act
Queue
Act Component
CGroup
Observe Component
Docker
Containers
RAPL
Decide Component
Actuation through the
cgroup hierarchy for
each container

53
Observe
Queue Act
Queue
Act Component
CGroup
Observe Component
Docker
Containers
RAPL
Decide Component

54
A. is the software-level power cap stable and precise?
B. are we able to meet the performance requirements of the containers?
• Benchmarks
– fluidanimate (fluid simulation)
– x264 (video encoding)
– dedup (compression)
• Three power caps explored: 40W, 30W and 20W
• All the benchmark containers run simultaneously on the same node
• Baseline: Intel RAPL power capping solution
– 2.8-GHz quad-core Intel Xeon
– 32GB of RAM
– Docker 1.11.2

55
Power cap: 30W
Power cap: 20W
Power cap: 40W
•dedup
•ﬂuidanimate
•x264
• Comparison between performance-
agnostic approaches: Fair
partitioning policy vs. RAPL
• Performance metric:
Time To Completion
(lower is better)
• Comparable performance, better
results on lower power caps

Power cap: 30W
Power cap: 20W
Power cap: 40W
•dedup
•ﬂuidanimate
•x264
All policies
• Comparing fair and
performance-aware
approaches
Time To Completion
(lower is better)

Power cap: 30W
Power cap: 20W
Power cap: 40W
•dedup
•ﬂuidanimate
•x264
All policies
performance-aware
approaches
Time To Completion
(lower is better)

Power cap: 30W
Power cap: 20W
Power cap: 40W
58
performance-aware
approaches
Time To Completion
(lower is better)
• fluidanimate is set to High
Priority with a SLO of 400s
•dedup
•ﬂuidanimate
•x264
All policies

Conclusion
Better performance w.r.t. Android L predictions
Modeling pipeline has been generalized and provided “as-a-service”
HW events are traced with neglibigle overhead on the system
Better performance w.r.t. SoA approaches
Better performance w.r.t. standard RAPL power cap
Promising results towards a performance-aware and power-aware orchestration
59
MODELCONTROL

• We want to validate the modeling methodology on different resources
• Time-to-Completion of Hadoop jobs
60
Future Work
• We want to exploit these model to:
• detect anomalies in a distributed
microservice infrastructure
• perform better resource allocation and
consolidation

- XEN MODELS -
DETAILS ON WORKING REGIMES

Working Regime identification
64
A single model is not enough: we explored the MARC approach
Question: What is a working regime in this case study?
Identified
a posteriori
by looking at the
different slopes
on the trace graph
Traced power Energy consumption
Energy
0J
20kJ
40kJ
60kJ
80kJ
100kJ
120kJ
140kJ
160kJ
180kJ
200kJ
220kJ
240kJ
Power
25W
30W
35W
40W
45W
50W
55W
60W
65W
70W
75W
Time
0s 500s 1000s 1500s 2000s 2500s 3000s 3500s 4000s

KERNEL DENSITY ESTIMATION (KDE)
By observing the local minima of the
reconstructed distribution of power consumption
we identify  
the points where a Working Regime change happens
Working Regime identification - How many are they?
65
Traced power Energy consumption
Energy
0J
20kJ
40kJ
60kJ
80kJ
100kJ
120kJ
140kJ
160kJ
180kJ
200kJ
220kJ
240kJ
Power
25W
30W
35W
40W
45W
50W
55W
60W
65W
70W
75W
Time
0s 500s 1000s 1500s 2000s 2500s 3000s 3500s 4000s
LINEAR RANGES
0: [0W , 42W)
1: [42W , 57W)
2: [57W , +∞)
Probabilitydensity
0
0,005
0,010
0,015
0,020
0,025
0,030
0,035
0,040
0,045
0,050
0,055
0,060
0,065
0,070
Power
10W 20W 30W 40W 50W 60W 70W 80W 90W

From hardware events to Working Regimes (1)
66
Weights
0
0,01
0,02
0,03
0,04
0,05
0,06
0,07
0,08
0,09
0,10
0,11
0,12
0,13
0,14
0,15
0,16
Features
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
RELIEFF + KDE
1. ReliefF is used to identify which feature better induce the
Working Regimes classiﬁcation identiﬁed before

2. For each Working Regime:
The distribution of the values of that feature is reconstructed using KDE
3. The distribution are compared to obtain discriminant values
67
Weights
0
0,01
0,02
0,03
0,04
0,05
0,06
0,07
0,08
0,09
0,10
0,11
0,12
0,13
0,14
0,15
0,16
Features
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
CLASS 0
CLASS 1
CLASS 2
Probabilitydensity
0
5×10
−11
10×10
−11
15×10
−11
20×10
−11
25×10
−11
30×10
−11
35×10
−11
40×10
−11
45×10
−11
50×10
−11
55×10
−11
60×10
−11
PMC values
0 2×10
9
4×10
9
6×10
9
8×10
9
RELIEFF + KDE

RESULT:
A Working Regime classiﬁer that is able to determine in which
Working Regime the system is, starting from the sampled features
68
Weights
0
0,01
0,02
0,03
0,04
0,05
0,06
0,07
0,08
0,09
0,10
0,11
0,12
0,13
0,14
0,15
0,16
Features
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
CLASS 0
CLASS 1
CLASS 2
Probabilitydensity
0
5×10
−11
10×10
−11
15×10
−11
20×10
−11
25×10
−11
30×10
−11
35×10
−11
40×10
−11
45×10
−11
50×10
−11
55×10
−11
60×10
−11
PMC values
0 2×10
9
4×10
9
6×10
9
8×10
9
INST_RET
0
[0, 
1.235e9]
1
(1.235e9, 
3.61e9)
[3.61e9, 
5.58e9)
2
(1.235e9, 
3.61e9)
(1.235e9, 
3.61e9)
[5.58e9, 
+∞)
RELIEFF + KDE

69
RELIEFF + KDE
Weights
0
0,01
0,02
0,03
0,04
0,05
0,06
0,07
0,08
0,09
0,10
0,11
0,12
0,13
0,14
0,15
0,16
Features
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
CLASS 0
CLASS 1
CLASS 2
Probabilitydensity
0
5×10
−11
10×10
−11
15×10
−11
20×10
−11
25×10
−11
30×10
−11
35×10
−11
40×10
−11
45×10
−11
50×10
−11
55×10
−11
60×10
−11
PMC values
0 2×10
9
4×10
9
6×10
9
8×10
9
INST_RET L1_HIT
0
[0, 
1.235e9]
1
(1.235e9, 
3.61e9)
(2.36362e8, 
5.672e8)
[3.61e9, 
5.58e9)
2
(1.235e9, 
3.61e9)
[0, 
2.36362e8]
(1.235e9, 
3.61e9)
[5.672e8, 
+∞)
[5.58e9, 
+∞)
• In case of uncertainty, repeat from ReliefF:
• Eliminating the already selected features
• Eliminating all the data that are not part of the uncertain zone

- DOCKERCAP -
IMPLEMENTATION DETAILS

74
Decide phase
Observe
Queue Act
Queue
Act Component
CGroup
Observe Component
Docker
Containers
RAPL
Decide Component
Resource Control

76
Resource control
100%
CPU quota cap
available resource
With the feedback control loop logic, we ﬁnd the
allocation of resources that ensures the power cap

77
Decide phase
Observe
Queue Act
Queue
Act Component
CGroup
Observe Component
Docker
Containers
RAPL
Decide Component
Resource Control

78
Resource partitioning
Containers: C1 C2 C3 C4
?
We explore three different partitioning policies:
• Fair resource partitioning
• Priority-aware resource partitioning
• Throughput-aware resource partitioning
100%
CPU quota cap
available resource

• The quota Q is evenly partitioned across all the containers
• No control over the throughput of a single container
79
1. Fair resource partitioning
100%
CPU quota cap
Containers: C1 C2 C3 C4
Q/4 Q/4 Q/4 Q/4

80
2. Priority-aware partitioning
100%
CPU quota cap
Containers:
• The quota Q is partitioned following the priority of each container
• The quota of the single container is estimated through a weighted mean,
where every priority has its own associated weight
High priority:
Low priority:
C1
HIGH LOW
C2 C3 C4
LOW LOW

81
Throughput-aware resource partitioning
C3 C4Best effort:
+ SLO1
+ SLO2
SLO1 SLO2
100%
CPU quota cap
Containers:
High priority:
Low priority:
C1
C2
BE BE
• The quota Q is partitioned following the priority of each container and its
Service Level Objectives (SLO)
• SLO is here defined as the Time-To-Completion (TTC) of the task

82
Experimental setup
All the benchmark containers run simultaneously on the same node
HW OS
CONTAINER
ENGINE
RUNTIME
Intel Xeon
E5-1410
32GB RAM
Ubuntu 14.04
Linux 3.19.0-42
Docker 1.11.2 Python 2.7.6
BENCHMARK CONTAINERS
PARSEC DESCRIPTION
fluidanimate fluid dynamics simulation generic CPU-bound
x264 video streaming encoding e.g., video surveilance
dedup compression cloud-fog communication

83
Goals of the experiments
The comparison is done with the state of the art power capping solution RAPL by Intel[1]
PERFORMANCES OF THE
CONTAINERS
PRECISION OF THE POWER
CAPPING
allocate resource to meet
containers’ requirements
manage machine
power consumption

84
Precision of the power capping
• Comparable results in terms
of average power
consumption under the
power cap
• As expected, RAPL provides
a more stable power capping
•Fair
•Priority-aware
•Throughput-aware
•RAPL

85
Performances: Fair Partitioning vs RAPL
• Comparison between the
performance-agnostic approaches
Time To Completion
(lower is better)
Power cap: 30W
Power cap: 20W
Power cap: 40W
•dedup
•ﬂuidanimate
•x264

Power cap: 30W
Power cap: 20W
Power cap: 40W
86
Performances: all policies
•dedup
•ﬂuidanimate
•x264
• Comparison with the
performance-aware
approaches
• fluidanimate is set to
High Priority with a SLO
of 400s
Time To Completion
(lower is better)

87
Conclusion and future work
✓We presented DockerCap, a power-aware orchestrator
that manages containers’ resources
✓We showed how DockerCap is able to limit the power
consumption of the machine
✓We discussed three distinct partitioning policies and
compared their impact on containers’ SLO
FUTURE DIRECTIONS
• Exploit both HW and SW power capping
• Improve the precision of the power capping with
more refined modeling techniques [2]
• Compute the right allocation of resources online by
observing the performance of the containers
[2] Andrea Corna and Andrea Damiani. A scalable framework for resource consumption modelling: the MARC approach.
Master’s thesis. Politecnico di Milano, 2016.

- XEN MODELS -
PRELIMINARY RESULTS

Experimental settings and benchmarks
89
“XARC1”
Dell OptiPlex 990
“SANDY”
Dell PowerEdge T320
Processor Intel Core i7-2600 @ 3.40GHz Intel Xeon CPU E5-1410 @ 2.80GHz
Memory 4 banks of Synchronous 2GB DIMM DDR3
RAM @ 1.33GHz
2 banks of Synchronous 16GB DIMM DDR3
RAM @ 1.60Ghz
Storage Seagate 250GB 7200rpm 8MB cache SATA
3.5” HDD
Western Digital 250GB 7200rpm 16MB
cache SATA 3.5” HDD
Network Intel 82579LM Gigabit Network Connection Broadcom NetXtreme BCM5720 Gigabit
Ethernet PCIe
[1] YANG, Hailong, et al. iMeter: An integrated VM power model based on
performance profiling. Future Generation Computer Systems, 2014, 36:
267-286.
Micro Benchmarks[1]
• NASA Parallel Benchmarks
• CPUMemory Features
• Cachebench
• Cache Hierarchy
• IOzone
• Disk IO Operations
TRAIN SET
Realistic Benchmarks
• Redis Server
• Non-relational DBMS interrogations
• MySQL Server
• Relational DBMS query
• FFMPEG
• AudioVideo transcoding and compression
TEST SET

Power models: the MARC approach (1)
90
RMSE
Relative
error
Coverage
Redis ±0.58W 1.10% 100.00%
MySQL ±1.94W 3.80% 100.00%
FFMPEG ±0.51W 1.00% 100.00%
SANDY
TRAIN AND TEST
ON THE SAME
PHYSICAL MACHINE
LOWER BOUND IN THE STATE OF THE ART
5% of relative error [1]
RMSE
Relative
error
Coverage
Redis ±2.07W 4.14% 100.00%
MySQL ±9.27W 18.5% 100.00%
FFMPEG ±1.32W 2.64% 99.90%
XARC1
[1] YANG, Hailong, et al. iMeter: An integrated VM power model based on
performance profiling. Future Generation Computer Systems, 2014, 36:
267-286.

Power models: the MARC approach (2)
91
TRAIN ON XARC1,
TEST ON SANDY
RMSE
Relative
error
Coverage
Redis ±0.58W 1.10% 100.00%
MySQL ±1.94W 3.80% 100.00%
FFMPEG ±0.51W 1.00% 100.00%
SANDY
TRAIN AND TEST
ON THE SAME
PHYSICAL MACHINE
RMSE
Relative
error
Coverage
Redis ±0.61W 1.23% 99.70%
MySQL ±1.97W 3.86% 100.00%
FFMPEG ±0.63W 1.26% 100.00%
XARC1

93
The concept of Working Regime
• Domain-specific feature: hardware modules currently used
• We defined the concept of working regime:
“Given the controllable hardware modules on a device,
a working regime is a combination of their internal state”
Working regime
A
Working regime
B
Working regime
C

94
MISO Model for every configuration
• We tackle the problem of power model estimation in a fixed configuration with a
linear Multiple Input Single Output (MISO) model
Battery
prediction
Previous
battery
levels
Exogenous
input values
Model
parameter

95
Actions on controllable variables
• They are determined by the user’s behavior
• We model the evolution of the smartphone’s
configuration as a Markov Decision Process
• A state for every configuration
• Transitions’ weights represent the
probability to go from a configuration to
another
Configuration A
Configuration B
Configuration C

- XEMPOWER -
DETAILS AND RESULTS

Proposed Approach
• At each context
switch, start counting
the hardware events
of interest
• The configured PMC
registers store the
counts associated
with the domain that
is about to run
97
A1 A3
Core 0 Core N
Time
Xen Kernel
…

Proposed Approach
• At the next context
switch, read and
store PMC values,
accounted to the
domain that was
running
• Counters are then
cleared
98
A1
1
B1
A2
A1
A3
3
B3
Core 0 Core N
Time
context
switch
context
switch
Xen Kernel
…

Proposed Approach
• Steps A and B are
performed at every
context switch on
every system’s CPU
(i.e., physical core or
hardware thread).
• The reason is that
each domain may
have multiple virtual
CPUs (VCPUs).
99
A1
1
B1
A2
2
B2
A1
1
B1
A3
3
B3
A2
2
1
A1
Core 0 Core N
Time
B2
…
…
…
context
switch
context
switch
context
switch
context
switch
Xen Kernel
…

Proposed Approach
• Finally, the PMC
values are
aggregated by
domain and finally
reported or used for
other estimations
• Expose the collected
data to a higher level
– how?
100
A1
1
B1
A2
2
B2
A1
1
B1
A3
3
B3
A2
2
1
A1
Core 0 Core N
Time
B2
…
…
…
context
switch
context
switch
context
switch
context
switch
XeMPowerDaemon
B2
B2
B1
B1
B3
Xen Kernel Dom0
…

Proposed Approach
xentrace
• a lightweight trace
capturing facility
present in Xen
• we tag every trace
record with the ID of
the scheduled
domain and its
current VCPU
• a timestamp is kept
to later reconstruct
the trace flow
101
A1
1
B1
A2
2
B2
A1
1
B1
A3
3
B3
A2
2
1
A1
Core 0 Core N
Time
B2
…
…
…
context
switch
context
switch
context
switch
context
switch
XeMPowerDaemon
B2
B2
B1
B1
B3
Xen Kernel Dom0
energy per socket
…

Use Case: Power Consumption Attribution
Use case
• Enable real-time
attribution of CPU
power consumption
to each guest
• Socket-level energy
measurements are
also read (via Intel
RAPL interface) at
each context switch
102
XeMPowerCLI
A1
1
B1
A2
2
B2
A1
1
B1
A3
3
B3
A2
2
1
A1
Core 0 Core N
Time
B2
…
…
…
context
switch
context
switch
context
switch
context
switch
XeMPowerDaemon
B2
B2
B1
B1
B3
B2
B2
B1
B1
B3
Xen Kernel Dom0
energy per socket
…

Power models from PMC traces
• High correlation between hardware events
and power consumption [28]
• Non-halted cycle is the best metric to
correlate power consumption (linear
correlation coefficient above 0.95)
• Such correlation suggests that the higher
the rate of non-halted cycles for a domain
is, the more CPU power the domain
consumes
103
XeMPowerCLI
A1
1
B1
A2
2
B2
A1
1
B1
A3
3
B3
A2
2
1
A1
Core 0 Core N
Time
B2
…
…
…
context
switch
context
switch
context
switch
context
switch
XeMPowerDaemon
B2
B2
B1
B1
B3
B2
B2
B1
B1
B3
Xen Kernel Dom0
energy per socket
…

Power models from PMC traces
• High correlation between hardware events
and power consumption [28]
• Non-halted cycle is the best metric to
correlate power consumption (linear
correlation coefficient above 0.95)
• Such correlation suggests that the higher
the rate of non-halted cycles for a domain
is, the more CPU power the domain
consumes
Idea
• Split system-level power consumption and
account it to virtual guests
104
XeMPowerCLI
A1
1
B1
A2
2
B2
A1
1
B1
A3
3
B3
A2
2
1
A1
Core 0 Core N
Time
B2
…
…
…
context
switch
context
switch
context
switch
context
switch
XeMPowerDaemon
B2
B2
B1
B1
B3
B2
B2
B1
B1
B3
Xen Kernel Dom0
energy per socket
…

Proposed approach to account
1. For each tumbling window, the XeMPower
daemon calculates the total number of
non-halted cycles (one of the PMC traced)
2. We estimate the percentage of non-halted
cycles for each domain over the total
number of non-halted cycles; this
represents the contribution of each domain
to the whole CPU power consumption
3. Finally, we split the socket power
consumption proportionally to the
estimated contributions of each domain
105
XeMPowerCLI
A1
1
B1
A2
2
B2
A1
1
B1
A3
3
B3
A2
2
1
A1
Core 0 Core N
Time
B2
…
…
…
context
switch
context
switch
context
switch
context
switch
XeMPowerDaemon
B2
B2
B1
B1
B3
B2
B2
B1
B1
B3
Xen Kernel Dom0
energy per socket
…

Experimental evaluation 106
• Back to the XeMPower requirements:
1. provide precise attribution of hardware events to virtual
tenants
2. agnostic to the mapping between virtual and physical
resources, hosted applications and scheduling policies
3. add negligible overhead
• Goals of the experimental evaluation:
– show how XeMPower monitoring components incur
very low overhead under different configurations
and workload conditions
V
V

• Overhead metric:
– the difference in the system’s power consumption
while using XeMPower versus an off-the-shelf Xen 4.6
installation
• Experimental setup:
– 2.8 GHz quad-core Intel Xeon E5-1410 processor (4
hardware threads)
– a Watts up? PRO meter to monitor the entire
machine’s power consumption
– Each guest repeatedly runs a multi-threaded
compute-bound microbenchmark on three VCPUs
and uses a stripped-down Linux 3.14 as the guest OS

• Three system configurations:
1. the baseline configuration uses off-the-shelf Xen 4.4
2. the patched configuration introduces the kernel-level
instrumentation without the XeMPower daemon
3. the monitoring configuration is the patched with the XeMPower
daemon running and reporting statistics
• Four running scenarios:
– an idle scenario in which the system only runs Dom0
– 3 running-n scenarios, where n = {1, 2, 3} indicates the number of
guest domains in addition to Dom0
• The idea is to stress the system with an increasing number of
CPU-intensive tenant applications
• This increases the amount of data traced and collected by
XeMPower

• Mean power consumption (μ), in Watts, scenarios idle and running-
{1,2,3}, and configurations baseline (b), patched (p), and monitoring
(m)
• Mean power values are reported with their 95% confidence interval
Experimental Results 109
• At a glance, we can see how measurements are pretty close
pinned-VCPU
unpinned-VCPU

• We estimate an upper bound ϵ for the maximum overhead using a
hypothesis test:
• A rejection of the null hypothesis means that there is strong statistical
evidence that the power consumption overhead is lower than ϵ
• We compute ϵ for the considered test cases and scenarios, ensuring
average values of power consumption (μ) with confidence: α = 5%
• We want to compare the overhead with the one measured for XenMon, a
performance monitoring tool for Xen
• unlike XeMPower, XenMon does not collect PMC reads
• it is still a reference design in the context of runtime monitoring for
the Xen ecosystem

• Estimated upper bound ϵ for the power consumption overhead, in Watts
• Parenthetical values are the overheads w.r.t. mean power consumption
• XeMPower introduces an overhead not greater than 1.18W (1.58%),
observed for the [unpinned-VCPU, running-3, patched] case
• In all the other cases, the overhead is less than 1W (and less than 1%)
• This result is satisfactory compared to an overhead of 1-2% observed for
XenMon, the reference implementation for XeMPower

Related work: PUPiL [5] 113
[5] H. Zhang and H. Hoffmann. Maximizing performance under a power cap: A comparison of hardware, software, and hybrid techniques. In International Conference on Architectural Support for
Programming Languages and Operating Systems (ASPLOS), 2016.
• PUPiL, a framework that aims to minimize and to maximize respectively
the concept of timeliness and efficiency
• Proposed approach:
– both hardware (i.e., the Intel RAPL interface [10]) and software (i.e.,
resource partitioning and allocation) techniques
– exploits a canonical ODA control loop, one of the main building blocks of
self-aware computing
• Limitations
– the applications running on the system need to be instrumented with the
Heartbeat framework, to provide uniform metric of throughput
– applications running bare-metal on Linux
• These conditions might not hold in the context of a multi-tenant
virtualized environment

The Xen Hypervisor 114
Slides from: http://www.slideshare.net/xen_com_mgr/xpds16-porting-xen-on-arm-to-a-new-soc-julien-grall-arm

1. Performance metric identification
• Hardware event counters as low level metrics of
performance
• We exploit the Intel Performance Monitoring Unit (PMU)
to monitor the number of Instruction Retired (IR)
accounted to each domain in a certain time window
– an insight on how many microinstructions were completely
executed (i.e., that successfully reached the end of the
pipeline)
– it represents a reasonable indicator of performance, as the
same manufacturer suggests [6]
115
[6] Clockticks per instructions retired (cpi). https://software.intel.com/en-us/node/544403. Accessed: 2016-06-01.

2. Decision phase and virtualization
• Evaluation criterion: the average IR rate over a certain time
window
– the time window allows the workload to adapt to the actual
configuration
– the comparison of IR rates of different configurations highlights
which one makes the workload perform better
• Resource allocation granularity: core-level
– each domain owns a set virtual CPUs (vCPUs)
– a set of physical CPUs (pCPU) present on the machine
– each vCPU can be mapped on a pCPU for a certain amount of
time, while multiple vCPUs can be mapped on the same pCPU
• We wanted our allocation to cover the whole set of pCPUs, if
possible
116

3. Extending the hypervisor - RAPL
• Working with the Intel RAPL interface:
– harshly cutting the frequency and the voltage of the whole CPU socket
• On a bare-metal operating system:
– reading and writing data into the right Model Specific Register (MSR)
• MSR_RAPL_POWER_UNIT: read processor-specific time, energy and power
units, used to scale each value read or written
• MSR_PKG_RAPL_POWER_LIMIT: write to set a limit on the power
consumption of the whole socket
• In a virtualized environment:
– the Xen hypervisor does not natively support the RAPL interface
– we developed custom hypercalls, with kernel callback functions and
memory buffers
– we developed a CLI tool that performs some checks on the input
parameters, as well as of instantiating and invoking the Xen command
interface to launch the hypercalls
117

3. Extending the hypervisor - Resources
• cpupool tool:
– allows to cluster the physical CPUs in different pools
– the pool scheduler will schedule the domain’s vCPUs only
on the pCPUs that are part of that cluster
– as a new resource allocation is chosen by the decide phase,
we increase or decrease the number of pCPUs in the pool
– pin the domain’s vCPUs to these, to increase workload
stability
• NO xenpm:
– set a maximum and minimum frequency for each pCPU
– it may interfere with the actuation made by RAPL
118

120
Motivation - Modeling approaches (1)
• Deep insights
• Accurate
• Invasive instrumentation
• Ignoreunderrate
degradation
• Adaptive
• Ever-improving
• Generalizable
• At-a-glance view
• Accuracy depends on
acquisition procedures
PROSCONS
PHYSICAL MODELS DATA-DRIVEN MODELS

121
• Controllable environment
• Ad-hoc instrumentation
• Relies on reasonable
simulations
• Does not evolve with the
target
• Requires ex-novo modeling
for new targets
• Intrinsic ability of evolve
with the target
• Tackles new targets
• Does not require in-lab
phases
• Noisy real-world
environment
PROSCONS
OFF-LINE MODELING ON-LINE MODELING

122
On-demand data-driven modeling
=
GENERAL AS-A-SERVICE MODELING FRAMEWORK
+

32B
2A
2C
MARC METHODOLOGY
Our KD&DM procedure
15
Preprocessing
Data
Manipulation
Feature
Selection

32B
2A
2C
MARC METHODOLOGY
Our KD&DM procedure
16
Preprocessing
Data
Manipulation
STANDARD DATA CLEANING OPERATIONS
Scope: single sample
1. Coherence Correction
2. Residual Incoherence Elimination
3. Out-of-Bound Elimination
4. Granularity Reduction
Feature
Selection

32B
2A
2C
MARC METHODOLOGY
Our KD&DM procedure
17
Preprocessing
Data
Manipulation
Scope: full dataset
•Standardization
•Quantization Correction
•…
Feature
Selection

32B
2A
2C
MARC METHODOLOGY
Our KD&DM procedure
18
Preprocessing
Data
Manipulation
Feature
Selection
Scope: feature-wise
•Manual feature fusion and exclusion
•Automatic Configuration Feature  
Elicitation and Synthesis

4. MARC PLATFORM
Scalability
14
Load Balancer
Communication Actor Communication Actor
Module Specific  
Functional Logic
Module Specific  
Functional Logic
SCALE-IN 
INTRA-MODULE PARALLELISM
Technologies: Scala - Akka

4. MARC PLATFORM
Scalability
15
SCALE-OUT 
MODULE DISTRIBUTION
Load Balancer
Communication Actor Communication Actor
Module Specific  
Functional Logic
Module Specific  
Functional Logic
=
DOCKER  
CONTAINER
Technologies: Scala - Akka - Docker

4. MARC PLATFORM
Scalability
16
BACKWARD ACTIVATION
Technologies: Scala - Akka - Docker - Scalatra
PHASE1
PHASE2A
PHASE2B
PHASE2C
PHASE3 WEBAPP

4. MARC PLATFORM
Scalability
16
BACKWARD ACTIVATION
PHASE1
PHASE2A
PHASE2B
PHASE2C
PHASE3 WEBAPP
“PHASE2A, please!”

4. MARC PLATFORM
Scalability
16
BACKWARD ACTIVATION
PHASE1
PHASE2A
PHASE2B
PHASE2C
PHASE3 WEBAPP
⏳

4. MARC PLATFORM
Scalability
16
BACKWARD ACTIVATION
PHASE1
PHASE2A
PHASE2B
PHASE2C
PHASE3 WEBAPP
⏳
⏳

4. MARC PLATFORM
Scalability
16
BACKWARD ACTIVATION
PHASE1
PHASE2A
PHASE2B
PHASE2C
PHASE3 WEBAPP
⏳
✅

4. MARC PLATFORM
Scalability
16
BACKWARD ACTIVATION
PHASE1
PHASE2A
PHASE2B
PHASE2C
PHASE3 WEBAPP
“Thank you”
✅
✅

4. MARC PLATFORM
Scalability
16
BACKWARD ACTIVATION
PHASE1
PHASE2A
PHASE2B
PHASE2C
PHASE3 WEBAPP
“PHASE2B, please!”
✅
✅ ⏳

4. MARC PLATFORM
Scalability
16
BACKWARD ACTIVATION
PHASE1
PHASE2A
PHASE2B
PHASE2C
PHASE3 WEBAPP
“PHASE2B, please!”
✅
✅ ⏳
“Already
computed!”

4. MARC PLATFORM
Scalability
16
BACKWARD ACTIVATION
PHASE1
PHASE2A
PHASE2B
PHASE2C
PHASE3 WEBAPP
“Thank you”
✅
✅ ✅

4. MARC PLATFORM
Scalability
17
WHITEBOARD APPROACH
Scala - Akka - Docker - Scalatra - Redis
PHASE1
PHASE2A
PHASE2B
PHASE2C
PHASE3
BACKBONE
INTERNAL 
WHITEBOARD
EXTERNAL 
WHITEBOARD

[February 2017 - Ph.D. Final Dissertation] Enabling Power-awareness For Multi-tenant Systems

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to [February 2017 - Ph.D. Final Dissertation] Enabling Power-awareness For Multi-tenant Systems

Similar to [February 2017 - Ph.D. Final Dissertation] Enabling Power-awareness For Multi-tenant Systems (20)

More from Matteo Ferroni

More from Matteo Ferroni (7)

Recently uploaded

Recently uploaded (20)

[February 2017 - Ph.D. Final Dissertation] Enabling Power-awareness For Multi-tenant Systems