2. My background
5
RobusT2Scale/FQL4KE [PhD]
✓ Engineering / Technical
✓ Auto-scaling in cloud
(RobusT2Scale)
✓ Self-learning controller for
cloud auto-scaling (FQL4KE)
BO4CO/TL4CO [Postdoc1@Imperial]
✓ Mathematical modeling
✓ Configuration optimization for big
data (BO4CO)
✓ Performance-aware DevOps
(TL4CO)
Transfer Learning [Postdoc2@CMU]
✓ Empirical
✓ Learns accurate and reliable models
from “related” sources
✓ Reuse learning across
environmental changes
Software industry [2003-2010]: Pre-PhD
Close collaborations with Intel and Microsoft [PhD]
3 EU projects: MODAClouds (cloud), DICE (big data), Human Brain (clustering) [Postdoc1@Imperial]
1 DARPA project: BRASS (Robotics) [Postdoc2@CMU]
4. Cloud applications
7
* JupiterResearch ** Amazon ***Google
• 82% of end-users give up on a lost payment transaction*
• 25% of end-users leave if load time > 4s**
• 1% reduced sale per 100ms load time**
• 20% reduced income if 0.5s longer load time***
flash-crowds
failures
capacity
shortage
slow
application
[Credit to Cristian Klein, Brownout]
6. Common characteristics of the systems
• Moderns systems are
increasingly configurable
• Modern systems are
deployed in dynamic and
uncertain environments
• Modern systems can be
adapted on the fly
9
Hey, You Have Given Me Too Many Knobs!
Understanding and Dealing with Over-Designed Configuration in System Software
Tianyin Xu*, Long Jin*, Xuepeng Fan*‡, Yuanyuan Zhou*,
Shankar Pasupathy† and Rukma Talwadker†
*University of California San Diego, ‡Huazhong Univ. of Science & Technology, †NetApp, Inc
{tixu, longjin, xuf001, yyzhou}@cs.ucsd.edu
{Shankar.Pasupathy, Rukma.Talwadker}@netapp.com
ABSTRACT
Configuration problems are not only prevalent, but also severely
mpair the reliability of today’s system software. One fundamental
eason is the ever-increasing complexity of configuration, reflected
y the large number of configuration parameters (“knobs”). With
undreds of knobs, configuring system software to ensure high re-
ability and performance becomes a daunting, error-prone task.
This paper makes a first step in understanding a fundamental
uestion of configuration design: “do users really need so many
nobs?” To provide the quantitatively answer, we study the con-
guration settings of real-world users, including thousands of cus-
omers of a commercial storage system (Storage-A), and hundreds
f users of two widely-used open-source system software projects.
Our study reveals a series of interesting findings to motivate soft-
ware architects and developers to be more cautious and disciplined
n configuration design. Motivated by these findings, we provide
few concrete, practical guidelines which can significantly reduce
he configuration space. Take Storage-A as an example, the guide-
nes can remove 51.9% of its parameters and simplify 19.7% of
he remaining ones with little impact on existing users. Also, we
tudy the existing configuration navigation methods in the context
f “too many knobs” to understand their effectiveness in dealing
with the over-designed configuration, and to provide practices for
uilding navigation support in system software.
Categories and Subject Descriptors: D.2.10 [Software Engineer-
7/2006 7/2008 7/2010 7/2012 7/2014
0
100
200
300
400
500
600
700
Storage-A
Numberofparameters
Release time
1/1999 1/2003 1/2007 1/2011
0
100
200
300
400
500
5.6.2
5.5.0
5.0.16
5.1.3
4.1.0
4.0.12
3.23.0
1/2014
MySQL
Numberofparameters
Release time
1/1998 1/2002 1/2006 1/2010 1/2014
0
100
200
300
400
500
600
1.3.14
2.2.14
2.3.4
2.0.35
1.3.24
Numberofparameters
Release time
Apache
1/2006 1/2008 1/2010 1/2012 1/2014
0
40
80
120
160
200
2.0.0
1.0.0
0.19.0
0.1.0
Hadoop
Numberofparameters
Release time
MapReduce
HDFS
Figure 1: The increasing number of configuration parameters with
software evolution. Storage-A is a commercial storage system from a ma-
jor storage company in the U.S.
all the customer-support cases in a major storage company in the
U.S., and were the most significant contributor (31%) among all
[Credit to Tianyin Xu, Too Many Knobs]
11. An Example of Auto-scaling Rule These values are
required to be
determined by users
Þ requires deep
knowledge of
application (CPU,
memory)
Þ requires
performance
modeling expertise
(how to scale)
Þ A unified opinion
of user(s) is
required
Amazon auto scaling
Microsoft Azure Watch
14
Microsoft Azure Auto-
scaling Application Block
36. GP for modeling black box response function
true function
GP mean
GP variance
observation
selected point
true
minimum
making. The other reason is that all the computations in
this framework are based on tractable linear algebra.
In our previous work [21], we proposed BO4CO that ex-
ploits single-task GPs (no transfer learning) for prediction of
posterior distribution of response functions. A GP model is
composed by its prior mean (µ(·) : X ! R) and a covariance
unction (k(·, ·) : X ⇥ X ! R) [41]:
y = f(x) ⇠ GP(µ(x), k(x, x0
)), (2)
where covariance k(x, x0
) defines the distance between x
and x0
. Let us assume S1:t = {(x1:t, y1:t)|yi := f(xi)} be
the collection of t experimental data (observations). In this
ramework, we treat f(x) as a random variable, conditioned
on observations S1:t, which is normally distributed with the
ollowing posterior mean and variance functions [41]:
µt(x) = µ(x) + k(x)|
(K + 2
I) 1
(y µ) (3)
2
t (x) = k(x, x) + 2
I k(x)|
(K + 2
I) 1
k(x) (4)
where y := y1:t, k(x)|
= [k(x, x1) k(x, x2) . . . k(x, xt)],
µ := µ(x1:t), K := k(xi, xj) and I is identity matrix. The
shortcoming of BO4CO is that it cannot exploit the observa-
tions regarding other versions of the system and as therefore
cannot be applied in DevOps.
3.2 TL4CO: an extension to multi-tasks
TL4CO 1
uses MTGPs that exploit observations from other
previous versions of the system under test. Algorithm 1
defines the internal details of TL4CO. As Figure 4 shows,
TL4CO is an iterative algorithm that uses the learning from
other system versions. In a high-level overview, TL4CO: (i)
selects the most informative past observations (details in
Section 3.3); (ii) fits a model to existing data based on kernel
earning (details in Section 3.4), and (iii) selects the next
configuration based on the model (details in Section 3.5).
In the multi-task framework, we use historical data to fit a
better GP providing more accurate predictions. Before that,
we measure few sample points based on Latin Hypercube De-
n Optimization here is that it o↵ers a framework
asoning can be not only based on mean estimates
he variance, providing more informative decision
The other reason is that all the computations in
work are based on tractable linear algebra.
revious work [21], we proposed BO4CO that ex-
e-task GPs (no transfer learning) for prediction of
istribution of response functions. A GP model is
by its prior mean (µ(·) : X ! R) and a covariance
k(·, ·) : X ⇥ X ! R) [41]:
y = f(x) ⇠ GP(µ(x), k(x, x0
)), (2)
ariance k(x, x0
) defines the distance between x
et us assume S1:t = {(x1:t, y1:t)|yi := f(xi)} be
on of t experimental data (observations). In this
, we treat f(x) as a random variable, conditioned
tions S1:t, which is normally distributed with the
posterior mean and variance functions [41]:
= µ(x) + k(x)|
(K + 2
I) 1
(y µ) (3)
= k(x, x) + 2
I k(x)|
(K + 2
I) 1
k(x) (4)
y1:t, k(x)|
= [k(x, x1) k(x, x2) . . . k(x, xt)],
t), K := k(xi, xj) and I is identity matrix. The
g of BO4CO is that it cannot exploit the observa-
ding other versions of the system and as therefore
applied in DevOps.
4CO: an extension to multi-tasks
uses MTGPs that exploit observations from other
ersions of the system under test. Algorithm 1
e internal details of TL4CO. As Figure 4 shows,
an iterative algorithm that uses the learning from
em versions. In a high-level overview, TL4CO: (i)
Motivations:
1- mean estimates + variance
2- all computations are linear algebra
3- good estimations when few data
k(x)|
= [k(x, x1) k(x, x2) . . . k(x, xt)], I is
y matrix and
K :=
2
6
4
k(x1, x1) . . . k(x1, xt)
...
...
...
k(xt, x1) . . . k(xt, xt)
3
7
5 (7)
models have shown to be effective for performance
ions in data scarce domains [20]. However, as we
demonstrated in Figure 2, it may becomes inaccurate
the samples do not cover the space uniformly. For
configurable systems, we require a large number of
ations to cover the space uniformly, making GP models
tive in such situations.
del prediction using transfer learning
ansfer learning, the key question is how to make accu-
edictions for the target environment using observations
ther sources, Ds. We need a measure of relatedness not
etween input configurations but between the sources
l. The relationships between input configurations was
ed in the GP models using the covariance matrix that
efined based on the kernel function in Eq. (7). More
cally, a kernel is a function that computes a dot product
sure of “similarity”) between two input configurations.
e kernel helps to get accurate predictions for similar
urations. We now need to exploit the relationship be-
the source and target functions, g, f, using the current
ations Ds, Dt to build the predictive model ˆf. To capture
ationship, we define the following kernel function:
k(f, g, x, x0
) = kt(f, g) ⇥ kxx(x, x0
), (8)
the kernels kt represent the correlation between source
rget function, while kxx is the covariance function for
Typically, kxx is parameterized and its parameters are
by maximizing the marginal likelihood of the model
An overview of a self-optimization solution is depicted
in Figure 3 following the well-know MAPE-K framework
[9], [23]. We consider the GP model as the K (knowledge)
component of this framework that acts as an interface to which
other components can query the performance under specific
configurations or update the model given a new observation.
We use transfer learning to make the knowledge more accurate
using observations that are taken from a simulator or any
other cheap sources. For deciding how many observations
and from what source to transfer, we use the cost model
that we have introduced earlier. At runtime, the managed
system is Monitored by pulling the end-to-end performance
metrics (e.g., latency, throughput) from the corresponding
sensors. Then, the retrieved performance data needs to be
Analysed and the mean performance associated to a specific
setting of the system will be stored in a data repository.
Next, the GP model needs to be updated taking into account
the new performance observation. Having updated the GP
model, a new configuration may be Planned to replace the
current configuration. Finally, the new configuration will be
enacted by Executing appropriate platform specific operations.
This enables model-based knowledge evolution using machine
learning [2], [21]. The underlying GP model can now be
updated not only when a new observation is available but
also by transferring the learning from other related sources.
So at each adaptation cycle, we can update our belief about
the correct response given data from the managed system and
other related sources, accelerating the learning process.
IV. EXPERIMENTAL RESULTS
We evaluate the effectiveness and applicability of our
transfer learning approach for learning models for highly-
configurable systems, in particular, compared to conventional
non-transfer learning. Specifically, we aim to answer the
following three research questions:
RQ1: How much does transfer learning improve the prediction
ernal details about the system; the learning process can
ied in a black-box fashion using the sampled perfor-
measurements. In the GP framework, it is also possible
rporate domain knowledge as prior, if available, which
hance the model accuracy [20].
rder to describe the technical details of our transfer
g methodology, let us briefly describe an overview of
del regression; a more detailed description can be found
ere [35]. GP models assume that the function ˆf(x) can
rpreted as a probability distribution over functions:
y = ˆf(x) ⇠ GP(µ(x), k(x, x0
)), (4)
µ : X ! R is the mean function and k : X ⇥ X ! R
covariance function (kernel function) which describes
tionship between response values, y, according to the
e of the input values x, x0
. The mean and variance of
model predictions can be derived analytically [35]:
x) = µ(x) + k(x)|
(K + 2
I) 1
(y µ), (5)
x) = k(x, x) + 2
I k(x)|
(K + 2
I) 1
k(x), (6)
k(x)|
= [k(x, x1) k(x, x2) . . . k(x, xt)], I is
matrix and
K :=
2
6
4
k(x1, x1) . . . k(x1, xt)
...
...
...
k(xt, x1) . . . k(xt, xt)
3
7
5 (7)
models have shown to be effective for performance
ons in data scarce domains [20]. However, as we
emonstrated in Figure 2, it may becomes inaccurate
he samples do not cover the space uniformly. For
configurable systems, we require a large number of
tions to cover the space uniformly, making GP models
a standard method [35]. After learning the parameters of kxx,
we construct the covariance matrix exactly the same way as in
Eq. 7 and derive the mean and variance of predictions using
Eq. (5), (6) with the new K. The main essence of transfer
learning is, therefore, the kernel that capture the source and
target relationship and provide more accurate predictions using
the additional knowledge we can gain via the relationship
between source and target.
D. Transfer learning in a self-adaptation loop
Now that we have described the idea of transfer learning for
providing more accurate predictions, the question is whether
such an idea can be applied at runtime and how the self-
adaptive systems can benefit from it. More specifically, we
now describe the idea of model learning and transfer learning
in the context of self-optimization, where the system adapts
its configuration to meet performance requirements at runtime.
The difference to traditional configurable systems is that we
learn the performance model online in a feedback loop under
time and resource constraints. Such performance reasoning is
done more frequently for self-adaptation purposes.
An overview of a self-optimization solution is depicted
in Figure 3 following the well-know MAPE-K framework
[9], [23]. We consider the GP model as the K (knowledge)
component of this framework that acts as an interface to which
other components can query the performance under specific
configurations or update the model given a new observation.
We use transfer learning to make the knowledge more accurate
using observations that are taken from a simulator or any
other cheap sources. For deciding how many observations
and from what source to transfer, we use the cost model
that we have introduced earlier. At runtime, the managed
system is Monitored by pulling the end-to-end performance
44. -1.5 -1 -0.5 0 0.5 1 1.5
-4
-3
-2
-1
0
1
2
3
(a) 3 sample response functions
configuration domain
responsevalue
(1)
(2)
(3)
observations
(b) GP fit for (1) ignoring observations for (2),(3)
LCB
not informative
(c) multi-task GP fit for (1) by transfer learning from (2),(3)
highly informative
GP prediction mean
GP prediction variance
probability distribution
of the minimizers
Transfer learning improves optimization
65. Transferable knowledge
71
Target (Learn)Source (Given)
DataModel
Transferable
Knowledge
II. INTUITION
Understanding the performance behavior of configurable
oftware systems can enable (i) performance debugging, (ii)
erformance tuning, (iii) design-time evolution, or (iv) runtime
daptation [11]. We lack empirical understanding of how the
erformance behavior of a system will vary when the environ-
ment of the system changes. Such empirical understanding will
rovide important insights to develop faster and more accurate
earning techniques that allow us to make predictions and
ptimizations of performance for highly configurable systems
n changing environments [10]. For instance, we can learn
erformance behavior of a system on a cheap hardware in a
ontrolled lab environment and use that to understand the per-
ormance behavior of the system on a production server before
hipping to the end user. More specifically, we would like to
now, what the relationship is between the performance of a
ystem in a specific environment (characterized by software
onfiguration, hardware, workload, and system version) to the
ne that we vary its environmental conditions.
In this research, we aim for an empirical understanding of
A. Preliminary concepts
In this section, we prov
cepts that we use through
enable us to concisely con
1) Configuration and e
the i-th feature of a confi
enabled or disabled and o
configuration space is ma
all the features C = Do
Dom(Fi) = {0, 1}. A
a member of the configu
all the parameters are as
range (i.e., complete instan
We also describe an env
e = [w, h, v] drawn from
W ⇥H ⇥V , where they re
values for workload, hard
2) Performance model:
configuration space F and
formance model is a blac
given some observations o
combination of system’s
NTUITION
mance behavior of configurable
(i) performance debugging, (ii)
gn-time evolution, or (iv) runtime
pirical understanding of how the
stem will vary when the environ-
Such empirical understanding will
develop faster and more accurate
ow us to make predictions and
e for highly configurable systems
10]. For instance, we can learn
ystem on a cheap hardware in a
nd use that to understand the per-
em on a production server before
ore specifically, we would like to
is between the performance of a
ment (characterized by software
kload, and system version) to the
mental conditions.
or an empirical understanding of
A. Preliminary concepts
In this section, we provide formal definitions of fo
cepts that we use throughout this study. The formal n
enable us to concisely convey concept throughout the
1) Configuration and environment space: Let Fi
the i-th feature of a configurable system A which i
enabled or disabled and one of them holds by defa
configuration space is mathematically a Cartesian pro
all the features C = Dom(F1) ⇥ · · · ⇥ Dom(Fd)
Dom(Fi) = {0, 1}. A configuration of a system
a member of the configuration space (feature space
all the parameters are assigned to a specific value
range (i.e., complete instantiations of the system’s para
We also describe an environment instance by 3 v
e = [w, h, v] drawn from a given environment spa
W ⇥H ⇥V , where they respectively represent sets of
values for workload, hardware and system version.
2) Performance model: Given a software system
configuration space F and environmental instances E
formance model is a black-box function f : F ⇥ E
given some observations of the system performance
combination of system’s features x 2 F in an envi
space F and environmental instances E, a per-
odel is a black-box function f : F ⇥ E ! R
observations of the system performance for each
of system’s features x 2 F in an environment
construct a performance model for a system A
ation space F, we run A in environment instance
ious combinations of configurations xi 2 F, and
sulting performance values yi = f(xi) + ✏i, xi 2
⇠ N (0, i). The training data for our regression
en simply Dtr = {(xi, yi)}n
i=1. In other words, a
ction is simply a mapping from the input space to
performance metric that produces interval-scaled
e assume it produces real numbers).
mance distribution: For the performance model,
and associated the performance response to each
, now let introduce another concept where we
ironment and we measure the performance. An
rformance distribution is a stochastic process,
(R), that defines a probability distribution over
measures for each environmental conditions. To
ware
o the
g of
med
rfor-
uited
we
arch
ans-
and
ried
able
ider
t of
vari-
ance
tand
be
configuration space F and environmental instances E, a
formance model is a black-box function f : F ⇥ E !
given some observations of the system performance for
combination of system’s features x 2 F in an environm
e 2 E. To construct a performance model for a system
with configuration space F, we run A in environment inst
e 2 E on various combinations of configurations xi 2 F,
record the resulting performance values yi = f(xi) + ✏i, x
F where ✏i ⇠ N (0, i). The training data for our regres
models is then simply Dtr = {(xi, yi)}n
i=1. In other wor
response function is simply a mapping from the input spac
a measurable performance metric that produces interval-sc
data (here we assume it produces real numbers).
3) Performance distribution: For the performance mo
we measured and associated the performance response to
configuration, now let introduce another concept where
vary the environment and we measure the performance
empirical performance distribution is a stochastic pro
pd : E ! (R), that defines a probability distribution
performance measures for each environmental conditions
Extract Reuse
f(·,es) f(·, et)
We hypothesize that
we can
- learn and transfer
- different forms of
knowledge
- across
environments,
- while so far only
simple transfers are
attempted!
68. Subject systems
74
TABLE I: Overview of the real-world subject systems.
System Domain d |C| |H| |W| |V |
SPEAR SAT solver 14 16 384 3 4 2
x264 Video encoder 16 4 000 2 3 3
SQLite Database 14 1 000 2 14 2
SaC Compiler 50 71 267 1 10 1
d: configuration options; C: configurations; H: hardware environments; W : analyzed
workload; V : analyzed versions.
IV. PERFORMANCE BEHAVIOR CONSISTENCY (RQ1)
Here, we investigate the relatedness of the source and target
environments in the entire configuration space. We start by
testing the strongest assumptions about the relatedness of
environm
reuse me
predict th
that virtu
not be ab
Worklo
correlatio
SAT prob
the differ
across en
slightly
for other
instance
69. Level of relatedness between source and
target is important
10
20
30
40
50
60
AbsolutePercentageError[%]
Sources s s1 s2 s3 s4 s5 s6
noise-level 0 5 10 15 20 25 30
corr. coeff. 0.98 0.95 0.89 0.75 0.54 0.34 0.19
µ(pe) 15.34 14.14 17.09 18.71 33.06 40.93 46.75
Fig. 6: Prediction accuracy of the model learned with samples
from different sources of different relatedness to the target.
GP is the model without transfer learning.
TABLE
column
datasets
measure
1
2
3
4
5
6
predictio
system,
as the e
pled for
• Model becomes more accurate
when the source is more
related to the target
• Even learning from a source
with a small correlation is
better than no transfer
75
70. 76
5 10 15 20 25
number of particles
5
10
15
20
25
numberofrefinements
5
10
15
20
25
30
5 10 15 20 25
number of particles
5
10
15
20
25
numberofrefinements
10
12
14
16
18
20
22
24
5 10 15 20 25
number of particles
5
10
15
20
25
numberofrefinements
10
15
20
25
5 10 15 20 25
number of particles
5
10
15
20
25
numberofrefinements
10
15
20
25
5 10 15 20 25
number of particles
5
10
15
20
25
numberofrefinements
6
8
10
12
14
16
18
20
22
24
(a) (b) (c)
(d) (e) 5 10 15 20 25
number of particles
5
10
15
20
25
numberofrefinements
12
14
16
18
20
22
24
(f)
CPU usage [%]
CPU usage [%] CPU usage [%]
CPU usage [%] CPU usage [%] CPU usage [%]
Less related->
Less accurate
More related->
More accurate
79. Implications for transfer learning
• When and why TL works for performance modeling
• Small environmental changes -> Performance behavior is consistent
• A linear transformation of performance models provide a good approximation
• Large environmental changes -> Individual options and interactions
may stay consistent
• A non-linear mapping between performance behavior across environments
• Severe environmental changes -> Found transferable knowledge
• Invalid configurations providing opportunities for avoiding measurements
• Intuitive judgments about transferability of knowledge
• Without deep knowledge about the configuration or implementation
85
81. Future research opportunities
• Sampling strategies
• More informative samples
• Exploiting the importance of specific regions or avoiding invalid regions
• Learning mechanisms
• Learning either a linear or non-linear associations
• Performance testing and debugging
• Transferring interesting test cases that cover interactions between options
• Performance tuning and optimization
• Identifying the interacting options
• Importance sampling exploiting feature interactions
87
82. Selecting from multiple sources
88
Source Simulator Target Simulator
Source Robot Target Robot
C1
C3
C2
- Different cost
associated to the sources
- Problem is to take
sample from appropriate
source to gain more
information given limited
budget
86. Recap of my previous work
92
RobusT2Scale/FQL4KE [PhD]
✓ Engineering / technical
✓ Maintains application
responsiveness
✓ Handles environmental
uncertainties
✓ Enables knowledge evolution
✗ Learns slowly when situation is
changed
BO4CO/TL4CO [Postdoc1@Imperial]
✓ Mathematical modeling
✓ Finds optimal configuration given a
measurement budget
✓ Step-wise active learning
✗ Does’nt scale well to high dimensions
✗ Expensive to learn
Goal
✔ Industry relevant research
✔ Learn accurate/reliable/cheap performance model
✔ Learned model is used for performance
tuning/debugging/optimization/runtime adaptation
Transfer Learning [Postdoc2@CMU]
✓ Empirical
✓ Learns accurate and reliable models
from “related” sources
✓ Reuse learning across
environmental changes
✗ For severe environmental changes,
transfer is limited, but possible!
[SEAMS14,QoSA16,CCGrid17, TAAS] [MASCOTS16,WICSA16] [SEAMS17]