SlideShare una empresa de Scribd logo
1 de 29
Descargar para leer sin conexión
Transfer	Learning	for	Optimal	Configuration	of	
Big	Data	Software	
Pooyan Jamshidi, Giuliano Casale, Salvatore Dipietro
Imperial College London
p.jamshidi@imperial.ac.uk
Department of Computing
Research Associate Symposium
14th June 2016
Motivation
1- Many different
Parameters =>
- large state space
- interactions
2- Defaults are
typically used =>
- poor performance
Motivation
0 1 2 3 4 5
average read latency (µs) ×104
0
20
40
60
80
100
120
140
160
observations
1000 1200 1400 1600 1800 2000
average read latency (µs)
0
10
20
30
40
50
60
70
observations
1
1
(a) cass-20 (b) cass-10
Best configurations
Worst configurations
Experiments on
Apache Cassandra:
- 6 parameters, 1024 configurations
- Average read latency
- 10 millions records (cass-10)
- 20 millions records (cass-20)
Goal!
is denoted by f(x). Throughout, we assume
ncy, however, other metrics for response may
re consider the problem of finding an optimal
⇤
that globally minimizes f(·) over X:
x⇤
= arg min
x2X
f(x) (1)
esponse function f(·) is usually unknown or
n, i.e., yi = f(xi), xi ⇢ X. In practice, such
may contain noise, i.e., yi = f(xi) + ✏. The
of the optimal configuration is thus a black-
on program subject to noise [27, 33], which
harder than deterministic optimization. A
n is based on sampling that starts with a
pled configurations. The performance of the
sociated to this initial samples can deliver
tanding of f(·) and guide the generation of
of samples. If properly guided, the process
ration-evaluation-feedback-regeneration will
tinuously, (ii) Big Data systems are d
frameworks (e.g., Apache Hadoop, S
on similar platforms (e.g., cloud clust
versions of a system often share a sim
To the best of our knowledge, only
the possibility of transfer learning in
The authors learn a Bayesian network
of a system and reuse this model fo
systems. However, the learning is lim
the Bayesian network. In this paper,
that not only reuse a model that has b
but also the valuable raw data. There
to the accuracy of the learned model
consider Bayesian networks and inste
2.4 Motivation
A motivating example. We now i
points on an example. WordCount (cf.
benchmark [12]. WordCount features
(Xi). In general, Xi may either indicate (i) integer vari-
such as level of parallelism or (ii) categorical variable
as messaging frameworks or Boolean variable such as
ng timeout. We use the terms parameter and factor in-
angeably; also, with the term option we refer to possible
s that can be assigned to a parameter.
assume that each configuration x 2 X in the configura-
pace X = Dom(X1) ⇥ · · · ⇥ Dom(Xd) is valid, i.e., the
m accepts this configuration and the corresponding test
s in a stable performance behavior. The response with
guration x is denoted by f(x). Throughout, we assume
f(·) is latency, however, other metrics for response may
ed. We here consider the problem of finding an optimal
guration x⇤
that globally minimizes f(·) over X:
x⇤
= arg min
x2X
f(x) (1)
fact, the response function f(·) is usually unknown or
ally known, i.e., yi = f(xi), xi ⇢ X. In practice, such
it still requires hundr
per, we propose to ad
with the search e ci
than starting the sear
the learned knowledg
software to accelerate
version. This idea is i
in real software engin
in DevOps di↵erent
tinuously, (ii) Big Da
frameworks (e.g., Ap
on similar platforms (
versions of a system o
To the best of our k
the possibility of tran
The authors learn a B
of a system and reus
systems. However, the
the Bayesian network.
his configuration and the corresponding test
le performance behavior. The response with
is denoted by f(x). Throughout, we assume
ncy, however, other metrics for response may
e consider the problem of finding an optimal
⇤
that globally minimizes f(·) over X:
x⇤
= arg min
x2X
f(x) (1)
esponse function f(·) is usually unknown or
, i.e., yi = f(xi), xi ⇢ X. In practice, such
may contain noise, i.e., yi = f(xi) + ✏. The
of the optimal configuration is thus a black-
n program subject to noise [27, 33], which
harder than deterministic optimization. A
n is based on sampling that starts with a
pled configurations. The performance of the
sociated to this initial samples can deliver
tanding of f(·) and guide the generation of
of samples. If properly guided, the process
ation-evaluation-feedback-regeneration will
erge and the optimal configuration will be
in DevOps di↵erent versions of a system is delivered
tinuously, (ii) Big Data systems are developed using s
frameworks (e.g., Apache Hadoop, Spark, Kafka) an
on similar platforms (e.g., cloud clusters), (iii) and di↵
versions of a system often share a similar business log
To the best of our knowledge, only one study [9] ex
the possibility of transfer learning in system configur
The authors learn a Bayesian network in the tuning p
of a system and reuse this model for tuning other s
systems. However, the learning is limited to the struct
the Bayesian network. In this paper, we introduce a m
that not only reuse a model that has been learned prev
but also the valuable raw data. Therefore, we are not li
to the accuracy of the learned model. Moreover, we d
consider Bayesian networks and instead focus on MTG
2.4 Motivation
A motivating example. We now illustrate the pre
points on an example. WordCount (cf. Figure 1) is a po
benchmark [12]. WordCount features a three-layer arc
ture that counts the number of words in the incoming s
A Processing Element (PE) of type Spout reads the
havior. The response with
). Throughout, we assume
metrics for response may
blem of finding an optimal
nimizes f(·) over X:
f(x) (1)
(·) is usually unknown or
xi ⇢ X. In practice, such
i.e., yi = f(xi) + ✏. The
figuration is thus a black-
t to noise [27, 33], which
ministic optimization. A
mpling that starts with a
. The performance of the
itial samples can deliver
d guide the generation of
perly guided, the process
in DevOps di↵erent versions of a system is delivered con
tinuously, (ii) Big Data systems are developed using simila
frameworks (e.g., Apache Hadoop, Spark, Kafka) and ru
on similar platforms (e.g., cloud clusters), (iii) and di↵eren
versions of a system often share a similar business logic.
To the best of our knowledge, only one study [9] explore
the possibility of transfer learning in system configuration
The authors learn a Bayesian network in the tuning proces
of a system and reuse this model for tuning other simila
systems. However, the learning is limited to the structure o
the Bayesian network. In this paper, we introduce a metho
that not only reuse a model that has been learned previousl
but also the valuable raw data. Therefore, we are not limite
to the accuracy of the learned model. Moreover, we do no
consider Bayesian networks and instead focus on MTGPs.
2.4 Motivation
A motivating example. We now illustrate the previou
points on an example. WordCount (cf. Figure 1) is a popula
benchmark [12]. WordCount features a three-layer architenumber of counters
number of splitters
latency(ms)
100
150
1
200
250
2
300
Cubic Interpolation Over Finer Grid
243 684 10125 14166 18
Partially known
Measurements contain noise
Configuration space
Finding	optimum	configuration	is	difficult!
0 5 10 15 20
Number of counters
100
120
140
160
180
200
220
240
Latency(ms)
splitters=2
splitters=3
number of counters
number of splitters
latency(ms)
100
150
1
200
250
2
300
Cubic Interpolation Over Finer Grid
243 684 10125 14166 18
ble performance behavior. The response with
x is denoted by f(x). Throughout, we assume
ency, however, other metrics for response may
ere consider the problem of finding an optimal
x⇤
that globally minimizes f(·) over X:
x⇤
= arg min
x2X
f(x) (1)
response function f(·) is usually unknown or
wn, i.e., yi = f(xi), xi ⇢ X. In practice, such
may contain noise, i.e., yi = f(xi) + ✏. The
of the optimal configuration is thus a black-
ion program subject to noise [27, 33], which
y harder than deterministic optimization. A
on is based on sampling that starts with a
mpled configurations. The performance of the
ssociated to this initial samples can deliver
rstanding of f(·) and guide the generation of
d of samples. If properly guided, the process
eration-evaluation-feedback-regeneration will
nverge and the optimal configuration will be
ver, a sampling-based approach of this kind can
y expensive in terms of time or cost (e.g., rental
rces) considering that a function evaluation in
d be costly and the optimization process may
hundreds of samples to converge.
ed work
roaches have attempted to address the above
rsive Random Sampling (RRS) [43] integrates
echanism into the random sampling to achieve
in DevOps di↵erent versions of a system is delivered con-
tinuously, (ii) Big Data systems are developed using similar
frameworks (e.g., Apache Hadoop, Spark, Kafka) and run
on similar platforms (e.g., cloud clusters), (iii) and di↵erent
versions of a system often share a similar business logic.
To the best of our knowledge, only one study [9] explores
the possibility of transfer learning in system configuration.
The authors learn a Bayesian network in the tuning process
of a system and reuse this model for tuning other similar
systems. However, the learning is limited to the structure of
the Bayesian network. In this paper, we introduce a method
that not only reuse a model that has been learned previously
but also the valuable raw data. Therefore, we are not limited
to the accuracy of the learned model. Moreover, we do not
consider Bayesian networks and instead focus on MTGPs.
2.4 Motivation
A motivating example. We now illustrate the previous
points on an example. WordCount (cf. Figure 1) is a popular
benchmark [12]. WordCount features a three-layer architec-
ture that counts the number of words in the incoming stream.
A Processing Element (PE) of type Spout reads the input
messages from a data source and pushes them to the system.
A PE of type Bolt named Splitter is responsible for splitting
sentences, which are then counted by the Counter.
Kafka Spout Splitter Bolt Counter Bolt
(sentence) (word)
[paintings, 3]
[poems, 60]
[letter, 75]
Kafka Topic
Stream to
Kafka
File
(sentence)
(sentence)
(sentence)
figuration parameter, which ranges in a finite domain
m(Xi). In general, Xi may either indicate (i) integer vari-
such as level of parallelism or (ii) categorical variable
as messaging frameworks or Boolean variable such as
ling timeout. We use the terms parameter and factor in-
hangeably; also, with the term option we refer to possible
es that can be assigned to a parameter.
e assume that each configuration x 2 X in the configura-
space X = Dom(X1) ⇥ · · · ⇥ Dom(Xd) is valid, i.e., the
em accepts this configuration and the corresponding test
ts in a stable performance behavior. The response with
guration x is denoted by f(x). Throughout, we assume
f(·) is latency, however, other metrics for response may
sed. We here consider the problem of finding an optimal
figuration x⇤
that globally minimizes f(·) over X:
x⇤
= arg min
x2X
f(x) (1)
fact, the response function f(·) is usually unknown or
ially known, i.e., yi = f(xi), xi ⇢ X. In practice, such
surements may contain noise, i.e., yi = f(xi) + ✏. The
rmination of the optimal configuration is thus a black-
optimization program subject to noise [27, 33], which
nsiderably harder than deterministic optimization. A
ular solution is based on sampling that starts with a
ber of sampled configurations. The performance of the
eriments associated to this initial samples can deliver
tter understanding of f(·) and guide the generation of
next round of samples. If properly guided, the process
ample generation-evaluation-feedback-regeneration will
tually converge and the optimal configuration will be
result, the learning is limited to the current observations and
it still requires hundreds of sample evaluations. In this pa-
per, we propose to adopt a transfer learning method to deal
with the search e ciency in configuration tuning. Rather
than starting the search from scratch, the approach transfers
the learned knowledge coming from similar versions of the
software to accelerate the sampling process in the current
version. This idea is inspired from several observations arise
in real software engineering practice [2, 3]. For instance, (i)
in DevOps di↵erent versions of a system is delivered con-
tinuously, (ii) Big Data systems are developed using similar
frameworks (e.g., Apache Hadoop, Spark, Kafka) and run
on similar platforms (e.g., cloud clusters), (iii) and di↵erent
versions of a system often share a similar business logic.
To the best of our knowledge, only one study [9] explores
the possibility of transfer learning in system configuration.
The authors learn a Bayesian network in the tuning process
of a system and reuse this model for tuning other similar
systems. However, the learning is limited to the structure of
the Bayesian network. In this paper, we introduce a method
that not only reuse a model that has been learned previously
but also the valuable raw data. Therefore, we are not limited
to the accuracy of the learned model. Moreover, we do not
consider Bayesian networks and instead focus on MTGPs.
2.4 Motivation
A motivating example. We now illustrate the previous
points on an example. WordCount (cf. Figure 1) is a popular
benchmark [12]. WordCount features a three-layer architec-
ture that counts the number of words in the incoming stream.
Dom(X1) ⇥ · · · ⇥ Dom(Xd) is valid, i.e., the
this configuration and the corresponding test
ble performance behavior. The response with
x is denoted by f(x). Throughout, we assume
ency, however, other metrics for response may
ere consider the problem of finding an optimal
x⇤
that globally minimizes f(·) over X:
x⇤
= arg min
x2X
f(x) (1)
response function f(·) is usually unknown or
n, i.e., yi = f(xi), xi ⇢ X. In practice, such
may contain noise, i.e., yi = f(xi) + ✏. The
of the optimal configuration is thus a black-
on program subject to noise [27, 33], which
y harder than deterministic optimization. A
on is based on sampling that starts with a
pled configurations. The performance of the
ssociated to this initial samples can deliver
standing of f(·) and guide the generation of
d of samples. If properly guided, the process
ration-evaluation-feedback-regeneration will
verge and the optimal configuration will be
ver, a sampling-based approach of this kind can
y expensive in terms of time or cost (e.g., rental
ces) considering that a function evaluation in
d be costly and the optimization process may
hundreds of samples to converge.
d work
oaches have attempted to address the above
rsive Random Sampling (RRS) [43] integrates
chanism into the random sampling to achieve
ciency. Smart Hill Climbing (SHC) [42] inte-
rtance sampling with Latin Hypercube Design
version. This idea is inspired from several observations arise
in real software engineering practice [2, 3]. For instance, (i)
in DevOps di↵erent versions of a system is delivered con-
tinuously, (ii) Big Data systems are developed using similar
frameworks (e.g., Apache Hadoop, Spark, Kafka) and run
on similar platforms (e.g., cloud clusters), (iii) and di↵erent
versions of a system often share a similar business logic.
To the best of our knowledge, only one study [9] explores
the possibility of transfer learning in system configuration.
The authors learn a Bayesian network in the tuning process
of a system and reuse this model for tuning other similar
systems. However, the learning is limited to the structure of
the Bayesian network. In this paper, we introduce a method
that not only reuse a model that has been learned previously
but also the valuable raw data. Therefore, we are not limited
to the accuracy of the learned model. Moreover, we do not
consider Bayesian networks and instead focus on MTGPs.
2.4 Motivation
A motivating example. We now illustrate the previous
points on an example. WordCount (cf. Figure 1) is a popular
benchmark [12]. WordCount features a three-layer architec-
ture that counts the number of words in the incoming stream.
A Processing Element (PE) of type Spout reads the input
messages from a data source and pushes them to the system.
A PE of type Bolt named Splitter is responsible for splitting
sentences, which are then counted by the Counter.
Kafka Spout Splitter Bolt Counter Bolt
(sentence) (word)
[paintings, 3]
[poems, 60]
[letter, 75]
Kafka Topic
Stream to
Kafka
File
(sentence)
(sentence)
(sentence)
Figure 1: WordCount architecture.
Dom(Xd) is valid, i.e., the
and the corresponding test
ehavior. The response with
x). Throughout, we assume
r metrics for response may
oblem of finding an optimal
nimizes f(·) over X:
n
X
f(x) (1)
f(·) is usually unknown or
xi ⇢ X. In practice, such
e, i.e., yi = f(xi) + ✏. The
nfiguration is thus a black-
ct to noise [27, 33], which
rministic optimization. A
mpling that starts with a
s. The performance of the
nitial samples can deliver
nd guide the generation of
operly guided, the process
feedback-regeneration will
imal configuration will be
ed approach of this kind can
s of time or cost (e.g., rental
at a function evaluation in
optimization process may
les to converge.
pted to address the above
pling (RRS) [43] integrates
andom sampling to achieve
Climbing (SHC) [42] inte-
in real software engineering practice [2, 3]. For instance, (i)
in DevOps di↵erent versions of a system is delivered con-
tinuously, (ii) Big Data systems are developed using similar
frameworks (e.g., Apache Hadoop, Spark, Kafka) and run
on similar platforms (e.g., cloud clusters), (iii) and di↵erent
versions of a system often share a similar business logic.
To the best of our knowledge, only one study [9] explores
the possibility of transfer learning in system configuration.
The authors learn a Bayesian network in the tuning process
of a system and reuse this model for tuning other similar
systems. However, the learning is limited to the structure of
the Bayesian network. In this paper, we introduce a method
that not only reuse a model that has been learned previously
but also the valuable raw data. Therefore, we are not limited
to the accuracy of the learned model. Moreover, we do not
consider Bayesian networks and instead focus on MTGPs.
2.4 Motivation
A motivating example. We now illustrate the previous
points on an example. WordCount (cf. Figure 1) is a popular
benchmark [12]. WordCount features a three-layer architec-
ture that counts the number of words in the incoming stream.
A Processing Element (PE) of type Spout reads the input
messages from a data source and pushes them to the system.
A PE of type Bolt named Splitter is responsible for splitting
sentences, which are then counted by the Counter.
Kafka Spout Splitter Bolt Counter Bolt
(sentence) (word)
[paintings, 3]
[poems, 60]
[letter, 75]
Kafka Topic
Stream to
Kafka
File
(sentence)
(sentence)
(sentence)
Figure 1: WordCount architecture.
Response surface is:
- Non-linear
- Non convex
- Multi-modal
- Measurements subject to noise
Bayesian	Optimization	for	Configuration	Optimization	(BO4CO)
Code:	https://github.com/dice-project/DICE-Configuration-BO4CO
Data:	https://zenodo.org/record/56238
Paper:	P.	Jamshidi,	G.	Casale,	“An	Uncertainty-Aware	Approach	to	Optimal	Configuration	of	
Stream	Processing	Systems”,	MASCOTS	2016. https://arxiv.org/abs/1606.06543
GP	for	modeling	blackbox response	function
true function
GP mean
GP variance
observation
selected point
true
minimum
mposed by its prior mean (µ(·) : X ! R) and a covariance
nction (k(·, ·) : X ⇥ X ! R) [41]:
y = f(x) ⇠ GP(µ(x), k(x, x0
)), (2)
here covariance k(x, x0
) defines the distance between x
d x0
. Let us assume S1:t = {(x1:t, y1:t)|yi := f(xi)} be
e collection of t experimental data (observations). In this
mework, we treat f(x) as a random variable, conditioned
observations S1:t, which is normally distributed with the
lowing posterior mean and variance functions [41]:
µt(x) = µ(x) + k(x)|
(K + 2
I) 1
(y µ) (3)
2
t (x) = k(x, x) + 2
I k(x)|
(K + 2
I) 1
k(x) (4)
here y := y1:t, k(x)|
= [k(x, x1) k(x, x2) . . . k(x, xt)],
:= µ(x1:t), K := k(xi, xj) and I is identity matrix. The
ortcoming of BO4CO is that it cannot exploit the observa-
ns regarding other versions of the system and as therefore
nnot be applied in DevOps.
2 TL4CO: an extension to multi-tasks
TL4CO 1
uses MTGPs that exploit observations from other
evious versions of the system under test. Algorithm 1
fines the internal details of TL4CO. As Figure 4 shows,
4CO is an iterative algorithm that uses the learning from
her system versions. In a high-level overview, TL4CO: (i)
ects the most informative past observations (details in
ction 3.3); (ii) fits a model to existing data based on kernel
arning (details in Section 3.4), and (iii) selects the next
ork are based on tractable linear algebra.
evious work [21], we proposed BO4CO that ex-
task GPs (no transfer learning) for prediction of
tribution of response functions. A GP model is
y its prior mean (µ(·) : X ! R) and a covariance
·, ·) : X ⇥ X ! R) [41]:
y = f(x) ⇠ GP(µ(x), k(x, x0
)), (2)
iance k(x, x0
) defines the distance between x
us assume S1:t = {(x1:t, y1:t)|yi := f(xi)} be
n of t experimental data (observations). In this
we treat f(x) as a random variable, conditioned
ons S1:t, which is normally distributed with the
sterior mean and variance functions [41]:
µ(x) + k(x)|
(K + 2
I) 1
(y µ) (3)
k(x, x) + 2
I k(x)|
(K + 2
I) 1
k(x) (4)
1:t, k(x)|
= [k(x, x1) k(x, x2) . . . k(x, xt)],
, K := k(xi, xj) and I is identity matrix. The
of BO4CO is that it cannot exploit the observa-
ng other versions of the system and as therefore
pplied in DevOps.
CO: an extension to multi-tasks
uses MTGPs that exploit observations from other
Motivations:
1- mean estimates + variance
2- all computations are linear algebra
-1.5 -1 -0.5 0 0.5 1 1.5
-1.5
-1
-0.5
0
0.5
1
x1 x2 x3 x4
true function
GP surrogate
mean estimate
observation
Fig. 5: An example of 1D GP model: GPs provide mean esti-
mates as well as the uncertainty in estimations, i.e., variance.
Configuration
Optimisation Tool
performance
repository
Monitoring
Deployment Service
Data Preparation
configuration
parameters
values
configuration
parameters
values
Experimental Suite
Testbed
Doc
Data Broker
Tester
experiment time
polling interval
configuration
parameters
GP model
Kafka
System Under Test
Workload
Generator
Technology Interface
Storm
Cassandra
Spark
Algorithm 1 : BO4CO
Input: Configuration space X, Maximum budget Nmax, Re-
sponse function f, Kernel function K✓, Hyper-parameters
✓, Design sample size n, learning cycle Nl
Output: Optimal configurations x⇤
and learned model M
1: choose an initial sparse design (lhd) to find an initial
design samples D = {x1, . . . , xn}
2: obtain performance measurements of the initial design,
yi f(xi) + ✏i, 8xi 2 D
3: S1:n {(xi, yi)}n
i=1; t n + 1
4: M(x|S1:n, ✓) fit a GP model to the design . Eq.(3)
5: while t  Nmax do
6: if (t mod Nl = 0) ✓ learn the kernel hyper-
parameters by maximizing the likelihood
7: find next configuration xt by optimizing the selection
criteria over the estimated response surface given the data,
xt arg maxxu(x|M, S1:t 1) . Eq.(9)
8: obtain performance for the new configuration xt, yt
f(xt) + ✏t
9: Augment the configuration S1:t = {S1:t 1, (xt, yt)}
10: M(x|S1:t, ✓) re-fit a new GP model . Eq.(7)
11: t t + 1
12: end while
13: (x⇤
, y⇤
) = min S1:Nmax
14: M(x)
(x, x0
) defines the distance between x and x0
. Let us
S1:t = {(x1:t, y1:t)|yi := f(xi)} be the collection of
ations. The function values are drawn from a multi-
Gaussian distribution N(µ, K), where µ := µ(x1:t),
K :=
2
6
4
k(x1, x1) . . . k(x1, xt)
...
...
...
k(xt, x1) . . . k(xt, xt)
3
7
5 (4)
e while loop in BO4CO, given the observations we
ated so far, we intend to fit a new GP model:

f1:t
ft+1
⇠ N(µ,

K + 2
I k
k|
k(xt+1, xt+1)
), (5)
(x)|
= [k(x, x1) k(x, x2) . . . k(x, xt)] and I
ty matrix. Given the Eq. (5), the new GP model can
n from this new Gaussian distribution:
Pr(ft+1|S1:t, xt+1) = N(µt(xt+1), 2
t (xt+1)), (6)
) = µ(x) + k(x)|
(K + 2
I) 1
(y µ) (7)
) = k(x, x) + 2
I k(x)|
(K + 2
I) 1
k(x) (8)
osterior functions are used to select the next point xt+1
ed in Section III-C.
figuration selection criteria
election criteria is defined as u : X ! R that selects
X, should f(·) be evaluated next (step 7):
0 2000 4000 6000 8000 10000
Iteration
4
4.5
5
5.5
6
ig. 7: Change of  value over time: it begins with a small
alue to exploit the mean estimates and it increases over time
order to explore.
BO4CO uses Lower Confidence Bound (LCB) [25]. LCB
inimizes the distance to optimum by balancing exploitation
f mean and exploration via variance:
uLCB(x|M, S1:n) = argmin
x2X
µt(x)  t(x), (10)
here  can be set according to the objectives. For instance,
we require to find a near optimal configuration quickly we
et a low value to  to take the most out of the initial design
nowledge. However, if we are looking for a globally optimum
onfiguration, we can set a high value in order to skip local
inima. Furthermore,  can be adapted over time to benefit
dimensions as suggested in [31], [25], this
Automatic Relevance Determination (ARD)
hyper-parameters (step 6), if the `-th dime
be irrelevant, then ✓` will be a small value,
be discarded. This is particularly helpful in
spaces, where it is difficult to find the opti
2) Prior mean function: While the k
structure of the estimated function, the p
X ! R provides a possible offset for o
default, this function is set to a constant µ
inferred from the observations [25]. Howev
function is a way of incorporating the ex
it is available, then we can use this know
we have collected extensive experimental
based on our datasets (cf. Table I), we obse
for Big Data systems, there is a significan
the minimum and the maximum of each f
2). Therefore, a linear mean function µ(x)
for more flexible structures, and provides
data than a constant mean. We only need
for each dimension and an offset (denoted
3) Learning parameters: marginal likeli
s greater flexibility in modeling such functions [31]. The
ng is a variation of the Mat´ern kernel for ⌫ = 1/2:
k⌫=1/2(xi, xj) = ✓2
0 exp( r), (11)
r2
(xi, xj) = (xi xj)|
⇤(xi xj) for some positive
finite matrix ⇤. For categorical variables, we imple-
the following [12]:
k✓(xi, xj) = exp(⌃d
`=1( ✓` (xi 6= xj))), (12)
From an implementation perspective, B
three major components: (i) an optimizat
left part of Figure 6), (ii) an experimental
of Figure 6) integrated via a (iii) data broke
component implements the model (re-)fitt
optimization (9) steps in Algorithm 1 and i
lab 2015b. The experimental suite compon
facilities for automated deployment of topo
measurements and data preparation and is
Acquisition function:
Kernel function:
-1.5 -1 -0.5 0 0.5 1 1.5
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
configuration domain
responsevalue
-1.5 -1 -0.5 0 0.5 1 1.5
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
true response function
GP fit
-1.5 -1 -0.5 0 0.5 1 1.5
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
criteria evaluation
new selected point
-1.5 -1 -0.5 0 0.5 1 1.5
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
new GP fit
Correlations	across	different	versions
100
150
1
200
250
Latency(ms)
300
2 5
3 104
5 15
6
14
16
18
20
1
22
24
26
Latency(ms)
28
30
32
2 53 104
5 156
number of countersnumber of splitters number of countersnumber of splitters
2.8
2.9
1
3
3.1
3.2
3.3
2
Latency(ms)
3.4
3.5
3.6
3 5
4 10
5 15
6
1.2
1.3
1.4
1
1.5
1.6
1.7
Latency(ms)
1.8
1.9
2 53
104
5 156
(a) WordCount v1
(b) WordCount v2
(c) WordCount v3 (d) WordCount v4
(e) Pearson correlation coefficients
(g) Measurement noise across WordCount versions
(f) Spearman correlation coefficients
correlation coefficient
p-value
v1 v2 v3 v4
500
600
700
800
900
1000
1100
1200
Latency(ms)
hardware change
softwarechange
Table 1: My caption
v1 v2 v3 v4
v1 1 0.41 -0.46 -0.50
v2 7.36E-06 1 -0.20 -0.18
v3 6.92E-07 0.04 1 0.94
v4 2.54E-08 0.07 1.16E-52 1
Table 2: My caption
v1 v2 v3 v4
v1 1 0.49 -0.51 -0.51
v2 5.50E-08 1 -0.2793 -0.24
v3 1.30E-08 0.003 1 0.88
v4 1.40E-08 0.01 8.30E-36 1
Table 3: My caption
ver. µ µ
v1 516.59 7.96 64.88
v1 v2 v3 v4
v1 1 0.41 -0.46 -0.50
v2 7.36E-06 1 -0.20 -0.18
v3 6.92E-07 0.04 1 0.94
v4 2.54E-08 0.07 1.16E-52 1
Table 2: My caption
v1 v2 v3 v4
v1 1 0.49 -0.51 -0.51
v2 5.50E-08 1 -0.2793 -0.24
v3 1.30E-08 0.003 1 0.88
v4 1.40E-08 0.01 8.30E-36 1
Table 3: My caption
ver. µ µ
v1 516.59 7.96 64.88
v2 584.94 2.58 226.32
v3 654.89 13.56 48.30
v4 1125.81 16.92 66.56
- Different correlations
- Different optimum
Configurations
- Different noise level
DevOps
- Different	versions	are	continuously	delivered	(daily	basis).
- Big	Data	systems	are	developed	using	similar	frameworks	
(Apache	Storm,	Spark,	Hadoop,	Kafka,	etc).
- Different	versions	share	similar	business	logics.
Solution:	Transfer	Learning	for	Configuration	Optimization
Configuration Optimization
(version j=M)
performance
measurements
Initial Design
Model Fit
Next Experiment
Model Update
Budget Finished
performance
repository
Configuration Optimization
(version j=N)
Initial Design
Model Fit
Next Experiment
Model Update
Budget Finished
select data
for training
GP model hyper-parameters
store filter
Transfer	Learning	for	Configuration	
Optimization	(TL4CO)
Code:	https://github.com/dice-project/DICE-Configuration-TL4CO
true function
GP mean
GP variance
observation
selected point
true
minimumX
f(x)
Figure 3: An example of 1-dimensional GP model.
Configuration Optimization
(version j=M)
performance
measurements
Initial Design
Model Fit
Next Experiment
Model Update
Budget Finished
performance
repository
Configuration Optimization
(version j=N)
Initial Design
Model Fit
Next Experiment
Model Update
Budget Finished
select data
for training
GP model hyper-parameters
store filter
Algorithm 1 : TL4CO
Input: Configuration space X, Number of historical con-
figuration optimization datasets T, Maximum budget
Nmax, Response function f, Kernel function K, Hyper-
parameters ✓j
, The current dataset Sj=1
, Datasets be-
longing to other versions Sj=2:T
, Diagonal noise matrix
⌃, Design sample size n, Learning cycle Nl
Output: Optimal configurations x⇤
and learned model M
1: D = {x1, . . . , xn} create an initial sparse design (lhd)
2: Obtain the performance 8xi 2 D, yi f(xi) + ✏i
3: Sj=2:T
select m points from other versions (j = 2 : T)
that reduce entropy . Eq.(8)
4: S1
1:n Sj=2:T
[ {(xi, yi)}n
i=1; t n + 1
5: M(x|S1
1:n, ✓j
) fit a multi-task GP to D . Eq. (6)
6: while t  Nmax do
7: If (t mod Nl = 0) then [✓ learn the kernel hyper-
parameters by maximizing the likelihood]
8: Determine the next configuration, xt, by optimizing
LCB over M, xt arg maxxu(x|M, S1
1:t 1) . Eq.(11)
9: Obtain the performance of xt, yt f(xt) + ✏t
10: Augment the configuration S1
1:t = S1
1:t 1
S
{(xt, yt)}
11: M(x|S1
1:t, ✓) re-fit a new GP model . Eq.(6)
12: t t + 1
13: end while
14: (x⇤
, y⇤
) = min[S1
1:Nmax
Sj=2:T
] locating the optimum
configuration by discarding data for other system versions
15: M(x) storing the learned model
:t), where xi are the configurations of di↵erent ver-
or which we observed response yj
i . The version of
stem (task) is defined as j that contains tj
observa-
Without loss of generality, we always assume j = 1
current version under test. In order to associate the
ations (xj
i , yj
i ) to task j a label lj
= j is used. To
r the learning from previous tasks to the current one,
TPG framework allows us to use the multiplication of
dependent kernels [5]:
kT L4CO(l, l0
, x, x0
) = kt(l, l0
) ⇥ kxx(x, x0
), (5)
the kernels kt and kxx represent the correlation be-
di↵erent versions and covariance functions for a partic-
rsion, respectively. Note that kt depends only on the
pointing to the versions for which we have some obser-
s, and kxx depends only on the configuration x. This
exploiting the similarity between measureme
to (3), (4), the mean and variance of MTGP
µt(x) = µ(x) + k|
(K(X, X) + ⌃) 1
(y
2
t (x) = k(x, x) + 2
k|
(K(X, X) +
where y := [f1, . . . , fT ]|
is the measured pe
garding all tasks, X = [X1, . . . , XT ] are the c
configuration and ⌃ = diag[ 2
1, . . . , 2
T ] is a d
matrix where each noise term is repeated as man
number of historical observations represented
and k := k(X, x). This is a conditional esti
sired configuration given the historical dataset
system versions and their respective GP hype
An illustration of a multi-task GP versus a s
and its e↵ect on providing a more accurate mo
4
em versions; (iii) fast learning of the hyper-parameters by
oiting the similarity between measurements. Similarly
3), (4), the mean and variance of MTGP are [34]:
µt(x) = µ(x) + k|
(K(X, X) + ⌃) 1
(y µ) (6)
2
t (x) = k(x, x) + 2
k|
(K(X, X) + ⌃) 1
k, (7)
re y := [f1, . . . , fT ]|
is the measured performance re-
ding all tasks, X = [X1, . . . , XT ] are the corresponding
figuration and ⌃ = diag[ 2
1, . . . , 2
T ] is a diagonal noise
rix where each noise term is repeated as many times as the
mber of historical observations represented by t1
, . . . , tT
k := k(X, x). This is a conditional estimate at a de-
d configuration given the historical datasets for previous
em versions and their respective GP hyper-parameters.
n illustration of a multi-task GP versus a single-task GP
its e↵ect on providing a more accurate model is given in
correlation between
different versions
covariance functions
Multi-task	GP	vs	single-task	GP
-1.5 -1 -0.5 0 0.5 1 1.5
-4
-3
-2
-1
0
1
2
3
(a) 3 sample response functions
configuration domain
responsevalue
(1)
(2)
(3)
observations
(b) GP fit for (1) ignoring observations for (2),(3)
LCB
not informative
(c) multi-task GP fit for (1) by transfer learning from (2),(3)
highly informative
GP prediction mean
GP prediction variance
probability distribution
of the minimizers
Exploitation	vs	exploration
0 2000 4000 6000 8000 10000
Iteration
4
4.5
5
5.5
6
6.5
7
7.5
8
8.5
κ
κ
TL4CO
[ϵ=1]
κ
BO4CO
[ϵ=1]
κ
TL4CO
[ϵ=0.1]
κ
BO4CO
[ϵ=0.1]
κ
TL4CO
[ϵ=0.01]
κ
BO4CO
[ϵ=0.01]
noise variance of the individual datasets.
figuration selection criteria
uires a selection criterion (step 8 in Algorithm
the next configuration to measure. Intuitively,
elect the minimum response. This is done using
ction u : X ! R that determines xt+1 2 X,
e evaluated next as:
xt+1 = argmax
x2X
u(x|M, S1
1:t) (11)
on criterion depends on the MTGP model M
h its predictive mean µt(xt) and variance 2
t (xt)
on observations S1
1:t. TL4CO uses the Lower
Bound (LCB) [24]:
B(x|M, S1
1:t) = argmin
x2X
µt(x)  t(x), (12)
xploitation-exploration parameter. For instance,
to find a near optimal configuration we set a
 to take the most out of the predictive mean.
e are looking for a globally optimum one, we can
ue in order to skip local minima. Furthermore,
pted over time [22] to perform more explorations.
decide the next configuration to measure. Intuitively,
ant to select the minimum response. This is done using
ity function u : X ! R that determines xt+1 2 X,
d f(·) be evaluated next as:
xt+1 = argmax
x2X
u(x|M, S1
1:t) (11)
e selection criterion depends on the MTGP model M
through its predictive mean µt(xt) and variance 2
t (xt)
tioned on observations S1
1:t. TL4CO uses the Lower
dence Bound (LCB) [24]:
uLCB(x|M, S1
1:t) = argmin
x2X
µt(x)  t(x), (12)
 is a exploitation-exploration parameter. For instance,
require to find a near optimal configuration we set a
alue to  to take the most out of the predictive mean.
ver, if we are looking for a globally optimum one, we can
high value in order to skip local minima. Furthermore,
be adapted over time [22] to perform more explorations.
e 6 shows that in TL4CO,  can start with a relatively
r value at the early iterations comparing to BO4CO
the former provides a better estimate of mean and
ore contains more information at the early stages.
4CO output. Once the N di↵erent configurations of
TL4CO	architecture
Configuration
Optimisation Tool
(TL4CO)
performance
repository
Monitoring
Deployment Service
Data Preparation
configuration
parameters
values
configuration
parameters
values
Experimental Suite
Testbed
Doc
Data Broker
Tester
experiment time
polling interval
previous versions
configuration
parameters
GP model
Kafka
Vn
V1 V2
System Under Test
historical
data
Workload
Generator
Technology Interface
Storm
Cassandra
Spark
Experimental	results
0 20 40 60 80 100
Iteration
10
-6
10
-4
10
-2
100
10
2
104
AbsoluteError
TL4CO
BO4CO
SA
GA
HILL
PS
Drift
(b) WC(3D)
(8 minutes each)
0 20 40 60 80 100
Iteration
10-3
10-2
10
-1
100
10
1
102
103
AbsoluteError
TL4CO
BO4CO
SA
GA
HILL
PS
Drift (a) WC(5D)
(8 minutes each)
- 30 runs, report average performance
- Yes, we did full factorial
measurements and we know where
global min is… J
Experimental	results	(cont.)
0 50 100 150 200
Iteration
10
-3
10
-2
10
-1
10
0
10
1
10
2
10
3
10
4
AbsoluteError
TL4CO
BO4CO
SA
GA
HILL
PS
Drift
(b) WC(6D)
(8 minutes each)
Experimental	results	(cont.)
0 50 100 150 200
Iteration
10
-2
10
-1
10
0
10
1
10
2
10
3
10
4
AbsoluteError
TL4CO
BO4CO
SA
GA
HILL
PS
Drift
0 50 100 150 200
Iteration
10
-4
10
-2
10
0
10
2
10
4
AbsoluteError
TL4CO
BO4CO
SA
GA
HILL
PS
Drift
(a) SOL(6D) (b) RS(6D)
(8 minutes each)(8 minutes each)
Comparison	with	default	and	expert	prescription
0 500 1000 1500
Throughput (ops/sec)
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Averagereadlatency(µs)
×10
4
TL4CO
BO4CO
BO4CO after
20 iterations TL4CO after
20 iterations
TL4CO after
100 iterations
0 500 1000 1500
Throughput (ops/sec)
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Averagewritelatency(µs)
TL4CO
BO4CO
Default configuration
Configuration
recommended
by expert
TL4CO after
100 iterations
BO4CO after
100 iterations
Default configuration
Configuration
recommended
by expert
Model	accuracy
TL4CO polyfit1 polyfit2 polyfit3 polyfit4 polyfit5
10
-12
10-10
10
-8
10
-6
10
-4
10-2
10
0
102
AbsolutePercentageError[%]
TL4CO BO4CO M5Tree R-Tree M5Rules MARS PRIM
10
-2
10-1
10
0
101
10
2
AbsolutePercentageError[%]
Prediction	accuracy	over	time
0 20 40 60 80 100
Iteration
10-4
10-3
10
-2
10-1
10
0
10
1
PredictionError(RMSE)
T=2,m=100
T=2,m=200
T=2,m=300
T=3,m=100
0 20 40 60 80 100
Iteration
10-4
10-3
10-2
10-1
100
101
PredictionError(RMSE)
TL4CO
polyfit1
polyfit2
polyfit4
polyfit5
M5Tree
M5Rules
PRIM (a) (b)
Entropy	of	the	density	function	of	the	minimizers
0 20 40 60 80 100
0
1
2
3
4
5
6
7
8
9
10
Entropy
T=1(BO4CO)
T=2,m=100
T=2,m=200
T=2,m=300
T=2,m=400
T=3,m=100
1 2 3 4 5 6 7 8 9
0
2
4
6
8
10
BO4CO
TL4CO
Entropy
Iteration
Branin Hartmann WC(3D) SOL(6D) WC(5D)Dixon WC(6D) RS(6D) cass-20
he knowledge about the location of optimum configura-
is summarized by the approximation of the conditional
bability density function of the response function mini-
ers, i.e., X⇤
= Pr(x⇤
|f(x)), where f(·) is drawn from
MTGP model (cf. solid red line in Figure 5(b,c)). The
opy of the density functions in Figure 5(b,c) are 6.39,
so we know more information about the latter.
he results in Figure 19 confirm that the entropy measure
e minimizers with the models provided by TL4CO for all
datasets (synthetic and real) significantly contains more
mation. The results demonstrate that the main reason
finding quick convergence comparing with the baselines
at TL4CO employs a more e↵ective model. The results
igure 19(b) show the change of entropy of X⇤
over time
WC(5D) dataset. First, it shows that in TL4CO, the
opy decreases sharply. However, the overall decrease of
opy for BO4CO is slow. The second observation is that
TL4CO
variance,
storing K
making th
5. DIS
5.1 Be
TL4CO
experimen
practice. A
than thre
the system
our appro
Knowledge about the location of the minimizer
Effect	of	ARD
0 20 40 60 80 100
10-6
10-5
10-4
10-3
10-2
10
-1
10
0
10
1
AbsoluteError(ms)
TL4CO(ARD on)
TL4CO(ARD off)
0 20 40 60 80 100
10-3
10
-2
10-1
10
0
10
1
AbsoluteError(ms)
TL4CO(ARD on)
TL4CO(ARD off)
Iteration Iteration
(a) (b)
We implemented the following kernel function to support
integer and categorical variables (cf. Section 2.1):
kxx(xi, xj) = exp(⌃d
`=1( ✓` (xi 6= xj))), (9)
where d is the number of dimensions (i.e., the number of
configuration parameters), ✓` adjust the scales along the
function dimensions and is a function gives the distance
between two categorical variables using Kronecker delta [20,
34]. TL4CO uses di↵erent scales {✓`, ` = 1 . . . d} on di↵erent
dimensions as suggested in [41, 34], this technique is called
Automatic Relevance Determination (ARD). After learning
the hyper-parameters (step 7 in Algorithm 1), if the `-th
dimension (parameter) turns out be irrelevant, so the ✓`
will be a small value, and therefore, will be automatically
discarded. This is particularly helpful in high dimensional
spaces, where it is di cult to find the optimal configuration.
3.4.2 Prior mean function
While the kernel controls the structure of the estimated
function, the prior mean µ(x) : X ! R provides a possible
o↵set for our estimation. By default, this function is set to a
constant µ(x) := µ, which is inferred from the observations
Effect	of	correlation	between	tasks
0 20 40 60 80 100
Iteration
10
-6
10
-5
10
-4
10
-3
10
-2
10
-1
10
0
10
1
AbsoluteError
TL4CO [f(x)+randn×f(x)]
TL4CO [f(x)+2randn×f(x)]
BO4CO
Runtime	overhead
0 20 40 60 80 100
Iteration
0.15
0.2
0.25
0.3
0.35
0.4
ElapsedTime(s)
WordCount (3D)
WordCount (6D)
SOL (6D)
RollingSort (6D)
WordCount (5D)
T=2,m=100 T=2,m=200 T=2,m=400 T=3,m=100 T=3,m=200
0.5
1
1.5
2
2.5
3
3.5
4
ElapsedTime(s)
(a) TL4CO runtime for different datasets (b) TL4CO runtime for different size of observations
Key	takeaways
- A	principled	way	to	leverage	prior	knowledge	gained	from	
searches	over	previous	versions	of	the	system.	
- Multi-task	GPs	can	be	used	in	order	to	capture	correlation	
between	related	tuning	tasks.	
- MTGPs	are	more	accurate	than	STGP	and	other	regression	
models.
- Application	to	SPSs,	Batch	and	NoSQL	
- Lead	to	a	better	performance	in	practice
Acknowledgement: BO4CO and TL4CO are now integrated with
other DevOps tools in the delivery pipeline for Big Data in H2020
DICE project
Big	Data	Technologies
Cloud	(Priv/Pub)
`
DICE	IDE
Profile
Plugins
Sim Ver Opt
DPIM
DTSM
DDSM																			TOSCAMethodology
Deploy Config Test
M
o
n
Anomaly
Trace
Iter.	Enh.
Data	Intensive	Application	(DIA)
Cont.Int. Fault	Inj.
WP4
WP3
WP2
WP5
WP1 WP6	- Demonstrators
Tool Demo: http://www.slideshare.net/pooyanjamshidi/configuration-optimization-tool

Más contenido relacionado

Destacado

Refinery meet
Refinery meet Refinery meet
Refinery meet rajeshvari
 
How to Get My Paper Accepted at Top Software Engineering Conferences
How to Get My Paper Accepted at Top Software Engineering ConferencesHow to Get My Paper Accepted at Top Software Engineering Conferences
How to Get My Paper Accepted at Top Software Engineering ConferencesAlex Orso
 
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...Pooyan Jamshidi
 
Microservices Architecture Enables DevOps: Migration to a Cloud-Native Archit...
Microservices Architecture Enables DevOps: Migration to a Cloud-Native Archit...Microservices Architecture Enables DevOps: Migration to a Cloud-Native Archit...
Microservices Architecture Enables DevOps: Migration to a Cloud-Native Archit...Pooyan Jamshidi
 
Cloud Migration Patterns: A Multi-Cloud Architectural Perspective
Cloud Migration Patterns: A Multi-Cloud Architectural PerspectiveCloud Migration Patterns: A Multi-Cloud Architectural Perspective
Cloud Migration Patterns: A Multi-Cloud Architectural PerspectivePooyan Jamshidi
 
Machine Learning meets DevOps
Machine Learning meets DevOpsMachine Learning meets DevOps
Machine Learning meets DevOpsPooyan Jamshidi
 

Destacado (6)

Refinery meet
Refinery meet Refinery meet
Refinery meet
 
How to Get My Paper Accepted at Top Software Engineering Conferences
How to Get My Paper Accepted at Top Software Engineering ConferencesHow to Get My Paper Accepted at Top Software Engineering Conferences
How to Get My Paper Accepted at Top Software Engineering Conferences
 
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
 
Microservices Architecture Enables DevOps: Migration to a Cloud-Native Archit...
Microservices Architecture Enables DevOps: Migration to a Cloud-Native Archit...Microservices Architecture Enables DevOps: Migration to a Cloud-Native Archit...
Microservices Architecture Enables DevOps: Migration to a Cloud-Native Archit...
 
Cloud Migration Patterns: A Multi-Cloud Architectural Perspective
Cloud Migration Patterns: A Multi-Cloud Architectural PerspectiveCloud Migration Patterns: A Multi-Cloud Architectural Perspective
Cloud Migration Patterns: A Multi-Cloud Architectural Perspective
 
Machine Learning meets DevOps
Machine Learning meets DevOpsMachine Learning meets DevOps
Machine Learning meets DevOps
 

Más de Pooyan Jamshidi

Learning LWF Chain Graphs: A Markov Blanket Discovery Approach
Learning LWF Chain Graphs: A Markov Blanket Discovery ApproachLearning LWF Chain Graphs: A Markov Blanket Discovery Approach
Learning LWF Chain Graphs: A Markov Blanket Discovery ApproachPooyan Jamshidi
 
A Framework for Robust Control of Uncertainty in Self-Adaptive Software Conn...
 A Framework for Robust Control of Uncertainty in Self-Adaptive Software Conn... A Framework for Robust Control of Uncertainty in Self-Adaptive Software Conn...
A Framework for Robust Control of Uncertainty in Self-Adaptive Software Conn...Pooyan Jamshidi
 
Machine Learning Meets Quantitative Planning: Enabling Self-Adaptation in Aut...
Machine Learning Meets Quantitative Planning: Enabling Self-Adaptation in Aut...Machine Learning Meets Quantitative Planning: Enabling Self-Adaptation in Aut...
Machine Learning Meets Quantitative Planning: Enabling Self-Adaptation in Aut...Pooyan Jamshidi
 
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...Pooyan Jamshidi
 
Transfer Learning for Performance Analysis of Machine Learning Systems
Transfer Learning for Performance Analysis of Machine Learning SystemsTransfer Learning for Performance Analysis of Machine Learning Systems
Transfer Learning for Performance Analysis of Machine Learning SystemsPooyan Jamshidi
 
Transfer Learning for Performance Analysis of Configurable Systems: A Causal ...
Transfer Learning for Performance Analysis of Configurable Systems:A Causal ...Transfer Learning for Performance Analysis of Configurable Systems:A Causal ...
Transfer Learning for Performance Analysis of Configurable Systems: A Causal ...Pooyan Jamshidi
 
Machine Learning meets DevOps
Machine Learning meets DevOpsMachine Learning meets DevOps
Machine Learning meets DevOpsPooyan Jamshidi
 
Integrated Model Discovery and Self-Adaptation of Robots
Integrated Model Discovery and Self-Adaptation of RobotsIntegrated Model Discovery and Self-Adaptation of Robots
Integrated Model Discovery and Self-Adaptation of RobotsPooyan Jamshidi
 
Transfer Learning for Performance Analysis of Highly-Configurable Software
Transfer Learning for Performance Analysis of Highly-Configurable SoftwareTransfer Learning for Performance Analysis of Highly-Configurable Software
Transfer Learning for Performance Analysis of Highly-Configurable SoftwarePooyan Jamshidi
 
Architectural Tradeoff in Learning-Based Software
Architectural Tradeoff in Learning-Based SoftwareArchitectural Tradeoff in Learning-Based Software
Architectural Tradeoff in Learning-Based SoftwarePooyan Jamshidi
 
Production-Ready Machine Learning for the Software Architect
Production-Ready Machine Learning for the Software ArchitectProduction-Ready Machine Learning for the Software Architect
Production-Ready Machine Learning for the Software ArchitectPooyan Jamshidi
 
Transfer Learning for Software Performance Analysis: An Exploratory Analysis
Transfer Learning for Software Performance Analysis: An Exploratory AnalysisTransfer Learning for Software Performance Analysis: An Exploratory Analysis
Transfer Learning for Software Performance Analysis: An Exploratory AnalysisPooyan Jamshidi
 
Learning Software Performance Models for Dynamic and Uncertain Environments
Learning Software Performance Models for Dynamic and Uncertain EnvironmentsLearning Software Performance Models for Dynamic and Uncertain Environments
Learning Software Performance Models for Dynamic and Uncertain EnvironmentsPooyan Jamshidi
 
Fuzzy Self-Learning Controllers for Elasticity Management in Dynamic Cloud Ar...
Fuzzy Self-Learning Controllers for Elasticity Management in Dynamic Cloud Ar...Fuzzy Self-Learning Controllers for Elasticity Management in Dynamic Cloud Ar...
Fuzzy Self-Learning Controllers for Elasticity Management in Dynamic Cloud Ar...Pooyan Jamshidi
 
Configuration Optimization for Big Data Software
Configuration Optimization for Big Data SoftwareConfiguration Optimization for Big Data Software
Configuration Optimization for Big Data SoftwarePooyan Jamshidi
 
Self learning cloud controllers
Self learning cloud controllersSelf learning cloud controllers
Self learning cloud controllersPooyan Jamshidi
 
Workload Patterns for Quality-driven Dynamic Cloud Service Configuration and...
Workload Patterns for Quality-driven Dynamic Cloud Service Configuration and...Workload Patterns for Quality-driven Dynamic Cloud Service Configuration and...
Workload Patterns for Quality-driven Dynamic Cloud Service Configuration and...Pooyan Jamshidi
 
Fuzzy Control meets Software Engineering
Fuzzy Control meets Software EngineeringFuzzy Control meets Software Engineering
Fuzzy Control meets Software EngineeringPooyan Jamshidi
 

Más de Pooyan Jamshidi (20)

Learning LWF Chain Graphs: A Markov Blanket Discovery Approach
Learning LWF Chain Graphs: A Markov Blanket Discovery ApproachLearning LWF Chain Graphs: A Markov Blanket Discovery Approach
Learning LWF Chain Graphs: A Markov Blanket Discovery Approach
 
A Framework for Robust Control of Uncertainty in Self-Adaptive Software Conn...
 A Framework for Robust Control of Uncertainty in Self-Adaptive Software Conn... A Framework for Robust Control of Uncertainty in Self-Adaptive Software Conn...
A Framework for Robust Control of Uncertainty in Self-Adaptive Software Conn...
 
Machine Learning Meets Quantitative Planning: Enabling Self-Adaptation in Aut...
Machine Learning Meets Quantitative Planning: Enabling Self-Adaptation in Aut...Machine Learning Meets Quantitative Planning: Enabling Self-Adaptation in Aut...
Machine Learning Meets Quantitative Planning: Enabling Self-Adaptation in Aut...
 
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
 
Transfer Learning for Performance Analysis of Machine Learning Systems
Transfer Learning for Performance Analysis of Machine Learning SystemsTransfer Learning for Performance Analysis of Machine Learning Systems
Transfer Learning for Performance Analysis of Machine Learning Systems
 
Transfer Learning for Performance Analysis of Configurable Systems: A Causal ...
Transfer Learning for Performance Analysis of Configurable Systems:A Causal ...Transfer Learning for Performance Analysis of Configurable Systems:A Causal ...
Transfer Learning for Performance Analysis of Configurable Systems: A Causal ...
 
Machine Learning meets DevOps
Machine Learning meets DevOpsMachine Learning meets DevOps
Machine Learning meets DevOps
 
Learning to Sample
Learning to SampleLearning to Sample
Learning to Sample
 
Integrated Model Discovery and Self-Adaptation of Robots
Integrated Model Discovery and Self-Adaptation of RobotsIntegrated Model Discovery and Self-Adaptation of Robots
Integrated Model Discovery and Self-Adaptation of Robots
 
Transfer Learning for Performance Analysis of Highly-Configurable Software
Transfer Learning for Performance Analysis of Highly-Configurable SoftwareTransfer Learning for Performance Analysis of Highly-Configurable Software
Transfer Learning for Performance Analysis of Highly-Configurable Software
 
Architectural Tradeoff in Learning-Based Software
Architectural Tradeoff in Learning-Based SoftwareArchitectural Tradeoff in Learning-Based Software
Architectural Tradeoff in Learning-Based Software
 
Production-Ready Machine Learning for the Software Architect
Production-Ready Machine Learning for the Software ArchitectProduction-Ready Machine Learning for the Software Architect
Production-Ready Machine Learning for the Software Architect
 
Transfer Learning for Software Performance Analysis: An Exploratory Analysis
Transfer Learning for Software Performance Analysis: An Exploratory AnalysisTransfer Learning for Software Performance Analysis: An Exploratory Analysis
Transfer Learning for Software Performance Analysis: An Exploratory Analysis
 
Architecting for Scale
Architecting for ScaleArchitecting for Scale
Architecting for Scale
 
Learning Software Performance Models for Dynamic and Uncertain Environments
Learning Software Performance Models for Dynamic and Uncertain EnvironmentsLearning Software Performance Models for Dynamic and Uncertain Environments
Learning Software Performance Models for Dynamic and Uncertain Environments
 
Fuzzy Self-Learning Controllers for Elasticity Management in Dynamic Cloud Ar...
Fuzzy Self-Learning Controllers for Elasticity Management in Dynamic Cloud Ar...Fuzzy Self-Learning Controllers for Elasticity Management in Dynamic Cloud Ar...
Fuzzy Self-Learning Controllers for Elasticity Management in Dynamic Cloud Ar...
 
Configuration Optimization for Big Data Software
Configuration Optimization for Big Data SoftwareConfiguration Optimization for Big Data Software
Configuration Optimization for Big Data Software
 
Self learning cloud controllers
Self learning cloud controllersSelf learning cloud controllers
Self learning cloud controllers
 
Workload Patterns for Quality-driven Dynamic Cloud Service Configuration and...
Workload Patterns for Quality-driven Dynamic Cloud Service Configuration and...Workload Patterns for Quality-driven Dynamic Cloud Service Configuration and...
Workload Patterns for Quality-driven Dynamic Cloud Service Configuration and...
 
Fuzzy Control meets Software Engineering
Fuzzy Control meets Software EngineeringFuzzy Control meets Software Engineering
Fuzzy Control meets Software Engineering
 

Último

Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 

Último (20)

Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 

Transfer Learning for Optimal Configuration of Big Data Software

  • 1. Transfer Learning for Optimal Configuration of Big Data Software Pooyan Jamshidi, Giuliano Casale, Salvatore Dipietro Imperial College London p.jamshidi@imperial.ac.uk Department of Computing Research Associate Symposium 14th June 2016
  • 2. Motivation 1- Many different Parameters => - large state space - interactions 2- Defaults are typically used => - poor performance
  • 3. Motivation 0 1 2 3 4 5 average read latency (µs) ×104 0 20 40 60 80 100 120 140 160 observations 1000 1200 1400 1600 1800 2000 average read latency (µs) 0 10 20 30 40 50 60 70 observations 1 1 (a) cass-20 (b) cass-10 Best configurations Worst configurations Experiments on Apache Cassandra: - 6 parameters, 1024 configurations - Average read latency - 10 millions records (cass-10) - 20 millions records (cass-20)
  • 4. Goal! is denoted by f(x). Throughout, we assume ncy, however, other metrics for response may re consider the problem of finding an optimal ⇤ that globally minimizes f(·) over X: x⇤ = arg min x2X f(x) (1) esponse function f(·) is usually unknown or n, i.e., yi = f(xi), xi ⇢ X. In practice, such may contain noise, i.e., yi = f(xi) + ✏. The of the optimal configuration is thus a black- on program subject to noise [27, 33], which harder than deterministic optimization. A n is based on sampling that starts with a pled configurations. The performance of the sociated to this initial samples can deliver tanding of f(·) and guide the generation of of samples. If properly guided, the process ration-evaluation-feedback-regeneration will tinuously, (ii) Big Data systems are d frameworks (e.g., Apache Hadoop, S on similar platforms (e.g., cloud clust versions of a system often share a sim To the best of our knowledge, only the possibility of transfer learning in The authors learn a Bayesian network of a system and reuse this model fo systems. However, the learning is lim the Bayesian network. In this paper, that not only reuse a model that has b but also the valuable raw data. There to the accuracy of the learned model consider Bayesian networks and inste 2.4 Motivation A motivating example. We now i points on an example. WordCount (cf. benchmark [12]. WordCount features (Xi). In general, Xi may either indicate (i) integer vari- such as level of parallelism or (ii) categorical variable as messaging frameworks or Boolean variable such as ng timeout. We use the terms parameter and factor in- angeably; also, with the term option we refer to possible s that can be assigned to a parameter. assume that each configuration x 2 X in the configura- pace X = Dom(X1) ⇥ · · · ⇥ Dom(Xd) is valid, i.e., the m accepts this configuration and the corresponding test s in a stable performance behavior. The response with guration x is denoted by f(x). Throughout, we assume f(·) is latency, however, other metrics for response may ed. We here consider the problem of finding an optimal guration x⇤ that globally minimizes f(·) over X: x⇤ = arg min x2X f(x) (1) fact, the response function f(·) is usually unknown or ally known, i.e., yi = f(xi), xi ⇢ X. In practice, such it still requires hundr per, we propose to ad with the search e ci than starting the sear the learned knowledg software to accelerate version. This idea is i in real software engin in DevOps di↵erent tinuously, (ii) Big Da frameworks (e.g., Ap on similar platforms ( versions of a system o To the best of our k the possibility of tran The authors learn a B of a system and reus systems. However, the the Bayesian network. his configuration and the corresponding test le performance behavior. The response with is denoted by f(x). Throughout, we assume ncy, however, other metrics for response may e consider the problem of finding an optimal ⇤ that globally minimizes f(·) over X: x⇤ = arg min x2X f(x) (1) esponse function f(·) is usually unknown or , i.e., yi = f(xi), xi ⇢ X. In practice, such may contain noise, i.e., yi = f(xi) + ✏. The of the optimal configuration is thus a black- n program subject to noise [27, 33], which harder than deterministic optimization. A n is based on sampling that starts with a pled configurations. The performance of the sociated to this initial samples can deliver tanding of f(·) and guide the generation of of samples. If properly guided, the process ation-evaluation-feedback-regeneration will erge and the optimal configuration will be in DevOps di↵erent versions of a system is delivered tinuously, (ii) Big Data systems are developed using s frameworks (e.g., Apache Hadoop, Spark, Kafka) an on similar platforms (e.g., cloud clusters), (iii) and di↵ versions of a system often share a similar business log To the best of our knowledge, only one study [9] ex the possibility of transfer learning in system configur The authors learn a Bayesian network in the tuning p of a system and reuse this model for tuning other s systems. However, the learning is limited to the struct the Bayesian network. In this paper, we introduce a m that not only reuse a model that has been learned prev but also the valuable raw data. Therefore, we are not li to the accuracy of the learned model. Moreover, we d consider Bayesian networks and instead focus on MTG 2.4 Motivation A motivating example. We now illustrate the pre points on an example. WordCount (cf. Figure 1) is a po benchmark [12]. WordCount features a three-layer arc ture that counts the number of words in the incoming s A Processing Element (PE) of type Spout reads the havior. The response with ). Throughout, we assume metrics for response may blem of finding an optimal nimizes f(·) over X: f(x) (1) (·) is usually unknown or xi ⇢ X. In practice, such i.e., yi = f(xi) + ✏. The figuration is thus a black- t to noise [27, 33], which ministic optimization. A mpling that starts with a . The performance of the itial samples can deliver d guide the generation of perly guided, the process in DevOps di↵erent versions of a system is delivered con tinuously, (ii) Big Data systems are developed using simila frameworks (e.g., Apache Hadoop, Spark, Kafka) and ru on similar platforms (e.g., cloud clusters), (iii) and di↵eren versions of a system often share a similar business logic. To the best of our knowledge, only one study [9] explore the possibility of transfer learning in system configuration The authors learn a Bayesian network in the tuning proces of a system and reuse this model for tuning other simila systems. However, the learning is limited to the structure o the Bayesian network. In this paper, we introduce a metho that not only reuse a model that has been learned previousl but also the valuable raw data. Therefore, we are not limite to the accuracy of the learned model. Moreover, we do no consider Bayesian networks and instead focus on MTGPs. 2.4 Motivation A motivating example. We now illustrate the previou points on an example. WordCount (cf. Figure 1) is a popula benchmark [12]. WordCount features a three-layer architenumber of counters number of splitters latency(ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 243 684 10125 14166 18 Partially known Measurements contain noise Configuration space
  • 5. Finding optimum configuration is difficult! 0 5 10 15 20 Number of counters 100 120 140 160 180 200 220 240 Latency(ms) splitters=2 splitters=3 number of counters number of splitters latency(ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 243 684 10125 14166 18 ble performance behavior. The response with x is denoted by f(x). Throughout, we assume ency, however, other metrics for response may ere consider the problem of finding an optimal x⇤ that globally minimizes f(·) over X: x⇤ = arg min x2X f(x) (1) response function f(·) is usually unknown or wn, i.e., yi = f(xi), xi ⇢ X. In practice, such may contain noise, i.e., yi = f(xi) + ✏. The of the optimal configuration is thus a black- ion program subject to noise [27, 33], which y harder than deterministic optimization. A on is based on sampling that starts with a mpled configurations. The performance of the ssociated to this initial samples can deliver rstanding of f(·) and guide the generation of d of samples. If properly guided, the process eration-evaluation-feedback-regeneration will nverge and the optimal configuration will be ver, a sampling-based approach of this kind can y expensive in terms of time or cost (e.g., rental rces) considering that a function evaluation in d be costly and the optimization process may hundreds of samples to converge. ed work roaches have attempted to address the above rsive Random Sampling (RRS) [43] integrates echanism into the random sampling to achieve in DevOps di↵erent versions of a system is delivered con- tinuously, (ii) Big Data systems are developed using similar frameworks (e.g., Apache Hadoop, Spark, Kafka) and run on similar platforms (e.g., cloud clusters), (iii) and di↵erent versions of a system often share a similar business logic. To the best of our knowledge, only one study [9] explores the possibility of transfer learning in system configuration. The authors learn a Bayesian network in the tuning process of a system and reuse this model for tuning other similar systems. However, the learning is limited to the structure of the Bayesian network. In this paper, we introduce a method that not only reuse a model that has been learned previously but also the valuable raw data. Therefore, we are not limited to the accuracy of the learned model. Moreover, we do not consider Bayesian networks and instead focus on MTGPs. 2.4 Motivation A motivating example. We now illustrate the previous points on an example. WordCount (cf. Figure 1) is a popular benchmark [12]. WordCount features a three-layer architec- ture that counts the number of words in the incoming stream. A Processing Element (PE) of type Spout reads the input messages from a data source and pushes them to the system. A PE of type Bolt named Splitter is responsible for splitting sentences, which are then counted by the Counter. Kafka Spout Splitter Bolt Counter Bolt (sentence) (word) [paintings, 3] [poems, 60] [letter, 75] Kafka Topic Stream to Kafka File (sentence) (sentence) (sentence) figuration parameter, which ranges in a finite domain m(Xi). In general, Xi may either indicate (i) integer vari- such as level of parallelism or (ii) categorical variable as messaging frameworks or Boolean variable such as ling timeout. We use the terms parameter and factor in- hangeably; also, with the term option we refer to possible es that can be assigned to a parameter. e assume that each configuration x 2 X in the configura- space X = Dom(X1) ⇥ · · · ⇥ Dom(Xd) is valid, i.e., the em accepts this configuration and the corresponding test ts in a stable performance behavior. The response with guration x is denoted by f(x). Throughout, we assume f(·) is latency, however, other metrics for response may sed. We here consider the problem of finding an optimal figuration x⇤ that globally minimizes f(·) over X: x⇤ = arg min x2X f(x) (1) fact, the response function f(·) is usually unknown or ially known, i.e., yi = f(xi), xi ⇢ X. In practice, such surements may contain noise, i.e., yi = f(xi) + ✏. The rmination of the optimal configuration is thus a black- optimization program subject to noise [27, 33], which nsiderably harder than deterministic optimization. A ular solution is based on sampling that starts with a ber of sampled configurations. The performance of the eriments associated to this initial samples can deliver tter understanding of f(·) and guide the generation of next round of samples. If properly guided, the process ample generation-evaluation-feedback-regeneration will tually converge and the optimal configuration will be result, the learning is limited to the current observations and it still requires hundreds of sample evaluations. In this pa- per, we propose to adopt a transfer learning method to deal with the search e ciency in configuration tuning. Rather than starting the search from scratch, the approach transfers the learned knowledge coming from similar versions of the software to accelerate the sampling process in the current version. This idea is inspired from several observations arise in real software engineering practice [2, 3]. For instance, (i) in DevOps di↵erent versions of a system is delivered con- tinuously, (ii) Big Data systems are developed using similar frameworks (e.g., Apache Hadoop, Spark, Kafka) and run on similar platforms (e.g., cloud clusters), (iii) and di↵erent versions of a system often share a similar business logic. To the best of our knowledge, only one study [9] explores the possibility of transfer learning in system configuration. The authors learn a Bayesian network in the tuning process of a system and reuse this model for tuning other similar systems. However, the learning is limited to the structure of the Bayesian network. In this paper, we introduce a method that not only reuse a model that has been learned previously but also the valuable raw data. Therefore, we are not limited to the accuracy of the learned model. Moreover, we do not consider Bayesian networks and instead focus on MTGPs. 2.4 Motivation A motivating example. We now illustrate the previous points on an example. WordCount (cf. Figure 1) is a popular benchmark [12]. WordCount features a three-layer architec- ture that counts the number of words in the incoming stream. Dom(X1) ⇥ · · · ⇥ Dom(Xd) is valid, i.e., the this configuration and the corresponding test ble performance behavior. The response with x is denoted by f(x). Throughout, we assume ency, however, other metrics for response may ere consider the problem of finding an optimal x⇤ that globally minimizes f(·) over X: x⇤ = arg min x2X f(x) (1) response function f(·) is usually unknown or n, i.e., yi = f(xi), xi ⇢ X. In practice, such may contain noise, i.e., yi = f(xi) + ✏. The of the optimal configuration is thus a black- on program subject to noise [27, 33], which y harder than deterministic optimization. A on is based on sampling that starts with a pled configurations. The performance of the ssociated to this initial samples can deliver standing of f(·) and guide the generation of d of samples. If properly guided, the process ration-evaluation-feedback-regeneration will verge and the optimal configuration will be ver, a sampling-based approach of this kind can y expensive in terms of time or cost (e.g., rental ces) considering that a function evaluation in d be costly and the optimization process may hundreds of samples to converge. d work oaches have attempted to address the above rsive Random Sampling (RRS) [43] integrates chanism into the random sampling to achieve ciency. Smart Hill Climbing (SHC) [42] inte- rtance sampling with Latin Hypercube Design version. This idea is inspired from several observations arise in real software engineering practice [2, 3]. For instance, (i) in DevOps di↵erent versions of a system is delivered con- tinuously, (ii) Big Data systems are developed using similar frameworks (e.g., Apache Hadoop, Spark, Kafka) and run on similar platforms (e.g., cloud clusters), (iii) and di↵erent versions of a system often share a similar business logic. To the best of our knowledge, only one study [9] explores the possibility of transfer learning in system configuration. The authors learn a Bayesian network in the tuning process of a system and reuse this model for tuning other similar systems. However, the learning is limited to the structure of the Bayesian network. In this paper, we introduce a method that not only reuse a model that has been learned previously but also the valuable raw data. Therefore, we are not limited to the accuracy of the learned model. Moreover, we do not consider Bayesian networks and instead focus on MTGPs. 2.4 Motivation A motivating example. We now illustrate the previous points on an example. WordCount (cf. Figure 1) is a popular benchmark [12]. WordCount features a three-layer architec- ture that counts the number of words in the incoming stream. A Processing Element (PE) of type Spout reads the input messages from a data source and pushes them to the system. A PE of type Bolt named Splitter is responsible for splitting sentences, which are then counted by the Counter. Kafka Spout Splitter Bolt Counter Bolt (sentence) (word) [paintings, 3] [poems, 60] [letter, 75] Kafka Topic Stream to Kafka File (sentence) (sentence) (sentence) Figure 1: WordCount architecture. Dom(Xd) is valid, i.e., the and the corresponding test ehavior. The response with x). Throughout, we assume r metrics for response may oblem of finding an optimal nimizes f(·) over X: n X f(x) (1) f(·) is usually unknown or xi ⇢ X. In practice, such e, i.e., yi = f(xi) + ✏. The nfiguration is thus a black- ct to noise [27, 33], which rministic optimization. A mpling that starts with a s. The performance of the nitial samples can deliver nd guide the generation of operly guided, the process feedback-regeneration will imal configuration will be ed approach of this kind can s of time or cost (e.g., rental at a function evaluation in optimization process may les to converge. pted to address the above pling (RRS) [43] integrates andom sampling to achieve Climbing (SHC) [42] inte- in real software engineering practice [2, 3]. For instance, (i) in DevOps di↵erent versions of a system is delivered con- tinuously, (ii) Big Data systems are developed using similar frameworks (e.g., Apache Hadoop, Spark, Kafka) and run on similar platforms (e.g., cloud clusters), (iii) and di↵erent versions of a system often share a similar business logic. To the best of our knowledge, only one study [9] explores the possibility of transfer learning in system configuration. The authors learn a Bayesian network in the tuning process of a system and reuse this model for tuning other similar systems. However, the learning is limited to the structure of the Bayesian network. In this paper, we introduce a method that not only reuse a model that has been learned previously but also the valuable raw data. Therefore, we are not limited to the accuracy of the learned model. Moreover, we do not consider Bayesian networks and instead focus on MTGPs. 2.4 Motivation A motivating example. We now illustrate the previous points on an example. WordCount (cf. Figure 1) is a popular benchmark [12]. WordCount features a three-layer architec- ture that counts the number of words in the incoming stream. A Processing Element (PE) of type Spout reads the input messages from a data source and pushes them to the system. A PE of type Bolt named Splitter is responsible for splitting sentences, which are then counted by the Counter. Kafka Spout Splitter Bolt Counter Bolt (sentence) (word) [paintings, 3] [poems, 60] [letter, 75] Kafka Topic Stream to Kafka File (sentence) (sentence) (sentence) Figure 1: WordCount architecture. Response surface is: - Non-linear - Non convex - Multi-modal - Measurements subject to noise
  • 7. GP for modeling blackbox response function true function GP mean GP variance observation selected point true minimum mposed by its prior mean (µ(·) : X ! R) and a covariance nction (k(·, ·) : X ⇥ X ! R) [41]: y = f(x) ⇠ GP(µ(x), k(x, x0 )), (2) here covariance k(x, x0 ) defines the distance between x d x0 . Let us assume S1:t = {(x1:t, y1:t)|yi := f(xi)} be e collection of t experimental data (observations). In this mework, we treat f(x) as a random variable, conditioned observations S1:t, which is normally distributed with the lowing posterior mean and variance functions [41]: µt(x) = µ(x) + k(x)| (K + 2 I) 1 (y µ) (3) 2 t (x) = k(x, x) + 2 I k(x)| (K + 2 I) 1 k(x) (4) here y := y1:t, k(x)| = [k(x, x1) k(x, x2) . . . k(x, xt)], := µ(x1:t), K := k(xi, xj) and I is identity matrix. The ortcoming of BO4CO is that it cannot exploit the observa- ns regarding other versions of the system and as therefore nnot be applied in DevOps. 2 TL4CO: an extension to multi-tasks TL4CO 1 uses MTGPs that exploit observations from other evious versions of the system under test. Algorithm 1 fines the internal details of TL4CO. As Figure 4 shows, 4CO is an iterative algorithm that uses the learning from her system versions. In a high-level overview, TL4CO: (i) ects the most informative past observations (details in ction 3.3); (ii) fits a model to existing data based on kernel arning (details in Section 3.4), and (iii) selects the next ork are based on tractable linear algebra. evious work [21], we proposed BO4CO that ex- task GPs (no transfer learning) for prediction of tribution of response functions. A GP model is y its prior mean (µ(·) : X ! R) and a covariance ·, ·) : X ⇥ X ! R) [41]: y = f(x) ⇠ GP(µ(x), k(x, x0 )), (2) iance k(x, x0 ) defines the distance between x us assume S1:t = {(x1:t, y1:t)|yi := f(xi)} be n of t experimental data (observations). In this we treat f(x) as a random variable, conditioned ons S1:t, which is normally distributed with the sterior mean and variance functions [41]: µ(x) + k(x)| (K + 2 I) 1 (y µ) (3) k(x, x) + 2 I k(x)| (K + 2 I) 1 k(x) (4) 1:t, k(x)| = [k(x, x1) k(x, x2) . . . k(x, xt)], , K := k(xi, xj) and I is identity matrix. The of BO4CO is that it cannot exploit the observa- ng other versions of the system and as therefore pplied in DevOps. CO: an extension to multi-tasks uses MTGPs that exploit observations from other Motivations: 1- mean estimates + variance 2- all computations are linear algebra
  • 8. -1.5 -1 -0.5 0 0.5 1 1.5 -1.5 -1 -0.5 0 0.5 1 x1 x2 x3 x4 true function GP surrogate mean estimate observation Fig. 5: An example of 1D GP model: GPs provide mean esti- mates as well as the uncertainty in estimations, i.e., variance. Configuration Optimisation Tool performance repository Monitoring Deployment Service Data Preparation configuration parameters values configuration parameters values Experimental Suite Testbed Doc Data Broker Tester experiment time polling interval configuration parameters GP model Kafka System Under Test Workload Generator Technology Interface Storm Cassandra Spark Algorithm 1 : BO4CO Input: Configuration space X, Maximum budget Nmax, Re- sponse function f, Kernel function K✓, Hyper-parameters ✓, Design sample size n, learning cycle Nl Output: Optimal configurations x⇤ and learned model M 1: choose an initial sparse design (lhd) to find an initial design samples D = {x1, . . . , xn} 2: obtain performance measurements of the initial design, yi f(xi) + ✏i, 8xi 2 D 3: S1:n {(xi, yi)}n i=1; t n + 1 4: M(x|S1:n, ✓) fit a GP model to the design . Eq.(3) 5: while t  Nmax do 6: if (t mod Nl = 0) ✓ learn the kernel hyper- parameters by maximizing the likelihood 7: find next configuration xt by optimizing the selection criteria over the estimated response surface given the data, xt arg maxxu(x|M, S1:t 1) . Eq.(9) 8: obtain performance for the new configuration xt, yt f(xt) + ✏t 9: Augment the configuration S1:t = {S1:t 1, (xt, yt)} 10: M(x|S1:t, ✓) re-fit a new GP model . Eq.(7) 11: t t + 1 12: end while 13: (x⇤ , y⇤ ) = min S1:Nmax 14: M(x) (x, x0 ) defines the distance between x and x0 . Let us S1:t = {(x1:t, y1:t)|yi := f(xi)} be the collection of ations. The function values are drawn from a multi- Gaussian distribution N(µ, K), where µ := µ(x1:t), K := 2 6 4 k(x1, x1) . . . k(x1, xt) ... ... ... k(xt, x1) . . . k(xt, xt) 3 7 5 (4) e while loop in BO4CO, given the observations we ated so far, we intend to fit a new GP model:  f1:t ft+1 ⇠ N(µ,  K + 2 I k k| k(xt+1, xt+1) ), (5) (x)| = [k(x, x1) k(x, x2) . . . k(x, xt)] and I ty matrix. Given the Eq. (5), the new GP model can n from this new Gaussian distribution: Pr(ft+1|S1:t, xt+1) = N(µt(xt+1), 2 t (xt+1)), (6) ) = µ(x) + k(x)| (K + 2 I) 1 (y µ) (7) ) = k(x, x) + 2 I k(x)| (K + 2 I) 1 k(x) (8) osterior functions are used to select the next point xt+1 ed in Section III-C. figuration selection criteria election criteria is defined as u : X ! R that selects X, should f(·) be evaluated next (step 7): 0 2000 4000 6000 8000 10000 Iteration 4 4.5 5 5.5 6 ig. 7: Change of  value over time: it begins with a small alue to exploit the mean estimates and it increases over time order to explore. BO4CO uses Lower Confidence Bound (LCB) [25]. LCB inimizes the distance to optimum by balancing exploitation f mean and exploration via variance: uLCB(x|M, S1:n) = argmin x2X µt(x)  t(x), (10) here  can be set according to the objectives. For instance, we require to find a near optimal configuration quickly we et a low value to  to take the most out of the initial design nowledge. However, if we are looking for a globally optimum onfiguration, we can set a high value in order to skip local inima. Furthermore,  can be adapted over time to benefit dimensions as suggested in [31], [25], this Automatic Relevance Determination (ARD) hyper-parameters (step 6), if the `-th dime be irrelevant, then ✓` will be a small value, be discarded. This is particularly helpful in spaces, where it is difficult to find the opti 2) Prior mean function: While the k structure of the estimated function, the p X ! R provides a possible offset for o default, this function is set to a constant µ inferred from the observations [25]. Howev function is a way of incorporating the ex it is available, then we can use this know we have collected extensive experimental based on our datasets (cf. Table I), we obse for Big Data systems, there is a significan the minimum and the maximum of each f 2). Therefore, a linear mean function µ(x) for more flexible structures, and provides data than a constant mean. We only need for each dimension and an offset (denoted 3) Learning parameters: marginal likeli s greater flexibility in modeling such functions [31]. The ng is a variation of the Mat´ern kernel for ⌫ = 1/2: k⌫=1/2(xi, xj) = ✓2 0 exp( r), (11) r2 (xi, xj) = (xi xj)| ⇤(xi xj) for some positive finite matrix ⇤. For categorical variables, we imple- the following [12]: k✓(xi, xj) = exp(⌃d `=1( ✓` (xi 6= xj))), (12) From an implementation perspective, B three major components: (i) an optimizat left part of Figure 6), (ii) an experimental of Figure 6) integrated via a (iii) data broke component implements the model (re-)fitt optimization (9) steps in Algorithm 1 and i lab 2015b. The experimental suite compon facilities for automated deployment of topo measurements and data preparation and is Acquisition function: Kernel function:
  • 9. -1.5 -1 -0.5 0 0.5 1 1.5 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 configuration domain responsevalue -1.5 -1 -0.5 0 0.5 1 1.5 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 true response function GP fit -1.5 -1 -0.5 0 0.5 1 1.5 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 criteria evaluation new selected point -1.5 -1 -0.5 0 0.5 1 1.5 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 new GP fit
  • 10. Correlations across different versions 100 150 1 200 250 Latency(ms) 300 2 5 3 104 5 15 6 14 16 18 20 1 22 24 26 Latency(ms) 28 30 32 2 53 104 5 156 number of countersnumber of splitters number of countersnumber of splitters 2.8 2.9 1 3 3.1 3.2 3.3 2 Latency(ms) 3.4 3.5 3.6 3 5 4 10 5 15 6 1.2 1.3 1.4 1 1.5 1.6 1.7 Latency(ms) 1.8 1.9 2 53 104 5 156 (a) WordCount v1 (b) WordCount v2 (c) WordCount v3 (d) WordCount v4 (e) Pearson correlation coefficients (g) Measurement noise across WordCount versions (f) Spearman correlation coefficients correlation coefficient p-value v1 v2 v3 v4 500 600 700 800 900 1000 1100 1200 Latency(ms) hardware change softwarechange Table 1: My caption v1 v2 v3 v4 v1 1 0.41 -0.46 -0.50 v2 7.36E-06 1 -0.20 -0.18 v3 6.92E-07 0.04 1 0.94 v4 2.54E-08 0.07 1.16E-52 1 Table 2: My caption v1 v2 v3 v4 v1 1 0.49 -0.51 -0.51 v2 5.50E-08 1 -0.2793 -0.24 v3 1.30E-08 0.003 1 0.88 v4 1.40E-08 0.01 8.30E-36 1 Table 3: My caption ver. µ µ v1 516.59 7.96 64.88 v1 v2 v3 v4 v1 1 0.41 -0.46 -0.50 v2 7.36E-06 1 -0.20 -0.18 v3 6.92E-07 0.04 1 0.94 v4 2.54E-08 0.07 1.16E-52 1 Table 2: My caption v1 v2 v3 v4 v1 1 0.49 -0.51 -0.51 v2 5.50E-08 1 -0.2793 -0.24 v3 1.30E-08 0.003 1 0.88 v4 1.40E-08 0.01 8.30E-36 1 Table 3: My caption ver. µ µ v1 516.59 7.96 64.88 v2 584.94 2.58 226.32 v3 654.89 13.56 48.30 v4 1125.81 16.92 66.56 - Different correlations - Different optimum Configurations - Different noise level
  • 12. Solution: Transfer Learning for Configuration Optimization Configuration Optimization (version j=M) performance measurements Initial Design Model Fit Next Experiment Model Update Budget Finished performance repository Configuration Optimization (version j=N) Initial Design Model Fit Next Experiment Model Update Budget Finished select data for training GP model hyper-parameters store filter
  • 14. true function GP mean GP variance observation selected point true minimumX f(x) Figure 3: An example of 1-dimensional GP model. Configuration Optimization (version j=M) performance measurements Initial Design Model Fit Next Experiment Model Update Budget Finished performance repository Configuration Optimization (version j=N) Initial Design Model Fit Next Experiment Model Update Budget Finished select data for training GP model hyper-parameters store filter Algorithm 1 : TL4CO Input: Configuration space X, Number of historical con- figuration optimization datasets T, Maximum budget Nmax, Response function f, Kernel function K, Hyper- parameters ✓j , The current dataset Sj=1 , Datasets be- longing to other versions Sj=2:T , Diagonal noise matrix ⌃, Design sample size n, Learning cycle Nl Output: Optimal configurations x⇤ and learned model M 1: D = {x1, . . . , xn} create an initial sparse design (lhd) 2: Obtain the performance 8xi 2 D, yi f(xi) + ✏i 3: Sj=2:T select m points from other versions (j = 2 : T) that reduce entropy . Eq.(8) 4: S1 1:n Sj=2:T [ {(xi, yi)}n i=1; t n + 1 5: M(x|S1 1:n, ✓j ) fit a multi-task GP to D . Eq. (6) 6: while t  Nmax do 7: If (t mod Nl = 0) then [✓ learn the kernel hyper- parameters by maximizing the likelihood] 8: Determine the next configuration, xt, by optimizing LCB over M, xt arg maxxu(x|M, S1 1:t 1) . Eq.(11) 9: Obtain the performance of xt, yt f(xt) + ✏t 10: Augment the configuration S1 1:t = S1 1:t 1 S {(xt, yt)} 11: M(x|S1 1:t, ✓) re-fit a new GP model . Eq.(6) 12: t t + 1 13: end while 14: (x⇤ , y⇤ ) = min[S1 1:Nmax Sj=2:T ] locating the optimum configuration by discarding data for other system versions 15: M(x) storing the learned model :t), where xi are the configurations of di↵erent ver- or which we observed response yj i . The version of stem (task) is defined as j that contains tj observa- Without loss of generality, we always assume j = 1 current version under test. In order to associate the ations (xj i , yj i ) to task j a label lj = j is used. To r the learning from previous tasks to the current one, TPG framework allows us to use the multiplication of dependent kernels [5]: kT L4CO(l, l0 , x, x0 ) = kt(l, l0 ) ⇥ kxx(x, x0 ), (5) the kernels kt and kxx represent the correlation be- di↵erent versions and covariance functions for a partic- rsion, respectively. Note that kt depends only on the pointing to the versions for which we have some obser- s, and kxx depends only on the configuration x. This exploiting the similarity between measureme to (3), (4), the mean and variance of MTGP µt(x) = µ(x) + k| (K(X, X) + ⌃) 1 (y 2 t (x) = k(x, x) + 2 k| (K(X, X) + where y := [f1, . . . , fT ]| is the measured pe garding all tasks, X = [X1, . . . , XT ] are the c configuration and ⌃ = diag[ 2 1, . . . , 2 T ] is a d matrix where each noise term is repeated as man number of historical observations represented and k := k(X, x). This is a conditional esti sired configuration given the historical dataset system versions and their respective GP hype An illustration of a multi-task GP versus a s and its e↵ect on providing a more accurate mo 4 em versions; (iii) fast learning of the hyper-parameters by oiting the similarity between measurements. Similarly 3), (4), the mean and variance of MTGP are [34]: µt(x) = µ(x) + k| (K(X, X) + ⌃) 1 (y µ) (6) 2 t (x) = k(x, x) + 2 k| (K(X, X) + ⌃) 1 k, (7) re y := [f1, . . . , fT ]| is the measured performance re- ding all tasks, X = [X1, . . . , XT ] are the corresponding figuration and ⌃ = diag[ 2 1, . . . , 2 T ] is a diagonal noise rix where each noise term is repeated as many times as the mber of historical observations represented by t1 , . . . , tT k := k(X, x). This is a conditional estimate at a de- d configuration given the historical datasets for previous em versions and their respective GP hyper-parameters. n illustration of a multi-task GP versus a single-task GP its e↵ect on providing a more accurate model is given in correlation between different versions covariance functions
  • 15. Multi-task GP vs single-task GP -1.5 -1 -0.5 0 0.5 1 1.5 -4 -3 -2 -1 0 1 2 3 (a) 3 sample response functions configuration domain responsevalue (1) (2) (3) observations (b) GP fit for (1) ignoring observations for (2),(3) LCB not informative (c) multi-task GP fit for (1) by transfer learning from (2),(3) highly informative GP prediction mean GP prediction variance probability distribution of the minimizers
  • 16. Exploitation vs exploration 0 2000 4000 6000 8000 10000 Iteration 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 κ κ TL4CO [ϵ=1] κ BO4CO [ϵ=1] κ TL4CO [ϵ=0.1] κ BO4CO [ϵ=0.1] κ TL4CO [ϵ=0.01] κ BO4CO [ϵ=0.01] noise variance of the individual datasets. figuration selection criteria uires a selection criterion (step 8 in Algorithm the next configuration to measure. Intuitively, elect the minimum response. This is done using ction u : X ! R that determines xt+1 2 X, e evaluated next as: xt+1 = argmax x2X u(x|M, S1 1:t) (11) on criterion depends on the MTGP model M h its predictive mean µt(xt) and variance 2 t (xt) on observations S1 1:t. TL4CO uses the Lower Bound (LCB) [24]: B(x|M, S1 1:t) = argmin x2X µt(x)  t(x), (12) xploitation-exploration parameter. For instance, to find a near optimal configuration we set a  to take the most out of the predictive mean. e are looking for a globally optimum one, we can ue in order to skip local minima. Furthermore, pted over time [22] to perform more explorations. decide the next configuration to measure. Intuitively, ant to select the minimum response. This is done using ity function u : X ! R that determines xt+1 2 X, d f(·) be evaluated next as: xt+1 = argmax x2X u(x|M, S1 1:t) (11) e selection criterion depends on the MTGP model M through its predictive mean µt(xt) and variance 2 t (xt) tioned on observations S1 1:t. TL4CO uses the Lower dence Bound (LCB) [24]: uLCB(x|M, S1 1:t) = argmin x2X µt(x)  t(x), (12)  is a exploitation-exploration parameter. For instance, require to find a near optimal configuration we set a alue to  to take the most out of the predictive mean. ver, if we are looking for a globally optimum one, we can high value in order to skip local minima. Furthermore, be adapted over time [22] to perform more explorations. e 6 shows that in TL4CO,  can start with a relatively r value at the early iterations comparing to BO4CO the former provides a better estimate of mean and ore contains more information at the early stages. 4CO output. Once the N di↵erent configurations of
  • 17. TL4CO architecture Configuration Optimisation Tool (TL4CO) performance repository Monitoring Deployment Service Data Preparation configuration parameters values configuration parameters values Experimental Suite Testbed Doc Data Broker Tester experiment time polling interval previous versions configuration parameters GP model Kafka Vn V1 V2 System Under Test historical data Workload Generator Technology Interface Storm Cassandra Spark
  • 18. Experimental results 0 20 40 60 80 100 Iteration 10 -6 10 -4 10 -2 100 10 2 104 AbsoluteError TL4CO BO4CO SA GA HILL PS Drift (b) WC(3D) (8 minutes each) 0 20 40 60 80 100 Iteration 10-3 10-2 10 -1 100 10 1 102 103 AbsoluteError TL4CO BO4CO SA GA HILL PS Drift (a) WC(5D) (8 minutes each) - 30 runs, report average performance - Yes, we did full factorial measurements and we know where global min is… J
  • 19. Experimental results (cont.) 0 50 100 150 200 Iteration 10 -3 10 -2 10 -1 10 0 10 1 10 2 10 3 10 4 AbsoluteError TL4CO BO4CO SA GA HILL PS Drift (b) WC(6D) (8 minutes each)
  • 20. Experimental results (cont.) 0 50 100 150 200 Iteration 10 -2 10 -1 10 0 10 1 10 2 10 3 10 4 AbsoluteError TL4CO BO4CO SA GA HILL PS Drift 0 50 100 150 200 Iteration 10 -4 10 -2 10 0 10 2 10 4 AbsoluteError TL4CO BO4CO SA GA HILL PS Drift (a) SOL(6D) (b) RS(6D) (8 minutes each)(8 minutes each)
  • 21. Comparison with default and expert prescription 0 500 1000 1500 Throughput (ops/sec) 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Averagereadlatency(µs) ×10 4 TL4CO BO4CO BO4CO after 20 iterations TL4CO after 20 iterations TL4CO after 100 iterations 0 500 1000 1500 Throughput (ops/sec) 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Averagewritelatency(µs) TL4CO BO4CO Default configuration Configuration recommended by expert TL4CO after 100 iterations BO4CO after 100 iterations Default configuration Configuration recommended by expert
  • 22. Model accuracy TL4CO polyfit1 polyfit2 polyfit3 polyfit4 polyfit5 10 -12 10-10 10 -8 10 -6 10 -4 10-2 10 0 102 AbsolutePercentageError[%] TL4CO BO4CO M5Tree R-Tree M5Rules MARS PRIM 10 -2 10-1 10 0 101 10 2 AbsolutePercentageError[%]
  • 23. Prediction accuracy over time 0 20 40 60 80 100 Iteration 10-4 10-3 10 -2 10-1 10 0 10 1 PredictionError(RMSE) T=2,m=100 T=2,m=200 T=2,m=300 T=3,m=100 0 20 40 60 80 100 Iteration 10-4 10-3 10-2 10-1 100 101 PredictionError(RMSE) TL4CO polyfit1 polyfit2 polyfit4 polyfit5 M5Tree M5Rules PRIM (a) (b)
  • 24. Entropy of the density function of the minimizers 0 20 40 60 80 100 0 1 2 3 4 5 6 7 8 9 10 Entropy T=1(BO4CO) T=2,m=100 T=2,m=200 T=2,m=300 T=2,m=400 T=3,m=100 1 2 3 4 5 6 7 8 9 0 2 4 6 8 10 BO4CO TL4CO Entropy Iteration Branin Hartmann WC(3D) SOL(6D) WC(5D)Dixon WC(6D) RS(6D) cass-20 he knowledge about the location of optimum configura- is summarized by the approximation of the conditional bability density function of the response function mini- ers, i.e., X⇤ = Pr(x⇤ |f(x)), where f(·) is drawn from MTGP model (cf. solid red line in Figure 5(b,c)). The opy of the density functions in Figure 5(b,c) are 6.39, so we know more information about the latter. he results in Figure 19 confirm that the entropy measure e minimizers with the models provided by TL4CO for all datasets (synthetic and real) significantly contains more mation. The results demonstrate that the main reason finding quick convergence comparing with the baselines at TL4CO employs a more e↵ective model. The results igure 19(b) show the change of entropy of X⇤ over time WC(5D) dataset. First, it shows that in TL4CO, the opy decreases sharply. However, the overall decrease of opy for BO4CO is slow. The second observation is that TL4CO variance, storing K making th 5. DIS 5.1 Be TL4CO experimen practice. A than thre the system our appro Knowledge about the location of the minimizer
  • 25. Effect of ARD 0 20 40 60 80 100 10-6 10-5 10-4 10-3 10-2 10 -1 10 0 10 1 AbsoluteError(ms) TL4CO(ARD on) TL4CO(ARD off) 0 20 40 60 80 100 10-3 10 -2 10-1 10 0 10 1 AbsoluteError(ms) TL4CO(ARD on) TL4CO(ARD off) Iteration Iteration (a) (b) We implemented the following kernel function to support integer and categorical variables (cf. Section 2.1): kxx(xi, xj) = exp(⌃d `=1( ✓` (xi 6= xj))), (9) where d is the number of dimensions (i.e., the number of configuration parameters), ✓` adjust the scales along the function dimensions and is a function gives the distance between two categorical variables using Kronecker delta [20, 34]. TL4CO uses di↵erent scales {✓`, ` = 1 . . . d} on di↵erent dimensions as suggested in [41, 34], this technique is called Automatic Relevance Determination (ARD). After learning the hyper-parameters (step 7 in Algorithm 1), if the `-th dimension (parameter) turns out be irrelevant, so the ✓` will be a small value, and therefore, will be automatically discarded. This is particularly helpful in high dimensional spaces, where it is di cult to find the optimal configuration. 3.4.2 Prior mean function While the kernel controls the structure of the estimated function, the prior mean µ(x) : X ! R provides a possible o↵set for our estimation. By default, this function is set to a constant µ(x) := µ, which is inferred from the observations
  • 26. Effect of correlation between tasks 0 20 40 60 80 100 Iteration 10 -6 10 -5 10 -4 10 -3 10 -2 10 -1 10 0 10 1 AbsoluteError TL4CO [f(x)+randn×f(x)] TL4CO [f(x)+2randn×f(x)] BO4CO
  • 27. Runtime overhead 0 20 40 60 80 100 Iteration 0.15 0.2 0.25 0.3 0.35 0.4 ElapsedTime(s) WordCount (3D) WordCount (6D) SOL (6D) RollingSort (6D) WordCount (5D) T=2,m=100 T=2,m=200 T=2,m=400 T=3,m=100 T=3,m=200 0.5 1 1.5 2 2.5 3 3.5 4 ElapsedTime(s) (a) TL4CO runtime for different datasets (b) TL4CO runtime for different size of observations
  • 28. Key takeaways - A principled way to leverage prior knowledge gained from searches over previous versions of the system. - Multi-task GPs can be used in order to capture correlation between related tuning tasks. - MTGPs are more accurate than STGP and other regression models. - Application to SPSs, Batch and NoSQL - Lead to a better performance in practice
  • 29. Acknowledgement: BO4CO and TL4CO are now integrated with other DevOps tools in the delivery pipeline for Big Data in H2020 DICE project Big Data Technologies Cloud (Priv/Pub) ` DICE IDE Profile Plugins Sim Ver Opt DPIM DTSM DDSM TOSCAMethodology Deploy Config Test M o n Anomaly Trace Iter. Enh. Data Intensive Application (DIA) Cont.Int. Fault Inj. WP4 WP3 WP2 WP5 WP1 WP6 - Demonstrators Tool Demo: http://www.slideshare.net/pooyanjamshidi/configuration-optimization-tool