SlideShare una empresa de Scribd logo
1 de 192
Descargar para leer sin conexión
Data Science,
Empirical SE and
Software
Analytics: It’s a
Family Affair!
1
Bram Adams

http://mcis.polymtl.ca
Who is
Bram Adams?
3
Wolfgang De Meuter
Vrije Universiteit Brussel
Herman Tromp
Ghent University
3
Ahmed E. Hassan
Queen's University
3
M
C IS
(Lab on Maintenance,
Construction and Intelligence
of Software)
3
Who is MCIS?
!4
http://mcis.polymtl.ca
Who is MCIS?
!4
Parisa
Doriane
Shivashree
Isabella
me Yutaro
M
C I
S
Javier
Armstrong
http://mcis.polymtl.ca
!5
https://conf.researchr.org/home/msr-2019
!5
https://conf.researchr.org/home/msr-2019
deadline mid
January 2019
Data Science,
Empirical SE and
Software
Analytics: It’s a
Family Affair!
6
Bram Adams

http://mcis.polymtl.ca
Dear Bram;

As per our conversation during ICSE, we are pleased to invite you as a speaker in IASESE
2018 co-located with ESEM conference. We are planning to have a discussion about
similarities, differences, and synergies between empirical
software engineering, data science, and mining
software repositories under the umbrella of "Empirical Software
Engineering and Data science: Old Wine in a New Bottle". we are looking into a talk around
45 - 60 minutes and a round table discussion.  

Attached please find the agenda for IASESE. Please don't hesitate to contact us in case of
any questions.

We hope to hear back from you soon;

Best Regards;

Silvia Abrahão, Maleknaz Nayebi
Data Science has Overtaken Other
Major Technologies in Popularity
https://trends.google.com/trends/explore?date=today%205-y,today%205-y,today%205-y,today%205-
y&geo=,,,&q=%22big%20data%22,%22data%20science%22,%22cloud%20computing%22,%22empirical%20research%22
Data Science has Overtaken Other
Major Technologies in Popularity
https://trends.google.com/trends/explore?date=today%205-y,today%205-y,today%205-y,today%205-
y&geo=,,,&q=%22big%20data%22,%22data%20science%22,%22cloud%20computing%22,%22empirical%20research%22
Data Science has Overtaken Other
Major Technologies in Popularity
big data
https://trends.google.com/trends/explore?date=today%205-y,today%205-y,today%205-y,today%205-
y&geo=,,,&q=%22big%20data%22,%22data%20science%22,%22cloud%20computing%22,%22empirical%20research%22
Data Science has Overtaken Other
Major Technologies in Popularity
big datacloud computing
https://trends.google.com/trends/explore?date=today%205-y,today%205-y,today%205-y,today%205-
y&geo=,,,&q=%22big%20data%22,%22data%20science%22,%22cloud%20computing%22,%22empirical%20research%22
Data Science has Overtaken Other
Major Technologies in Popularity
big datacloud computing
data science
https://trends.google.com/trends/explore?date=today%205-y,today%205-y,today%205-y,today%205-
y&geo=,,,&q=%22big%20data%22,%22data%20science%22,%22cloud%20computing%22,%22empirical%20research%22
Data Science has Overtaken Other
Major Technologies in Popularity
big datacloud computing
data scienceempirical research :-(
https://trends.google.com/trends/explore?date=today%205-y,today%205-y,today%205-y,today%205-
y&geo=,,,&q=%22big%20data%22,%22data%20science%22,%22cloud%20computing%22,%22empirical%20research%22
https://www.forbes.com/sites/louiscolumbus/2018/01/29/data-scientist-is-the-best-job-in-america-according-
glassdoors-2018-rankings/#c2d2ae55357e
https://elitedatascience.com/books
!11
!11
Software Data ScienceSciM
SoDaSciM 2018
would such a
renaming
make sense?
why (not)?
Part I: Data Science
!13
!13
!13
!13
https://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/#539fcc0c55cf
!14
John Tukey, “The Future of Data Analysis”,	 Ann. Math. Statist., 33(1), pp. 1-67, 1962.
https://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/#539fcc0c55cf
I have come to feel that my central
interest is in data analysis... Data
analysis, and the parts of statistics
which adhere to it, must...take on
the characteristics of science
rather than those of mathematics...
data analysis is intrinsically an
empirical science

[John Tukey, 1962]
https://en.wikipedia.org/w/index.php?curid=17099473 !14
Empirical Science?
Empirical Science?
testable hypothesis
Empirical Science?
testable hypothesis
validation
Empirical Science?
testable hypothesis
validation
validated
hypothesis
Empirical Science?
testable hypothesis
validation
validated
hypothesis
Empirical Science?
testable hypothesis
validation
rejected
hypothesis
validated
hypothesis
Empirical Science?
testable hypothesis
validation
rejected
hypothesis
validated
hypothesis
Empirical Science?
testable hypothesis
validation
rejected
hypothesis
validated
hypothesis
observation/
experience
Fayyad, Piatetsky-Shapiro, Smyth, "From Data Mining to Knowledge Discovery: An
Overview", in “Advances in Knowledge Discovery and Data Mining”, 1996, pp.1-34
1989: KDD
Workshop
& Process
!16
Fayyad, Piatetsky-Shapiro, Smyth, "From Data Mining to Knowledge Discovery: An
Overview", in “Advances in Knowledge Discovery and Data Mining”, 1996, pp.1-34
1989: KDD
Workshop
& Process
raw
data
!16
Fayyad, Piatetsky-Shapiro, Smyth, "From Data Mining to Knowledge Discovery: An
Overview", in “Advances in Knowledge Discovery and Data Mining”, 1996, pp.1-34
1989: KDD
Workshop
& Process
raw
data
knowledge
!16
Fayyad, Piatetsky-Shapiro, Smyth, "From Data Mining to Knowledge Discovery: An
Overview", in “Advances in Knowledge Discovery and Data Mining”, 1996, pp.1-34
1989: KDD
Workshop
& Process
raw
data
select
target data
knowledge
!16
Fayyad, Piatetsky-Shapiro, Smyth, "From Data Mining to Knowledge Discovery: An
Overview", in “Advances in Knowledge Discovery and Data Mining”, 1996, pp.1-34
1989: KDD
Workshop
& Process
raw
data
select
target data
knowledge
preprocess
clean data
!16
Fayyad, Piatetsky-Shapiro, Smyth, "From Data Mining to Knowledge Discovery: An
Overview", in “Advances in Knowledge Discovery and Data Mining”, 1996, pp.1-34
1989: KDD
Workshop
& Process
raw
data
select
target data
knowledge
transform
a b
preprocess
clean data
!16
Fayyad, Piatetsky-Shapiro, Smyth, "From Data Mining to Knowledge Discovery: An
Overview", in “Advances in Knowledge Discovery and Data Mining”, 1996, pp.1-34
1989: KDD
Workshop
& Process
raw
data
select
target data
knowledge
data mining
transform
a b
preprocess
clean data
!16
Fayyad, Piatetsky-Shapiro, Smyth, "From Data Mining to Knowledge Discovery: An
Overview", in “Advances in Knowledge Discovery and Data Mining”, 1996, pp.1-34
1989: KDD
Workshop
& Process
raw
data
interpret
select
target data
knowledge
data mining
transform
a b
preprocess
clean data
!16
Fayyad, Piatetsky-Shapiro, Smyth, "From Data Mining to Knowledge Discovery: An
Overview", in “Advances in Knowledge Discovery and Data Mining”, 1996, pp.1-34
1989: KDD
Workshop
& Process
raw
data
interpret
select
target data
knowledge
iterate
data mining
transform
a b
preprocess
clean data
!16
Fayyad, Piatetsky-Shapiro, Smyth, "From Data Mining to Knowledge Discovery: An
Overview", in “Advances in Knowledge Discovery and Data Mining”, 1996, pp.1-34
1989: KDD
Workshop
& Process
raw
data
interpret
select
target data
knowledge
iterate
data mining
transform
a b
preprocess
clean data
!16
1996/1998
!17
Data Science is not only a synthetic concept to unify
statistics, data analysis and their related methods but
also comprises its results. It includes three phases, design
for data, collection of data, and analysis on data.
1996/1998
!17
Data Science is not only a synthetic concept to unify
statistics, data analysis and their related methods but
also comprises its results. It includes three phases, design
for data, collection of data, and analysis on data.
1996/1998
!17
Data Science is not only a synthetic concept to unify
statistics, data analysis and their related methods but
also comprises its results. It includes three phases, design
for data, collection of data, and analysis on data.
1996/1998
!17
Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/
2010: OSEMN
Process for
Data Science
!18
Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/
2010: OSEMN
Process for
Data Science
!18
raw
data
Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/
2010: OSEMN
Process for
Data Science
!18
raw
data
knowledge
Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/
2010: OSEMN
Process for
Data Science
!18
Obtain
metrics
raw
data
knowledge
Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/
2010: OSEMN
Process for
Data Science
!18
Obtain
metrics
raw
data
knowledge
Scrub
a b
Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/
2010: OSEMN
Process for
Data Science
!18
Obtain
metrics
raw
data
knowledge
Scrub
a b
Explore
Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/
2010: OSEMN
Process for
Data Science
!18
Obtain
metrics
raw
data
knowledge
Scrub
a b
Model
Explore
Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/
2010: OSEMN
Process for
Data Science
!18
Obtain
metrics
raw
data
knowledge
Scrub
a b
iNterpret Model
Explore
Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/
2010: OSEMN
Process for
Data Science
!18
Obtain
metrics
raw
data
knowledge
iterate
Scrub
a b
iNterpret Model
Explore
Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/
2010: OSEMN
Process for
Data Science
!18
Obtain
metrics
raw
data
knowledge
iterate
Scrub
a b
iNterpret Model
Explore
Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/
2010: OSEMN
Process for
Data Science
!18
Obtain
metrics
raw
data
knowledge
iterate
Scrub
a b
iNterpret Model
Explore
Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/
raw
data
Obtain Scrub
ModeliNterpret
metrics
knowledge
iterate
a b
!19
Explore
Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/
raw
data
Obtain Scrub
ModeliNterpret
metrics
knowledge
iterate
a b
!19
Explore
Exercise: Fairness of
Eurovision Song Contest
https://www.imdb.com/title/tt0800959/mediaviewer/rm623579392
Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/
2010: OSEMN
Process for
Data Science
!21
Obtain
metrics
raw
data
knowledge
iterate
Scrub
a b
iNterpret Model
Explore
Why (Not) Data Science?
!22
customize software
tools and data
analysis techniques
to problem at hand
Why (Not) Data Science?
!22
exploit explosion in
computing power
(cloud!) to scale with
volume and velocity
of available data
sources
customize software
tools and data
analysis techniques
to problem at hand
Why (Not) Data Science?
!22
exploit explosion in
computing power
(cloud!) to scale with
volume and velocity
of available data
sources
lacking guidance on the
planning/problem
formulation side (due to
focus on exploratory/
bottom-up research)
customize software
tools and data
analysis techniques
to problem at hand
Why (Not) Data Science?
!22
Part II: Empirical SE
Where does “Raw Data”
Come from?
!24
Where does “Raw Data”
Come from?
any data
domain
!24
Where does “Raw Data”
Come from?
SE Development Process
any data
domain
!24
… and How does it Relate to
High-Level Questions?
!25
… and How does it Relate to
High-Level Questions?
What kinds of patches are
more likely to be accepted
by OSS projects?
!25
… and How does it Relate to
High-Level Questions?
What kinds of patches are
more likely to be accepted
by OSS projects?
Does our new dashboard for
code review improve
reviewers’ effectiveness?
!25
… and How does it Relate to
High-Level Questions?
What kinds of patches are
more likely to be accepted
by OSS projects?
Does our new dashboard for
code review improve
reviewers’ effectiveness?
Commit. Why do developers prefer
short commit messages?
!25
These Questions All Require
Empirical Evidence
• Why? SE development process is managed by humans,
hence no formal laws, rules, etc. exist that “prove” the
answer to these questions

• How? through “a posteriori” knowledge “observed” or
“experienced” through:

• quantitative analysis: how large is effect of given
manipulation or activity?

• qualitative analysis: why does something happen, from
perspective/point of view of subjects?
!26
Common Strategies to
Obtain Empirical Evidence
!27
Common Strategies to
Obtain Empirical Evidence
case study
(observational
study, no control)
!27
Common Strategies to
Obtain Empirical Evidence
case study
(observational
study, no control)
experiment
(assigning
treatments,
with control)
!27
Common Strategies to
Obtain Empirical Evidence
case study
(observational
study, no control)
experiment
(assigning
treatments,
with control)
survey
(interviews/
questionnaire)
!27
Claes Wohlin et al., “Experimentation in Software Engineering - An Introduction”, 2000.
Traditional
Empirical SE
Process
!28
Claes Wohlin et al., “Experimentation in Software Engineering - An Introduction”, 2000.
Traditional
Empirical SE
Process
!28
Claes Wohlin et al., “Experimentation in Software Engineering - An Introduction”, 2000.
Traditional
Empirical SE
Process
define
!28
Claes Wohlin et al., “Experimentation in Software Engineering - An Introduction”, 2000.
Traditional
Empirical SE
Process
define
plan
!28
Claes Wohlin et al., “Experimentation in Software Engineering - An Introduction”, 2000.
Traditional
Empirical SE
Process
define
plan
iterate
!28
Claes Wohlin et al., “Experimentation in Software Engineering - An Introduction”, 2000.
Traditional
Empirical SE
Process
define
plan
raw
dataoperate (case
study, experiment
or survey)
iterate
!28
Claes Wohlin et al., “Experimentation in Software Engineering - An Introduction”, 2000.
Traditional
Empirical SE
Process
define
plan
raw
dataoperate (case
study, experiment
or survey)
analyze/
interpret stats
iterate
!28
Claes Wohlin et al., “Experimentation in Software Engineering - An Introduction”, 2000.
Traditional
Empirical SE
Process
define
plan
raw
dataoperate (case
study, experiment
or survey)
analyze/
interpret stats
knowledge
present/
package
iterate
!28
Claes Wohlin et al., “Experimentation in Software Engineering - An Introduction”, 2000.
Traditional
Empirical SE
Process
define
plan
raw
dataoperate (case
study, experiment
or survey)
analyze/
interpret stats
knowledge
present/
package
iterate
iterate
!28
Claes Wohlin et al., “Experimentation in Software Engineering - An Introduction”, 2000.
Traditional
Empirical SE
Process
define
plan
raw
dataoperate (case
study, experiment
or survey)
analyze/
interpret stats
knowledge
present/
package
iterate
iterate
!28
Enhanced
Empirical SE
Process
define
plan
iterate
operate (case
study, experiment
or survey)
ESE
!29
Enhanced
Empirical SE
Process
define
plan
iterate
operate (case
study, experiment
or survey)
empirical SE
data science
raw
data
Obtain Scrub
Explore
Model
iNterpret
iterate
a b
OSEMN
ESE
!29
Why this Hybrid
Process?
!30
strong empirical
definition/planning/
operation base
Why this Hybrid
Process?
!30
strong empirical
definition/planning/
operation base
Why this Hybrid
Process?
!30
exploit explosion in
computing power
(cloud!) to scale with
volume and velocity of
available data sources
strong empirical
definition/planning/
operation base
Why this Hybrid
Process?
!30
exploit explosion in
computing power
(cloud!) to scale with
volume and velocity of
available data sources
customize software
tools and data
analysis techniques to
problem at hand
strong empirical
definition/planning/
operation base
Why this Hybrid
Process?
experiments, surveys and
even case studies require
large effort, depending on
available data sources
!30
exploit explosion in
computing power
(cloud!) to scale with
volume and velocity of
available data sources
customize software
tools and data
analysis techniques to
problem at hand
effort
empirical
strategy
case study experiment survey
!31
effort
empirical
strategy
case study experiment survey
!31
effort
empirical
strategy
case study experiment survey
!31
effort
empirical
strategy
case study experiment survey
!31
What if Accurate SE Process
Data would be Available?
effort
empirical
strategy
case study experiment survey
!31
What if Accurate SE Process
Data would be Available?
effort
empirical
strategy
case study experiment survey
!31
What if Accurate SE Process
Data would be Available?
effort
empirical
strategy
case study experiment survey
!31
What if Accurate SE Process
Data would be Available?
effort
empirical
strategy
case study experiment survey
!31
What if Accurate SE Process
Data would be Available?
effort
empirical
strategy
case study experiment survey
!31
What if Accurate SE Process
Data would be Available?
effort
empirical
strategy
case study experiment survey
enables more case studies!
!31
reduces effort!
Part III: Software
Analytics (aka MSR)
Mining Version Histories to Guide Software Changes
Thomas Zimmermann
tz@acm.org
Peter Weißgerber
weissger@st.cs.uni-sb.de
Stephan Diehl
diehl@acm.org
Andreas Zeller
zeller@acm.org
Saarland University, Saarbr¨ucken, Germany
Abstract
We apply data mining to version histories in order to
guide programmers along related changes: “Programmers
who changed these functions also changed...”. Given a
set of existing changes, such rules (a) suggest and predict
likely further changes, (b) show up item coupling that is in-
detectable by program analysis, and (c) prevent errors due
to incomplete changes. After an initial change, our ROSE
prototype can correctly predict 26% of further files to be
changed—and 15% of the precise functions or variables.
The topmost three suggestions contain a correct location
with a likelihood of 64%.
1. Introduction
Shopping for a book at Amazon.com, you may have come
across a section that reads “Customers who bought this
book also bought...”, listing other books that were typi-
cally included in the same purchase. Such information is
gathered by data mining— the automated extraction of hid-
den predictive information from large data sets. In this pa-
per, we apply data mining to version histories: “Program-
mers who changed these functions also changed. ..”. Just
like the Amazon.com feature helps the customer browsing
along related items, our ROSE tool guides the programmer
along related changes, with the following aims:
each time some programmer extended the fKeys[] ar-
ray, she also extended the function that sets the pref-
erence default values. If the programmer now wanted
to commit her changes without altering the suggested
location, ROSE would issue a warning.
Detect coupling indetectable by program analysis. As
ROSE operates uniquely on the version history, it is
able to detect coupling between items that cannot be
detected by program analysis—including coupling
between items that are not even programs. In Figure 1,
position 3 on the list is an ECLIPSE HTML documen-
tation file with a confidence of 0.75—suggesting that
after adding the new preference, the documentation
should be updated, too.
ROSE is not the first tool to leverage version histories. In
earlier work (Section 7), researchers have used history data
to understand programs and their evolution [3], to detect
evolutionary coupling between files [8] or classes [4], or to
support navigation in the source code [6]. In contrast to this
state of the art, the present work
• uses full-fledged data mining techniques to obtain as-
sociation rules from version histories,
• detects coupling between fine-grained program enti-
ties such as functions or variables (rather than, say,
classes), thus increasing precision and integrating with
program analysis,
ZimmermannT,WeißgerberP,DiehlS,ZellerA.,“Miningversion
historiestoguidesoftwarechanges”,26thInternationalConferenceon
SoftwareEngineering(ICSE),pp.563–572,2004.
!34
Mining Version Histories to Guide Software Changes
Thomas Zimmermann
tz@acm.org
Peter Weißgerber
weissger@st.cs.uni-sb.de
Stephan Diehl
diehl@acm.org
Andreas Zeller
zeller@acm.org
Saarland University, Saarbr¨ucken, Germany
Abstract
We apply data mining to version histories in order to
guide programmers along related changes: “Programmers
who changed these functions also changed...”. Given a
set of existing changes, such rules (a) suggest and predict
likely further changes, (b) show up item coupling that is in-
detectable by program analysis, and (c) prevent errors due
to incomplete changes. After an initial change, our ROSE
prototype can correctly predict 26% of further files to be
changed—and 15% of the precise functions or variables.
The topmost three suggestions contain a correct location
with a likelihood of 64%.
1. Introduction
Shopping for a book at Amazon.com, you may have come
across a section that reads “Customers who bought this
book also bought...”, listing other books that were typi-
cally included in the same purchase. Such information is
gathered by data mining— the automated extraction of hid-
den predictive information from large data sets. In this pa-
per, we apply data mining to version histories: “Program-
mers who changed these functions also changed. ..”. Just
like the Amazon.com feature helps the customer browsing
along related items, our ROSE tool guides the programmer
along related changes, with the following aims:
each time some programmer extended the fKeys[] ar-
ray, she also extended the function that sets the pref-
erence default values. If the programmer now wanted
to commit her changes without altering the suggested
location, ROSE would issue a warning.
Detect coupling indetectable by program analysis. As
ROSE operates uniquely on the version history, it is
able to detect coupling between items that cannot be
detected by program analysis—including coupling
between items that are not even programs. In Figure 1,
position 3 on the list is an ECLIPSE HTML documen-
tation file with a confidence of 0.75—suggesting that
after adding the new preference, the documentation
should be updated, too.
ROSE is not the first tool to leverage version histories. In
earlier work (Section 7), researchers have used history data
to understand programs and their evolution [3], to detect
evolutionary coupling between files [8] or classes [4], or to
support navigation in the source code [6]. In contrast to this
state of the art, the present work
• uses full-fledged data mining techniques to obtain as-
sociation rules from version histories,
• detects coupling between fine-grained program enti-
ties such as functions or variables (rather than, say,
classes), thus increasing precision and integrating with
program analysis,
ZimmermannT,WeißgerberP,DiehlS,ZellerA.,“Miningversion
historiestoguidesoftwarechanges”,26thInternationalConferenceon
SoftwareEngineering(ICSE),pp.563–572,2004.
what other functions
should I change for
this task?
similar to: “what did
other people buy (in
the past)?”
!34
Mining Version Histories to Guide Software Changes
Thomas Zimmermann
tz@acm.org
Peter Weißgerber
weissger@st.cs.uni-sb.de
Stephan Diehl
diehl@acm.org
Andreas Zeller
zeller@acm.org
Saarland University, Saarbr¨ucken, Germany
Abstract
We apply data mining to version histories in order to
guide programmers along related changes: “Programmers
who changed these functions also changed...”. Given a
set of existing changes, such rules (a) suggest and predict
likely further changes, (b) show up item coupling that is in-
detectable by program analysis, and (c) prevent errors due
to incomplete changes. After an initial change, our ROSE
prototype can correctly predict 26% of further files to be
changed—and 15% of the precise functions or variables.
The topmost three suggestions contain a correct location
with a likelihood of 64%.
1. Introduction
Shopping for a book at Amazon.com, you may have come
across a section that reads “Customers who bought this
book also bought...”, listing other books that were typi-
cally included in the same purchase. Such information is
gathered by data mining— the automated extraction of hid-
den predictive information from large data sets. In this pa-
per, we apply data mining to version histories: “Program-
mers who changed these functions also changed. ..”. Just
like the Amazon.com feature helps the customer browsing
along related items, our ROSE tool guides the programmer
along related changes, with the following aims:
each time some programmer extended the fKeys[] ar-
ray, she also extended the function that sets the pref-
erence default values. If the programmer now wanted
to commit her changes without altering the suggested
location, ROSE would issue a warning.
Detect coupling indetectable by program analysis. As
ROSE operates uniquely on the version history, it is
able to detect coupling between items that cannot be
detected by program analysis—including coupling
between items that are not even programs. In Figure 1,
position 3 on the list is an ECLIPSE HTML documen-
tation file with a confidence of 0.75—suggesting that
after adding the new preference, the documentation
should be updated, too.
ROSE is not the first tool to leverage version histories. In
earlier work (Section 7), researchers have used history data
to understand programs and their evolution [3], to detect
evolutionary coupling between files [8] or classes [4], or to
support navigation in the source code [6]. In contrast to this
state of the art, the present work
• uses full-fledged data mining techniques to obtain as-
sociation rules from version histories,
• detects coupling between fine-grained program enti-
ties such as functions or variables (rather than, say,
classes), thus increasing precision and integrating with
program analysis,
ZimmermannT,WeißgerberP,DiehlS,ZellerA.,“Miningversion
historiestoguidesoftwarechanges”,26thInternationalConferenceon
SoftwareEngineering(ICSE),pp.563–572,2004.
what other functions
should I change for
this task?
similar to: “what did
other people buy (in
the past)?”
so: mine version control system tounderstand which functions werechanged together in the past, inorder to predict the future
!34
!35
Ahmed E. Hassan, “The road ahead for mining software repositories”, Frontiers of Software
Maintenance, pp. 48-57, 2008.
The Mining Software Repositories
(MSR) field analyzes and
cross-links the rich data
available in software
repositories to uncover
interesting and actionable
information about software
systems and projects

[Ahmed E. Hassan, 2008]
!36
What Repositories
are we Talking About?
!37
What Repositories
are we Talking About?
!37
What Repositories
are we Talking About?
!37
What Repositories
are we Talking About?
!37
What Repositories
are we Talking About?
!37
What Repositories
are we Talking About?
!37
What Repositories
are we Talking About?
!37
What Repositories
are we Talking About?
!37
!38
!38
!38
!38
recording
traces of
activities
actionable
insights
!38
recording
traces of
activities
#detailed
insight
#analyzed
repositories
How Many Repositories and
Projects are Required?
!39
#detailed
insight
#analyzed
repositories
How Many Repositories and
Projects are Required?
!39
#detailed
insight
#analyzed
repositories
How Many Repositories and
Projects are Required?
deep qualitative
analysis
possible, lack of
generalization
!39
#detailed
insight
#analyzed
repositories
How Many Repositories and
Projects are Required?
deep qualitative
analysis
possible, lack of
generalization
generalization of shallow
findings due to large-
scale quantitative
analysis
!39
#detailed
insight
#analyzed
repositories
How Many Repositories and
Projects are Required?
deep qualitative
analysis
possible, lack of
generalization
generalization of shallow
findings due to large-
scale quantitative
analysis
!39
#detailed
insight
#analyzed
repositories
How Many Repositories and
Projects are Required?
deep qualitative
analysis
possible, lack of
generalization
generalization of shallow
findings due to large-
scale quantitative
analysis
mix of
both
!39
Hold On,
What about
that “Cross-
Linking”?
!40
!41
!41
how did
reviewers react
to this commit?
!42
!43
!43
does this
commit fix a
bug or add a
new feature?
!44
Is it
Always
that Easy?
!45
Is it
Always
that Easy?
NO!
!45
Linking Repositories is
Challenging!
!46
lack of access to confidential/
closed repositories
not enough data left
after linking
imprecise heuristics
based on email
addresses, …
Linking Repositories is
Challenging!
no recorded links due
to lack of discipline
developers
ambiguous or
incorrect links
invalid identifiers,
e.g., after migration
to newer repository
technology
!46
OSEMN
Enhanced
Empirical SE
Process
define
plan
iterate
raw
data
Obtain Scrub
Explore
Model
iNterpret
iterate
operate (case
study, experiment
or survey)
empirical SE
data science
a b
ESE
!47
OSEMN
Enhanced
Empirical SE
Process
define
plan
iterate
raw
data
Obtain Scrub
Explore
Model
iNterpret
iterate
operate (case
study, experiment
or survey)
empirical SE
data science
a b
how to
incorporate
software
analytics?
ESE
!47
Today’s Empirical
SE Process
define
plan
iterate
Scrub
Explore
Model
iNterpret
iterate
operate (case
study, experiment
or survey)
empirical SE
data science
a b
LONSEMN
ESE
!48
Today’s Empirical
SE Process
define
plan
iterate
Scrub
Explore
Model
iNterpret
iterate
operate (case
study, experiment
or survey)
empirical SE
data science
software analytics
a b
LONSEMN
ESE
!48
Today’s Empirical
SE Process
define
plan
iterate
Scrub
Explore
Model
iNterpret
iterate
operate (case
study, experiment
or survey)
empirical SE
data science
software analytics
Link
a b
LONSEMN
ESE
!48
Today’s Empirical
SE Process
define
plan
iterate
Scrub
Explore
Model
iNterpret
iterate
operate (case
study, experiment
or survey)
empirical SE
data science
software analytics
Link
a b
ObtaiN
LONSEMN
ESE
!48
ObtaiN Activity should be
Time-aware!
!49
ObtaiN Activity should be
Time-aware!
!49
ObtaiN Activity should be
Time-aware!
check out snapshot of code
!49
ObtaiN Activity should be
Time-aware!
check out snapshot of code
apply time-
unaware
metric
extraction
!49
ObtaiN Activity should be
Time-aware!
check out snapshot of code
apply time-
unaware
metric
extraction
!49
ObtaiN Activity should be
Time-aware!
check out snapshot of code
apply time-
unaware
metric
extraction
!49
ObtaiN Activity should be
Time-aware!
check out snapshot of code
apply time-
unaware
metric
extraction
!49
ObtaiN Activity should be
Time-aware!
check out snapshot of code
apply time-
unaware
metric
extraction
!49
ObtaiN Activity should be
Time-aware!
check out snapshot of code
apply time-
unaware
metric
extraction
!49
ObtaiN Activity should be
Time-aware!
check out snapshot of code
apply time-
unaware
metric
extraction
!49
ObtaiN Activity should be
Time-aware!
check out snapshot of code
apply time-
unaware
metric
extraction
!49
what if
snapshot does
not compile?
ObtaiN Activity should be
Time-aware!
check out snapshot of code
apply time-
unaware
metric
extraction
!49
what if
snapshot does
not compile?
ObtaiN Activity should be
Time-aware!
check out snapshot of code
apply time-
unaware
metric
extraction
what if analysis is
time-intensive?
!49
what if
snapshot does
not compile?
ObtaiN Activity should be
Time-aware!
check out snapshot of code
apply time-
unaware
metric
extraction
what if more than
one branch?
what if analysis is
time-intensive?
!49
what if
snapshot does
not compile?
ObtaiN Activity should be
Time-aware!
check out snapshot of code
apply time-
unaware
metric
extraction
what if more than
one branch?
what if results of
prior snapshots
needed?
what if analysis is
time-intensive?
!49
Exercise: Predicting Bug-
introducing Commits (1)
!50
Today’s Empirical
SE Process
define
plan
iterate Link
Scrub
Explore
Model
iNterpret
iterate
operate (case
study, experiment
or survey)
ObtaiN
empirical SE
data science
software analytics
a b
LONSEMN
ESE
!51
Part IV: Now What?
!53
!53
The 3 concepts
are in fact
complementary!
… so can you just
submit your work
to venues of either
domain?
!53
The 3 concepts
are in fact
complementary!
Yes, you can!
(Depending on the
contributions you
want to stress)
!54
Yes, you can!
(Depending on the
contributions you
want to stress)
NOTE: “stress” !=
“include” (i.e., all activities
will be included, but with
different importance)
!54
empirical venues
(ESEM, EMSE, …)
define
plan
iterate
operate (case
study, experiment
or survey)
!55
empirical venues
(ESEM, EMSE, …)
define
plan
iterate
operate (case
study, experiment
or survey)
software analytics venues
(MSR, PROMISE, …)
Link
ObtaiN
!55
empirical venues
(ESEM, EMSE, …)
define
plan
iterate
operate (case
study, experiment
or survey)
software analytics venues
(MSR, PROMISE, …)
Link
ObtaiN
data science/general SE
venues (KDD, ICDM,
ICSE, TSE, …)
Scrub
Explore
Model
iNterpret
iterate
a b
!55
Exercise: Predicting Bug-
introducing Commits (2)
!56
Exercise: Predicting Bug-
introducing Commits (2)
How to sell this
work as an
ESEM paper?
!56
Exercise: Predicting Bug-
introducing Commits (2)
How to sell this
work as an
ESEM paper?
How to sell this
work as an MSR
paper?
!56
Exercise: Predicting Bug-
introducing Commits (2)
How to sell this
work as an
ESEM paper?
How to sell this
work as an MSR
paper?
How to sell this
work as an ICSE
paper?
!56
empirical venues
(ESEM, EMSE, …)
data science/general SE
venues (KDD, ICDM,
ICSE, TSE, …)
software analytics venues
(MSR, PROMISE, …)
Link
ObtaiN
define
plan
iterate
operate (case
study, experiment
or survey)
Scrub
Explore
Model
iNterpret
iterate
a b
!57
How Will Tomorrow’s ESE
Process Look Like?
!58
How Will Tomorrow’s ESE
Process Look Like?
hard to predict …
!58
How Will Tomorrow’s ESE
Process Look Like?
hard to predict …
… but very likely
an evolution of
today’s process!
!58
Empirical Science?
testable hypothesis
validation
rejected
hypothesis
validated
hypothesis
observation/
experience
Empirical Science?
testable hypothesis
validation
rejected
hypothesis
validated
hypothesis
observation/
experience
Claes Wohlin et al., “Experimentation in Software Engineering - An Introduction”, 2000.
Traditional
Empirical SE
Process
define
plan
raw
dataoperate (case
study, experiment
or survey)
iterate
!28
Empirical Science?
testable hypothesis
validation
rejected
hypothesis
validated
hypothesis
observation/
experience
Claes Wohlin et al., “Experimentation in Software Engineering - An Introduction”, 2000.
Traditional
Empirical SE
Process
define
plan
raw
dataoperate (case
study, experiment
or survey)
iterate
!28
actionable
insights
!37
recording
traces of
activities
Today’s Empirical
SE Process
define
plan
iterate Link
Scrub
Explore
Model
iNterpret
iterate
operate (case
study, experiment
or survey)
ObtaiN
empirical SE
data science
software analytics
a b
LONSEMN
ESE
!50
Empirical Science?
testable hypothesis
validation
rejected
hypothesis
validated
hypothesis
observation/
experience
Claes Wohlin et al., “Experimentation in Software Engineering - An Introduction”, 2000.
Traditional
Empirical SE
Process
define
plan
raw
dataoperate (case
study, experiment
or survey)
iterate
!28
actionable
insights
!37
recording
traces of
activities

Más contenido relacionado

Similar a Data Science, Empirical SE and Software Analytics: It’s a Family Affair!

Introduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .pptIntroduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .pptSangrangBargayary3
 
Machine Learning, Data Mining, and
Machine Learning, Data Mining, and Machine Learning, Data Mining, and
Machine Learning, Data Mining, and butest
 
data mining
data miningdata mining
data mininguoitc
 
Data Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has ChangedData Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has ChangedPhilip Bourne
 
Presentación Ciro Cattuto, ISI Foundation en VI Summit País Digital 2018
Presentación Ciro Cattuto, ISI Foundation en VI Summit País Digital 2018Presentación Ciro Cattuto, ISI Foundation en VI Summit País Digital 2018
Presentación Ciro Cattuto, ISI Foundation en VI Summit País Digital 2018PAÍS DIGITAL
 
Application of-statistics-in-CSE
Application of-statistics-in-CSEApplication of-statistics-in-CSE
Application of-statistics-in-CSEMashudRana9
 
Application of statistics in cse
Application of statistics in cseApplication of statistics in cse
Application of statistics in cseKrishno Dey
 
Demystifying Big Data, Data Science and Statistics, along with Machine Intell...
Demystifying Big Data, Data Science and Statistics, along with Machine Intell...Demystifying Big Data, Data Science and Statistics, along with Machine Intell...
Demystifying Big Data, Data Science and Statistics, along with Machine Intell...Prof. Dr. Diego Kuonen
 
Big Data, Data Science, Machine Intelligence and Learning: Demystification, C...
Big Data, Data Science, Machine Intelligence and Learning: Demystification, C...Big Data, Data Science, Machine Intelligence and Learning: Demystification, C...
Big Data, Data Science, Machine Intelligence and Learning: Demystification, C...Prof. Dr. Diego Kuonen
 
Big Data as the Fuel and Visual Analytics as the Engine Mount of the Digital ...
Big Data as the Fuel and Visual Analytics as the Engine Mount of the Digital ...Big Data as the Fuel and Visual Analytics as the Engine Mount of the Digital ...
Big Data as the Fuel and Visual Analytics as the Engine Mount of the Digital ...Prof. Dr. Diego Kuonen
 
Data Mining and Knowledge Discovery in Business Databases
Data Mining and Knowledge Discovery in Business DatabasesData Mining and Knowledge Discovery in Business Databases
Data Mining and Knowledge Discovery in Business Databasesbutest
 
"Data as the Fuel and Analytics as the Engine of the Digital Transformation -...
"Data as the Fuel and Analytics as the Engine of the Digital Transformation -..."Data as the Fuel and Analytics as the Engine of the Digital Transformation -...
"Data as the Fuel and Analytics as the Engine of the Digital Transformation -...Prof. Dr. Diego Kuonen
 
Introduction to Data Science 1115.pptx
Introduction to Data Science 1115.pptxIntroduction to Data Science 1115.pptx
Introduction to Data Science 1115.pptxmark828
 
Introduction to Data Science 1117.pptx
Introduction to Data Science 1117.pptxIntroduction to Data Science 1117.pptx
Introduction to Data Science 1117.pptxmark828
 
Introduction to Data Science 1116.pptx
Introduction to Data Science 1116.pptxIntroduction to Data Science 1116.pptx
Introduction to Data Science 1116.pptxmark828
 
Dia sds2015 web version
Dia sds2015 web versionDia sds2015 web version
Dia sds2015 web versionMichael Brodie
 
Data science innovations
Data science innovations Data science innovations
Data science innovations suresh sood
 

Similar a Data Science, Empirical SE and Software Analytics: It’s a Family Affair! (20)

Introduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .pptIntroduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .ppt
 
Machine Learning, Data Mining, and
Machine Learning, Data Mining, and Machine Learning, Data Mining, and
Machine Learning, Data Mining, and
 
Why Data Science is a Science
Why Data Science is a ScienceWhy Data Science is a Science
Why Data Science is a Science
 
data mining
data miningdata mining
data mining
 
Data Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has ChangedData Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has Changed
 
Presentación Ciro Cattuto, ISI Foundation en VI Summit País Digital 2018
Presentación Ciro Cattuto, ISI Foundation en VI Summit País Digital 2018Presentación Ciro Cattuto, ISI Foundation en VI Summit País Digital 2018
Presentación Ciro Cattuto, ISI Foundation en VI Summit País Digital 2018
 
Application of-statistics-in-CSE
Application of-statistics-in-CSEApplication of-statistics-in-CSE
Application of-statistics-in-CSE
 
Application of statistics in cse
Application of statistics in cseApplication of statistics in cse
Application of statistics in cse
 
Demystifying Big Data, Data Science and Statistics, along with Machine Intell...
Demystifying Big Data, Data Science and Statistics, along with Machine Intell...Demystifying Big Data, Data Science and Statistics, along with Machine Intell...
Demystifying Big Data, Data Science and Statistics, along with Machine Intell...
 
Big Data, Data Science, Machine Intelligence and Learning: Demystification, C...
Big Data, Data Science, Machine Intelligence and Learning: Demystification, C...Big Data, Data Science, Machine Intelligence and Learning: Demystification, C...
Big Data, Data Science, Machine Intelligence and Learning: Demystification, C...
 
Big Data as the Fuel and Visual Analytics as the Engine Mount of the Digital ...
Big Data as the Fuel and Visual Analytics as the Engine Mount of the Digital ...Big Data as the Fuel and Visual Analytics as the Engine Mount of the Digital ...
Big Data as the Fuel and Visual Analytics as the Engine Mount of the Digital ...
 
Data Mining and Knowledge Discovery in Business Databases
Data Mining and Knowledge Discovery in Business DatabasesData Mining and Knowledge Discovery in Business Databases
Data Mining and Knowledge Discovery in Business Databases
 
Ayasdi Case Study
Ayasdi Case StudyAyasdi Case Study
Ayasdi Case Study
 
Ayasdi: Demystifying the Unknown
Ayasdi: Demystifying the UnknownAyasdi: Demystifying the Unknown
Ayasdi: Demystifying the Unknown
 
"Data as the Fuel and Analytics as the Engine of the Digital Transformation -...
"Data as the Fuel and Analytics as the Engine of the Digital Transformation -..."Data as the Fuel and Analytics as the Engine of the Digital Transformation -...
"Data as the Fuel and Analytics as the Engine of the Digital Transformation -...
 
Introduction to Data Science 1115.pptx
Introduction to Data Science 1115.pptxIntroduction to Data Science 1115.pptx
Introduction to Data Science 1115.pptx
 
Introduction to Data Science 1117.pptx
Introduction to Data Science 1117.pptxIntroduction to Data Science 1117.pptx
Introduction to Data Science 1117.pptx
 
Introduction to Data Science 1116.pptx
Introduction to Data Science 1116.pptxIntroduction to Data Science 1116.pptx
Introduction to Data Science 1116.pptx
 
Dia sds2015 web version
Dia sds2015 web versionDia sds2015 web version
Dia sds2015 web version
 
Data science innovations
Data science innovations Data science innovations
Data science innovations
 

Más de Bram Adams

Modern Release Engineering in a Nutshell - Why Researchers should Care!
Modern Release Engineering in a Nutshell - Why Researchers should Care!Modern Release Engineering in a Nutshell - Why Researchers should Care!
Modern Release Engineering in a Nutshell - Why Researchers should Care!Bram Adams
 
The Bash Dashboard (Or: How to Use Bash for Data Analysis)
The Bash Dashboard (Or: How to Use Bash for Data Analysis)The Bash Dashboard (Or: How to Use Bash for Data Analysis)
The Bash Dashboard (Or: How to Use Bash for Data Analysis)Bram Adams
 
Why do Automated Builds Break? An Empirical Study (ICSME 2014)
Why do Automated Builds Break? An Empirical Study (ICSME 2014)Why do Automated Builds Break? An Empirical Study (ICSME 2014)
Why do Automated Builds Break? An Empirical Study (ICSME 2014)Bram Adams
 
The Evolution of the R Software Ecosystem (CSMR 2013)
The Evolution of the R Software Ecosystem (CSMR 2013)The Evolution of the R Software Ecosystem (CSMR 2013)
The Evolution of the R Software Ecosystem (CSMR 2013)Bram Adams
 
An Empirical Study of Build System Migrations in Practice (ICSM 2012)
An Empirical Study of Build System Migrations in Practice (ICSM 2012)An Empirical Study of Build System Migrations in Practice (ICSM 2012)
An Empirical Study of Build System Migrations in Practice (ICSM 2012)Bram Adams
 
On Software Release Engineering (Bram Adams)
On Software Release Engineering (Bram Adams)On Software Release Engineering (Bram Adams)
On Software Release Engineering (Bram Adams)Bram Adams
 
A Qualitative Study on Performance Bugs (MSR 2012)
A Qualitative Study on Performance Bugs (MSR 2012)A Qualitative Study on Performance Bugs (MSR 2012)
A Qualitative Study on Performance Bugs (MSR 2012)Bram Adams
 

Más de Bram Adams (7)

Modern Release Engineering in a Nutshell - Why Researchers should Care!
Modern Release Engineering in a Nutshell - Why Researchers should Care!Modern Release Engineering in a Nutshell - Why Researchers should Care!
Modern Release Engineering in a Nutshell - Why Researchers should Care!
 
The Bash Dashboard (Or: How to Use Bash for Data Analysis)
The Bash Dashboard (Or: How to Use Bash for Data Analysis)The Bash Dashboard (Or: How to Use Bash for Data Analysis)
The Bash Dashboard (Or: How to Use Bash for Data Analysis)
 
Why do Automated Builds Break? An Empirical Study (ICSME 2014)
Why do Automated Builds Break? An Empirical Study (ICSME 2014)Why do Automated Builds Break? An Empirical Study (ICSME 2014)
Why do Automated Builds Break? An Empirical Study (ICSME 2014)
 
The Evolution of the R Software Ecosystem (CSMR 2013)
The Evolution of the R Software Ecosystem (CSMR 2013)The Evolution of the R Software Ecosystem (CSMR 2013)
The Evolution of the R Software Ecosystem (CSMR 2013)
 
An Empirical Study of Build System Migrations in Practice (ICSM 2012)
An Empirical Study of Build System Migrations in Practice (ICSM 2012)An Empirical Study of Build System Migrations in Practice (ICSM 2012)
An Empirical Study of Build System Migrations in Practice (ICSM 2012)
 
On Software Release Engineering (Bram Adams)
On Software Release Engineering (Bram Adams)On Software Release Engineering (Bram Adams)
On Software Release Engineering (Bram Adams)
 
A Qualitative Study on Performance Bugs (MSR 2012)
A Qualitative Study on Performance Bugs (MSR 2012)A Qualitative Study on Performance Bugs (MSR 2012)
A Qualitative Study on Performance Bugs (MSR 2012)
 

Último

High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑Damini Dixit
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
American Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxAmerican Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxabhishekdhamu51
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Servicemonikaservice1
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)Areesha Ahmad
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...ssuser79fe74
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformationAreesha Ahmad
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Joonhun Lee
 

Último (20)

High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
American Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxAmerican Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptx
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 

Data Science, Empirical SE and Software Analytics: It’s a Family Affair!

  • 1. Data Science, Empirical SE and Software Analytics: It’s a Family Affair! 1 Bram Adams http://mcis.polymtl.ca
  • 3. 3
  • 4. Wolfgang De Meuter Vrije Universiteit Brussel Herman Tromp Ghent University 3
  • 5. Ahmed E. Hassan Queen's University 3
  • 6. M C IS (Lab on Maintenance, Construction and Intelligence of Software) 3
  • 8. Who is MCIS? !4 Parisa Doriane Shivashree Isabella me Yutaro M C I S Javier Armstrong http://mcis.polymtl.ca
  • 11. Data Science, Empirical SE and Software Analytics: It’s a Family Affair! 6 Bram Adams http://mcis.polymtl.ca
  • 12. Dear Bram; As per our conversation during ICSE, we are pleased to invite you as a speaker in IASESE 2018 co-located with ESEM conference. We are planning to have a discussion about similarities, differences, and synergies between empirical software engineering, data science, and mining software repositories under the umbrella of "Empirical Software Engineering and Data science: Old Wine in a New Bottle". we are looking into a talk around 45 - 60 minutes and a round table discussion.  Attached please find the agenda for IASESE. Please don't hesitate to contact us in case of any questions. We hope to hear back from you soon; Best Regards; Silvia Abrahão, Maleknaz Nayebi
  • 13. Data Science has Overtaken Other Major Technologies in Popularity https://trends.google.com/trends/explore?date=today%205-y,today%205-y,today%205-y,today%205- y&geo=,,,&q=%22big%20data%22,%22data%20science%22,%22cloud%20computing%22,%22empirical%20research%22
  • 14. Data Science has Overtaken Other Major Technologies in Popularity https://trends.google.com/trends/explore?date=today%205-y,today%205-y,today%205-y,today%205- y&geo=,,,&q=%22big%20data%22,%22data%20science%22,%22cloud%20computing%22,%22empirical%20research%22
  • 15. Data Science has Overtaken Other Major Technologies in Popularity big data https://trends.google.com/trends/explore?date=today%205-y,today%205-y,today%205-y,today%205- y&geo=,,,&q=%22big%20data%22,%22data%20science%22,%22cloud%20computing%22,%22empirical%20research%22
  • 16. Data Science has Overtaken Other Major Technologies in Popularity big datacloud computing https://trends.google.com/trends/explore?date=today%205-y,today%205-y,today%205-y,today%205- y&geo=,,,&q=%22big%20data%22,%22data%20science%22,%22cloud%20computing%22,%22empirical%20research%22
  • 17. Data Science has Overtaken Other Major Technologies in Popularity big datacloud computing data science https://trends.google.com/trends/explore?date=today%205-y,today%205-y,today%205-y,today%205- y&geo=,,,&q=%22big%20data%22,%22data%20science%22,%22cloud%20computing%22,%22empirical%20research%22
  • 18. Data Science has Overtaken Other Major Technologies in Popularity big datacloud computing data scienceempirical research :-( https://trends.google.com/trends/explore?date=today%205-y,today%205-y,today%205-y,today%205- y&geo=,,,&q=%22big%20data%22,%22data%20science%22,%22cloud%20computing%22,%22empirical%20research%22
  • 21. !11
  • 22. !11 Software Data ScienceSciM SoDaSciM 2018 would such a renaming make sense? why (not)?
  • 23. Part I: Data Science
  • 24. !13
  • 25. !13
  • 26. !13
  • 27. !13
  • 29. John Tukey, “The Future of Data Analysis”, Ann. Math. Statist., 33(1), pp. 1-67, 1962. https://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/#539fcc0c55cf I have come to feel that my central interest is in data analysis... Data analysis, and the parts of statistics which adhere to it, must...take on the characteristics of science rather than those of mathematics... data analysis is intrinsically an empirical science [John Tukey, 1962] https://en.wikipedia.org/w/index.php?curid=17099473 !14
  • 38. Fayyad, Piatetsky-Shapiro, Smyth, "From Data Mining to Knowledge Discovery: An Overview", in “Advances in Knowledge Discovery and Data Mining”, 1996, pp.1-34 1989: KDD Workshop & Process !16
  • 39. Fayyad, Piatetsky-Shapiro, Smyth, "From Data Mining to Knowledge Discovery: An Overview", in “Advances in Knowledge Discovery and Data Mining”, 1996, pp.1-34 1989: KDD Workshop & Process raw data !16
  • 40. Fayyad, Piatetsky-Shapiro, Smyth, "From Data Mining to Knowledge Discovery: An Overview", in “Advances in Knowledge Discovery and Data Mining”, 1996, pp.1-34 1989: KDD Workshop & Process raw data knowledge !16
  • 41. Fayyad, Piatetsky-Shapiro, Smyth, "From Data Mining to Knowledge Discovery: An Overview", in “Advances in Knowledge Discovery and Data Mining”, 1996, pp.1-34 1989: KDD Workshop & Process raw data select target data knowledge !16
  • 42. Fayyad, Piatetsky-Shapiro, Smyth, "From Data Mining to Knowledge Discovery: An Overview", in “Advances in Knowledge Discovery and Data Mining”, 1996, pp.1-34 1989: KDD Workshop & Process raw data select target data knowledge preprocess clean data !16
  • 43. Fayyad, Piatetsky-Shapiro, Smyth, "From Data Mining to Knowledge Discovery: An Overview", in “Advances in Knowledge Discovery and Data Mining”, 1996, pp.1-34 1989: KDD Workshop & Process raw data select target data knowledge transform a b preprocess clean data !16
  • 44. Fayyad, Piatetsky-Shapiro, Smyth, "From Data Mining to Knowledge Discovery: An Overview", in “Advances in Knowledge Discovery and Data Mining”, 1996, pp.1-34 1989: KDD Workshop & Process raw data select target data knowledge data mining transform a b preprocess clean data !16
  • 45. Fayyad, Piatetsky-Shapiro, Smyth, "From Data Mining to Knowledge Discovery: An Overview", in “Advances in Knowledge Discovery and Data Mining”, 1996, pp.1-34 1989: KDD Workshop & Process raw data interpret select target data knowledge data mining transform a b preprocess clean data !16
  • 46. Fayyad, Piatetsky-Shapiro, Smyth, "From Data Mining to Knowledge Discovery: An Overview", in “Advances in Knowledge Discovery and Data Mining”, 1996, pp.1-34 1989: KDD Workshop & Process raw data interpret select target data knowledge iterate data mining transform a b preprocess clean data !16
  • 47. Fayyad, Piatetsky-Shapiro, Smyth, "From Data Mining to Knowledge Discovery: An Overview", in “Advances in Knowledge Discovery and Data Mining”, 1996, pp.1-34 1989: KDD Workshop & Process raw data interpret select target data knowledge iterate data mining transform a b preprocess clean data !16
  • 49. Data Science is not only a synthetic concept to unify statistics, data analysis and their related methods but also comprises its results. It includes three phases, design for data, collection of data, and analysis on data. 1996/1998 !17
  • 50. Data Science is not only a synthetic concept to unify statistics, data analysis and their related methods but also comprises its results. It includes three phases, design for data, collection of data, and analysis on data. 1996/1998 !17
  • 51. Data Science is not only a synthetic concept to unify statistics, data analysis and their related methods but also comprises its results. It includes three phases, design for data, collection of data, and analysis on data. 1996/1998 !17
  • 52. Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/ 2010: OSEMN Process for Data Science !18
  • 53. Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/ 2010: OSEMN Process for Data Science !18 raw data
  • 54. Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/ 2010: OSEMN Process for Data Science !18 raw data knowledge
  • 55. Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/ 2010: OSEMN Process for Data Science !18 Obtain metrics raw data knowledge
  • 56. Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/ 2010: OSEMN Process for Data Science !18 Obtain metrics raw data knowledge Scrub a b
  • 57. Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/ 2010: OSEMN Process for Data Science !18 Obtain metrics raw data knowledge Scrub a b Explore
  • 58. Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/ 2010: OSEMN Process for Data Science !18 Obtain metrics raw data knowledge Scrub a b Model Explore
  • 59. Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/ 2010: OSEMN Process for Data Science !18 Obtain metrics raw data knowledge Scrub a b iNterpret Model Explore
  • 60. Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/ 2010: OSEMN Process for Data Science !18 Obtain metrics raw data knowledge iterate Scrub a b iNterpret Model Explore
  • 61. Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/ 2010: OSEMN Process for Data Science !18 Obtain metrics raw data knowledge iterate Scrub a b iNterpret Model Explore
  • 62. Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/ 2010: OSEMN Process for Data Science !18 Obtain metrics raw data knowledge iterate Scrub a b iNterpret Model Explore
  • 63. Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/ raw data Obtain Scrub ModeliNterpret metrics knowledge iterate a b !19 Explore
  • 64. Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/ raw data Obtain Scrub ModeliNterpret metrics knowledge iterate a b !19 Explore
  • 65. Exercise: Fairness of Eurovision Song Contest https://www.imdb.com/title/tt0800959/mediaviewer/rm623579392
  • 66. Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/ 2010: OSEMN Process for Data Science !21 Obtain metrics raw data knowledge iterate Scrub a b iNterpret Model Explore
  • 67. Why (Not) Data Science? !22
  • 68. customize software tools and data analysis techniques to problem at hand Why (Not) Data Science? !22
  • 69. exploit explosion in computing power (cloud!) to scale with volume and velocity of available data sources customize software tools and data analysis techniques to problem at hand Why (Not) Data Science? !22
  • 70. exploit explosion in computing power (cloud!) to scale with volume and velocity of available data sources lacking guidance on the planning/problem formulation side (due to focus on exploratory/ bottom-up research) customize software tools and data analysis techniques to problem at hand Why (Not) Data Science? !22
  • 72. Where does “Raw Data” Come from? !24
  • 73. Where does “Raw Data” Come from? any data domain !24
  • 74. Where does “Raw Data” Come from? SE Development Process any data domain !24
  • 75. … and How does it Relate to High-Level Questions? !25
  • 76. … and How does it Relate to High-Level Questions? What kinds of patches are more likely to be accepted by OSS projects? !25
  • 77. … and How does it Relate to High-Level Questions? What kinds of patches are more likely to be accepted by OSS projects? Does our new dashboard for code review improve reviewers’ effectiveness? !25
  • 78. … and How does it Relate to High-Level Questions? What kinds of patches are more likely to be accepted by OSS projects? Does our new dashboard for code review improve reviewers’ effectiveness? Commit. Why do developers prefer short commit messages? !25
  • 79. These Questions All Require Empirical Evidence • Why? SE development process is managed by humans, hence no formal laws, rules, etc. exist that “prove” the answer to these questions • How? through “a posteriori” knowledge “observed” or “experienced” through: • quantitative analysis: how large is effect of given manipulation or activity? • qualitative analysis: why does something happen, from perspective/point of view of subjects? !26
  • 80. Common Strategies to Obtain Empirical Evidence !27
  • 81. Common Strategies to Obtain Empirical Evidence case study (observational study, no control) !27
  • 82. Common Strategies to Obtain Empirical Evidence case study (observational study, no control) experiment (assigning treatments, with control) !27
  • 83. Common Strategies to Obtain Empirical Evidence case study (observational study, no control) experiment (assigning treatments, with control) survey (interviews/ questionnaire) !27
  • 84. Claes Wohlin et al., “Experimentation in Software Engineering - An Introduction”, 2000. Traditional Empirical SE Process !28
  • 85. Claes Wohlin et al., “Experimentation in Software Engineering - An Introduction”, 2000. Traditional Empirical SE Process !28
  • 86. Claes Wohlin et al., “Experimentation in Software Engineering - An Introduction”, 2000. Traditional Empirical SE Process define !28
  • 87. Claes Wohlin et al., “Experimentation in Software Engineering - An Introduction”, 2000. Traditional Empirical SE Process define plan !28
  • 88. Claes Wohlin et al., “Experimentation in Software Engineering - An Introduction”, 2000. Traditional Empirical SE Process define plan iterate !28
  • 89. Claes Wohlin et al., “Experimentation in Software Engineering - An Introduction”, 2000. Traditional Empirical SE Process define plan raw dataoperate (case study, experiment or survey) iterate !28
  • 90. Claes Wohlin et al., “Experimentation in Software Engineering - An Introduction”, 2000. Traditional Empirical SE Process define plan raw dataoperate (case study, experiment or survey) analyze/ interpret stats iterate !28
  • 91. Claes Wohlin et al., “Experimentation in Software Engineering - An Introduction”, 2000. Traditional Empirical SE Process define plan raw dataoperate (case study, experiment or survey) analyze/ interpret stats knowledge present/ package iterate !28
  • 92. Claes Wohlin et al., “Experimentation in Software Engineering - An Introduction”, 2000. Traditional Empirical SE Process define plan raw dataoperate (case study, experiment or survey) analyze/ interpret stats knowledge present/ package iterate iterate !28
  • 93. Claes Wohlin et al., “Experimentation in Software Engineering - An Introduction”, 2000. Traditional Empirical SE Process define plan raw dataoperate (case study, experiment or survey) analyze/ interpret stats knowledge present/ package iterate iterate !28
  • 95. Enhanced Empirical SE Process define plan iterate operate (case study, experiment or survey) empirical SE data science raw data Obtain Scrub Explore Model iNterpret iterate a b OSEMN ESE !29
  • 98. strong empirical definition/planning/ operation base Why this Hybrid Process? !30 exploit explosion in computing power (cloud!) to scale with volume and velocity of available data sources
  • 99. strong empirical definition/planning/ operation base Why this Hybrid Process? !30 exploit explosion in computing power (cloud!) to scale with volume and velocity of available data sources customize software tools and data analysis techniques to problem at hand
  • 100. strong empirical definition/planning/ operation base Why this Hybrid Process? experiments, surveys and even case studies require large effort, depending on available data sources !30 exploit explosion in computing power (cloud!) to scale with volume and velocity of available data sources customize software tools and data analysis techniques to problem at hand
  • 105. What if Accurate SE Process Data would be Available? effort empirical strategy case study experiment survey !31
  • 106. What if Accurate SE Process Data would be Available? effort empirical strategy case study experiment survey !31
  • 107. What if Accurate SE Process Data would be Available? effort empirical strategy case study experiment survey !31
  • 108. What if Accurate SE Process Data would be Available? effort empirical strategy case study experiment survey !31
  • 109. What if Accurate SE Process Data would be Available? effort empirical strategy case study experiment survey !31
  • 110. What if Accurate SE Process Data would be Available? effort empirical strategy case study experiment survey enables more case studies! !31 reduces effort!
  • 112.
  • 113. Mining Version Histories to Guide Software Changes Thomas Zimmermann tz@acm.org Peter Weißgerber weissger@st.cs.uni-sb.de Stephan Diehl diehl@acm.org Andreas Zeller zeller@acm.org Saarland University, Saarbr¨ucken, Germany Abstract We apply data mining to version histories in order to guide programmers along related changes: “Programmers who changed these functions also changed...”. Given a set of existing changes, such rules (a) suggest and predict likely further changes, (b) show up item coupling that is in- detectable by program analysis, and (c) prevent errors due to incomplete changes. After an initial change, our ROSE prototype can correctly predict 26% of further files to be changed—and 15% of the precise functions or variables. The topmost three suggestions contain a correct location with a likelihood of 64%. 1. Introduction Shopping for a book at Amazon.com, you may have come across a section that reads “Customers who bought this book also bought...”, listing other books that were typi- cally included in the same purchase. Such information is gathered by data mining— the automated extraction of hid- den predictive information from large data sets. In this pa- per, we apply data mining to version histories: “Program- mers who changed these functions also changed. ..”. Just like the Amazon.com feature helps the customer browsing along related items, our ROSE tool guides the programmer along related changes, with the following aims: each time some programmer extended the fKeys[] ar- ray, she also extended the function that sets the pref- erence default values. If the programmer now wanted to commit her changes without altering the suggested location, ROSE would issue a warning. Detect coupling indetectable by program analysis. As ROSE operates uniquely on the version history, it is able to detect coupling between items that cannot be detected by program analysis—including coupling between items that are not even programs. In Figure 1, position 3 on the list is an ECLIPSE HTML documen- tation file with a confidence of 0.75—suggesting that after adding the new preference, the documentation should be updated, too. ROSE is not the first tool to leverage version histories. In earlier work (Section 7), researchers have used history data to understand programs and their evolution [3], to detect evolutionary coupling between files [8] or classes [4], or to support navigation in the source code [6]. In contrast to this state of the art, the present work • uses full-fledged data mining techniques to obtain as- sociation rules from version histories, • detects coupling between fine-grained program enti- ties such as functions or variables (rather than, say, classes), thus increasing precision and integrating with program analysis, ZimmermannT,WeißgerberP,DiehlS,ZellerA.,“Miningversion historiestoguidesoftwarechanges”,26thInternationalConferenceon SoftwareEngineering(ICSE),pp.563–572,2004. !34
  • 114. Mining Version Histories to Guide Software Changes Thomas Zimmermann tz@acm.org Peter Weißgerber weissger@st.cs.uni-sb.de Stephan Diehl diehl@acm.org Andreas Zeller zeller@acm.org Saarland University, Saarbr¨ucken, Germany Abstract We apply data mining to version histories in order to guide programmers along related changes: “Programmers who changed these functions also changed...”. Given a set of existing changes, such rules (a) suggest and predict likely further changes, (b) show up item coupling that is in- detectable by program analysis, and (c) prevent errors due to incomplete changes. After an initial change, our ROSE prototype can correctly predict 26% of further files to be changed—and 15% of the precise functions or variables. The topmost three suggestions contain a correct location with a likelihood of 64%. 1. Introduction Shopping for a book at Amazon.com, you may have come across a section that reads “Customers who bought this book also bought...”, listing other books that were typi- cally included in the same purchase. Such information is gathered by data mining— the automated extraction of hid- den predictive information from large data sets. In this pa- per, we apply data mining to version histories: “Program- mers who changed these functions also changed. ..”. Just like the Amazon.com feature helps the customer browsing along related items, our ROSE tool guides the programmer along related changes, with the following aims: each time some programmer extended the fKeys[] ar- ray, she also extended the function that sets the pref- erence default values. If the programmer now wanted to commit her changes without altering the suggested location, ROSE would issue a warning. Detect coupling indetectable by program analysis. As ROSE operates uniquely on the version history, it is able to detect coupling between items that cannot be detected by program analysis—including coupling between items that are not even programs. In Figure 1, position 3 on the list is an ECLIPSE HTML documen- tation file with a confidence of 0.75—suggesting that after adding the new preference, the documentation should be updated, too. ROSE is not the first tool to leverage version histories. In earlier work (Section 7), researchers have used history data to understand programs and their evolution [3], to detect evolutionary coupling between files [8] or classes [4], or to support navigation in the source code [6]. In contrast to this state of the art, the present work • uses full-fledged data mining techniques to obtain as- sociation rules from version histories, • detects coupling between fine-grained program enti- ties such as functions or variables (rather than, say, classes), thus increasing precision and integrating with program analysis, ZimmermannT,WeißgerberP,DiehlS,ZellerA.,“Miningversion historiestoguidesoftwarechanges”,26thInternationalConferenceon SoftwareEngineering(ICSE),pp.563–572,2004. what other functions should I change for this task? similar to: “what did other people buy (in the past)?” !34
  • 115. Mining Version Histories to Guide Software Changes Thomas Zimmermann tz@acm.org Peter Weißgerber weissger@st.cs.uni-sb.de Stephan Diehl diehl@acm.org Andreas Zeller zeller@acm.org Saarland University, Saarbr¨ucken, Germany Abstract We apply data mining to version histories in order to guide programmers along related changes: “Programmers who changed these functions also changed...”. Given a set of existing changes, such rules (a) suggest and predict likely further changes, (b) show up item coupling that is in- detectable by program analysis, and (c) prevent errors due to incomplete changes. After an initial change, our ROSE prototype can correctly predict 26% of further files to be changed—and 15% of the precise functions or variables. The topmost three suggestions contain a correct location with a likelihood of 64%. 1. Introduction Shopping for a book at Amazon.com, you may have come across a section that reads “Customers who bought this book also bought...”, listing other books that were typi- cally included in the same purchase. Such information is gathered by data mining— the automated extraction of hid- den predictive information from large data sets. In this pa- per, we apply data mining to version histories: “Program- mers who changed these functions also changed. ..”. Just like the Amazon.com feature helps the customer browsing along related items, our ROSE tool guides the programmer along related changes, with the following aims: each time some programmer extended the fKeys[] ar- ray, she also extended the function that sets the pref- erence default values. If the programmer now wanted to commit her changes without altering the suggested location, ROSE would issue a warning. Detect coupling indetectable by program analysis. As ROSE operates uniquely on the version history, it is able to detect coupling between items that cannot be detected by program analysis—including coupling between items that are not even programs. In Figure 1, position 3 on the list is an ECLIPSE HTML documen- tation file with a confidence of 0.75—suggesting that after adding the new preference, the documentation should be updated, too. ROSE is not the first tool to leverage version histories. In earlier work (Section 7), researchers have used history data to understand programs and their evolution [3], to detect evolutionary coupling between files [8] or classes [4], or to support navigation in the source code [6]. In contrast to this state of the art, the present work • uses full-fledged data mining techniques to obtain as- sociation rules from version histories, • detects coupling between fine-grained program enti- ties such as functions or variables (rather than, say, classes), thus increasing precision and integrating with program analysis, ZimmermannT,WeißgerberP,DiehlS,ZellerA.,“Miningversion historiestoguidesoftwarechanges”,26thInternationalConferenceon SoftwareEngineering(ICSE),pp.563–572,2004. what other functions should I change for this task? similar to: “what did other people buy (in the past)?” so: mine version control system tounderstand which functions werechanged together in the past, inorder to predict the future !34
  • 116. !35
  • 117. Ahmed E. Hassan, “The road ahead for mining software repositories”, Frontiers of Software Maintenance, pp. 48-57, 2008. The Mining Software Repositories (MSR) field analyzes and cross-links the rich data available in software repositories to uncover interesting and actionable information about software systems and projects [Ahmed E. Hassan, 2008] !36
  • 118. What Repositories are we Talking About? !37
  • 119. What Repositories are we Talking About? !37
  • 120. What Repositories are we Talking About? !37
  • 121. What Repositories are we Talking About? !37
  • 122. What Repositories are we Talking About? !37
  • 123. What Repositories are we Talking About? !37
  • 124. What Repositories are we Talking About? !37
  • 125. What Repositories are we Talking About? !37
  • 126. !38
  • 127. !38
  • 128. !38
  • 133. #detailed insight #analyzed repositories How Many Repositories and Projects are Required? deep qualitative analysis possible, lack of generalization !39
  • 134. #detailed insight #analyzed repositories How Many Repositories and Projects are Required? deep qualitative analysis possible, lack of generalization generalization of shallow findings due to large- scale quantitative analysis !39
  • 135. #detailed insight #analyzed repositories How Many Repositories and Projects are Required? deep qualitative analysis possible, lack of generalization generalization of shallow findings due to large- scale quantitative analysis !39
  • 136. #detailed insight #analyzed repositories How Many Repositories and Projects are Required? deep qualitative analysis possible, lack of generalization generalization of shallow findings due to large- scale quantitative analysis mix of both !39
  • 137. Hold On, What about that “Cross- Linking”? !40
  • 138. !41
  • 140. !42
  • 141. !43
  • 142. !43 does this commit fix a bug or add a new feature?
  • 143. !44
  • 147. lack of access to confidential/ closed repositories not enough data left after linking imprecise heuristics based on email addresses, … Linking Repositories is Challenging! no recorded links due to lack of discipline developers ambiguous or incorrect links invalid identifiers, e.g., after migration to newer repository technology !46
  • 149. OSEMN Enhanced Empirical SE Process define plan iterate raw data Obtain Scrub Explore Model iNterpret iterate operate (case study, experiment or survey) empirical SE data science a b how to incorporate software analytics? ESE !47
  • 150. Today’s Empirical SE Process define plan iterate Scrub Explore Model iNterpret iterate operate (case study, experiment or survey) empirical SE data science a b LONSEMN ESE !48
  • 151. Today’s Empirical SE Process define plan iterate Scrub Explore Model iNterpret iterate operate (case study, experiment or survey) empirical SE data science software analytics a b LONSEMN ESE !48
  • 152. Today’s Empirical SE Process define plan iterate Scrub Explore Model iNterpret iterate operate (case study, experiment or survey) empirical SE data science software analytics Link a b LONSEMN ESE !48
  • 153. Today’s Empirical SE Process define plan iterate Scrub Explore Model iNterpret iterate operate (case study, experiment or survey) empirical SE data science software analytics Link a b ObtaiN LONSEMN ESE !48
  • 154. ObtaiN Activity should be Time-aware! !49
  • 155. ObtaiN Activity should be Time-aware! !49
  • 156. ObtaiN Activity should be Time-aware! check out snapshot of code !49
  • 157. ObtaiN Activity should be Time-aware! check out snapshot of code apply time- unaware metric extraction !49
  • 158. ObtaiN Activity should be Time-aware! check out snapshot of code apply time- unaware metric extraction !49
  • 159. ObtaiN Activity should be Time-aware! check out snapshot of code apply time- unaware metric extraction !49
  • 160. ObtaiN Activity should be Time-aware! check out snapshot of code apply time- unaware metric extraction !49
  • 161. ObtaiN Activity should be Time-aware! check out snapshot of code apply time- unaware metric extraction !49
  • 162. ObtaiN Activity should be Time-aware! check out snapshot of code apply time- unaware metric extraction !49
  • 163. ObtaiN Activity should be Time-aware! check out snapshot of code apply time- unaware metric extraction !49
  • 164. ObtaiN Activity should be Time-aware! check out snapshot of code apply time- unaware metric extraction !49
  • 165. what if snapshot does not compile? ObtaiN Activity should be Time-aware! check out snapshot of code apply time- unaware metric extraction !49
  • 166. what if snapshot does not compile? ObtaiN Activity should be Time-aware! check out snapshot of code apply time- unaware metric extraction what if analysis is time-intensive? !49
  • 167. what if snapshot does not compile? ObtaiN Activity should be Time-aware! check out snapshot of code apply time- unaware metric extraction what if more than one branch? what if analysis is time-intensive? !49
  • 168. what if snapshot does not compile? ObtaiN Activity should be Time-aware! check out snapshot of code apply time- unaware metric extraction what if more than one branch? what if results of prior snapshots needed? what if analysis is time-intensive? !49
  • 170. Today’s Empirical SE Process define plan iterate Link Scrub Explore Model iNterpret iterate operate (case study, experiment or survey) ObtaiN empirical SE data science software analytics a b LONSEMN ESE !51
  • 171. Part IV: Now What?
  • 172. !53
  • 173. !53 The 3 concepts are in fact complementary!
  • 174. … so can you just submit your work to venues of either domain? !53 The 3 concepts are in fact complementary!
  • 175. Yes, you can! (Depending on the contributions you want to stress) !54
  • 176. Yes, you can! (Depending on the contributions you want to stress) NOTE: “stress” != “include” (i.e., all activities will be included, but with different importance) !54
  • 177. empirical venues (ESEM, EMSE, …) define plan iterate operate (case study, experiment or survey) !55
  • 178. empirical venues (ESEM, EMSE, …) define plan iterate operate (case study, experiment or survey) software analytics venues (MSR, PROMISE, …) Link ObtaiN !55
  • 179. empirical venues (ESEM, EMSE, …) define plan iterate operate (case study, experiment or survey) software analytics venues (MSR, PROMISE, …) Link ObtaiN data science/general SE venues (KDD, ICDM, ICSE, TSE, …) Scrub Explore Model iNterpret iterate a b !55
  • 181. Exercise: Predicting Bug- introducing Commits (2) How to sell this work as an ESEM paper? !56
  • 182. Exercise: Predicting Bug- introducing Commits (2) How to sell this work as an ESEM paper? How to sell this work as an MSR paper? !56
  • 183. Exercise: Predicting Bug- introducing Commits (2) How to sell this work as an ESEM paper? How to sell this work as an MSR paper? How to sell this work as an ICSE paper? !56
  • 184. empirical venues (ESEM, EMSE, …) data science/general SE venues (KDD, ICDM, ICSE, TSE, …) software analytics venues (MSR, PROMISE, …) Link ObtaiN define plan iterate operate (case study, experiment or survey) Scrub Explore Model iNterpret iterate a b !57
  • 185. How Will Tomorrow’s ESE Process Look Like? !58
  • 186. How Will Tomorrow’s ESE Process Look Like? hard to predict … !58
  • 187. How Will Tomorrow’s ESE Process Look Like? hard to predict … … but very likely an evolution of today’s process! !58
  • 188.
  • 190. Empirical Science? testable hypothesis validation rejected hypothesis validated hypothesis observation/ experience Claes Wohlin et al., “Experimentation in Software Engineering - An Introduction”, 2000. Traditional Empirical SE Process define plan raw dataoperate (case study, experiment or survey) iterate !28
  • 191. Empirical Science? testable hypothesis validation rejected hypothesis validated hypothesis observation/ experience Claes Wohlin et al., “Experimentation in Software Engineering - An Introduction”, 2000. Traditional Empirical SE Process define plan raw dataoperate (case study, experiment or survey) iterate !28 actionable insights !37 recording traces of activities
  • 192. Today’s Empirical SE Process define plan iterate Link Scrub Explore Model iNterpret iterate operate (case study, experiment or survey) ObtaiN empirical SE data science software analytics a b LONSEMN ESE !50 Empirical Science? testable hypothesis validation rejected hypothesis validated hypothesis observation/ experience Claes Wohlin et al., “Experimentation in Software Engineering - An Introduction”, 2000. Traditional Empirical SE Process define plan raw dataoperate (case study, experiment or survey) iterate !28 actionable insights !37 recording traces of activities