Software engineering research often requires analyzing
multiple revisions of several software projects, be it to make and
test predictions or to observe and identify patterns in how software evolves. However, code analysis tools are almost exclusively designed for the analysis of one specific version of the code, and the time and resources requirements grow linearly with each additional revision to be analyzed. Thus, code studies often observe a relatively small number of revisions and projects. Furthermore, each programming ecosystem provides dedicated tools, hence researchers typically only analyze code of one language, even when researching topics that should generalize
to other ecosystems. To alleviate these issues, frameworks and models have been developed to combine analysis tools or automate the analysis of multiple revisions, but little research has gone into actually removing redundancies in multi-revision, multi-language code analysis. We present a novel end-to-end approach that systematically avoids redundancies every step of the way: when reading sources from version control, during parsing, in the internal code representation, and during the actual analysis. We evaluate our open-source implementation, LISA, on the full
history of 300 projects, written in 3 different programming languages, computing basic code metrics for over 1.1 million program revisions. When analyzing many revisions, LISA requires less than a second on average to compute basic code metrics for all files in a single revision, even for projects consisting of millions of lines of code.
Reducing Redundancies in Multi-Revision Code Analysis
1. Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall
Software Evolution and Architecture Lab
University of Zurich, Switzerland
{alexandru,panichella,gall}@ifi.uzh.ch
22.02.2017
SANER’17
Klagenfurt, Austria
6. A Typical Analysis Process
www
clone
checkout
select project
select revision
3
7. A Typical Analysis Process
www
clone
checkout analysis tool
Res
select project
select revision
apply
tool
store
analysis
results
3
8. A Typical Analysis Process
www
clone
checkout analysis tool
Res
select project
select revision
apply
tool
store
analysis
results
more revisions?
3
9. A Typical Analysis Process
www
clone
checkout analysis tool
Res
select project
select revision
apply
tool
store
analysis
results
more revisions?
more projects?
3
11. Redundancies all over...
Redundancies in
historical code analysis
Few files change
Only small parts of
a file change
Across Revisions
Impact on Code
Study Tools
4
12. Redundancies all over...
Redundancies in
historical code analysis
Few files change
Only small parts of
a file change
Across Revisions
Impact on Code
Study Tools
Repeated analysis
of "known" code
4
13. Redundancies all over...
Redundancies in
historical code analysis
Few files change
Only small parts of
a file change
Changes may not
even affect results
Across Revisions
Impact on Code
Study Tools
Repeated analysis
of "known" code
Storing redundant
results
4
14. Redundancies all over...
4
Redundancies in
historical code analysis
Few files change
Across Languages
Only small parts of
a file change
Changes may not
even affect results
Each language has
their own toolchain
Yet they share
many metrics
Across Revisions
Impact on Code
Study Tools
Repeated analysis
of "known" code
Storing redundant
results
15. Redundancies all over...
Redundancies in
historical code analysis
Few files change
Across Languages
Only small parts of
a file change
Changes may not
even affect results
Each language has
their own toolchain
Yet they share
many metrics
Across Revisions
Impact on Code
Study Tools
Repeated analysis
of "known" code
Storing redundant
results
Re-implementing
identical analyses
Generalizability is
expensive
4
16. Redundancies all over...
Redundancies in
historical code analysis
Few files change
Across Languages
Only small parts of
a file change
Changes may not
even affect results
Each language has
their own toolchain
Yet they share
many metrics
Across Revisions
Impact on Code
Study Tools
Repeated analysis
of "known" code
Storing redundant
results
Re-implementing
identical analyses
Generalizability is
expensive
5
22. Avoid checkouts
7
clone
checkout
read
For every file: 2 read ops + 1 write op
Checkout includes irrelevant files
Need 1 CWD for every revision to be analyzed in parallel
read write
analyze
25. Avoid checkouts
8
clone analyze
Only read relevant files in a single read op
No write ops
No overhead for parallization
Git
Analysis Tool
File Abstraction Layer
read
26. Avoid checkouts
8
clone analyze
Only read relevant files in a single read op
No write ops
No overhead for parallization
Git
Analysis Tool
File Abstraction Layer
E.g. for the JDK Compiler:
class JavaSourceFromCharrArray(name: String, val code: CharBuffer)
extends SimpleJavaFileObject(URI.create("string:///" + name), Kind.SOURCE) {
override def getCharContent(): CharSequence = code
}
read
27. Avoid checkouts
clone analyze
Only read relevant files in a single read op
No write ops
No overhead for parallization
Git
Analysis Tool
File Abstraction Layer
E.g. for the JDK Compiler:
class JavaSourceFromCharrArray(name: String, val code: CharBuffer)
extends SimpleJavaFileObject(URI.create("string:///" + name), Kind.SOURCE) {
override def getCharContent(): CharSequence = code
}
read
9
28. #2: Use a multi-revision representation
of your sources
39. #3: Store AST nodes only if
they're needed for analysis
40. public class Demo {
public void run() {
for (int i = 1; i< 100; i++) {
if (i % 3 == 0 || i % 5 == 0) {
System.out.println(i)
}
}
}
What's the complexity (1+#forks)
and name for each method and
class?
20
41. public class Demo {
public void run() {
for (int i = 1; i< 100; i++) {
if (i % 3 == 0 || i % 5 == 0) {
System.out.println(i)
}
}
}
140 AST nodes
(using ANTLR)
parse
What's the complexity (1+#forks)
and name for each method and
class?
20
42. public class Demo {
public void run() {
for (int i = 1; i< 100; i++) {
if (i % 3 == 0 || i % 5 == 0) {
System.out.println(i)
}
}
}
140 AST nodes
(using ANTLR)
parse
CompilationUnit
TypeDeclaration
Modifiers
public
Members
Method
Modifiers
public
Name
run
Name
Demo
Parameters ReturnType
PrimitiveType
VOID
Body
Statements
...
...
What's the complexity (1+#forks)
and name for each method and
class?
20
43. public class Demo {
public void run() {
for (int i = 1; i< 100; i++) {
if (i % 3 == 0 || i % 5 == 0) {
System.out.println(i)
}
}
}
What's the complexity (1+#forks)
and name for each method and
class?
140 AST nodes
(using ANTLR)
parse filtered parse
TypeDeclaration
Method
Name
run
Name
DemoForStatement
IfStatement
ConditionalExpression
7 AST nodes
(using ANTLR)
21
44. public class Demo {
public void run() {
for (int i = 1; i< 100; i++) {
if (i % 3 == 0 || i % 5 == 0) {
System.out.println(i)
}
}
}
What's the complexity (1+#forks)
and name for eachmethod and
class?
140 AST nodes
(using ANTLR)
parse filtered parse
TypeDeclaration
Method
Name
run
Name
DemoForStatement
IfStatement
ConditionalExpression
7 AST nodes
(using ANTLR)
22
45. public class Demo {
public void run() {
for (int i = 1; i< 100; i++) {
if (i % 3 == 0 || i % 5 == 0) {
System.out.println(i)
}
}
}
What's the complexity (1+#forks)
and name for eachmethod and
class?
140 AST nodes
(using ANTLR)
parse filtered parse
TypeDeclaration
Method
Name
run
Name
DemoForStatement
IfStatement
ConditionalExpression
7 AST nodes
(using ANTLR)
23
51. LISA also does:
#5: Parallel Parsing
#6: Asynchronous graph computation
#7: Generic graph computations
applying to ASTs from compatible
languages
26
54. The LISA Analysis Process
27
www
clone
parallel parse
into merged
graph
select project
Language Mappings
(Grammar to Analysis)
determines which AST
nodes are loaded
ANTLRv4
Grammar used by
Generates Parser
55. The LISA Analysis Process
27
www
clone
parallel parse
into merged
graph
Res
select project
store
analysis
results
Analysis formulated as
Graph Computation
Language Mappings
(Grammar to Analysis) used by
Async. compute
determines which AST
nodes are loaded
determines which
data is persisted
ANTLRv4
Grammar used by
Generates Parser
runs
on graph
56. The LISA Analysis Process
27
www
clone
parallel parse
into merged
graph
Res
select project
store
analysis
results
more projects?
Analysis formulated as
Graph Computation
Language Mappings
(Grammar to Analysis) used by
Async. compute
determines which AST
nodes are loaded
determines which
data is persisted
ANTLRv4
Grammar used by
Generates Parser
runs
on graph
61. The (not so) minor stuff
• Must implement analyses from scratch
• No help from a compiler
• Non-file-local analyses need some effort
30
62. The (not so) minor stuff
• Must implement analyses from scratch
• No help from a compiler
• Non-file-local analyses need some effort
• Moved files/methods etc. add overhead
• Uniquely identifying files/entities is hard
• (No impact on results, though)
30
65. Thank you for your attention
SANER ‘17, Klagenfurt, 22.02.2017
Read the paper: http://t.uzh.ch/Fj
Try the tool: http://t.uzh.ch/Fk
Get the slides: http://t.uzh.ch/Fm
Contact me: alexandru@ifi.uzh.ch