"Machine Learning for Software Maintainability":
Slides presented at the Joint Workshop on Intelligent Methods for Software System Engineering (JIMSE) 2012
2. SOFTWARE MAINTENANCE
βA software system must be continuously
adapted during its overall life cycle or it
progressively becomes less satisfactoryβ
(cit. Lehmanβs Law of Software Evolution)
β’ Software Maintenance is one of the most
expensive and time consuming phase of the
whole life cycle
β’ Anticipating the Maintenance operations
reduces the cost
β’ 85%-90% of the total cost are related to the
effort necessary to comprehend the system
and its source code [Erlikh, 2000]
3. Software Artifacts
UI Process
Components
UI
Components
Data Access
Components
Data Helpers /
Utilities
Security
Operational Management
Communications
Business
Components
Application Facade
Buisiness
Workο¬ows
Messages
Interfaces
Service Interfaces
β’ Provide models and views
representing the relationships
among different software artifacts
β’ Clustering of Software Artifacts
β’ Advantages:
β’ To aid the comprehension
β’ To reduce maintenance effort
SOFTWARE ARCHITECTURE
4. β’ Provide models and views
representing the relationships
among different software artifacts
β’ Clustering of Software Artifacts
β’ Advantages:
β’ To aid the comprehension
β’ To reduce maintenance effort
SOFTWARE ARCHITECTURE
External
Systems
Service Consumers
Services
Service Interfaces
Messages
Interfaces
Cross Cutting
Security
OperationalManagement
Communications
Data
Data Access
Components
Data Helpers /
Utilities
Presentation
UI
Components
UI Process
Components
Business
Application Facade
Buisiness
Workο¬ows
Business
Components
Clusters of Software Artifacts
5. β’ Provide models and views
representing the relationships
among different software artifacts
β’ Clustering of Software Artifacts
β’ Advantages:
β’ To aid the comprehension
β’ To reduce maintenance effort
SOFTWARE ARCHITECTURE
External
Systems
Service Consumers
Services
Service Interfaces
Messages
Interfaces
Cross Cutting
Security
OperationalManagement
Communications
Data
Data Access
Components
Data Helpers /
Utilities
Presentation
UI
Components
UI Process
Components
Business
Application Facade
Buisiness
Workο¬ows
Business
Components
Clusters of Software Artifacts
6. Software Artifacts may be analyzed at different
levels of abstractions
SOFTWARE ARTIFACTS
7. Software Artifacts may be analyzed at different
levels of abstractions
SOFTWARE ARTIFACTS
8. Software Artifacts may be analyzed at different
levels of abstractions
The different levels of abstractions
lead to different analysis tasks:
β’ Identification of functional
modules and their hierarchical
arrangement
β’ i.e., Clustering of Software
classes
β’ Identification of Code Clones
β’ i.e., Clustering of Duplicated
code fragments (blocks,
SOFTWARE ARTIFACTS
9. β’ Mine information directly from the source
code:
β’ Exploit the syntactic/lexical
information provided in the source
code text
β’ Exploit the relational information
between artifacts
β’ e.g., Program Dependencies
Problem: Definition of a proper similarity measure to apply in the clustering
analysis, which is able to exploit the considered representation of software artifacts
SOFTWARE ARTIFACTS
CLUSTERING
10. β’ Analysis of large and complex systems
β’ Solutions and algorithms must be able to scale efficiently
(in the large and in the many)
MINING LARGE REPOSITORIES
11. Idea: Definition of Machine Learning techniques to mine information from the
source code
β’ Combine different kind of information (lexical and structural)
β’ Application of Kernel Methods to software artifacts
β’ Provide flexible and computational effective solutions to analyze large data
sets
ADVANCED MACHINE LEARNING
FOR SOFTWARE MAINTENANCE
12. Idea: Definition of Machine Learning techniques to mine information from the
source code
β’ Combine different kind of information (lexical and structural)
β’ Application of Kernel Methods to software artifacts
β’ Provide flexible and computational effective solutions to analyze large data
sets
Advanced Machine Learning
β’ Learning with syntactic/semantic information (Natural Language Processing)
β’ Learning in relational domains (Structured-output learning, Logic Learning,
Statistical Relational Learning)
ADVANCED MACHINE LEARNING
FOR SOFTWARE MAINTENANCE
13. KERNEL METHODS FOR
STRUCTURED DATA
β’ A Kernel is a function between (arbitrary) pairs of
entities
β’ It can be seen as a kind of similarity measure
β’ Based on the idea that structured objects can be
described in terms of their constituent parts
β’ Generalize the computation of the dot product to
arbitrary domains
β’ Can be easily tailored to specific domains
β’ Tree Kernels
β’ Graph Kernels
β’ ....
15. β’ Parse Trees represent the syntactic
structure of a sentence
β’ Tree Kernels can be used to measure
the similarity between parse trees
KERNELS FOR
LANGUAGES
16. β’ Parse Trees represent the syntactic
structure of a sentence
β’ Tree Kernels can be used to measure
the similarity between parse trees
KERNELS FOR
LANGUAGES
β’ Abstract Syntax Trees (AST)
represent the syntactic structure of a
piece of code
β’ Research on Tree Kernels for NLP
carries over to AST (with
adjustments)
KERNELS FOR
SOURCE CODE
19. β’ Supervised Learning
β’ Binary Classification
β’ Multi-class Classification
β’ Ranking
β’ Unsupervised Learning
β’ Clustering
β’ Anomaly Detection
Idea: Any learning algorithm relying on similarity
measure can be used
KERNEL MACHINES
20. KERNEL MACHINES
FOR CONE DETECTION
β’ Supervised Learning
β’ Pairwise classifier: predict if a pair of fragments is clone
β’ Unsupervised Learning
β’ Clustering: cluster together all candidate clones
22. KERNEL LEARNING
β’ Construct a number of candidate kernels with different characteristics
β’ e.g., Ignore variables names or not
β’ Employ kernel learning approaches which learn a weighted combination
of candidate kernels
β’ Useless/harmful kernels will get zero weight and will be discarded in the
final model
LEARNING SIMILARITIES
23. Supervised Clustering
β’ Exploit information on already annotated pieces of software
β’ Training examples are software projects/portions with annotation on existing
clones (clustering)
β’ A learning model uses training examples to refine the similarity measure for
correctly clustering novel examples
STRUCTURED-OUTPUT
LEARNING
24. β’ Software has a rich structure and heterogeneous information
β’ Advanced Machine learning approaches are promising for exploiting such
information
β’ Kernel Methods are natural candidate
β’ e.g., see the analogy between NLP parse trees and AST
β’ Many applications:
β’ architecture recovery, code clone detection, vulnerability detection ....
SUMMARY
26. β’ Goal: βIdentify and group all duplicated code fragments/functionsβ
β’ Copy&Paste programming
β’ Taxonomy of 4 different types of clones
β’ Program Text similarities and Functional similarities
β’ Clones affect the reliability and the maintainability of a software
system
CODE CLONE DETECTION
27. β’ Abstract Syntax Tree (AST)
β’ Tree structure representing the
syntactic structure of the different
instructions of a program (function)
β’ Program Dependencies Graph
(PDG)
β’ (Directed) Graph structure
representing the relationship among
the different statement of a program
KERNELS FOR
CLONESCODE
STRUCTURES
Kernels for Structured Data:
β’ The source code could be represented by many different data
structures
37. CODE
STRUCTURES
PDG
β’ Two Types of Nodes
β’ Control Nodes (Dashed ones)
β’ e.g., if - for - while - function calls...
β’ Data Nodes
β’ e.g., expressions - parameters...
NODES AND EDGES
while call-site
argexpr
38. CODE
STRUCTURES
PDG
β’ Two Types of Nodes
β’ Control Nodes (Dashed ones)
β’ e.g., if - for - while - function calls...
β’ Data Nodes
β’ e.g., expressions - parameters...
β’ Two Types of Edges (i.e., dependencies)
β’ Control edges (Dashed ones)
β’ Data edges
NODES AND EDGES
while call-site
argexpr
39. DEFINING KERNELS FOR
STRUCTURED DATA
β’ The definition of a new Kernel for a Structured Object requires the definition
of:
β’ Set of features to annotate each part of the object
β’ A Kernel function to measure the similarity on the smallest part of the object
β’ e.g., Nodes of AST and Graphs
β’ A Kernel function to apply the computation on the different (sub)parts of the
structured object
KERNELS
FOR CODE
STRUCTURES
40. β’ Features: each node is characterized by a set of 4
features
β’ Instruction Class
β’ i.e., LOOP, CONDITIONAL_STATEMENT, CALL
β’ Instruction
β’ i.e., FOR, IF, WHILE, RETURN
β’ Context
β’ i.e., Instruction Class of the closer statement node
β’ Lexemes
β’ Lexical information gathered (recursively) from
leaves
KERNELS
FOR CODE
STRUCTURES:
AST
TREE KERNELS FOR
AST
FOR
FOR-INIT
FOR-
BODY
41. β’ Features: each node is characterized by a set of 4
features
β’ Instruction Class
β’ i.e., LOOP, CONDITIONAL_STATEMENT, CALL
β’ Instruction
β’ i.e., FOR, IF, WHILE, RETURN
β’ Context
β’ i.e., Instruction Class of the closer statement node
β’ Lexemes
β’ Lexical information gathered (recursively) from
leaves
KERNELS
FOR CODE
STRUCTURES:
AST
Instruction Class = LOOP
Instruction = FOR
Context = (e.g., LOOP)
Lexemes = (e.g, name of variables in FOR-
INIT..)
TREE KERNELS FOR
AST
FOR
FOR-INIT
FOR-
BODY
42. β’ Goal: Identify the maximum isomorphic Tree/Subtree
β’ Comparison of blocks to each other
β’ Blocks: Atomic unit for (sub) tree considered
KERNELS
FOR CODE
STRUCTURES:
AST
TREE KERNELS FOR
AST
BLOCK
= =
print
x 1.0 y f
x x
y
BLOCK
= = print
s 0.0 p f
s 1.0
p
43. β’ Goal: Identify the maximum isomorphic Tree/Subtree
β’ Comparison of blocks to each other
β’ Blocks: Atomic unit for (sub) tree considered
KERNELS
FOR CODE
STRUCTURES:
AST
TREE KERNELS FOR
AST
BLOCK
= =
print
x 1.0 y f
x x
y
BLOCK
= = print
s 0.0 p f
s 1.0
p
44. β’ Features of nodes:
β’ Node Label
β’ i.e., , WHILE, CALL-SITE, EXPR, ...
β’ Node Type
β’ i.e., Data Node or Control Node
β’ Features of edges:
β’ Edge Type
β’ i.e., Data Edge or Control Edge
KERNELS
FOR CODE
STRUCTURES:
PDG
GRAPH KERNELS
FOR PDG
while
call-site
arg
expr expr
45. β’ Features of nodes:
β’ Node Label
β’ i.e., , WHILE, CALL-SITE, EXPR, ...
β’ Node Type
β’ i.e., Data Node or Control Node
β’ Features of edges:
β’ Edge Type
β’ i.e., Data Edge or Control Edge
KERNELS
FOR CODE
STRUCTURES:
PDG
Node Label = WHILE
Node Type = Control Node
GRAPH KERNELS
FOR PDG
while
call-site
arg
expr expr
Control Edge
Data Edge
46. while
call-site
arg
expr expr
while
call-site
arg
expr call-site
GRAPH KERNELS
FOR PDG
β’ Goal: Identify common subgraphs
β’ Selectors: Compare nodes to each others and explore the subgraphs of only βcompatibleβ
nodes (i.e., Nodes of the same type)
β’ Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops)
KERNELS
FOR CODE
STRUCTURES:
PDG
47. while
call-site
arg
expr expr
while
call-site
arg
expr call-site
GRAPH KERNELS
FOR PDG
β’ Goal: Identify common subgraphs
β’ Selectors: Compare nodes to each others and explore the subgraphs of only βcompatibleβ
nodes (i.e., Nodes of the same type)
β’ Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops)
KERNELS
FOR CODE
STRUCTURES:
PDG
48. while
call-site
arg
expr expr
while
call-site
arg
expr call-site
GRAPH KERNELS
FOR PDG
β’ Goal: Identify common subgraphs
β’ Selectors: Compare nodes to each others and explore the subgraphs of only βcompatibleβ
nodes (i.e., Nodes of the same type)
β’ Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops)
KERNELS
FOR CODE
STRUCTURES:
PDG
49. while
call-site
arg
expr expr
while
call-site
arg
expr call-site
GRAPH KERNELS
FOR PDG
β’ Goal: Identify common subgraphs
β’ Selectors: Compare nodes to each others and explore the subgraphs of only βcompatibleβ
nodes (i.e., Nodes of the same type)
β’ Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops)
KERNELS
FOR CODE
STRUCTURES:
PDG
50. EVALUATION
PROTOCOL
β’ Comparison of results with other two clone detector tools:
β’ AST-based Clone detector
β’ PDG-based Clone Detector
EMPIRICAL
EVALUATION
51. EVALUATION
PROTOCOL
β’ Comparison of results with other two clone detector tools:
β’ AST-based Clone detector
β’ PDG-based Clone Detector
β’ No publicly available clone detection dataset
EMPIRICAL
EVALUATION
52. EVALUATION
PROTOCOL
β’ Comparison of results with other two clone detector tools:
β’ AST-based Clone detector
β’ PDG-based Clone Detector
β’ No publicly available clone detection dataset
β’ No unique set of analyzed open source systems
EMPIRICAL
EVALUATION
53. EVALUATION
PROTOCOL
β’ Comparison of results with other two clone detector tools:
β’ AST-based Clone detector
β’ PDG-based Clone Detector
β’ No publicly available clone detection dataset
β’ No unique set of analyzed open source systems
β’ Usually clone results are not available
EMPIRICAL
EVALUATION
54. EVALUATION
PROTOCOL
β’ Comparison of results with other two clone detector tools:
β’ AST-based Clone detector
β’ PDG-based Clone Detector
β’ No publicly available clone detection dataset
β’ No unique set of analyzed open source systems
β’ Usually clone results are not available
β’ Two possible strategies:
EMPIRICAL
EVALUATION
55. EVALUATION
PROTOCOL
β’ Comparison of results with other two clone detector tools:
β’ AST-based Clone detector
β’ PDG-based Clone Detector
β’ No publicly available clone detection dataset
β’ No unique set of analyzed open source systems
β’ Usually clone results are not available
β’ Two possible strategies:
β’ To automatically modify an existing system with randomly generated clones
EMPIRICAL
EVALUATION
56. EVALUATION
PROTOCOL
β’ Comparison of results with other two clone detector tools:
β’ AST-based Clone detector
β’ PDG-based Clone Detector
β’ No publicly available clone detection dataset
β’ No unique set of analyzed open source systems
β’ Usually clone results are not available
β’ Two possible strategies:
β’ To automatically modify an existing system with randomly generated clones
β’ Manual classification of candidate results
EMPIRICAL
EVALUATION
57. BENCHMARKS AND
DATASET
Project Size (KLOC) # PDGS
Apache-2.2.14 343 3017
Python-2.5.1 435 5091
β’ Comparison with another Graph-based clone detector
β’ MeCC (ICSE2011)
β’ Baseline Dataset
β’ Results provided by MeCC
β’ Extended Dataset
β’ Extension of Clones results by manual evaluation of candidate clones
β’ Agreement rate calculation between the evaluators
EMPIRICAL
EVALUATION
58. EMPIRICAL EVALUATION
OF TREE KERNEL FOR AST
β’ Comparison with another (pure) AST-based clone detector
β’ Clone Digger http://clonedigger.sourceforge.net/
β’ Comparison on a system with randomly seeded clones
Results refer to clones where code
fragments have been modified by
adding/removing or changing code
statements
EVALUATION
TREE KERNELS
FOR AST
59. PRECISION, RECALL AND F1
PLOT
0
0.25
0.5
0.75
1
0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98
Precision Recall F1
Clone results with different similarity
thresholds
EVALUATION
TREE KERNELS
FOR AST
60. LOREM
I P S U M
Threshold #Clones in the Baseline #Clones in the Extended Dataset
1.00 874 1089
0.99 874 1514
0"
0.1"
0.2"
0.3"
0.4"
0.5"
0.6"
0.7"
0.8"
0.9"
0.99" 1"
Soglia'di'SimilaritΓ '
Apache'2.2.14'6'Precision'con'Oracolo'Esteso'
Precision5
Baseline"
Precision5
Extended"
0"
0.1"
0.2"
0.3"
0.4"
0.5"
0.6"
0.99" 1"
Soglia'di'SimilaritΓ '
Apache'2.2.14'6'Recall'con'Oracolo'Esteso'
Recall0
Baseline"
Recall0
Extended"
RESULTS WITH
APACHE 2.2.14
EVALUATION
GRAPH KERNELS
FOR PDG
61. LOREM
I P S U M
Threshold #Clones in the Baseline #Clones in the Extended Dataset
1.00 858 1066
0.99 858 2119
0"
0.2"
0.4"
0.6"
0.8"
1"
1.2"
0.99" 1"
Soglia'di'SimilaritΓ '
Python'2.5.1'5'Precision'con'Oracolo'
Esteso'
Precision2
Baseline"
Precision2
Extended"
0"
0.1"
0.2"
0.3"
0.4"
0.5"
0.6"
0.7"
0.8"
0.9"
0.99" 1"
Soglia'di'SimilaritΓ '
Python'2.5.1'5'Recall'con'Oracolo'Esteso'
Recall2Baseline"
Recall2Extended"
RESULTS WITH
PYTHON 2.5.2
EVALUATION
GRAPH KERNELS
FOR PDG
62. CHALLENGES AND
OPPORTUNITIES
β’ Learning Kernel Functions from Data Set
β’ Kernel Methods advantages:
β’ flexible solution to be tailored to specific domain
β’ efficient solution easy to parallelize
β’ combinations of multiple kernels
β’ Provide a publicly available data set