130614 sebastiano panichella - mining source code descriptions from developers communications

Mining Source Code Descriptions
from Developer Communications

Sebastiano
Panichella

Jairo
Aponte

Massimiliano
Di Penta

Andrian
Marcus

Gerardo
Canfora

Context: Software Project

Sequence
diagram

Documentation

Source Code

Class
diagram

Developer

Program
Comprehension

Maintenance
Tasks


Sequence
diagram

Documentation

Source Code

Class
diagram

Difficult
understanding

Developer

Program
Comprehension

Maintenance
Tasks

describes

Sequence
diagram

Documentation

Source Code

Class
diagram

Difficult
understanding

understanding

Developer

Program
Comprehension

Maintenance
Tasks

Coming back
to the reality...

Program
Comprehension

Source Code
Maintenance
Tasks

Difficult
understanding

Developer

Idea
We argue that messages exchanged among contributors/developers
are a useful source of information to help understanding source code.

In such situations developers need to infer knowledge from,
the source code itself

Developer

source code descriptions in external
artifacts.

..................................................

When call the method IndexSplitter.split(File
destDir, String[] segs) from the Lucene cotrib
directory(contrib/misc/src/java/org/apache/luc
ene/index) it creates an index with segments
descriptor file with among contributors/developers
We argue that messages exchanged wrong data. Namely wrong
are a useful source of information to help understanding of segment
is the number representing the name source code.
that would be created next in this index.

Idea

..................................................

In such situations developers need to infer knowledge from,

CLASS: IndexSplitter
the source code itself

Developer

METHOD: split

source code descriptions in external
artifacts.

A Five Step-Approach for Mining
Method Descriptions

Developer

Step 1: Downloading emails/bugs reports and tracing them
onto classes
Two heuristics

Developer
Discussion

The discussion contains a fully-qualified class name (e.g.,
org.apache.lucene.analysis.MappingCharFilter); or the email contains a
file name (e.g., MappingCharFilter.java)
For bug reports, we complement the above heuristic by matching the
bug ID of each closed bug to the commit notes, therefore tracing the bug
report to the files changed in that commit

When call the method IndexSplitter .split(File destDir, String[] segs) from the
Lucene cotrib directory (contrib/misc/src/java/org/apache/lucene/index) it creates
an index with segments descriptor file with wrong data. Namely wrong is the number
representing the name of segment that would be created next in this index.

public void split(File destDir, String[] segs) throws IOException {
destDir.mkdirs();
FSDirectory destFSDir = FSDirectory.open(destDir);
SegmentInfos destInfos = new SegmentInfos }

If some of the segments of the index already has this name this results either to
impossibility to create new segment or in crating of an corrupted segment.

Step 1: Downloading emails/bugs reports and tracing them
onto classes
Two heuristics

Developer
Discussion

The discussion contains a fully-qualified class name (e.g.,
org.apache.lucene.analysis.MappingCharFilter); or the email contains a
file name (e.g., MappingCharFilter.java)
For bug reports, we complement the above heuristic by matching the
bug ID of each closed bug to the commit notes, therefore tracing the bug
report to the files changed in that commit

When call the method IndexSplitter .split(File destDir, String[] segs) from the

destDir.mkdirs();


Step 2: Extracting paragraphs
We consider as paragraphs, text section separated by one or more

white lines
Two heuristics

We prune out paragraph description from source code fragments
and/or stack Traces "by using an approach inspired by the work of
Bacchelli et al.

Developer
Discussion
When call the method IndexSplitter.split(File destDir, String[] segs) from the
PAR
1

destDir.mkdirs();

PAR
2


PAR
3

Step 3: Tracing paragraphs onto methods
A) A valid paragraph must contain the keyword “method”
These paragraphs must
respect the following
two conditions:

Developer
Discussion

A)

B) and the method name must be followed by a open parenthesis—
i.e., we match “foo(”

B)

When call the method IndexSplitter.split(File destDir, String[] segs)
PAR
from the Lucene cotrib directory it creates an index with segments1
descriptor file with wrong data. Namely wrong is the number
representing the name of segment that would be created next in this
index.

METHOD: split(
......................................................................................
......................................................................................
......................................................................................
......................................................................................

Step 4: Heuristic based Filtering
We defined a set of heuristics to further filter the paragraphs associated with
methods that assign each paragraph a score:

..........................
Problem seems to come from
MainMethodeSearchEngine in
org.eclipse.jdt.internal.ui.launcher
The Method
searchMainMethods(IProgressMonitor,
IJavaSearchScope, boolean) ,there's
a call to addSubTypes(List,
IProgressMonitor, IJavaSearchScope)
Method if includesSubtypes flag is
ON. This method add all types subtypes as soon as the given scope
encloses them without testing if
sub-types have a main! After return
IType[] before the excecution
..........................

CLASS: MainMethodSearchEngine
METHOD: serachMainMethods
SCORE


..........................
The Method
..........................

% parameter = 100% -> s1= 1
SCORE

a) Method parameters:
% of parameters
s1= mentioned in the
paragraphs. Value
between 0 and 1

1 if the
method
does not
have
parameters


..........................
The Method
..........................

% parameter = 100% -> s1= 1
SCORE =

1+

% of parameters
paragraphs. Value
between 0 and 1

1 if the
method
does not
have
parameters

b) Syntactic descriptions (mentioning return values):
check whether the
Equal to one if
paragraph contains the
the method is
s2= keyword “return”. If YES
void.
Value equal 1, 0 otherwise


..........................
The Method
..........................

% parameter = 100% -> s1= 1
SCORE =

1+ 0+ 1 = 2

% of parameters
paragraphs. Value
between 0 and 1

1 if the
method
does not
have
parameters

check whether the
Equal to one if
the method is
void.

c) Overriding/Overloading:
1 if any of the “overload” or
s3=“override” keywords appears
in the paragraph, 0 otherwise
d) Method invocations:
1 if any of the “call” or
s4=“excecute” keywords appears


We selected paragraphs that have:
1. s1 ≥ thP = 0.5
2. s2 + s3 + s4 ≥ thH = 1

% of parameters
paragraphs. Value
between 0 and 1

check whether the
Equal to one if
the method is
void.

OK

% parameter = 100% -> s1= 1 ≥ 0.5
SCORE =

1+ 0+ 1 = 2 ≥ 1

1 if the
method
does not
have
parameters

c) Overriding/Overloading:
1 if any of the “overload” or
s3=“override” keywords appears
d) Method invocations:
1 if any of the “call” or
s4=“execute” keywords appears

Step 5: Similarity based Filtering
We rank filtered paragraphs through their textual similarity with the method they
are likely describing.
METHOD

PARAGRAPH

SCORE

Similarity

Method_3

Paragraph_4

2.5

96.1%

Method_1

Paragraph_1

2.5

95.6%

Method_2

Paragraph_2

1.5

97.4%

Method_3

Paragraph_3

1.5

86.2%

Method_1

Paragraph_3

1.5

79.0%

Method_3

Paragraph_2

1.5

77.5%

Method_2

Paragraph_4

1.5

64.3%

Method_2

Paragraph_3

1.3

83.2%

Method_3

Paragraph_1

1.3

73.9%

Method_2

Paragraph_1

1.3

68.7%

Method_1

Paragraph_4

1.3

53.6%

Removing:
- English stop words;
- Programming
language keywords
Using:
- Camel Case
splitting the on
remaining words
- Vector Space Model

Step 5: Similarity based Filtering
We rank filtered paragraphs through their textual similarity with the method they
are likely describing.
METHOD

PARAGRAPH

SCORE

Similarity

Method_3

Paragraph_4

2.5

96.1%

Method_1

Paragraph_1

2.5

95.6%

Method_2

Paragraph_2

1.5

97.4%

Method_3

Paragraph_3

1.5

86.2%

Method_1

Paragraph_3

1.5

79.0%

Method_3

Paragraph_2

1.5

77.5%

Method_2

Paragraph_4

1.5

64.3%

Method_2

Paragraph_3

1.3

83.2%

Method_3

Paragraph_1

1.3

73.9%

Method_2

Paragraph_1

1.3

68.7%

Method_1

Paragraph_4

1.3

53.6%

Removing:
- English stop words;
- Programming
language keywords
Using:
- Camel Case
splitting the on
remaining words
- Vector Space Model

th>=0.80

Empirical Study
• Goal: analyze source code descriptions in developer
discussions
• Purpose: investigating how developer discussions
describe methods of Java Source Code
• Quality focus: find good method description in
messages exchanged among contributors/developers
• Context: Bug-report and mailing lists of two Java
Project
 Apache Lucene and Eclipse

Research Questions
 RQ1 (method coverage): How many methods from
the analyzed software systems are described by the
paragraphs identified by the proposed approach?

 RQ2 (precision): How precise is the proposed approach
in identifying method descriptions?

 RQ3 (missing descriptions): How many potentially
good method descriptions are missed by the approach?

RQ1: How many methods from the analyzed software
systems are described by the paragraphs identified
by the proposed approach?

RQ2: How precise is the proposed approach in identifying
method descriptions?
We sampled 250 descriptions from each project

RQ3: How many potentially good method descriptions are
missed by the approach?
We sampled 100 descriptions from each project
TABLE III
The analysis of a sample of 100 paragraphs traced to methods,
but not satisfying the Step 4 heuristic

System

True Negatives

False Negatives

Eclipse

78

22

Apache Lucene

67

33

130614 sebastiano panichella - mining source code descriptions from developers communications

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (14)

Destacado

Destacado (6)

Similar a 130614 sebastiano panichella - mining source code descriptions from developers communications

Similar a 130614 sebastiano panichella - mining source code descriptions from developers communications (20)

Más de Ptidej Team

Más de Ptidej Team (20)

Último

Último (20)

130614 sebastiano panichella - mining source code descriptions from developers communications