SlideShare una empresa de Scribd logo
1 de 4
Descargar para leer sin conexión
The Last Line Effect
Moritz Beller, Andy Zaidman
Delft University of Technology,
The Netherlands
{m.m.beller,a.e.zaidman}@tudelft.nl
Andrey Karpov
OOO “Program Verification Systems,”
Tula, Russian Federation
karpov@viva64.com
Abstract—Micro-clones are tiny duplicated pieces of code; they
typically comprise only a few statements or lines. In this paper,
we expose the “last line effect,” the phenomenon that the last
line or statement in a micro-clone is much more likely to contain
an error than the previous lines or statements. We do this by
analyzing 208 open source projects and reporting on 202 faulty
micro-clones.
I. INTRODUCTION
Software developers oft need to repeat one particular line of
code several times in succession with only small alterations,
like in this example from TrinityCore:1
Example 1. TrinityCore
1 x += o t h e r . x ;
2 y += o t h e r . y ;
3 z += o t h e r . y ;
The 3D-coordinates of the other object are added onto
the member variables representing the coordinates x, y, z.
However, the last line in this block of three similar lines
contains an error, as it adds the y coordinate onto the z
coordinate. Instead, the last line should be
3 z += o t h e r . z ;
Another example from the popular web browser Chromium2
shows that this effect also occurs in similar statements within
one single line:
Example 2. Chromium
1 s t d : : s t r i n g host = . . . ;
2 s t d : : s t r i n g p o r t s t r = . . . ;
3 i f ( host != buzz : : STR EMPTY &&
host != buzz : : STR EMPTY)
Instead of comparing twice that host does not equal the
empty string, in the last position, port_str should have been
compared:
3 i f ( host != buzz : : STR EMPTY &&
p o r t s t r != buzz : : STR EMPTY)
Lines 1–3 from Example 1 are similar to each other, as
are the statements in the if clause in line 3 from Example 2.
We call such an extremely short block of almost identically
looking, repeated lines or statements a micro-clone. Through
our experience as software engineers and software quality
consultants, we had the intuition that the last line or statement
in a micro-clone is much more likely to contain an error than
1TrinityCore is a popular opensource framework for the creation of Mas-
sively Multiplayer Online Games (MMOGs), www.trinitycore.org.
2Chromium is the open-source part of Google Chrome, www.chromium.org
the previous lines or statements. The aim of this paper is to
verify whether our intuition is indeed true, leading to two
research questions:
RQ 1 Is the last line in a multi-line micro-clone more likely
to contain an error?
RQ 2 Is the last statement in a single-line micro-clone more
likely to contain an error?
As recurring micro-clones are common to most program-
ming languages, the presence of a last line effect can impact
almost any programmer. If we can prove that the last of a series
of similar statements is more likely to be faulty, programmers
and code reviewers know which code segments to give extra
attention to. This can increase software quality by reducing
the amount of errors in a program. When we posted a popular
science blog entry3
about the last line effect, it was picked up
quickly and excitedly in Internet fora.4
Many programmers
agreed to our observation, often assuming a psychological
reason to cause the effect, namely that programmers think they
finished a change one step too early.
A natural way to come up with code like in Examples 1
and 2 is to write the first part, copy-and-paste it the necessary
number of times, and then adapt the pasted fragments. Copy-
and-paste is one of the most widely used idioms in the
development of software [1]. It is easy, fast, hence cheap,
and the copied code is already known to work. Though often
considered harmful, sometimes (micro-)cloning is in fact the
only way to achieve a certain program behavior, like in the
examples above.
A number of clone detection tools have been developed to
find and possibly remove code clones [2], [3]. While these
automated clone detection tools have produced very strong
results down to the method level, they are ill-suited for rec-
ognizing micro-clones in practice because of an abundance of
false-positives. Therefore, we currently have no understanding
of the error proneness of a single line or statement in a micro-
clone, and, to the best of our knowledge, no similar study has
ever been conducted.
Our paper closes this gap by
• introducing the term micro-clone.
• introducing the static analysis tool PVS-Studio that is
able to reliably detect faulty micro-clones, which cannot
be found with traditional clone detection.
3www.viva64.com/en/b/0260
4www.reddit.com/r/programming/comments/270orx/the last line effect
TABLE I
ERROR TYPES FROM PVS-STUDIO AND THEIR DISTRIBUTION.
PVS #multiline #within Σ#
Error Description5
warnings line
Code warnings
V501 There are identical sub-expressions to the left
and to the right of the ‘foo’ operator.
93 88 181
V519 The ‘x’ variable is assigned values twice
successively. Perhaps this is a mistake.
10 0 10
V537 Consider reviewing the correctness of ‘X’
item’s usage.
8 0 8
V524 It is odd that the body of ‘Foo 1’ function is
fully equivalent to the body of ‘Foo 2’.
3 0 3
V525 The code containing the collection of similar
blocks. Check items X, Y, Z, ... in lines N1,
N2, N3, ...
0 1 1
• manually investigating the error proneness of 202 micro-
clones in 208 well-known open-source systems (OSS).
Our findings show that in micro-clones similar to Exam-
ples 1 and 2, the last line or statement is significantly more
likely to contain an error than any other preceding line or
statement.
II. METHODS
In this section, we shortly describe traditional clone detec-
tion, why it is not suitable for the recognition of micro-clones,
and how we circumvented this problem.
As Examples 1 and 2 demonstrate, the code blocks
that we study in this paper contain “syntactically identical
cop[ies]; only variable, type, or function identifiers have [...]
changed.” [4] This makes them extremely short type-2 clones
(usually shorter than 5 lines or statements, see tables II
and III), which we refer to as micro-clones.
A. Why Current Clone Detectors Are Not Suitable
Traditional code clone detection works with a token-, line-,
abstract syntax tree (AST), or graph-based comparison [4].
However, to reduce the number of false-positives, all ap-
proaches are in need of specifying a minimal code clone length
for their unit of measurement (be it tokens, statements, lines
or AST nodes) when applied in practice. This minimal clone
length is usually in the range of 5-10 units [2], [5], which
makes it too long to detect our micro-clones of length 2 to 4
units (see tables II and III).
B. How We Found Faulty Micro-Clones Instead
As clone detection is not able to reliably detect micro-clones
in practice, we devised a different strategy to find them. Our
RQs aim not at finding all possible micro-clones, but only the
ones which are faulty. With this additional constraint, we could
design and implement a handful of powerful analyses based on
simple textual identity. These are able to find faulty code that
is very likely the result of a micro-clone. Table I describes
the five analyses that found micro-clones in our study. For
example, the analysis V501 simply evaluates whether there are
identical expressions next to certain logical operators. If so,
these are at best redundant and therefore cause a maintenance
problem, or at worst present a real bug in the system.
5For a more detailed description, refer to www.viva64.com/en/d
III. STUDY SETUP
In this section, we describe how we set-up our empirical
study and which study objects we selected.
A. Study Design
Our study consists of four easily replicable steps.
1) Run the static analysis checker PVS-Studio on our study
objects, with all checks enabled. PVS-Studio is a com-
mercial static analysis tool developed by OOO Program
Verification System and incorporates dozens of static
analyses from clone detection to anti-usage-patterns of
C-specific library functions. For replication purposes, a
free trial of PVS-Studio is publicly available.6
2) Inspect the corpus of warnings from PVS-Studio and re-
move false-positives and warnings not related to micro-
cloning.
3) For each faulty micro-clone, count the total number of
lines (RQ1) or statements (RQ2) and denote which lines
or statements are faulty.
4) Naively, we assume each line has the same likelihood
1/n of containing an error (H0), independent of its
position in an n-line long faulty micro-clone. For ex-
ample, lines 1 and 2 in a 2-liner clone each have a
0.5 probability of containing an error. However, if we
can show that the error distribution per line from step
(3) significantly differs from such a uniform distribu-
tion on a σ = 0.05 significance level, we reject H0
and assume a non-uniform error distribution. For each
micro-clone length n, we use Pearson’s χ2
test with
n − 1 degrees of freedom to compare our empirical
distribution’s goodness-of-fit to a 1/n-equipartition.
B. Study Objects
To ensure the replicability and feasibility of our study, we
focused on well-known open-source systems. Among the 208
OSS, we found instances of defective micro clones in such
renowned projects as the music editing software Audacity (2
findings), the web browsers Chromium (9) and Firefox (9),
the XML library libxml (1), the databases MySQL (1) and
MongoDB (1), the C compiler clang (14), the egoshooter
Quake III (3), the rendering software Blender (4), the 3D
visualization toolkit VTK (7), the network protocols Samba
(3) and OpenSSL (1), the video editor VirtualDub (3), and the
programming language PHP (1).
IV. RESULTS
In this section, we deepen our understanding of faulty
micro-clones by example and statistical evaluation.
A. In-Depth Investigation of Findings
We ran the complete suite of all PVS-Studio analyses on our
208 OSS from mid-2011 to December 2014. Andrey Karpov,
a professional software consultant, manually sorted-out false
positives, so that 1,683 potentially interesting warnings with
148 different error codes remained.7
We manually investigated
6www.viva64.com/en/pvs-studio-download
7Our raw and analyzed data: dx.doi.org/10.6084/m9.figshare.1313697.
all 1,683 and found that 202 warnings with five distinct error
codes were related to micro-cloning. Table II shows the error-
per-line distribution of our 108 micro-clones consisting of
several lines, and table III that of our 94 micro-clones within
one single line. Cells with gray background are non-sensible
(i.e. for a micro-clone of 2 lines, no error can occur in line
3). The yellow diagonal highlights errors in the last line or
statement.
B. Warning Types By Example
In the following, we report on the details of 202 PVS-
Studio-generated warnings by providing representative exam-
ples to convey a better intuitive understanding of our findings.
1) V501: As table I shows, the majority of micro-clone
warnings are of type V501. A prime example for a V501-type
warning comes from Chromium:
Example 3. Chromium
1 return
2 ! p r o f i l e . GetFieldText ( A u t o f i l l T y p e (
NAME FIRST) ) . empty ( ) | |
3 ! p r o f i l e . GetFieldText ( A u t o f i l l T y p e (
NAME MIDDLE) ) . empty ( ) | |
4 ! p r o f i l e . GetFieldText ( A u t o f i l l T y p e (
NAME MIDDLE) ) . empty ( ) ;
In this one-liner micro-clone the second and third cloned
statement are lexicographically identical but connected with
the logical OR-operator (||), thus representing a tautology.
Instead, the boolean expression misses to take into account
the surname (NAME_LAST), an example of the last statement
effect in this tricolon.
2) V519: When setting the value of a variable twice in
succession, this is typically either unnecessary – and therefore
a maintenance problem because it makes the code harder to
understand as the first assignment is not effective –, or an
outright bug. In this V519 example from MTASA, m_ucRed
is assigned twice, but the developers forgot to set m_ucBlue.
Example 4. MTASA
1 m ucRed = ucRed ; m ucGreen = ucGreen ; m ucRed =
ucRed ;
3) V537: An illustrative example for a V537 finding comes
from Quake III, where PVS-Studio alerts us to review the use
of rectf.X:
Example 5. Quake III
1 rect −>X = roundr ( r e c t f .X) ;
2 rect −>Y = roundr ( r e c t f .X) ;
Indeed, the rectangle is falsely assigned the rounded value of
rectf.X in its y coordinate in the second (i.e. last) line of
this micro-clone.
4) Counterexample: However, not in all instances of an
erroneous micro-clone does the problem lie in the last line
or statement. Take this rare counterexample from Chromium,
which we counted to the 7 instances of an error in line 2 of a
three-liner micro-clone (see table II):
Example 6. Chromium
1 i f ( s t d : : abs ( data [M01] − data [M10 ] ) > e p s i l o n | |
2 s t d : : abs ( data [M02] − data [M02 ] ) > e p s i l o n | |
3 s t d : : abs ( data [M12] − data [M21 ] ) > e p s i l o n )
TABLE II
ERROR DISTRIBUTION FOR MICRO-CLONES WITH > 1 LINE.
#total lines
1 2 3 4 5 6 7 8 9 10 >10
1 5 0 0 2 0 0 0 0 0 ...
2 47 7 3 1 1 0 0 0 0 ...
3 12 3 0 1 1 0 0 0 ...
4 9 0 0 0 0 0 0 ...
5 3 0 0 1 0 0 ...
6 2 0 0 0 0 ...
7 1 0 0 0 ...
8 0 1 0 ...
9 2 1 ...
10 0 ...
Σ 0 52 19 15 6 4 2 1 3 1 5
ΣΣ 108
#errorsinline
χ2
33.9 11.5 11.4 5.7
p 5×10−9
0.003 0.009 0.225
TABLE III
ERROR DISTRIBUTION FOR MICRO-CLONES WITHIN ONE LINE.
#total statements
1 2 3 4 5 > 5
1 0 0 0 0 0
2 67 3 2 0 0
3 15 0 0 0
4 7 0 0
5 0 0
Σ 0 67 18 9 0 0
ΣΣ 94#errorsin
statement χ2
67 21 15
p 2×10−16
2×10−5
0.002
In line 2, the engineers deducted data_[M02] from itself.
Instead, they meant to write:
2 s t d : : abs ( data [M02] − data [M20 ] ) > e p s i l o n | |
C. Statistical Evaluation
For each column in tables II and III, we performed a
Pearson’s χ2
test on a p = 0.05 significance level to show
whether the individual distributions are non-uniform. Results,
reported in the last two rows, are only meaningful for micro-
clone lengths with enough empirical observations, which are
columns 2, 3 and 4 in both tables.
For RQ1 and RQ2, we got significant p-values for micro-
clones consisting of 2, 3, and 4 lines or statements, respectively
(p < 0.05). This means that we can reject the null hypothesis
that errors are uniformly distributed across statements or lines.
Instead, the distribution is significantly skewed towards the last
line or statement. We would expect similar findings for longer
micro-clones, but these were too rare to derive statistically
valid information.
We can summarize the results across micro-clone lengths
into the two events “error not in last line or statement”
and “error in last line or statement”, shown in table IV.
Our absolute counts show that in micro-clones similar to
Example 1, the last line is more than twice as likely to contain
a fault than all previous lines taken together. When looking at
the individual line length in table II, the last line effect is
even as high as a nine-fold increased error-proneness for the
oft-appearing clone lengths 2 and 4. The results for cloned
TABLE IV
SUMMARIZED RESULTS.
multi-line micro-clone one-line micro-clone
#errors in last line/stmt. 76 89
#errors not in last line/stmt. 32 5
Σ 108 94
statements in micro-clones within one line, like Example 2,
are stronger still: We found the last statement to be 17 times
more faulty than all other statements taken together. In fact,
for the 67 micro-clones consisting of two statements, the last
statement was always the faulty one.
In total, our findings confirm the presence of a strong last
line and last statement effect, accepting both RQ1 and RQ2.
We assume the effect to be caused by and large through copy-
and-pasting code, and that developers have a psychological
tendency to think changes of similar code blocks are finished
earlier than they really are. This way, they miss one critical
last modification.
D. Usefulness of Results
Having unveiled a large number of potential bugs in OSS,
we wanted to (a) help the OSS community and (b) see if
our findings represented bugs that would be worth fixing in
reality. We approached the OSS projects by creating issues
with our findings in their bug trackers. Many of our bug
reports lead to quality improvements in the projects, like
fixing the validation bug from Example 2 in Chromium.8
The search query pvs-studio bug | issue9
delivers
numerous successful bug fixes in Firefox, libxml, MySQL,
Clang and many other projects.
V. THREATS TO VALIDITY
In this section, we show internal and external threats to the
validity of our results, and how we mitigated them.
One internal threat concerns how to determine in which
line the error lies. Given Example 2, any of the two state-
ments could be counted as the one containing the duplication.
However, reading and writing source code typically happens
from top to bottom and from left to right [6]. Therefore, the
only natural assessment is to count the line as problematic
according to this strict left-right/top-down-reading order: In
Example 2, only when we read the second statement do we
know it is a duplicate of the first. We hence count the second
statement as the one containing the error. Moreover, in many
cases, as in Example 2, the program context around the micro-
clone (here the definition order of the variables host first
and then port_str) imposes a natural logical order for the
remainder of the program (first check host, then port_str
in line 3). Because counting lines is a well-defined task under
these circumstances, we are not concerned about interrater
reliability.
8codereview.chromium.org/7031055
9www.google.com/search?q=pvs-studio+bug+|+issue
An external factor that threatens the generalizability is that
PVS-Studio is specific to C(++). C is one of the most com-
monly used languages [7]. Therefore, even if our results were
not generalizable, they would at least be valuable to the large C
community. However, our findings typically contain language
features common to most programming languages, like the
variable assignments, if clauses, boolean expressions and array
use in Examples 1 to 5. Almost all programming languages
have these constructs. Thus, we expect to see analogous results
in at least imperative languages such as Java, JavaScript, C#,
PHP, Ruby, or Python. While our overall corpus of findings is
quite large, the average number of ~1 warning per project is
rather small. This could be because PVS-Studio’s analyses for
defective micro-clones are not exhaustive, and that the projects
we studied are stable, production systems with a mature code
base containing relatively little trivial errors.
VI. FUTURE WORK & CONCLUSION
In this section, we describe possible extensions of our study
and summarize our initial results.
Because our study focuses on faulty micro-clones, we
cannot make predictions about how many of all micro-clones
are affected by the last line effect. A promising future research
direction is to develop a clone detector that can reliably detect
all micro-clones, and then to see how many are actually
defective. This gives an indication of the scale of the problem
at hand. Moreover, while our investigation provides a first
theory about the last line effect, the assumed psychological
reasons for its existence open up another vast and fruitful field
for further study.
In 208 open source projects, we found 202 faulty micro-
clones. Our analysis shows that there is a strong tendency for
the last line, and an even stronger for the last statement to
be faulty, called the last line effect. Because of it, we advise
programmers be extra-careful when reading, modifying, code-
reviewing, or creating the last line and statement of a micro-
clone, especially when they copy-and-pasted it.
REFERENCES
[1] M. Kim, L. Bergman, T. Lau, and D. Notkin, “An ethnographic study of
copy and paste programming practices in oopl,” in Proc. Int’l Symp. on
Empirical Software Engineering (ISESE). IEEE, 2004, pp. 83–92.
[2] S. Bellon, R. Koschke, G. Antoniol, J. Krinke, and E. Merlo, “Comparison
and evaluation of clone detection tools,” IEEE Transactions on Software
Engineering, vol. 33, no. 9, pp. 577–591, 2007.
[3] C. Roy, J. Cordy, and R. Koschke, “Comparison and evaluation of code
clone detection techniques and tools: A qualitative approach,” Science of
Computer Programming, vol. 74, no. 7, pp. 470–495, 2009.
[4] R. Koschke, “Survey of research on software clones,” in Duplication,
Redundancy, and Similarity in Software, 2007.
[5] E. Juergens, F. Deissenboeck, B. Hummel, and S. Wagner, “Do code
clones matter?” in Proceedings of the International Conference on
Software Engineering (ICSE). IEEE, 2009, pp. 485–495.
[6] J. Siegmund, C. K¨astner, S. Apel, C. Parnin, A. Bethmann, T. Leich,
G. Saake, and A. Brechmann, “Understanding understanding source code
with functional magnetic resonance imaging,” in Proc. Int’l Conference
on Software Engineering (ICSE). ACM, 2014, pp. 378–389.
[7] L. Meyerovich and A. Rabkin, “Empirical analysis of programming
language adoption,” in ACM SIGPLAN Notices, vol. 48, no. 10. ACM,
2013, pp. 1–18.

Más contenido relacionado

La actualidad más candente

Mining Fix Patterns for FindBugs Violations
Mining Fix Patterns for FindBugs ViolationsMining Fix Patterns for FindBugs Violations
Mining Fix Patterns for FindBugs Violations
Dongsun Kim
 
Abstract Symbolic Automata: Mixed syntactic/semantic similarity analysis of e...
Abstract Symbolic Automata: Mixed syntactic/semantic similarity analysis of e...Abstract Symbolic Automata: Mixed syntactic/semantic similarity analysis of e...
Abstract Symbolic Automata: Mixed syntactic/semantic similarity analysis of e...
FACE
 
Plagiarism introduction
Plagiarism introductionPlagiarism introduction
Plagiarism introduction
Merin Paul
 
Unveiling Metamorphism by Abstract Interpretation of Code Properties
Unveiling Metamorphism by Abstract Interpretation of Code PropertiesUnveiling Metamorphism by Abstract Interpretation of Code Properties
Unveiling Metamorphism by Abstract Interpretation of Code Properties
FACE
 
A hybrid model to detect malicious executables
A hybrid model to detect malicious executablesA hybrid model to detect malicious executables
A hybrid model to detect malicious executables
UltraUploader
 

La actualidad más candente (18)

Mining Fix Patterns for FindBugs Violations
Mining Fix Patterns for FindBugs ViolationsMining Fix Patterns for FindBugs Violations
Mining Fix Patterns for FindBugs Violations
 
Learning to Spot and Refactor Inconsistent Method Names
Learning to Spot and Refactor Inconsistent Method NamesLearning to Spot and Refactor Inconsistent Method Names
Learning to Spot and Refactor Inconsistent Method Names
 
Message Embedded Cipher Using 2-D Chaotic Map
Message Embedded Cipher Using 2-D Chaotic MapMessage Embedded Cipher Using 2-D Chaotic Map
Message Embedded Cipher Using 2-D Chaotic Map
 
Black-Box attacks against Neural Networks - technical project presentation
Black-Box attacks against Neural Networks - technical project presentationBlack-Box attacks against Neural Networks - technical project presentation
Black-Box attacks against Neural Networks - technical project presentation
 
Abstract Symbolic Automata: Mixed syntactic/semantic similarity analysis of e...
Abstract Symbolic Automata: Mixed syntactic/semantic similarity analysis of e...Abstract Symbolic Automata: Mixed syntactic/semantic similarity analysis of e...
Abstract Symbolic Automata: Mixed syntactic/semantic similarity analysis of e...
 
Plagiarism introduction
Plagiarism introductionPlagiarism introduction
Plagiarism introduction
 
Unveiling Metamorphism by Abstract Interpretation of Code Properties
Unveiling Metamorphism by Abstract Interpretation of Code PropertiesUnveiling Metamorphism by Abstract Interpretation of Code Properties
Unveiling Metamorphism by Abstract Interpretation of Code Properties
 
Infections as Abstract Symbolic Finite Automata: Formal Model and Applications
Infections as Abstract Symbolic Finite Automata: Formal Model and ApplicationsInfections as Abstract Symbolic Finite Automata: Formal Model and Applications
Infections as Abstract Symbolic Finite Automata: Formal Model and Applications
 
Lifting variability from C to mbeddr-C
Lifting variability from C to mbeddr-CLifting variability from C to mbeddr-C
Lifting variability from C to mbeddr-C
 
IRJET- Code Cloning using Abstract Syntax Tree
IRJET- Code Cloning using Abstract Syntax TreeIRJET- Code Cloning using Abstract Syntax Tree
IRJET- Code Cloning using Abstract Syntax Tree
 
Language Interaction and Quality Issues: An Exploratory Study
Language Interaction and Quality Issues: An Exploratory StudyLanguage Interaction and Quality Issues: An Exploratory Study
Language Interaction and Quality Issues: An Exploratory Study
 
A hybrid model to detect malicious executables
A hybrid model to detect malicious executablesA hybrid model to detect malicious executables
A hybrid model to detect malicious executables
 
[PDF]
[PDF][PDF]
[PDF]
 
Ab4102211213
Ab4102211213Ab4102211213
Ab4102211213
 
A Novel Approach to Genetic Algorithm Based Cryptography
A Novel Approach to Genetic Algorithm Based Cryptography A Novel Approach to Genetic Algorithm Based Cryptography
A Novel Approach to Genetic Algorithm Based Cryptography
 
Behavioral Analysis for Detecting Code Clones
Behavioral Analysis for Detecting Code ClonesBehavioral Analysis for Detecting Code Clones
Behavioral Analysis for Detecting Code Clones
 
Interview questions
Interview questionsInterview questions
Interview questions
 
2 programming with c# i
2 programming with c# i2 programming with c# i
2 programming with c# i
 

Destacado

Статический анализ кода: современный взгляд
Статический анализ кода: современный взглядСтатический анализ кода: современный взгляд
Статический анализ кода: современный взгляд
Andrey Karpov
 

Destacado (18)

200 open source проектов спустя: опыт статического анализа исходного кода
200 open source проектов спустя: опыт статического анализа исходного кода200 open source проектов спустя: опыт статического анализа исходного кода
200 open source проектов спустя: опыт статического анализа исходного кода
 
Of Rowers and Programmers, or What's in Common Between Software Development B...
Of Rowers and Programmers, or What's in Common Between Software Development B...Of Rowers and Programmers, or What's in Common Between Software Development B...
Of Rowers and Programmers, or What's in Common Between Software Development B...
 
The Ultimate Question of Programming, Refactoring, and Everything
The Ultimate Question of Programming, Refactoring, and EverythingThe Ultimate Question of Programming, Refactoring, and Everything
The Ultimate Question of Programming, Refactoring, and Everything
 
PVS-Studio vs Clang
PVS-Studio vs ClangPVS-Studio vs Clang
PVS-Studio vs Clang
 
PVS-Studio, a static analyzer detecting errors in the source code of C/C++/C+...
PVS-Studio, a static analyzer detecting errors in the source code of C/C++/C+...PVS-Studio, a static analyzer detecting errors in the source code of C/C++/C+...
PVS-Studio, a static analyzer detecting errors in the source code of C/C++/C+...
 
Rechecking TortoiseSVN with the PVS-Studio Code Analyzer
Rechecking TortoiseSVN with the PVS-Studio Code AnalyzerRechecking TortoiseSVN with the PVS-Studio Code Analyzer
Rechecking TortoiseSVN with the PVS-Studio Code Analyzer
 
Цена ошибки
Цена ошибкиЦена ошибки
Цена ошибки
 
PVS-Studio team experience: checking various open source projects, or mistake...
PVS-Studio team experience: checking various open source projects, or mistake...PVS-Studio team experience: checking various open source projects, or mistake...
PVS-Studio team experience: checking various open source projects, or mistake...
 
200 Open Source Projects Later: Source Code Static Analysis Experience
200 Open Source Projects Later: Source Code Static Analysis Experience200 Open Source Projects Later: Source Code Static Analysis Experience
200 Open Source Projects Later: Source Code Static Analysis Experience
 
Price of an Error
Price of an ErrorPrice of an Error
Price of an Error
 
Программа для регрессионного тестирования анализаторов PVS-Studio, CppCat
Программа для регрессионного тестирования анализаторов PVS-Studio, CppCatПрограмма для регрессионного тестирования анализаторов PVS-Studio, CppCat
Программа для регрессионного тестирования анализаторов PVS-Studio, CppCat
 
Статический анализ кода: современный взгляд
Статический анализ кода: современный взглядСтатический анализ кода: современный взгляд
Статический анализ кода: современный взгляд
 
PVS-Studio 5.00, a solution for developers of modern resource-intensive appl...
PVS-Studio 5.00, a solution for developers of modern resource-intensive appl...PVS-Studio 5.00, a solution for developers of modern resource-intensive appl...
PVS-Studio 5.00, a solution for developers of modern resource-intensive appl...
 
Static analysis of C++ source code
Static analysis of C++ source codeStatic analysis of C++ source code
Static analysis of C++ source code
 
20 issues of porting C++ code on the 64-bit platform
20 issues of porting C++ code on the 64-bit platform20 issues of porting C++ code on the 64-bit platform
20 issues of porting C++ code on the 64-bit platform
 
Difficulties of comparing code analyzers, or don't forget about usability
Difficulties of comparing code analyzers, or don't forget about usabilityDifficulties of comparing code analyzers, or don't forget about usability
Difficulties of comparing code analyzers, or don't forget about usability
 
Regular use of static code analysis in team development
Regular use of static code analysis in team developmentRegular use of static code analysis in team development
Regular use of static code analysis in team development
 
Реклама PVS-Studio - статический анализ кода на языке Си и Си++
Реклама PVS-Studio - статический анализ кода на языке Си и Си++Реклама PVS-Studio - статический анализ кода на языке Си и Си++
Реклама PVS-Studio - статический анализ кода на языке Си и Си++
 

Similar a The Last Line Effect

An Effective Privacy-Preserving Data Coding in Peer-To-Peer Network
An Effective Privacy-Preserving Data Coding in Peer-To-Peer NetworkAn Effective Privacy-Preserving Data Coding in Peer-To-Peer Network
An Effective Privacy-Preserving Data Coding in Peer-To-Peer Network
IJCNCJournal
 
User_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docxUser_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docx
dickonsondorris
 
Least cost rumor blocking in social networks
Least cost rumor blocking in social networksLeast cost rumor blocking in social networks
Least cost rumor blocking in social networks
Ping Yabin
 
Wcre2009 bettenburg
Wcre2009 bettenburgWcre2009 bettenburg
Wcre2009 bettenburg
SAIL_QU
 

Similar a The Last Line Effect (20)

Accord.Net: Looking for a Bug that Could Help Machines Conquer Humankind
Accord.Net: Looking for a Bug that Could Help Machines Conquer HumankindAccord.Net: Looking for a Bug that Could Help Machines Conquer Humankind
Accord.Net: Looking for a Bug that Could Help Machines Conquer Humankind
 
An Effective Privacy-Preserving Data Coding in Peer-To-Peer Network
An Effective Privacy-Preserving Data Coding in Peer-To-Peer NetworkAn Effective Privacy-Preserving Data Coding in Peer-To-Peer Network
An Effective Privacy-Preserving Data Coding in Peer-To-Peer Network
 
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...
 
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...
 
Phenoflow: An Architecture for Computable Phenotypes
Phenoflow: An Architecture for Computable PhenotypesPhenoflow: An Architecture for Computable Phenotypes
Phenoflow: An Architecture for Computable Phenotypes
 
Exploring Models of Computation through Static Analysis
Exploring Models of Computation through Static AnalysisExploring Models of Computation through Static Analysis
Exploring Models of Computation through Static Analysis
 
Harmful interupts
Harmful interuptsHarmful interupts
Harmful interupts
 
Labreport
LabreportLabreport
Labreport
 
Mining Challenge 2011 by Yukinao Hirata
Mining Challenge 2011 by Yukinao HirataMining Challenge 2011 by Yukinao Hirata
Mining Challenge 2011 by Yukinao Hirata
 
1757 1761
1757 17611757 1761
1757 1761
 
1757 1761
1757 17611757 1761
1757 1761
 
User_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docxUser_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docx
 
Mobile Networking and Mobile Ad Hoc Routing Protocol Modeling
Mobile Networking and Mobile Ad Hoc Routing Protocol ModelingMobile Networking and Mobile Ad Hoc Routing Protocol Modeling
Mobile Networking and Mobile Ad Hoc Routing Protocol Modeling
 
Least cost rumor blocking in social networks
Least cost rumor blocking in social networksLeast cost rumor blocking in social networks
Least cost rumor blocking in social networks
 
Wcre2009 bettenburg
Wcre2009 bettenburgWcre2009 bettenburg
Wcre2009 bettenburg
 
Software architacture recovery
Software architacture recoverySoftware architacture recovery
Software architacture recovery
 
Ijnsa050213
Ijnsa050213Ijnsa050213
Ijnsa050213
 
Use of Cell Block As An Indent Space In Python
Use of Cell Block As An Indent Space In PythonUse of Cell Block As An Indent Space In Python
Use of Cell Block As An Indent Space In Python
 
Abstracting Strings For Model Checking Of C Programs
Abstracting Strings For Model Checking Of C ProgramsAbstracting Strings For Model Checking Of C Programs
Abstracting Strings For Model Checking Of C Programs
 
MESSAGE EMBEDDED CIPHER USING 2-D CHAOTIC MAP
MESSAGE EMBEDDED CIPHER USING 2-D CHAOTIC MAPMESSAGE EMBEDDED CIPHER USING 2-D CHAOTIC MAP
MESSAGE EMBEDDED CIPHER USING 2-D CHAOTIC MAP
 

Más de Andrey Karpov

Más de Andrey Karpov (20)

60 антипаттернов для С++ программиста
60 антипаттернов для С++ программиста60 антипаттернов для С++ программиста
60 антипаттернов для С++ программиста
 
60 terrible tips for a C++ developer
60 terrible tips for a C++ developer60 terrible tips for a C++ developer
60 terrible tips for a C++ developer
 
Ошибки, которые сложно заметить на code review, но которые находятся статичес...
Ошибки, которые сложно заметить на code review, но которые находятся статичес...Ошибки, которые сложно заметить на code review, но которые находятся статичес...
Ошибки, которые сложно заметить на code review, но которые находятся статичес...
 
PVS-Studio in 2021 - Error Examples
PVS-Studio in 2021 - Error ExamplesPVS-Studio in 2021 - Error Examples
PVS-Studio in 2021 - Error Examples
 
PVS-Studio in 2021 - Feature Overview
PVS-Studio in 2021 - Feature OverviewPVS-Studio in 2021 - Feature Overview
PVS-Studio in 2021 - Feature Overview
 
PVS-Studio в 2021 - Примеры ошибок
PVS-Studio в 2021 - Примеры ошибокPVS-Studio в 2021 - Примеры ошибок
PVS-Studio в 2021 - Примеры ошибок
 
PVS-Studio в 2021
PVS-Studio в 2021PVS-Studio в 2021
PVS-Studio в 2021
 
Make Your and Other Programmer’s Life Easier with Static Analysis (Unreal Eng...
Make Your and Other Programmer’s Life Easier with Static Analysis (Unreal Eng...Make Your and Other Programmer’s Life Easier with Static Analysis (Unreal Eng...
Make Your and Other Programmer’s Life Easier with Static Analysis (Unreal Eng...
 
Best Bugs from Games: Fellow Programmers' Mistakes
Best Bugs from Games: Fellow Programmers' MistakesBest Bugs from Games: Fellow Programmers' Mistakes
Best Bugs from Games: Fellow Programmers' Mistakes
 
Does static analysis need machine learning?
Does static analysis need machine learning?Does static analysis need machine learning?
Does static analysis need machine learning?
 
Typical errors in code on the example of C++, C#, and Java
Typical errors in code on the example of C++, C#, and JavaTypical errors in code on the example of C++, C#, and Java
Typical errors in code on the example of C++, C#, and Java
 
How to Fix Hundreds of Bugs in Legacy Code and Not Die (Unreal Engine 4)
How to Fix Hundreds of Bugs in Legacy Code and Not Die (Unreal Engine 4)How to Fix Hundreds of Bugs in Legacy Code and Not Die (Unreal Engine 4)
How to Fix Hundreds of Bugs in Legacy Code and Not Die (Unreal Engine 4)
 
Game Engine Code Quality: Is Everything Really That Bad?
Game Engine Code Quality: Is Everything Really That Bad?Game Engine Code Quality: Is Everything Really That Bad?
Game Engine Code Quality: Is Everything Really That Bad?
 
C++ Code as Seen by a Hypercritical Reviewer
C++ Code as Seen by a Hypercritical ReviewerC++ Code as Seen by a Hypercritical Reviewer
C++ Code as Seen by a Hypercritical Reviewer
 
The Use of Static Code Analysis When Teaching or Developing Open-Source Software
The Use of Static Code Analysis When Teaching or Developing Open-Source SoftwareThe Use of Static Code Analysis When Teaching or Developing Open-Source Software
The Use of Static Code Analysis When Teaching or Developing Open-Source Software
 
Static Code Analysis for Projects, Built on Unreal Engine
Static Code Analysis for Projects, Built on Unreal EngineStatic Code Analysis for Projects, Built on Unreal Engine
Static Code Analysis for Projects, Built on Unreal Engine
 
Safety on the Max: How to Write Reliable C/C++ Code for Embedded Systems
Safety on the Max: How to Write Reliable C/C++ Code for Embedded SystemsSafety on the Max: How to Write Reliable C/C++ Code for Embedded Systems
Safety on the Max: How to Write Reliable C/C++ Code for Embedded Systems
 
The Great and Mighty C++
The Great and Mighty C++The Great and Mighty C++
The Great and Mighty C++
 
Static code analysis: what? how? why?
Static code analysis: what? how? why?Static code analysis: what? how? why?
Static code analysis: what? how? why?
 
Zero, one, two, Freddy's coming for you
Zero, one, two, Freddy's coming for youZero, one, two, Freddy's coming for you
Zero, one, two, Freddy's coming for you
 

Último

Último (20)

TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Philosophy of china and it's charactistics
Philosophy of china and it's charactisticsPhilosophy of china and it's charactistics
Philosophy of china and it's charactistics
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
 

The Last Line Effect

  • 1. The Last Line Effect Moritz Beller, Andy Zaidman Delft University of Technology, The Netherlands {m.m.beller,a.e.zaidman}@tudelft.nl Andrey Karpov OOO “Program Verification Systems,” Tula, Russian Federation karpov@viva64.com Abstract—Micro-clones are tiny duplicated pieces of code; they typically comprise only a few statements or lines. In this paper, we expose the “last line effect,” the phenomenon that the last line or statement in a micro-clone is much more likely to contain an error than the previous lines or statements. We do this by analyzing 208 open source projects and reporting on 202 faulty micro-clones. I. INTRODUCTION Software developers oft need to repeat one particular line of code several times in succession with only small alterations, like in this example from TrinityCore:1 Example 1. TrinityCore 1 x += o t h e r . x ; 2 y += o t h e r . y ; 3 z += o t h e r . y ; The 3D-coordinates of the other object are added onto the member variables representing the coordinates x, y, z. However, the last line in this block of three similar lines contains an error, as it adds the y coordinate onto the z coordinate. Instead, the last line should be 3 z += o t h e r . z ; Another example from the popular web browser Chromium2 shows that this effect also occurs in similar statements within one single line: Example 2. Chromium 1 s t d : : s t r i n g host = . . . ; 2 s t d : : s t r i n g p o r t s t r = . . . ; 3 i f ( host != buzz : : STR EMPTY && host != buzz : : STR EMPTY) Instead of comparing twice that host does not equal the empty string, in the last position, port_str should have been compared: 3 i f ( host != buzz : : STR EMPTY && p o r t s t r != buzz : : STR EMPTY) Lines 1–3 from Example 1 are similar to each other, as are the statements in the if clause in line 3 from Example 2. We call such an extremely short block of almost identically looking, repeated lines or statements a micro-clone. Through our experience as software engineers and software quality consultants, we had the intuition that the last line or statement in a micro-clone is much more likely to contain an error than 1TrinityCore is a popular opensource framework for the creation of Mas- sively Multiplayer Online Games (MMOGs), www.trinitycore.org. 2Chromium is the open-source part of Google Chrome, www.chromium.org the previous lines or statements. The aim of this paper is to verify whether our intuition is indeed true, leading to two research questions: RQ 1 Is the last line in a multi-line micro-clone more likely to contain an error? RQ 2 Is the last statement in a single-line micro-clone more likely to contain an error? As recurring micro-clones are common to most program- ming languages, the presence of a last line effect can impact almost any programmer. If we can prove that the last of a series of similar statements is more likely to be faulty, programmers and code reviewers know which code segments to give extra attention to. This can increase software quality by reducing the amount of errors in a program. When we posted a popular science blog entry3 about the last line effect, it was picked up quickly and excitedly in Internet fora.4 Many programmers agreed to our observation, often assuming a psychological reason to cause the effect, namely that programmers think they finished a change one step too early. A natural way to come up with code like in Examples 1 and 2 is to write the first part, copy-and-paste it the necessary number of times, and then adapt the pasted fragments. Copy- and-paste is one of the most widely used idioms in the development of software [1]. It is easy, fast, hence cheap, and the copied code is already known to work. Though often considered harmful, sometimes (micro-)cloning is in fact the only way to achieve a certain program behavior, like in the examples above. A number of clone detection tools have been developed to find and possibly remove code clones [2], [3]. While these automated clone detection tools have produced very strong results down to the method level, they are ill-suited for rec- ognizing micro-clones in practice because of an abundance of false-positives. Therefore, we currently have no understanding of the error proneness of a single line or statement in a micro- clone, and, to the best of our knowledge, no similar study has ever been conducted. Our paper closes this gap by • introducing the term micro-clone. • introducing the static analysis tool PVS-Studio that is able to reliably detect faulty micro-clones, which cannot be found with traditional clone detection. 3www.viva64.com/en/b/0260 4www.reddit.com/r/programming/comments/270orx/the last line effect
  • 2. TABLE I ERROR TYPES FROM PVS-STUDIO AND THEIR DISTRIBUTION. PVS #multiline #within Σ# Error Description5 warnings line Code warnings V501 There are identical sub-expressions to the left and to the right of the ‘foo’ operator. 93 88 181 V519 The ‘x’ variable is assigned values twice successively. Perhaps this is a mistake. 10 0 10 V537 Consider reviewing the correctness of ‘X’ item’s usage. 8 0 8 V524 It is odd that the body of ‘Foo 1’ function is fully equivalent to the body of ‘Foo 2’. 3 0 3 V525 The code containing the collection of similar blocks. Check items X, Y, Z, ... in lines N1, N2, N3, ... 0 1 1 • manually investigating the error proneness of 202 micro- clones in 208 well-known open-source systems (OSS). Our findings show that in micro-clones similar to Exam- ples 1 and 2, the last line or statement is significantly more likely to contain an error than any other preceding line or statement. II. METHODS In this section, we shortly describe traditional clone detec- tion, why it is not suitable for the recognition of micro-clones, and how we circumvented this problem. As Examples 1 and 2 demonstrate, the code blocks that we study in this paper contain “syntactically identical cop[ies]; only variable, type, or function identifiers have [...] changed.” [4] This makes them extremely short type-2 clones (usually shorter than 5 lines or statements, see tables II and III), which we refer to as micro-clones. A. Why Current Clone Detectors Are Not Suitable Traditional code clone detection works with a token-, line-, abstract syntax tree (AST), or graph-based comparison [4]. However, to reduce the number of false-positives, all ap- proaches are in need of specifying a minimal code clone length for their unit of measurement (be it tokens, statements, lines or AST nodes) when applied in practice. This minimal clone length is usually in the range of 5-10 units [2], [5], which makes it too long to detect our micro-clones of length 2 to 4 units (see tables II and III). B. How We Found Faulty Micro-Clones Instead As clone detection is not able to reliably detect micro-clones in practice, we devised a different strategy to find them. Our RQs aim not at finding all possible micro-clones, but only the ones which are faulty. With this additional constraint, we could design and implement a handful of powerful analyses based on simple textual identity. These are able to find faulty code that is very likely the result of a micro-clone. Table I describes the five analyses that found micro-clones in our study. For example, the analysis V501 simply evaluates whether there are identical expressions next to certain logical operators. If so, these are at best redundant and therefore cause a maintenance problem, or at worst present a real bug in the system. 5For a more detailed description, refer to www.viva64.com/en/d III. STUDY SETUP In this section, we describe how we set-up our empirical study and which study objects we selected. A. Study Design Our study consists of four easily replicable steps. 1) Run the static analysis checker PVS-Studio on our study objects, with all checks enabled. PVS-Studio is a com- mercial static analysis tool developed by OOO Program Verification System and incorporates dozens of static analyses from clone detection to anti-usage-patterns of C-specific library functions. For replication purposes, a free trial of PVS-Studio is publicly available.6 2) Inspect the corpus of warnings from PVS-Studio and re- move false-positives and warnings not related to micro- cloning. 3) For each faulty micro-clone, count the total number of lines (RQ1) or statements (RQ2) and denote which lines or statements are faulty. 4) Naively, we assume each line has the same likelihood 1/n of containing an error (H0), independent of its position in an n-line long faulty micro-clone. For ex- ample, lines 1 and 2 in a 2-liner clone each have a 0.5 probability of containing an error. However, if we can show that the error distribution per line from step (3) significantly differs from such a uniform distribu- tion on a σ = 0.05 significance level, we reject H0 and assume a non-uniform error distribution. For each micro-clone length n, we use Pearson’s χ2 test with n − 1 degrees of freedom to compare our empirical distribution’s goodness-of-fit to a 1/n-equipartition. B. Study Objects To ensure the replicability and feasibility of our study, we focused on well-known open-source systems. Among the 208 OSS, we found instances of defective micro clones in such renowned projects as the music editing software Audacity (2 findings), the web browsers Chromium (9) and Firefox (9), the XML library libxml (1), the databases MySQL (1) and MongoDB (1), the C compiler clang (14), the egoshooter Quake III (3), the rendering software Blender (4), the 3D visualization toolkit VTK (7), the network protocols Samba (3) and OpenSSL (1), the video editor VirtualDub (3), and the programming language PHP (1). IV. RESULTS In this section, we deepen our understanding of faulty micro-clones by example and statistical evaluation. A. In-Depth Investigation of Findings We ran the complete suite of all PVS-Studio analyses on our 208 OSS from mid-2011 to December 2014. Andrey Karpov, a professional software consultant, manually sorted-out false positives, so that 1,683 potentially interesting warnings with 148 different error codes remained.7 We manually investigated 6www.viva64.com/en/pvs-studio-download 7Our raw and analyzed data: dx.doi.org/10.6084/m9.figshare.1313697.
  • 3. all 1,683 and found that 202 warnings with five distinct error codes were related to micro-cloning. Table II shows the error- per-line distribution of our 108 micro-clones consisting of several lines, and table III that of our 94 micro-clones within one single line. Cells with gray background are non-sensible (i.e. for a micro-clone of 2 lines, no error can occur in line 3). The yellow diagonal highlights errors in the last line or statement. B. Warning Types By Example In the following, we report on the details of 202 PVS- Studio-generated warnings by providing representative exam- ples to convey a better intuitive understanding of our findings. 1) V501: As table I shows, the majority of micro-clone warnings are of type V501. A prime example for a V501-type warning comes from Chromium: Example 3. Chromium 1 return 2 ! p r o f i l e . GetFieldText ( A u t o f i l l T y p e ( NAME FIRST) ) . empty ( ) | | 3 ! p r o f i l e . GetFieldText ( A u t o f i l l T y p e ( NAME MIDDLE) ) . empty ( ) | | 4 ! p r o f i l e . GetFieldText ( A u t o f i l l T y p e ( NAME MIDDLE) ) . empty ( ) ; In this one-liner micro-clone the second and third cloned statement are lexicographically identical but connected with the logical OR-operator (||), thus representing a tautology. Instead, the boolean expression misses to take into account the surname (NAME_LAST), an example of the last statement effect in this tricolon. 2) V519: When setting the value of a variable twice in succession, this is typically either unnecessary – and therefore a maintenance problem because it makes the code harder to understand as the first assignment is not effective –, or an outright bug. In this V519 example from MTASA, m_ucRed is assigned twice, but the developers forgot to set m_ucBlue. Example 4. MTASA 1 m ucRed = ucRed ; m ucGreen = ucGreen ; m ucRed = ucRed ; 3) V537: An illustrative example for a V537 finding comes from Quake III, where PVS-Studio alerts us to review the use of rectf.X: Example 5. Quake III 1 rect −>X = roundr ( r e c t f .X) ; 2 rect −>Y = roundr ( r e c t f .X) ; Indeed, the rectangle is falsely assigned the rounded value of rectf.X in its y coordinate in the second (i.e. last) line of this micro-clone. 4) Counterexample: However, not in all instances of an erroneous micro-clone does the problem lie in the last line or statement. Take this rare counterexample from Chromium, which we counted to the 7 instances of an error in line 2 of a three-liner micro-clone (see table II): Example 6. Chromium 1 i f ( s t d : : abs ( data [M01] − data [M10 ] ) > e p s i l o n | | 2 s t d : : abs ( data [M02] − data [M02 ] ) > e p s i l o n | | 3 s t d : : abs ( data [M12] − data [M21 ] ) > e p s i l o n ) TABLE II ERROR DISTRIBUTION FOR MICRO-CLONES WITH > 1 LINE. #total lines 1 2 3 4 5 6 7 8 9 10 >10 1 5 0 0 2 0 0 0 0 0 ... 2 47 7 3 1 1 0 0 0 0 ... 3 12 3 0 1 1 0 0 0 ... 4 9 0 0 0 0 0 0 ... 5 3 0 0 1 0 0 ... 6 2 0 0 0 0 ... 7 1 0 0 0 ... 8 0 1 0 ... 9 2 1 ... 10 0 ... Σ 0 52 19 15 6 4 2 1 3 1 5 ΣΣ 108 #errorsinline χ2 33.9 11.5 11.4 5.7 p 5×10−9 0.003 0.009 0.225 TABLE III ERROR DISTRIBUTION FOR MICRO-CLONES WITHIN ONE LINE. #total statements 1 2 3 4 5 > 5 1 0 0 0 0 0 2 67 3 2 0 0 3 15 0 0 0 4 7 0 0 5 0 0 Σ 0 67 18 9 0 0 ΣΣ 94#errorsin statement χ2 67 21 15 p 2×10−16 2×10−5 0.002 In line 2, the engineers deducted data_[M02] from itself. Instead, they meant to write: 2 s t d : : abs ( data [M02] − data [M20 ] ) > e p s i l o n | | C. Statistical Evaluation For each column in tables II and III, we performed a Pearson’s χ2 test on a p = 0.05 significance level to show whether the individual distributions are non-uniform. Results, reported in the last two rows, are only meaningful for micro- clone lengths with enough empirical observations, which are columns 2, 3 and 4 in both tables. For RQ1 and RQ2, we got significant p-values for micro- clones consisting of 2, 3, and 4 lines or statements, respectively (p < 0.05). This means that we can reject the null hypothesis that errors are uniformly distributed across statements or lines. Instead, the distribution is significantly skewed towards the last line or statement. We would expect similar findings for longer micro-clones, but these were too rare to derive statistically valid information. We can summarize the results across micro-clone lengths into the two events “error not in last line or statement” and “error in last line or statement”, shown in table IV. Our absolute counts show that in micro-clones similar to Example 1, the last line is more than twice as likely to contain a fault than all previous lines taken together. When looking at the individual line length in table II, the last line effect is even as high as a nine-fold increased error-proneness for the oft-appearing clone lengths 2 and 4. The results for cloned
  • 4. TABLE IV SUMMARIZED RESULTS. multi-line micro-clone one-line micro-clone #errors in last line/stmt. 76 89 #errors not in last line/stmt. 32 5 Σ 108 94 statements in micro-clones within one line, like Example 2, are stronger still: We found the last statement to be 17 times more faulty than all other statements taken together. In fact, for the 67 micro-clones consisting of two statements, the last statement was always the faulty one. In total, our findings confirm the presence of a strong last line and last statement effect, accepting both RQ1 and RQ2. We assume the effect to be caused by and large through copy- and-pasting code, and that developers have a psychological tendency to think changes of similar code blocks are finished earlier than they really are. This way, they miss one critical last modification. D. Usefulness of Results Having unveiled a large number of potential bugs in OSS, we wanted to (a) help the OSS community and (b) see if our findings represented bugs that would be worth fixing in reality. We approached the OSS projects by creating issues with our findings in their bug trackers. Many of our bug reports lead to quality improvements in the projects, like fixing the validation bug from Example 2 in Chromium.8 The search query pvs-studio bug | issue9 delivers numerous successful bug fixes in Firefox, libxml, MySQL, Clang and many other projects. V. THREATS TO VALIDITY In this section, we show internal and external threats to the validity of our results, and how we mitigated them. One internal threat concerns how to determine in which line the error lies. Given Example 2, any of the two state- ments could be counted as the one containing the duplication. However, reading and writing source code typically happens from top to bottom and from left to right [6]. Therefore, the only natural assessment is to count the line as problematic according to this strict left-right/top-down-reading order: In Example 2, only when we read the second statement do we know it is a duplicate of the first. We hence count the second statement as the one containing the error. Moreover, in many cases, as in Example 2, the program context around the micro- clone (here the definition order of the variables host first and then port_str) imposes a natural logical order for the remainder of the program (first check host, then port_str in line 3). Because counting lines is a well-defined task under these circumstances, we are not concerned about interrater reliability. 8codereview.chromium.org/7031055 9www.google.com/search?q=pvs-studio+bug+|+issue An external factor that threatens the generalizability is that PVS-Studio is specific to C(++). C is one of the most com- monly used languages [7]. Therefore, even if our results were not generalizable, they would at least be valuable to the large C community. However, our findings typically contain language features common to most programming languages, like the variable assignments, if clauses, boolean expressions and array use in Examples 1 to 5. Almost all programming languages have these constructs. Thus, we expect to see analogous results in at least imperative languages such as Java, JavaScript, C#, PHP, Ruby, or Python. While our overall corpus of findings is quite large, the average number of ~1 warning per project is rather small. This could be because PVS-Studio’s analyses for defective micro-clones are not exhaustive, and that the projects we studied are stable, production systems with a mature code base containing relatively little trivial errors. VI. FUTURE WORK & CONCLUSION In this section, we describe possible extensions of our study and summarize our initial results. Because our study focuses on faulty micro-clones, we cannot make predictions about how many of all micro-clones are affected by the last line effect. A promising future research direction is to develop a clone detector that can reliably detect all micro-clones, and then to see how many are actually defective. This gives an indication of the scale of the problem at hand. Moreover, while our investigation provides a first theory about the last line effect, the assumed psychological reasons for its existence open up another vast and fruitful field for further study. In 208 open source projects, we found 202 faulty micro- clones. Our analysis shows that there is a strong tendency for the last line, and an even stronger for the last statement to be faulty, called the last line effect. Because of it, we advise programmers be extra-careful when reading, modifying, code- reviewing, or creating the last line and statement of a micro- clone, especially when they copy-and-pasted it. REFERENCES [1] M. Kim, L. Bergman, T. Lau, and D. Notkin, “An ethnographic study of copy and paste programming practices in oopl,” in Proc. Int’l Symp. on Empirical Software Engineering (ISESE). IEEE, 2004, pp. 83–92. [2] S. Bellon, R. Koschke, G. Antoniol, J. Krinke, and E. Merlo, “Comparison and evaluation of clone detection tools,” IEEE Transactions on Software Engineering, vol. 33, no. 9, pp. 577–591, 2007. [3] C. Roy, J. Cordy, and R. Koschke, “Comparison and evaluation of code clone detection techniques and tools: A qualitative approach,” Science of Computer Programming, vol. 74, no. 7, pp. 470–495, 2009. [4] R. Koschke, “Survey of research on software clones,” in Duplication, Redundancy, and Similarity in Software, 2007. [5] E. Juergens, F. Deissenboeck, B. Hummel, and S. Wagner, “Do code clones matter?” in Proceedings of the International Conference on Software Engineering (ICSE). IEEE, 2009, pp. 485–495. [6] J. Siegmund, C. K¨astner, S. Apel, C. Parnin, A. Bethmann, T. Leich, G. Saake, and A. Brechmann, “Understanding understanding source code with functional magnetic resonance imaging,” in Proc. Int’l Conference on Software Engineering (ICSE). ACM, 2014, pp. 378–389. [7] L. Meyerovich and A. Rabkin, “Empirical analysis of programming language adoption,” in ACM SIGPLAN Notices, vol. 48, no. 10. ACM, 2013, pp. 1–18.