SlideShare una empresa de Scribd logo
1 de 150
Descargar para leer sin conexión
CLARKSON UNIVERSITY




MANAGING THE COPY-AND-PASTE PROGRAMMING PRACTICE



                          A Dissertation

                                By

                        Patricia Deshane




                  Coulter School of Engineering

        Submitted in partial fulfillment of the requirements

                         for the degree of

           Doctor of Philosophy, Engineering Science




                          April 30, 2010




                                       Accepted by the Graduate School




                                  ______________, _____________________
                                       Date,         Dean
Copyright 2010, Patricia Deshane
The undersigned have examined the dissertation entitled “Managing the Copy-and-
Paste Programming Practice” presented by Patricia Deshane, a candidate for the
degree of Doctor of Philosophy (Engineering Science), and hereby certify that it is
worthy of acceptance.



  April 30, 2010                                ______________________________
Date                                            Dr. Daqing Hou,
                                                Advisor
                                                Electrical and Computer Engineering



                                                ______________________________
                                                Dr. Susan Conry,
                                                Examining Committee
                                                Electrical and Computer Engineering



                                                ______________________________
                                                Dr. Christopher Lynch,
                                                Examining Committee
                                                Mathematics & Computer Science



                                                ______________________________
                                                Dr. Robert Meyer,
                                                Examining Committee
                                                Electrical and Computer Engineering



                                                ______________________________
                                                Dr. Christino Tamon,
                                                Examining Committee
                                                Mathematics & Computer Science
Abstract
                               CLARKSON UNIVERSITY

                  Managing the Copy-and-Paste Programming Practice

                                   By: Patricia Deshane

                                  Advisor: Daqing Hou


       Programmers often copy and paste source code in order to reuse an existing

solution in the completion of a current task. Copying and pasting results in code clones

(similar code fragments) throughout a code base, which need to be properly maintained

over time. Forgetting the cloning information and correspondence relationships within a

piece of code can be problematic for the software maintainer. Furthermore, inconsistent

editing to clones can introduce undetected bugs, decreasing the quality of the software.

       This dissertation presents a suite of software tools, Eclipse plug-ins named CnP,

that aid the programmer during copy, paste, and modify programming. The purpose is to

provide tool support throughout a clone’s entire lifecycle, from its creation to its removal

from the system. More than just traditional clone detection and removal, these clone

tracking tools have a particular focus on clone editing. One CnP plug-in helps with

consistent identifier renaming within clones (CReN), another one renames substrings

consistently within clones (LexId), and a third plug-in in the CnP suite visualizes user

edits within a clone for better clone comparison (CSeR). A user study was conducted on

CnP’s basic visualization, CReN, and LexId features with analysis in terms of task

completion time, solution correctness, and method of completion.




                                            iv
To my wonderful husband, Todd Deshane.
Acknowledgments
Personal Reflections

       With the completion of this dissertation paper and defense, I feel like I have had

the full PhD experience, and I am finally personally ready to graduate! I have completed

the course work, the qualifying exam, and the research component of the degree program.

Over the years, I have given various seminar presentations in the computer science and

engineering departments at the university, I have presented at one conference per year

during the research phase of my PhD (OOPSLA 2007 in Montreal, FSE 2008 in Atlanta,

and CASCON 2009 in Toronto), and I have written many paper submissions and drafts.

       A conference trip was often a reward for having a successful paper submission.

All three conference trips that I presented at were milestones during my PhD career and

each trip was truly unforgettable. I really enjoyed traveling to these cities and learning

from other researchers in the software engineering discipline. As a presenter, I personally

got a lot of valuable feedback and advice from the conference attendees that I may not

have gotten otherwise. I believe that the conferences were a vital part of my growth as a

researcher as they helped me to “get out of the lab” and experience the rest of the world.

Special Thanks

       I would first like to thank my husband, Todd Deshane, for everything over the

past ten years that we have been together. I would not be where I am today without him.

When it seems like the only thing that is constant is change, it is comforting to know that

Todd is always there to help me through the tough times and to celebrate with me in the

joyous times. Todd, all of this hard work during our PhD years was worth it – soon we




                                            vi
will both officially be “computer doctors”. “To every thing there is a season, and a time

to every purpose under heaven” (Ecclesiastes 3:1). I look forward to beginning the next

chapter of our lives together.

       I would like to thank Professor Hou, my advisor, who took me in as his first PhD

student at a time when I had nowhere else to go. I thank him for having put up with me

for the past four years. I have learned so much from him, most importantly, I believe that

“instead of just being fed, I was taught how to fish”. For the first time, I was able to get

the guidance that I needed, with the independence that I wanted.

       Thanks also to the Software Engineering Research Laboratory (SERL) at

Clarkson. I am truly grateful for having the lab/office space and resources that allowed

me to work more productively. I thank the other graduate students in SERL, especially:

Cheng (Jerry) Wang, Chandan Rupakheti, Ferosh Jacob, Yuejiao (Gloria) Wang, Xiaojia

(Joanna) Yao, Dave Pletcher, and Lin Li. I really appreciate the friendship and support

from each of you as we all spent countless hours in the lab.

       I would also like to thank Jeanna Matthews for supporting me during the first year

and a half of my PhD. Thanks also go to Eli M. Dow, my mentor at IBM, who helped me

with early research and has continued to be supportive. Finally, thank you, my PhD

committee (Susan Conry, Christopher Lynch, Robert Meyer, and Christino Tamon), who

have given me early feedback during my PhD proposal and remarks on this dissertation.

       Personal thanks to all of my family – my parents and siblings – extended family,

and best friends both from college and from back home. Special thanks to my best buddy,

Wenjin Hu, for his never-ending kindness and friendship, while here at school. I would

also like to mention my other “best friend”, my dog, Lady. I truly do miss her.




                                            vii
I give special thanks to my grandfather, B. John Jablonski, who continually kept

me motivated during my college years. It is hard to believe that it has already been four

years since his death, but I know that it has been that long since I have gotten an email

letter from him. His letters and emails always meant so much to me, with his words of

encouragement when many others did not approve or understand why I was still in school

especially without proper funding. He continues to be my inspiration.

        Finally, I thank God (literally). He is everything – my counselor, comforter, and

keeper. In particular, I would like to thank God for always keeping me grounded. One of

the toughest things that I experienced during my PhD is rejection (of paper submissions).

While I feel that the rejections may have at times hindered my research progress, I feel

that if all of my paper submissions were accepted, then I may have incorrectly assumed

that this process was very simple and I may have become too proud of my own successes.

During this whole ordeal, I have learned that there is always room for improvement even

if it is difficult for me to see on my own. As Randy Pausch said, “Experience is what you

get when you didn’t get what you wanted.” But, regardless of past rejections, I know that

I have always done my best and that “with God all things are possible” (Matthew 19:26).

Ultimately, my success is not measured by men’s approval, but by God’s*.



                                                                                    Patricia A Deshane

                                                                                  Clarkson University

                                                                                          January 2010


*
 “Study to show yourself approved to God, a workman that needs not to be ashamed.” (2 Timothy 2:15)
Clarkson University Motto
Disclaimer: The views and opinions expressed in this dissertation are solely those of the author and do not
necessarily represent the views and opinions of anyone else affiliated with Clarkson University.


                                                   viii
Contents
LIST OF TABLES ....................................................................................................................................... XI
LIST OF ILLUSTRATIONS ......................................................................................................................XII
LIST OF PUBLICATIONS....................................................................................................................... XIV
CHAPTER 1 INTRODUCTION.....................................................................................................................1
    1.1.         COPY, PASTE, AND MODIFY PROGRAMMING.................................................................................1
    1.2.         THE TRADITIONAL PERSPECTIVE: CLONES ARE BAD ....................................................................3
    1.3.         A NEW PERSPECTIVE: CLONES CAN BE GOOD...............................................................................8
    1.4.         RESEARCH CONTRIBUTIONS ........................................................................................................10
    1.5.         OUTLINE OF THIS DISSERTATION ................................................................................................11
CHAPTER 2 LITERATURE REVIEW........................................................................................................13
    2.1.         CLONE DETECTION AND REMOVAL.............................................................................................13
    2.2.         CLONE LIFECYCLE MANAGEMENT ..............................................................................................16
    2.2.1.       BRIEF CNP TOOL DESCRIPTIONS .................................................................................................16
    2.2.2.       DEFINITIONS OF CLONE PROPERTIES ...........................................................................................19
    2.2.2.1.       CLONE SIMILARITY ................................................................................................................20
    2.2.2.2.       CLONE MODEL .......................................................................................................................24
    2.2.2.3.       CLONE VISUALIZATION ..........................................................................................................28
    2.2.2.4.       CLONE PERSISTENCE ..............................................................................................................38
    2.2.2.5.       CLONE DOCUMENTATION AND CLONE ATTRIBUTES ..............................................................39
    2.2.3.       CLONE LIFECYCLE SUPPORT .......................................................................................................39
    2.2.3.1.       CLONE CREATION ..................................................................................................................42
    2.2.3.2.       CLONE CAPTURE ....................................................................................................................43
    2.2.3.3.       CLONE EDITING......................................................................................................................47
    2.2.3.4.       CLONE EXTINCTION ...............................................................................................................66
    2.3.         PREVALENCE OF CLONES, RENAMING, AND RELATED ERRORS IN PRODUCTION CODE ...............68
CHAPTER 3 METHODOLOGY..................................................................................................................73
    3.1.         USER STUDY ON CNP’S VISUALIZATION, CREN, AND LEXID .....................................................73
    3.1.1.       USER STUDY HYPOTHESES .........................................................................................................74
    3.1.2.       SUBJECT CHARACTERISTICS........................................................................................................75
    3.1.3.       STUDY PROCEDURE ....................................................................................................................76
    3.1.4.       TASK DESCRIPTIONS ...................................................................................................................79
    3.1.4.1.       DEBUGGING AND MODIFYING WITHIN A CLONE .....................................................................80
    3.1.4.2.       RENAMING WITHIN A CLONE ..................................................................................................88
CHAPTER 4 RESULTS ...............................................................................................................................93
    4.1.         TIME PER TASK ...........................................................................................................................93
    4.2.         SOLUTION CORRECTNESS............................................................................................................96
    4.3.         METHOD OF COMPLETION ...........................................................................................................98
CHAPTER 5 DISCUSSION .......................................................................................................................101
    5.1.         CONFOUNDING FACTORS FOR CLONE VISUALIZATION..............................................................101
    5.2.         THREATS TO VALIDITY .............................................................................................................103
    5.3.         TOOL DESIGN ............................................................................................................................103
CHAPTER 6 CONCLUSION .....................................................................................................................105
    6.1.         RESEARCH CONTRIBUTIONS ......................................................................................................106



                                                                            ix
6.2.         FUTURE WORK..........................................................................................................................107
    6.2.1.       THEORY ABOUT COPY-AND-PASTE AND ABSTRACTIONS ..........................................................108
    6.2.2.       OTHER APPLICATIONS OF THIS RESEARCH ...............................................................................109
REFERENCES............................................................................................................................................110
APPENDIX A IRB RECRUITMENT LETTER.........................................................................................131
APPENDIX B IRB CONSENT FORM ......................................................................................................132
APPENDIX C IRB QUESTIONNAIRE .....................................................................................................134




                                                                           x
List of Tables

TABLE 1: SUMMARY OF CLONE TRACKING TOOLS WITH THEIR DEFINITIONS OF CLONE PROPERTIES ...........21

TABLE 2: SUMMARY OF CLONE TRACKING TOOLS WITH THEIR CLONE LIFECYCLE SUPPORT ........................41

TABLE 3: EXAMPLES OF WHAT LEXID CONSIDERS TO BE SUBSTRINGS. ..........................................................54

TABLE 4: THREE EXAMPLES FROM LITERATURE THAT SHOW AN INCONSISTENT RENAMING OF IDENTIFIERS IN
    THE PASTED CODE FRAGMENT. .............................................................................................................72

TABLE 5: HIGH-LEVEL DESCRIPTION OF THE TASKS IN THE USER STUDY........................................................73

TABLE 6: THE TIME (IN MINUTES) TO COMPLETE EACH PAIR OF TASKS...........................................................94

TABLE 7: STATISTICAL HYPOTHESIS TESTING ON THE PAIRED TIME DATA......................................................96

TABLE 8: CORRECT STATES WHEN RUNNING THE PROGRAM OR WHEN FINISHED............................................97

TABLE 9: NUMBER OF SUBJECTS WHO USED EACH LOCATION AND INSPECTION METHOD FOR DEBUGGING AND
    MODIFICATION TASKS. ..........................................................................................................................99

TABLE 10: NUMBER OF TIMES EACH RENAMING METHOD WAS USED FOR RENAMING TASKS........................100




                                                                        xi
List of Illustrations

FIGURE 1: THE IDENTIFIER INSTANCES IN THE COPIED CODE ARE MATCHED WITH THEIR CORRESPONDING
     IDENTIFIER INSTANCES IN THE PASTED CODE. .......................................................................................18

FIGURE 2: THE IDENTIFIER INSTANCES IN THE COPIED AND PASTED CODE ARE PARTITIONED INTO GROUPS AND
     MAPPED TO EACH OTHER. .....................................................................................................................19

FIGURE 3: THE POSITION OF THE SOURCE CODE CHARACTERS AS REPRESENTED IN AN ASTNODE.................25

FIGURE 4: THE THREE CASES WHEN CAPTURING A RANGE OF SOURCE CODE USING THE ECLIPSE AST API. ..26

FIGURE 5: CNP CLONE VISUALIZATION HAS DISTINCTION BETWEEN CLONE GROUPS AND THE CLONE ORIGIN
     AND ITS SUBSEQUENT PASTES. ..............................................................................................................29

FIGURE 6: CSER SHOWS THE CHANGES THAT WOULD BE MADE TO THE EXCLUSIONINCLUSIONDIALOG CLASS
     (HIGHLIGHTED CODE FOR INSERTS, DELETES, UPDATES, MOVES; AND HOVER INFORMATION FOR
     DELETES, UPDATES) TO MAKE THE SETFILTERWIZARDPAGE CLASS IN SETFILTERWIZARDPAGE’S FILE
     IN THE ECLIPSE EDITOR.........................................................................................................................30

FIGURE 7: THE CLONE LIFECYCLE – CLONE CREATION, CLONE CAPTURE, CLONE EDITING, AND CLONE
     EXTINCTION. ........................................................................................................................................40

FIGURE 8: CONSISTENT IDENTIFIER RENAMING WITHIN A CLONE USING CREN..............................................50

FIGURE 9: THE PROGRAMMER CAN CHOOSE TO RENAME AN INSTANCE SEPARATELY FROM THE OTHERS
     (NOTICE THAT ONE “I” IN THE PASTED LOOP ON LINE 33 IS NOT BEING RENAMED AS A “J” WITH THE
     OTHERS ANYMORE)...............................................................................................................................51

FIGURE 10: THE ABSTRACT SYNTAX TREE (AST) OF A FOR LOOP WITH THE IDENTIFIER GROUPS HIGHLIGHTED.
     .............................................................................................................................................................53

FIGURE 11: LEXID CHANGES THE SUBSTRINGS “LEFT” TO “RIGHT” WHEN ONE IS EDITED. IN THE FUTURE,
     LEXID CAN BE MADE TO AUTOMATICALLY INFER THE SUBSTRING “RIGHT” IN THE PASTED CODE BASED
     ON “LEFT” BY MAINTAINING A DATABASE OF COMMON NAMING PAIRS. ...............................................55

FIGURE 12: LEXID RENAMES A SUBSTRING “B” TO “Y” CONSISTENTLY IN PASTED CODE. ..............................56

FIGURE 13: A NEW FEATURE OF LEXID CAN BE SUPPORT FOR AUTO-INCREMENTING TOKENS (LEFT) AS WELL
     AS LEXICAL PATTERNS IN IDENTIFIERS (RIGHT).....................................................................................57

FIGURE 14: LEXID CAN BE MADE TO INFER THAT THE CONSTRUCTOR THAT IS CALLED WITHIN A COMMON
     METHOD SHOULD BE THE SAME AS THE CURRENT SUBCLASS’ NAME (“XXX”). ....................................58

FIGURE 15: FIND & REPLACE CAN RENAME ALL INSTANCES OF “I” (AS A WHOLE WORD) TO “J” IN THE
     SELECTED LINES, BUT THIS NEEDS TO BE SPECIFIED BY THE PROGRAMMER AND IS SIMPLY A TEXT-
     BASED SEARCH. ....................................................................................................................................61

FIGURE 16: RENAME REFACTORING DOES NOT WORK WITH CODE THAT DOES NOT TYPE CHECK (BINDING IS
     REQUIRED FOR IT TO WORK)..................................................................................................................62




                                                                               xii
FIGURE 17: CREN WORKS WITH CODE THAT DOES NOT TYPE CHECK (BINDING IS NOT REQUIRED FOR IT TO
     WORK). .................................................................................................................................................62

FIGURE 18: RENAME REFACTORING IS NOT LIMITED TO RENAMING WITHIN A CLONE (FOR EXAMPLE, ONLY IN
     THE PASTED FOR LOOP).........................................................................................................................62

FIGURE 19: REFACTORING (TOP) VS. CREN (BOTTOM). .................................................................................63

FIGURE 20: CREN WORKS ACROSS MULTIPLE FILES (FILE 1 IS ON TOP, FILE 2 IS ON THE BOTTOM).................64

FIGURE 21: LINKED RENAMING DOES NOT WORK WITH CODE THAT DOES NOT PARSE (NOTICE THE ADDED
     SEMI-COLON BETWEEN THE ++ ON LINE 33)..........................................................................................64

FIGURE 22: CREN WORKS WITH CODE THAT DOES NOT PARSE (NOTICE THE ADDED SEMI-COLON BETWEEN
     THE ++ ON LINE 33). .............................................................................................................................65

FIGURE 23: LINKED RENAMING IS NOT LIMITED TO RENAMING WITHIN A CLONE (FOR EXAMPLE, ONLY IN THE
     PASTED FOR LOOP)................................................................................................................................65

FIGURE 24: THE CMU PAINT PROGRAM USED IN THE USER STUDY WITH WIDGETS ANNOTATED BY
     CORRESPONDING INSTANCE VARIABLES. ..............................................................................................78

FIGURE 25: TASK 1 – RSLIDER SHOULD BE BSLIDER (ON LINE 120)................................................................82

FIGURE 26: TASK 2 – COLORCHANGELISTENER SHOULD BE THICKNESSCHANGELISTENER (ON LINE 142). ...83

FIGURE 27: TITLED BORDERS ARE SHOWN AROUND THE COLOR PANEL AND THE THICKNESS PANEL..............84

FIGURE 28: TASK 3 – ADD A TITLED BORDER TO COLORPANEL AND TO THICKNESSPANEL.............................85

FIGURE 29: THE LABELS OF THE RED, GREEN, AND BLUE SLIDERS ARE SHOWN COLORED...............................86

FIGURE 30: TASK 4 – ADD COLOR TO THE LABEL OF EACH COLOR SLIDER: RED, GREEN, AND BLUE. ..............87

FIGURE 31: TASK 5 – RENAME COLORPANEL TO THICKNESSPANEL. ..............................................................89

FIGURE 32: TASK 6 – RENAME TOOLPANEL TO CLEARUNDOPANEL. ..............................................................90

FIGURE 33: TASK 7 (PART 1) – RENAME RPANEL TO GPANEL AND RSLIDER TO GSLIDER IN THE GREEN SLIDER
     CLONE...................................................................................................................................................91

FIGURE 34: TASK 8 – RENAME BPANEL TO TPANEL AND BSLIDER TO TSLIDER IN THE THICKNESS SLIDER
     CLONE...................................................................................................................................................92




                                                                            xiii
List of Publications
[1]   P. Jablonski and D. Hou, “Renaming Parts of Identifiers Consistently within Code
      Clones”, IEEE International Conference on Program Comprehension (ICPC),
      2010. (2 pages)

[2]   P. Jablonski and D. Hou, “Aiding Software Maintenance with Copy-and-Paste
      Clone-Awareness”, IEEE International Conference on Program Comprehension
      (ICPC), 2010. (10 pages)

[3]   F. Jacob, D. Hou, and P. Jablonski, “Actively Comparing Clones Inside The Code
      Editor”, International Workshop on Software Clones (IWSC), 2010. (8 pages)

[4]   D. Hou, F. Jacob, and P. Jablonski, “Exploring the Design Space of Proactive Tool
      Support for Copy-and-Paste Programming”, IBM Conference of the Centre for
      Advanced Studies on Collaborative Research (CASCON), 2009. (15 pages)

[5]   D. Hou, F. Jacob, and P. Jablonski, “Proactively Managing Copy-and-Paste
      Induced Code Clones”, IEEE International Conference on Software Maintenance
      (ICSM), 2009. (2 pages)

[6]   D. Hou, P. Jablonski, and F. Jacob, “CnP: Towards an Environment for the
      Proactive Management of Copy-and-Paste Programming”, IEEE International
      Conference on Program Comprehension (ICPC), 2009. (5 pages)

[7]   P. Jablonski, “Clone-Aware Editing with CnP”, ACM SIGSOFT International
      Symposium on the Foundations of Software Engineering (FSE), Student Research
      Forum, 2008. (poster)

[8]   P. Jablonski, “Techniques for Detecting and Preventing Copy-and-Paste Errors
      during Software Development”, Clarkson University, PhD Dissertation Proposal,
      2007. (21 pages)

[9]   P. Jablonski and D. Hou, “CReN: A Tool for Tracking Copy-and-Paste Code
      Clones and Renaming Identifiers Consistently in the IDE”, Eclipse Technology
      Exchange Workshop at OOPSLA (ETX), 2007. (5 pages)

[10] P. Jablonski, “Managing the Copy-and-Paste Programming Practice in Modern
     IDEs”, ACM SIGPLAN Conference on Object-Oriented Programming, Systems,
     Languages, and Applications (OOPSLA), 2007. (2 pages)




                                          xiv
Copy and paste is a design error. - David Parnas

Chapter 1                                     Copying all or parts of a program is as natural to
                                              a programmer as breathing, and as productive.
Introduction                                  - Richard Stallman



1.1. Copy, Paste, and Modify Programming

                                              All programming is maintenance programming,
                                              because you are rarely writing original code.
                                              - Dave Thomas



Copy and paste [236, 237, 238, 239, 240] – some people love it, others hate it. Why?
       Copying and pasting obviously provides some short-term benefits such as saving

typing and remembering a name’s spelling. In a study on copy-and-paste usage,

approximately 74% of programmers copied very small pieces of code of less than a single

line (such as variable names, type names, or method names) [132], which indicates that

they were copying and pasting for these kinds of reasons.

       The same study also concluded that the programmers on average made four non-

trivial copy-and-pastes per hour [132]. It seems natural for programmers to copy and

paste larger code fragments (such as blocks, methods, or classes) when they see a similar

existing solution to their current task rather than write the new software solution entirely

from scratch. Not only can copying and pasting make programmers more productive in

this way, but it can be especially useful when working in an unfamiliar domain, for

instance, when learning a new programming language or framework. To help get started,

programmers can copy and paste examples from the framework’s documentation [28],

from a software repository consisting of past projects [87, 92, 217], or from an online

search engine (such as Google Code Search) [28, 86] to use as a base to work from.


                                             1
Reusing Source Code Examples

        Example-based programming is a legitimate form of software reuse (unlike cases

of copying and pasting in order to plagiarize [176, 198, 218], which a variety of

plagiarism detection tools have been developed to help deter, including AntiPlagiarist,

CopyCatch, DOC Cop, Eve2, Glatt, GPlag, JPlag, MyDropBox, PAIRwise, SNITCH,

SPlaT, TurnItIn, and WCopyFind*). Research findings in the psychology and AI fields

verify that working with concrete examples can be advantageous [51, 209]. However,

though some software components are especially designed to be reused (such as libraries,

frameworks, APIs, and software product lines), not all examples that a programmer may

find were specifically made for reuse purposes. As such, the programmer must be careful

to extract only the functionality that is needed for reuse, while also dealing with

dependencies that this code fragment may have to other parts of the software. Tool

support has been developed to aid programmers in the whole process of pragmatic reuse

[85, 86, 88, 89, 90, 91], reengineering [64], and in the comparison of examples [42].


The Psychology of Software Reuse

        Novices generally copy and paste when they do not have a full understanding of

the programming task. Since they are new to programming or to a particular language,

they do not have the syntactic, semantic, and schematic knowledge that experts have in

order to craft a solution. Novices are not the only ones who copy and paste for reuse,

however. According to [51, 52], expert programmers have “schemas” (plans) that


*
 http://www.anticutandpaste.com/antiplagiarist/, http://www.copycatchgold.com/,
http://www.doccop.com/, http://www.canexus.com/, http://www.plagiarism.com/,
http://research.microsoft.com/apps/pubs/default.aspx?id=73093, https://www.ipd.uni-karlsruhe.de/jplag/,
http://www.mydropbox.com/, http://www.pairwise.cits.ucsb.edu/, http://actlab.csc.villanova.edu/simtools/,
http://splat.cs.arizona.edu/, http://www.turnitin.com/, http://plagiarism.phys.virginia.edu/Wsoftware.html


                                                    2
represent generic solutions kept in their memories specific to a programming domain that

they can retrieve and instantiate to solve a particular programming problem. In other

words, as experts become familiar with a problem domain, they develop domain-specific

schemas, representing their knowledge of certain types of problems [52], which they can

later recall to help them design a new program. Having prior knowledge and experience,

expert programmers can use their familiarity with the situation to gain efficiency and the

ability to solve more difficult tasks than if they had to design the solution entirely from

scratch. Routine tasks can even become impossible to do if every part is treated as new

[51, 52]. Humans naturally reuse knowledge from prior experience in the present time.

       The copy and paste of source code (both large and small) tends to be a natural

behavior that provides immediate benefits. The copy-and-paste operation is not bad by

itself, but the result of copying and pasting is what is considered bad, since the resulting

clones need to be consistently modified and maintained in the long-term (the “modify”

part of “copy, paste, and modify programming” [234]). Still, many people continue to

strongly dislike copy-and-paste itself and blame it as the culprit of the maintenance

problem of clones (which often leads to code inconsistencies) [182]. This and some other

perceived problems of code clones are discussed in the following section.


1.2. The Traditional Perspective: Clones are Bad

                                              So, copy-and-paste is not necessarily bad in the
                                              short run, if you are copying good code. But it is
                                              always bad in the long run.
                                                                               - Ralph Johnson


       Traditionally code cloning was considered “harmful” to a system. Some problem

areas include software maintenance, evolution, quality, and code aesthetics or design.


                                             3
Clones as a Software Maintenance Problem

       Copying and pasting within the same code base results in code duplication [243]

that needs to be properly managed and maintained. The clones are exactly the same when

initially copied and pasted, but start to differ as the newly pasted code is modified to fit

its task. At the time the copy and paste occurs, the programmer sees the similarity

between the clones (otherwise he or she would not have made an exact duplicate as a

base to work from) and he or she also has an idea of the differences that need to be made

for the new code to be properly adapted. A natural dependency exists between the clones,

which are assumed to have a certain level of similarity that must remain between them.

This invisible relationship between copied and pasted code fragments consists of the

correspondences and differences between the clones that must be maintained as the

software is updated, for example, with new features and bug fixes. It is important for the

software maintainer to remember the parts of the related clones that should remain

unchanged, parts that must change in the same way, and parts between the clones that are

meant to differ [72]. Identifying the locations of all clones in a system and remembering

their invisible relationships to one another can be extremely difficult over time.


Clones as a Software Evolution Problem

       As changes to the software (like new features or bug fixes) are required over time,

the clones in the system may also naturally change. In some cases, the programmer may

have copied and pasted in order to get a quick solution rather than taking the time to

create an abstraction such as a procedure, function, or method. If so, these clones are

likely to be replaced by an abstraction as the code matures. The issue here is that even




                                             4
though the creation of the clones is avoidable to begin with and the clones will eventually

disappear anyway, there is still a time when the clones exist in which they need to be

properly maintained. Though these particular clones are only in the system temporarily

and their entire life may be short, there is still significant effort needed in refactoring the

code. On the other hand, perpetual clones are problematic in that they require continuous,

long-term maintenance.


Clones as a Software Quality Problem

       The increase in source code maintenance is not the only concern of opponents to

code cloning. The potential increase in the number of software bugs in the system is one

of the most widely cited reasons for avoiding clone creation. Some scenarios where bugs

are introduced into the system as a result of cloning include:

   •   The addition of a new feature: When the system needs to be updated to include a

       new feature, the software maintainer must know whether to apply this particular

       change to all related clones or only to some of them. If the maintainer fails to

       apply this change to all of the correct clones, a bug (inconsistency) is made.

   •   A bug is propagated and fixed: It is possible that the original code that was copied

       had an existing bug in it that has now been multiplied as it was pasted throughout

       the system. Once this bug has been noticed, it then needs to be fixed in all clones

       that it is in. If one of those bugs is not fixed, there remains an inconsistency,

       which is actually a new bug introduced into the system!

   •   A clone is modified to fit its task: Changes are made to a single clone when it is

       being modified to fit its own individual task. The newly pasted code fragment

       typically has identifiers changed to a new name related to the current task. If all


                                              5
identifier instances are not renamed consistently within the code fragment, this

        will create an inconsistency (bug).

In all of these cases, the clone-related bugs can remain undetected. It may take a long

time for the absence of a new feature in a clone to be detected (especially if that part of

the software is not used often in practice). In the second case, since the existing bug was

not detected earlier, it is possible that the same bug might remain hidden somewhere else

in the code. Lastly, though a renaming inconsistency could be caught by the compiler,

there are cases when the unchanged identifier instance is still in scope (Section 2.3 –

Errors), which can remain undetected by both the compiler and programmer. All of these

clone-related bugs occur when the implicit rules in the cloning relationship are broken.


Clones as an Aesthetic or Design Problem


                                                 Number 1 in the stink parade is duplicated code.
                                                 If you see the same code structure in more than
                                                 one place, you can be sure that your program will
                                                 be better if you find a way to unify them.
                                                                   - Kent Beck and Martin Fowler


        In addition to the potential decrease in software quality, some people say that

clones in software just look bad and that their presence in the code might indicate an

underlying design problem. Clones can artificially increase the number of lines of code

by adding “unnecessary” lines that otherwise would be in the body of a single abstraction

[226]. Charles Simonyi, who introduced the concept of “intentional programming” [29,

222], is a proponent of programming with abstractions rather than with clones. He states

that “...it is still pretty easy to decide at a glance that the code is bad – by the identifiers,

by the juxtapositions, by the size of the expressions, or by evidences of code copying”




                                               6
[221]. But he also says that a program can still be beautiful even if it is not strictly

structured, as long as the program has other redeeming features [159].

       Code clones are often labeled as a “code smell” [235], which is a hint that

something could be wrong with the code. This part of the code should be inspected

further to determine whether there is actually a problem that needs to be fixed or that the

smell can just be tolerated [179]. The term “clone smell” [13] was later made to describe

an individual clone that appears to be problematic over time, which should be looked at.

       The existence of clones may indicate a design problem, since it could be that the

programmer did not fully think through the design of the software solution if abstractions

were not used wherever possible. Abstraction-supported programming languages are

designed so that programmers can take advantage of these powerful tools [29]. So, when

programmers do not use the abstractions (for whatever reason) [150], they are not getting

all of the benefits that the programming language has to offer and they may not be

properly utilizing the language as it was intended to be used by design. If the clones are

to be refactored out of the code later on anyway, it might be worth spending the effort

and time to design the abstractions correctly from the beginning. Martin Fowler sees a

connection between a code’s look and smell: “I wrote that about aesthetics in discussing

when you apply refactorings. To some extent, the situations I describe in the refactoring

guidelines are fairly vague notions of aesthetics. But I try to provide more guidance than

just saying, ‘Refactor when the code looks ugly.’ I say, for instance, that duplicated code

is a bad smell. I say that long methods are a bad smell. Big classes are a bad smell.”




                                             7
1.3. A New Perspective: Clones can be Good


                                              If you have a procedure with ten parameters, you
                                              probably missed some. - Anonymous


       Duplicated or cloned code is often considered harmful to software quality,

however it can also be a reasonable or beneficial design option. Cloning can be done with

“good intentions”, including when 1) it keeps the code clean and understandable rather

than introducing an unreadable, complicated abstraction, and 2) the programming

language lacks expressiveness, so a trusted solution is reused (for example, in COBOL)

[122, 125]. If a procedure would have too many parameters or if a programming language

does not support abstractions, then clones can be a viable alternative.

       There are times when it is advised to keep clones in the source code. An empirical

study of code clone genealogies that looked at clones over multiple versions of a program

[137], found that it may not be worth refactoring short-lived clones if they are likely to

diverge soon and that the long-living clones are often in the system due to shortcomings

of the programming language. As a result, limitations of the programming language

design may result in unavoidable duplicates in a code [132]. Research from Cordy claims

that making changes to clones (which includes refactoring them) can be considered risky

from a corporate standpoint, so to be safe, the clones should remain in the system [39].




                                             8
Have People Been Led Astray?




                                               All we like sheep have gone astray. - Isaiah 53:6


       According to MythSE 2007, the statement that “clones are evil” is actually a myth

in software engineering [81]. Various facts are used to refute the myth, including [8, 39,

122, 125, 137, 151, 180, 191] with reasons explained on the website [81]. Godfrey says

that people may have been led astray like sheep, in their thinking as a group that cloning

is bad. He reiterates that cloning (or starting with the familiar) is both natural and good.

For example, he claims that in both arts and life, people explore new things by carefully

venturing away from the familiar and that humans find comfort in ritual, and more

importantly, repetition of trusted design elements is a part of engineering [74].

       Regardless of the outcome of the debate about the value of copy-and-paste and

cloning, this PhD research focused on the fact that code clones do exist and thus need to

be managed. Even if clones are made with good intentions or out of necessity, they can

still be problematic if not handled properly. One contribution of this work, the software

tool CnP, is a proactive clone management environment that tracks copy-and-paste-

induced clones upon creation. Based on the tracked cloning information, CnP provides

support for clone-related maintenance activities. This dissertation shows how CnP’s

support for copy-and-paste clone-awareness may be able to help programmers benefit

from this clone information during debugging and modification tasks, develop software

more efficiently, and prevent inconsistent identifier renaming within clones. A user study

was performed to measure the effects of this kind of clone-aware programming.




                                             9
1.4. Research Contributions
The main contributions of this research included:

   •   The copy-and-paste (CnP) tool

           o Proactive tracking – CnP/CReN were the first known clone tracking

              tools published (in 2007), which took a more proactive approach to

              capturing clones upon creation (by detecting when a copy and paste occurs

              and gathering the initial clone and identifier information at that time when

              the clones are identical).

           o Intra-clone editing – CReN was the only known tool to support editing

              within a clone (all previous tools only supported between-clone editing).

              Intra-clone editing is done when programmers copy, paste, and modify the

              pasted code to fit the current task. The kind of modification that is made in

              these cases is often identifier renaming, which is what CReN supports.

           o AST-based – CnP makes use of the abstract syntax tree (AST)

              representation of the source code, which is a better approach than the text-

              based methods that cannot differentiate between source code and any other

              text. CSeR is one of the few differencing tools to take advantage of ASTs.

   •   Dimensions of clone tracking tool development – When comparing CnP with

       related clone tracking tools, a variety of clone properties were determined that

       these kinds of tools must explicitly define. Listing the properties can be useful in

       the creation of new tools or to help redefine a tool’s current property definitions.




                                            10
•   Definition of the clone lifecycle – The comparison of tools also led to a definition

        of the clone lifecycle stages, including some areas where there is current tool

        support and areas that need more support.

    •   Realization about clone visualization – After completing a user study on CnP,

        CnP’s clone visualization was not found to provide statistically quicker and

        correct solutions than without it. Observation and other analysis (in Section 5.1)

        helped better determine whether and when a programmer may exploit clone

        information. There is no other known similar analysis of the role of clone

        information in maintenance tasks, and, thus the analysis in and of itself can be a

        contribution. The analysis can be used in the design of future experiments.


1.5. Outline of This Dissertation
        This dissertation first presents the traditional perspective on copying and pasting

and code cloning (Section 1.2), including the clone detection and removal approach

(Section 2.1). It then introduces the new perspective that states that even though cloning

can be problematic, clones can be reasonable and beneficial to a software system (Section

1.3). Furthermore, since these clones can be in the source code for any length of time, this

dissertation proposes that clones should be managed throughout their lifecycles until

extinction, that is, if they ever get to that stage (Section 2.2).

        As most of the problems with cloning revolve around the issue of software

maintenance, support for modification or editing is the main focus of the related clone

tracking tools (Section 2.2.3.3). An additional distinction between these clone tracking

tools is whether they are proactive or retroactive, that is, whether they start capturing

clone information upon the clone’s creation (via copy and paste) or whether they use


                                               11
clone detection or clone selection by the programmer, which can start the clone tracking

much later in the clone’s life (Section 2.2.3.2). Each tool can also define the properties of

clones differently, with some tool designs and implementations preferred over others

(Section 2.2.2).

       Finally, this dissertation presents the design (Chapter 3) and results (Chapter 4) of

a user study that tested the CnP tool’s basic visualization and renaming features, followed

by a discussion related to this study (Chapter 5). Lastly, this paper contains a conclusion

and future work (Chapter 6).




                                             12
If something is worth doing once,
Chapter 2                                                 it's worth building a tool to do it.
                                                                - A Software Engineering Proverb
Literature Review
2.1. Clone Detection and Removal
                                                     Software entities are more complex for their size
                                                     than perhaps any other human construct because
                                                     no two parts are alike (at least above the
                                                     statement level). If they are, we make the two
                                                     similar parts into a subroutine – open or closed.
                                                     In this respect, software systems differ profoundly
                                                     from computers, buildings, or automobiles, where
                                                     repeated elements abound.
                                                                               - Frederick P. Brooks, Jr.

Clone Detection

        There is a wide variety of clone-related research [148, 149]. Traditionally, much

of the focus has been on clone detection [162, 211, 213, 214] and removal. In this field,

researchers often contribute a variety of clone detection techniques, including algorithms

[57, 60, 61, 69, 109, 110, 112, 113, 120, 193, 207, 215, 216], heuristics [17, 18] and

processes [158]. Many early algorithms made use of program dependence graphs (PDGs)

[20, 63, 93, 94, 144, 152] and program slicing [24, 145]. Beginning research dealt with

finding exact code duplicates, while later work expanded to detect “near-miss clones”

(code fragments that are not identical, but have some level of similarity) [10, 11, 21, 40,

212, 223]. Some algorithms were implemented as clone detection tools [22, 23] (such as

AntiCutAndPaste, CCFinderX, Clone Digger, CloneDR, Dup, Duplo, DupMan, Moss,

SDD, Simian, and SimScan*) whose purpose is to find code clones in pre-existing code.


*
  http://www.anticutandpaste.com/anticutandpaste/, http://www.ccfinder.net/ccfinderx.html,
http://clonedigger.sourceforge.net/, http://www.semdesigns.com/Products/Clone/, http://cm.bell-
labs.com/who/bsb/research.html, http://sourceforge.net/projects/duplo/,
http://sourceforge.net/projects/dupman/, http://theory.stanford.edu/~aiken/moss/,
http://wiki.eclipse.org/index.php/Duplicated_code_detection_tool_(SDD),
http://www.redhillconsulting.com.au/products/simian/, http://www.blue-edge.bg/download.html


                                                   13
Clone detection tools are retroactive and as a result, can reveal a number of false

positives and false negatives that must be sorted through by the programmer. The fact

that humans need to go through a clone detection tool’s results to verify its accuracy in

returning actual clones of interest is a major disadvantage of these kinds of tools.


Clone Removal

       People who dislike copy-and-paste and code clones tend to want to solve the

problems of cloning by removing the clones from the system as soon as possible. The

main reason for clone detection has been for subsequent clone removal, that is, to get rid

of the clones in legacy systems (already existing source code). As previously mentioned,

this approach is retroactive and thus is not solving the problem as it happens. On the

other hand, one way of proactive “clone prevention” [21] that is suggested is to simply

run a clone detection tool on the code as it is being developed, so that the clones can be

removed instantaneously by the programmer. Others even suggest preventing the creation

of clones by disabling the copy and paste functionality in the programming editor! But,

prevention is not enough, since some clones must or should remain in the source code.

       The most common method of clone removal is refactoring [67], which means to

restructure or change the source code without changing its external functional behavior.

One of the most common forms of refactored clones is as a functional abstraction – to

replace the multiple, similar code fragments with a single procedure [142, 143] to make

maintenance easier since updates could be made in one spot. The common portion

between the clones would be the function body and the differences would be handled by

the function parameters. Cloned classes can be refactored such that “a base class

encapsulates the commonalities and the derived classes specialize in the peculiarities”



                                             14
[74]. Using generics [108] and templates for classes [19] can also add an acceptable form

of abstraction into the system thus eliminating class-level clones. Other forms of

refactored clones [74, 148] include: macros [3], design patterns [148], program slices

[71], and software product lines [68, 184, 185]. The process of code refactoring can be

error-prone when done manually [79], but there is some default refactoring support in the

IDE (like renaming and moving [252]) and separate refactoring tools (such as [78, 79, 82,

84, 227]), which can help the programmer determine how and where to refactor.


When to Refactor
                                               The first time you do something, you just do it.
                                               The second time you do something similar, you
                                               wince at the duplication, but you do the duplicate
                                               thing anyway. The third time you do something
                                               similar, you refactor. - Don Roberts



       There are varying perspectives about when to refactor. Purists believe that all

code smells (including code clones) should be avoided with no exceptions [235]. They

agree with the “Don’t Repeat Yourself (DRY)” principle, which states that “every piece

of knowledge must have a single, unambiguous, authoritative representation within a

system” [242]. The Extreme Programming (XP) software development methodology calls

this “Once and Only Once” (that is, that “each and every declaration of behavior should

appear once and only once”) [244]. Followers of these rules would favor refactoring to

make a single abstraction as soon as possible. The “rule of thumb” of when to refactor,

however, states that copying and pasting of the same code is allowed up to three times

until the clones should be refactored [246, 247], called the “Rule of Three”. In general, it

takes at least three applications of something for it to be considered a pattern [247], so it

seems that the “Rule of Three” would be what is more often done naturally in practice.



                                             15
Despite the potential benefits of refactoring to make the code more maintainable

and less complex, refactoring can be done prematurely before it would happen naturally.

This could be problematic and require significant effort to fix. Also, creating an

abstraction can be difficult or impossible, for example, due to the programmer’s inability

to create the abstraction [76, 150] or due to language constraints. Furthermore, even

though there are rules about when to refactor, the rules can be broken, which would leave

clones in the system that need to be managed for a temporary or extended period of time.


2.2. Clone Lifecycle Management

                                             Cloning is a good strategy if you have the right
                                             tools in place. Let programmers copy and adjust,
                                             and then let tools factor out the differences with
                                             appropriate mechanisms. - Ira Baxter


       Since clones will continue to exist and some clones may even be intentionally

permanent, tool support is needed for all stages of the clone lifecycle. The term “clone

management” has been used to refer to “clone removal” [146, 147] and also one kind of

“clone editing” that links together clones for common changes to be made simultaneously

among them [54, 55, 189, 231]. Both “clone editing” and “clone removal” (in other

words, clone extinction) are parts of the clone lifecycle that can be managed with the aid

of software tools. This dissertation presents the dimensions of a software tool, CnP,

which provides copy-and-paste-induced clone management in the Eclipse IDE.


2.2.1.        Brief CnP Tool Descriptions
       The entire suite of Eclipse plug-ins from this research that support copy, paste,

and modify programming are called CnP. At the time of this writing, the CnP project


                                           16
consists of three plug-ins: CReN (for consistent identifier renaming), LexId (for

consistent substring renaming), and CSeR (for clone comparison). All CnP plug-ins

utilize the abstract syntax tree (AST) source code representation that is available in the

Eclipse framework. First, the tools track the cloning relationship right when the code is

copied and pasted before any changes are made. Each clone’s location is accurately

tracked according to its starting character position and length in number of characters

within a source code file. Only copied and pasted code that is fully contained within an

AST node is captured in this model. Related clones from the same copy and paste

sequence are also noted (Section 2.2.2.2 – Clone Model).

       CnP’s basic visualization (used in CReN and LexId) consists of colored bars next

to the clone’s code fragment within the source code file. CSeR has its own unique

method of visualization that differentiates between inserts, deletes, updates, and moves,

highlighting each kind of user-made change with a different color (Section 2.2.2.3 –

Clone Visualization).

       In addition to clone tracking and visualization, CReN and LexId track identifiers

within these related clones. First, the identifier instance locations between the clones

(which are AST leaf nodes of type SimpleName) are matched, which represents the

correspondence relationship, as in Figure 1. (Note: this correspondence is not used by

CReN or LexId yet). Then, all of the same identifier instances are grouped together,

which are assumed to be renamed together consistently, as in Figure 2. This way when

the programmer edits any one of the identifier instances, all others of the same program

element or name are renamed with it automatically and consistently. All identifier




                                           17
instances that are currently being edited within a clone are shown boxed, similar to

Eclipse’s Linked Renaming (Section 2.2.3.3 – Clone Editing).




    Figure 1: The identifier instances in the copied code are matched with their
               corresponding identifier instances in the pasted code.




                                          18
Figure 2: The identifier instances in the copied and pasted code are partitioned into
                          groups and mapped to each other.

       LexId further adds onto this default functionality of CReN by tracking and

grouping together common substrings between the different identifiers within a clone.

LexId tracks corresponding identifier pieces and renames these identical parts of

identifier names consistently together within copied and pasted code fragments. All

instances of a common substring between all identifiers within a clone are renamed

together as one of those is renamed by the programmer (Section 2.2.3.3 – Clone Editing).


2.2.2.         Definitions of Clone Properties
       Certain properties of clones need to be explicitly defined when creating a software

tool that tracks code clones. CnP and related software tools can define each clone

property in different ways. The following subsections give a variety of definitions that are

used for clone similarity, clone model, clone visualization, clone persistence, and clone


                                            19
documentation and clone attributes. Table 1 (on the next page) summarizes the design

and implementation details for each of the related clone tracking tools: Clonescape [38],

CPC [251], Codelink [231], LAPIS [189], and CloneTracker [54, 55], including CnP [95,

96, 97, 100, 101, 102] (and its parts: CReN consistent identifier renaming [103], LexId

consistent substring renaming [104], and CSeR clone comparison [106, 107])*, and it

specifically highlights the problems that the related tools did not address that CnP does.

The emphasis of these six tools, in particular, is in supporting the editing phase of the

lifecycle to avoid inconsistent modifications to clones.


2.2.2.1.         Clone Similarity

                                                     Software clones are segments of code that are
                                                     similar according to some definition of similarity.
                                                                                           - Ira Baxter


        As mentioned in Chapter 1, programmers often copy and paste (which creates

code clones) when they see a similarity between existing code and the current task at

hand. Research in the psychology field agrees that people’s minds work in this way –

new problems are often solved by using prior problems’ solutions [51, 52, 65, 73, 160,

170, 253]. People, even as children, recognize analogy and similarity when comparing

things and they know the correspondence relationship between the objects, whether the

object attributes are shared (similarity) or not (analogy) [73].




*
 http://s88387243.onlinehome.us/wiki/Clonescape/, http://cpc.anetwork.de/,
http://harmonia.cs.berkeley.edu/harmonia/projects/codelink/, http://www.cs.cmu.edu/~rcm/lapis/,
http://www.cs.mcgill.ca/~swevo/clonetracker/, http://www.clarkson.edu/~dhou/projects/CnP/


                                                   20
Table 1: Summary of Clone Tracking Tools with their Definitions of Clone Properties
21
In general, code clones are defined as “similar” code fragments in software, from

a few lines of code to whole files. The similarity relationship between clones is often

defined in terms of the characteristics of the code that make up the clones such as its text,

syntax, semantics, or pattern [148]. Four types of clones have been defined [23]:

   •   A Type 1 clone is an exact copy without modifications (except for white space

       and comments).

   •   A Type 2 clone is a syntactically identical copy in which only variable, type, or

       function identifiers were changed.

   •   A Type 3 clone is a copy with further modifications such that statements were

       changed, added, or removed.

   •   And a Type 4 clone is a semantically (or functionally) equivalent segment, which

       may differ significantly in terms of textual equivalence.

Clones that are a result of copying and pasting usually remain textually similar (Types 1-

3) [23] and are the kind of clones that most clone detection research has focused on.

Semantic clones (Type 4), however, can be very difficult [69] or nearly impossible to find

retroactively [23]. All clone detection tools rely on some notion of similarity in source

code in order to define clones and they return “sets of code blocks within a user-supplied

similarity threshold of each other” [223]. But, clone detection tool results are not perfect,

even for identical code, since other things like clone boundaries need to be considered.

       Like with clone detection tools, determining the similarities and differences

between code fragments is also useful in managing clones. The next two subsections

explain some ways that clone tracking tools use similarity to define what a clone is and

how to manage these clones, respectively.



                                             22
Defining Clones

       For the retroactive tools that rely on clone detection (CloneTracker), there is a

level of similarity that must exist for existing code pieces to be considered clones that is

defined by the clone detection tool. For the retroactive tools that rely on the

programmer’s selection (Codelink and LAPIS), the initial level of similarity is defined by

the programmer who is selecting the clones. Either selecting clones or using the clone

detection tool, if done after the cloning relationships have been forgotten by the

programmer, can yield inaccurate clones. For proactive tools that capture copy-and-paste-

induced clones (CnP, Clonescape, and CPC), the new code fragment is guaranteed to be a

clone and is identical to the original when initially pasted. Because of this, proactive tools

only need to consider what happens to the similarity between clones as they evolve.


Managing Clones

       CnP’s approach to the definition of clone similarity can be characterized as being

constructive and extensional. For example, the consistent renaming (CReN) portion of

CnP manages similarity such that clones in the same clone group all have corresponding

identifiers, which must be renamed together in each clone. The corresponding identifier

groups need to be constructed ahead of time and tracked thereafter. This correspondence

between identifiers can thus be considered as part of the similarity between clones within

the same clone group. In addition to identifier extraction, LexId goes further by grouping

and tracking parts of identifiers (substrings) together. The CSeR correspondence map

currently tracks fields, methods, parameters, conditional expressions, method calls,

simple names, and literal constants between the clone and its origin. It also uses the

Levenshtein Distance (LD) to connect similar but not identical changes as an “update”.



                                             23
Codelink uses the longest-common subsequence (LCS) algorithm (like the one

implemented by the UNIX Diff utility) to determine the commonalities and differences of

clones within a clone group. The main shortcomings of the LCS algorithm include its

potentially long running time and lack of intuitive results [231].

       The most popular method of code similarity in related work seems to be the

Levenshtein Distance (LD) (in Clonescape, CPC, CloneTracker, and CSeR), which is a

metric of the amount of editing (the edit distance) needed to make two strings the same.

CloneTracker does its line mapping technique by calculating the LD for two lines of code

at a time. Unlike the constructive, extensional nature of CReN and LexId’s approach, the

code can be tokenized whenever LD needs to be calculated. Thus, LD is not calculated

ahead of time and there is no need to track the result of LD. Also, since the Levenshtein

Distance only returns a numerical value representing clone similarity, it will not tell

additional information about similarity, like which parts of each clone are different.

CReN and LexId’s notion of similarity, on the other hand, is purely syntax-based and

requires parsing to reveal the exact commonalities and differences among clones.


2.2.2.2.       Clone Model
       The following subsections describe the clone model for each tool, both in terms of

how clone locations and clone relationships are represented.


Clone Location

       CnP and other clone-related tools that use a tree-based representation of the

source code specifically use the abstract syntax tree (AST) API provided in the Eclipse

JDT framework [157]. In Eclipse, an AST node (ASTNode) contains a part of the



                                             24
program’s source code. The source code characters and their absolute position in the

source code file are captured in the AST. Each ASTNode has a starting position that

denotes the numeric position of the first character in the node’s content and an ending

position that denotes the numeric position of the last character in the node’s content. An

ASTNode node’s character starting position can be represented as StartPos, whose value

can be retrieved with the Java code: node.getStartPosition() and its character ending

position can be represented as EndPos, whose value can be calculated with the Java code:

node.getStartPosition() + node.getLength() – 1, as shown in Figure 3.




Figure 3: The position of the source code characters as represented in an ASTNode.

       CnP represents the actual source code that is copied and pasted to the largest

continuous set of whole AST nodes within the range. The beginning of the code fragment

(that is selected and copied-then-pasted) can be denoted as BegIntRange and the end of

the code fragment can be denoted as EndIntRange, which defines the range. The case

which CnP supports is when the node is all within the range (in other words, CnP

captures only the nodes that are fully contained within the copied-and-pasted code

fragment), which is case 1 in Figure 4. In this case, the node that is captured is:

if(BegIntRange <= StartPos && EndIntRange >= EndPos). Copied and pasted source

code that is only partially contained within an AST node is not captured in this




                                           25
representation (CnP does not capture the node’s contents for cases 2 and 3 in Figure 4,

which is when the node is partly within the range or not within the range at all).




 Figure 4: The three cases when capturing a range of source code using the Eclipse
                                    AST API.

       Therefore, in general, CnP uses the character offset and length from the source

code to determine a clone’s location in a particular file. The actual source code that is

copied and pasted is represented to the largest continuous set of whole abstract syntax



                                             26
tree (AST) nodes within the range. Although it is not said in [231], Codelink probably

also uses offsets, since they use a token-oriented rather than a line-based algorithm for

similarity comparisons between clones. So does CPC. LAPIS represents a text region as a

substring with a start offset and an end offset relative to the start of the file.

        Some clone detection tools and clone management tools represent a clone’s

location by the file name that it is in with its line range, for example, Clonescape. The

problem with a line-based representation, however, is that it could give an imprecise

clone boundary because a single line may contain multiple statements. On the other hand,

the character offset representation would be able to pinpoint the exact range of all clones.

        CloneTracker was the first to create a way to represent the location of clones

without using file name with character or line ranges. Instead, CloneTracker uses a

“clone region descriptor (CRD)”, which tells of the clone’s relative location in the file

using syntactic, structural, and lexical information (for example, the clone’s alignment

with code blocks). It is possible to use a CRD calculated for a code clone in an early

release to locate the same clone in future releases. However, CRDs may fail to locate

clones when the assumptions that the approach relies on are broken. CnP is guaranteed to

always provide accurate clone locations.


Clone Relationship

        A lot of clone-related research, such as [54, 55, 111, 137, 251], including this one,

refers to all similar clones belonging to a “clone group”. Other research refers to a clone

group as a “region set” [189] or a “clone class” [13, 40, 123]. In all of these cases, the

related clones are viewed at the same level of group membership symmetrically.

Clonescape, on the other hand, distinguishes the original as the parent and the duplicated



                                               27
copy as the child. As a result, clones of the same parent can be called siblings. All related

clones form what they call a “clone family”. While it may be useful to know the clone’s

origin for comparison against the pasted code and for clone visualization, the origin

information could and should be separated from the basic clone model.


2.2.2.3.       Clone Visualization
       Clone visualization can be an effective means to make programmers aware of the

clones in a system.


Markers – Colored Bars and Highlights

       The latest version of CnP’s clone visualization feature was improved to

distinguish clone groups (related, similar clones that result from a series of copy and

pastes) by coloring all clones within the same group with the same color of bars. It

distinguishes between the origin and its pastes by slightly darkening the colored bar that

is next to each pasted region. For example, in Figure 5, the origin was the method

“more_variables” (shown in the back), which has a regular shade of yellow for its

visualization bar (since it is the original code fragment that was copied), while its pastes

(the newly modified and related methods “more_arrays” and “more_functions”) are

shown with slightly more grayed versions of the color yellow. These three clone

instances belong to the same clone group, hence they are displayed with variations of the

same color (yellow). A different code fragment that is copied and pasted (belonging to a

different clone group) would be represented with shades of a different color, such as the

color red.




                                             28
Figure 5: CnP clone visualization has distinction between clone groups and the clone
                         origin and its subsequent pastes.

       Visualizing clones is often a challenge that all clone-related tools must address.

Similar to CnP, CPC uses colored rulers to show the lines of each clone visually and

CloneTracker marks the lines of clones visually in the sidebar of Eclipse. Codelink

addresses the visualization issue by allowing similar parts of the clones to be hidden from

view (and indicating the commonalities between linked clones in blue and differences in

yellow). CSeR determines or infers each user-made change to clones as an insert, delete,

update, or move, and then highlights each kind of change with a different color.

Unchanged code within a clone is not highlighted. Mouse hover events reveal details

about the change, including what the updated code was before in the original and what



                                            29
has been deleted from the original. A screenshot of CSeR’s highlights and hover

information is shown in Figure 6.




           Figure 6: CSeR shows the changes that would be made to the
   ExclusionInclusionDialog class (highlighted code for inserts, deletes, updates,
          moves; and hover information for deletes, updates) to make the
   SetFilterWizardPage class in SetFilterWizardPage’s file in the Eclipse editor.


The four kinds of user-made differences between related clones, according to CSeR, are:


   1. Insert – the addition of an AST (abstract syntax tree) node, highlighted in green.

   2. Delete – the removal of an existing AST node, highlighted in red.

   3. Update – the modification of an existing AST node, highlighted in yellow.

   4. Move – the difference between the matching statements of the clones is that they

       have different neighbors, highlighted in blue.




                                           30
Differencing and Comparison Tools

         Some research looks at comparison [153] and its application, including comparing

source code examples [42]. Differencing tools must somehow show the differences

between files visually to the user. Though visualization is still a challenge to these tools,

most are very simple in how they display files’ differences, and the main distinguishing

feature to these related tools is the choice of differencing algorithm used.

         There are many text-based differencing tools available. Most make use of the diff

algorithm [99, 241] and are based on solving the LCS (Longest Common Subsequence)

problem. Since this approach is developed for text files, it has obvious disadvantages

when used for Java source code [106, 107]. Some differencing tools that are based on the

diff or LCS algorithm include UNIX Diff, Eclipse’s Compare Editor (which can be

invoked by right clicking selected file(s) in Eclipse’s Package Explorer view and then

choosing the “Compare With” menu option), Ldiff [32, 33], and Version Editor (ve) [7]*.

Ve provides tight integration of the revision history and the editor so it has the limitations

and disadvantages of the text-based tools and the version control system.

         There are a variety of graph-based differencing algorithms [5, 230, 233] and tools

such as Cdiff [25, 259], Jdiff [5], Semantic Diff [105], and Exas [193]*. The graph-based

approach has an advantage over the text-based tools, which only focused on syntax, since

these take into account the program’s semantics as well. However, they can be slower

and it is not always clear whether the extra analysis pays off.




*
  http://directory.fsf.org/project/diffutils/,
http://help.eclipse.org/help32/topic/org.eclipse.platform.doc.user/reference/ref-25.htm,
http://sourceforge.net/projects/ldiff/, http://ix.cs.uoregon.edu/~datkins/ve.html
*
  http://www.ece.iastate.edu/~nampham/projects/clone/Exas/


                                                     31
Many differencing tools are abstract syntax tree (AST)-based such as LaDiff [37],

Breakaway [41], Jigsaw [43, 44], ChangeDistiller [66], and Coogle [215, 216], including

CSeR [106, 107]*. These tools in general have the advantage of being able to obtain

structured information from the tree-based representation of the source code. CSeR

differs from these tools in terms of its purpose (clone differencing), its interactive and

incremental updating of correspondence rather than re-computing from scratch (in

contrast to what is done in Breakaway [41]), and the heuristics that it uses to infer change

categories (which differs from, for example, those of ChangeDistiller [66]).

         Another way of looking at program changes is to use mapping or origin analysis

as part of the differencing algorithm [138, 205] or the tool implementation such as Beagle

[75]. More recently, additional logic has been incorporated as well to get a better

understanding of the changes. The UMLDiff approach tracks the evolution of higher-

level program elements (at the level of UML models) over versions of systems [256, 257,

258] and other research utilizes a novel rule-based and combination algorithm (LSdiff)

[133, 136] to infer regular change patterns and overcome some of the disadvantages of

the other differencing approaches.


Capturing Program Structure and Edits

         There is a body of research that proposes structure-based editors and semantics-

preserving editing environments [16, 27, 80, 139, 140, 141, 188, 199, 200, 203, 204, 206,

219, 229, 250, 261] rather than traditional text-based editors. These structured editors and

IDEs can benefit programmers by letting them know exactly which edit operations are

being performed, however these specific, and often stand-alone, editors are not

*
 http://lsmr.cs.ucalgary.ca/projects/breakaway/, http://lsmr.cs.ucalgary.ca/projects/jigsaw/,
http://www.clarkson.edu/~dhou/projects/CnP/


                                                     32
commonly used in practice. Instead, other research focuses on determining and

presenting structural correspondence [41, 43, 44, 106, 107, 193] to programmers in the

IDEs that they already use, like Eclipse, by utilizing the tree-based representation of the

source code. Rather than bombarding the programmer with too much extra information,

CSeR makes a few general categories of possible user-edits and infers which category a

sequence of edits belongs to incrementally. How to efficiently parse code is a research

problem itself [249]. For better performance, CSeR only compares the smallest

corresponding sub-trees that contain the positions where the programmer last edited.


Capturing Program Changes

                                              … the problem [with software projects] isn’t
                                              change, per se, because change is going to
                                              happen; the problem, rather, is the inability to
                                              cope with change when it comes. - Kent Beck


       Not only is it important to capture the current state of the clones in a system (by

continuously updating clone locations and contents as they are changed), but capturing

change information over time and presenting this to the programmer can be extremely

beneficial. CSeR captures and displays certain clone changes in the editor, which can

help programmers see the level of similarity between the clones better. Seeing update and

deletion information that otherwise is not shown in the file can also be very useful in

learning about the code [186, 187]. Related research in the area of software evolution

looks further into program changes and multi-version programs [31, 37, 66, 77, 129, 131,

133, 134, 136, 154, 205, 262], changeability [177, 178], and evolutionary history [256].

Specifically studying the evolution of clones over multiple versions of the program helps

determine whether these clones require frequent consistent changes or whether they




                                            33
remain dormant and impose no significant maintenance challenges. It can also pinpoint at

what stage clones are refactored (when they are changed in form) and it can conclude

whether the clones need to be refactored at all [1, 135]. Seeing the code as it has evolved

over time in a version control system instead of just seeing the current version in the

editor can be extremely beneficial in learning how and why the program changes [14].


Using Change Information from Version Control Systems

        There is a large body of research that focuses on mining software repositories and

then analyzing the historical information from version control systems, such as SVN

(Subversion) or CVS (Concurrent Versions System), for a variety of reasons [6, 7, 15, 31,

33, 70, 82, 133, 134, 154, 262]. Clone-related tools that use version control system

information include Cleman and ClemanX [194, 195, 196], Clever [197], Clone

Detection Toolbox [190], Clone Smell Extractor [13], and Vaci [119]*. However, this

approach is limited since the information obtained is only from snapshots of when the

program’s source code was checked-in or checked-out and it often requires additional

analysis and inferences to be useful. Furthermore, the program histories may contain a lot

of irrelevant information that is not clone-related. Given program version changes, people

would need to sort through to detect likely copied-and-pasted code and eliminate extra

information. Also, although people might be able to obtain information about specific

changes made to a particular file, they would not automatically have correspondence

information (between files) from the histories alone.




*
 http://www.ece.iastate.edu/~nampham/projects/clone/Cleman/,
http://www.ece.iastate.edu/~nampham/projects/clever/, http://www.ccfinder.net/vaci.html


                                                  34
Warnings – Error Prevention or Detection

        Not only is support for clone management important, but the prevention or

detection of clone-related errors (also called bugs [13, 111, 171, 172, 174],

inconsistencies [111, 171, 172], or anomalies [251]) should also be provided. CnP

contains features that may either prevent errors (like CReN does) or detect potential

errors (warnings) in the tracked clones. CnP issues a warning if any identifier in the

pasted code binds to a declaration in the context where it is pasted (external identifier

scoping) [95, 96, 97]. For example, when a method is copied and pasted within the same

class, CnP can provide a warning for each identifier within the method that is defined at

the class level (outside of the method, but within the class). These warnings will alert the

programmer that these particular identifier instances within the clone (method) may need

to be renamed. This is useful, since it is common for programmers to copy and paste a

code fragment that contains references to external identifiers that are intended only in the

original fragment. The programmer can then use CReN to rename the identifier instances

in the pasted location, if desired.

        There are a number of software quality tools [26, 30, 50], including Axivion,

CloneDetective [116], ConQAT, and PMD,* and clone bug detection/prevention tools

such as CP-Miner [171, 172], CPC [251], DECKARD-based tool [111], and FixWizard.*

The famous Alice software prevents syntax errors by providing a drag-and-drop

programming system to aid novice programmers [127], who tend to make errors by

misunderstanding program constructs [260] and breaking implied system rules [58]. Bug

*
  http://www.axivion.com/index-en.html, http://conqat.cs.tum.edu/index.php/CloneDetective,
http://conqat.cs.tum.edu/index.php/ConQAT, http://pmd.sourceforge.net/
*
  http://opera.cs.uiuc.edu/Projects/ARTS/CP-Miner.htm, http://cpc.anetwork.de/,
http://wwwcsif.cs.ucdavis.edu/~jiangl/research.html,
http://www.ece.iastate.edu/~nampham/projects/fixwizard/


                                                  35
detection, on the other hand, (rather than prevention) is often done by finding

inconsistencies between the clones [118, 128, 255] when changes are made [13, 161],

especially inconsistencies in identifiers [111, 171, 172], and spelling errors [45, 98, 192].

       CP-Miner uses identifier mapping such that an identifier is considered consistent

when it always maps to the same identifier (which could be a different name) in the other

fragment and it is inconsistent when it maps itself to multiple identifiers [171, 172]. For

Example 1 in Table 4 of Section 2.3, the identifier “prom_phys_total” in the copied code

fragment maps to both “prom_prom_taken” and “prom_phys_total” in the pasted code.

Because “prom_phys_total” does not map only to “prom_prom_taken” in all instances,

for example, CP-Miner would detect it as an inconsistency. The DECKARD-based tool

claims that an inconsistency exists if the two code fragments contain different numbers of

unique identifiers [111]. For Example 3 in Table 4 of Section 2.3, the DECKARD-based

tool would count two instances of the identifier “l_stride” in the copied code fragment,

but only one instance in the pasted code fragment. Since both instances of “l_stride” were

not renamed to “r_stride” in the pasted code, for example, the DECKARD-based tool was

able to find this inconsistency. However, both CP-Miner and the DECKARD-based tool

produce false positives, which need to be inspected manually in order to verify the

existence of an actual bug. Similarly, the clone smell detection tool [13] also requires

human intervention to determine if the detected “unusual” changes are, in fact, bugs.

Instead of being retroactive in terms of bug prevention and detection, CnP provides a

form of automatic bug prevention (with its CReN and LexId renaming tools) and can give

warnings on code as it is being edited.




                                             36
Alerts – Clone Modification Notification

       Clone modification notification is a new feature found in clone-related tools.

Clonescape alerts programmers when they edit a clone by showing a red status line

message. CloneTracker uses notifications to alert programmers when tracked clones are

being modified (for example, so that they can choose to turn on the simultaneous editing

feature). CPC uses notifications to warn the programmer about possible update

anomalies. Clones can be marked as “ignored”, meaning that no more notifications will

be generated for this particular clone. CnP lets the programmer know via visualization

that a clone is being or was edited (boxes with CReN/LexId, and highlights with CSeR).


Views and Graphs

       Research in the area of clone detection visualizes clones with graphs [83, 114,

232, 228] and views [14, 208, 228]. CnP provides two views: one view to list the clone

detection tool results that are reported and one view to list the clones being tracked by

CnP [96]. Clones that are being tracked by CnP can be either clones that have been

automatically tracked since they were copied and pasted in the IDE or clones that started

being tracked after they were manually imported from the clone detection tool. The

LAPIS editor suggests three possible views for future work, including a bird’s-eye view,

an abbreviated context view, and an “unusual matches” view. CloneTracker uses a view

to list clones and clone groups. Clonescape proposes a multi-view approach, where only

the one or two views of interest are automatically shown to the programmer at one time, a

technique known as fisheye view, but these are unimplemented. CPC contains a few main

views, including a clone list view, tree clone view, and a clone replay view. The use of

graphs and views, like markers, is an issue that all clone-related tools face. The challenge



                                            37
is to find an alternative to the separate views that programmers need to invoke and the

relatively complex graphs that they need to learn and understand.


2.2.2.4.       Clone Persistence
       While all software tools make use of data structures (such as vectors and maps)

that store information in the system’s memory while the tool is currently being run, this

information must be recorded in some way so that it can be accessed and updated when

the programmer works on the source code again at a later time. Storing the clone

information between programming sessions is what is called clone persistence.

       CnP persists the information about the tracked clones between sessions in a flat

database (simple text file). Specifically, it stores each clone’s location (the file name that

contains the clone with the clone’s starting character position in the file and its length in

number of characters) within each clone group. In addition, as part of the information

needed for consistent identifier renaming within code fragments, each identifier’s

location (the identifier’s starting character position in the file and its length in number of

characters) within each identifier group is also stored. The information gets saved

automatically whenever Eclipse quits, and loaded automatically when Eclipse starts up.

This single file covers the whole workspace, not just individual projects.

       CPC also persists clone information. Codelink saves the links between clones as

file meta-data, making the links persistent between sessions. However, the persistent

links are not robust to edits. The latest version of CloneTracker persists the clone

information that it tracks for the current project. A unique feature of CPC is that it also

gathers information about the copying and pasting activities in general, and it persists the

full modification history of each clone in relation to its clone group.


                                             38
2.2.2.5.       Clone Documentation and Clone Attributes
       Some people believe that clone tracking and visualization act as a form of source

code documentation by themselves. Though many tools claim to “document” the clones

that they are tracking or managing in this way, clone documentation is actually defined as

support for additional information to be written about the clone (which forms the clone’s

external attributes). Clone documentation, such as why the clone was created (for

example, for hardware variation or a bug workaround [38]), generally cannot be retrieved

by the system and must be added by the programmer. Clonescape and CPC define “clone

classification”, however, their approaches include documenting the structural information

about clones, which does not fully fit into the previous definition. Similarly, other

research in the topic of clone classification [121, 123, 124] groups clones by the region

that they occur in (the level of abstraction involved and the location of the clones in a

file), which also does not require user intervention. Instead, the “reasoning” type of clone

classification described by the authors of Clonescape [38] is consistent with the definition

provided here. In addition, the clone attribute “severity” that is set at low, medium, or

high by the programmer depending on whether the clone should be removed from the

system, is an example of a resulting clone attribute according to the above definition.


2.2.3.         Clone Lifecycle Support
       Proactive clone management must be actively done at all times during software

development and maintenance, throughout a clone’s lifecycle. When designing CnP and

reviewing related work, various definitions of clone properties (in the previous section)

and a variety of tool support (and lack of tool support) for each stage of the clone




                                            39
lifecycle (in this section) were learned. In this dissertation, the clone lifecycle is explicitly

defined (shown in Figure 7) from clone creation, through clone capture and clone editing,

to clone extinction. The following subsections present a variety of tool support for the

phases of a clone’s life. Table 2 (on a following page) summarizes the design and

implementation details for each of the related clone tracking tools: Clonescape [38], CPC

[251], Codelink [231], LAPIS [189], and CloneTracker [54, 55], including CnP [95, 96,

97, 100, 101, 102] (and its parts: CReN consistent identifier renaming [103], LexId

consistent substring renaming [104], and CSeR clone comparison [106, 107]), and it

specifically highlights the problems that the related tools did not address that CnP does.

The emphasis of these six tools, in particular, is in supporting the editing phase of the

lifecycle to avoid inconsistent modifications to clones.




Figure 7: The Clone Lifecycle – Clone Creation, Clone Capture, Clone Editing, and
                                 Clone Extinction.



                                               40
Table 2: Summary of Clone Tracking Tools with their Clone Lifecycle Support
41
2.2.3.1.       Clone Creation
       When the concept of clone creation is considered, two questions come to mind:

how were the clones created, and why the clones were created? The answers to these two

questions determine whether or not that particular clone is tracked and supported by the

clone tracking tool.


How were the clones created?

       Code clones can be created in a number of ways [115], but many, if not most,

clones are undoubtedly created via copying and pasting, since duplication is very easy to

do with either a simple menu selection (Edit - Copy, Edit - Paste) or a keyboard shortcut

(Ctrl+C, Ctrl+V). As a result, the software tool CnP, which supports copy-and-paste-

induced code clones upon creation, essentially captures one of the most common kinds of

clones made and it guarantees 100% accuracy in clone “detection”, since the copying and

pasting is known exactly as it happens.


Why were the clones created?

       A key distinction between clones is also the reason for the clone creation. Existing

research distinguishes between intentional clones (code that the programmer intended to

reuse) [212, 213] and accidental clones (code that is similar due to a protocol

requirement) [2]. This and most other clone-related research focus on intentional clones,

but realize that accidental clones do exist. To address accidental clones, tools often allow

some form of user control such as allowing the programmer to remove certain clones

from those that are automatically being tracked.




                                            42
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation
Patricia Deshane - PhD Dissertation

Más contenido relacionado

Similar a Patricia Deshane - PhD Dissertation

Poster: Perspectives on Increasing Competency in Using Digital Practices and ...
Poster: Perspectives on Increasing Competency in Using Digital Practices and ...Poster: Perspectives on Increasing Competency in Using Digital Practices and ...
Poster: Perspectives on Increasing Competency in Using Digital Practices and ...Katja Reuter, PhD
 
Design at Large: Integrating Teaching and Experiments Online featuring Scott ...
Design at Large: Integrating Teaching and Experiments Online featuring Scott ...Design at Large: Integrating Teaching and Experiments Online featuring Scott ...
Design at Large: Integrating Teaching and Experiments Online featuring Scott ...Center for Online Innovation in Learning
 
Academic Integrity Handbook.Pdf
Academic Integrity Handbook.PdfAcademic Integrity Handbook.Pdf
Academic Integrity Handbook.PdfApril Knyff
 
Introduction to College Studies
Introduction to College StudiesIntroduction to College Studies
Introduction to College Studiesbtjt97
 
A BIM-integrated approach to construction quality management
A BIM-integrated approach to construction quality managementA BIM-integrated approach to construction quality management
A BIM-integrated approach to construction quality managementEsper Achkar
 
Emphasizing the Human side of Human-Computer Interaction
Emphasizing the Human side of Human-Computer InteractionEmphasizing the Human side of Human-Computer Interaction
Emphasizing the Human side of Human-Computer InteractionTom Allison
 
Reverse Engineering
Reverse EngineeringReverse Engineering
Reverse Engineeringdswanson
 
KLEINMAN-DISSERTATION
KLEINMAN-DISSERTATIONKLEINMAN-DISSERTATION
KLEINMAN-DISSERTATIONLisa Kleinman
 
A Comparison Of The Rule And Case-Based Reasoning Approaches For The Automati...
A Comparison Of The Rule And Case-Based Reasoning Approaches For The Automati...A Comparison Of The Rule And Case-Based Reasoning Approaches For The Automati...
A Comparison Of The Rule And Case-Based Reasoning Approaches For The Automati...Darian Pruitt
 
Witness wednesdays informing agile software development with continuous user...
Witness wednesdays  informing agile software development with continuous user...Witness wednesdays  informing agile software development with continuous user...
Witness wednesdays informing agile software development with continuous user...Rebecca Destello
 
Computation of Multi-Agent Based Relative Direction Learning Specification
Computation of Multi-Agent Based Relative Direction Learning SpecificationComputation of Multi-Agent Based Relative Direction Learning Specification
Computation of Multi-Agent Based Relative Direction Learning SpecificationS Rayhan Kabir (Hemel)
 
My research story (presentation at ICSE 2021 New Faculty Symposium)
My research story (presentation at ICSE 2021 New Faculty Symposium)My research story (presentation at ICSE 2021 New Faculty Symposium)
My research story (presentation at ICSE 2021 New Faculty Symposium)Alexander Serebrenik
 
Assignment Title Conducting Primary ResearchDeveloping the ab.docx
Assignment Title Conducting Primary ResearchDeveloping the ab.docxAssignment Title Conducting Primary ResearchDeveloping the ab.docx
Assignment Title Conducting Primary ResearchDeveloping the ab.docxssuser562afc1
 

Similar a Patricia Deshane - PhD Dissertation (20)

Poster: Perspectives on Increasing Competency in Using Digital Practices and ...
Poster: Perspectives on Increasing Competency in Using Digital Practices and ...Poster: Perspectives on Increasing Competency in Using Digital Practices and ...
Poster: Perspectives on Increasing Competency in Using Digital Practices and ...
 
dissertation
dissertationdissertation
dissertation
 
Design at Large: Integrating Teaching and Experiments Online featuring Scott ...
Design at Large: Integrating Teaching and Experiments Online featuring Scott ...Design at Large: Integrating Teaching and Experiments Online featuring Scott ...
Design at Large: Integrating Teaching and Experiments Online featuring Scott ...
 
Academic Integrity Handbook.Pdf
Academic Integrity Handbook.PdfAcademic Integrity Handbook.Pdf
Academic Integrity Handbook.Pdf
 
Introduction to College Studies
Introduction to College StudiesIntroduction to College Studies
Introduction to College Studies
 
A BIM-integrated approach to construction quality management
A BIM-integrated approach to construction quality managementA BIM-integrated approach to construction quality management
A BIM-integrated approach to construction quality management
 
H02 syllabus
H02 syllabusH02 syllabus
H02 syllabus
 
Emphasizing the Human side of Human-Computer Interaction
Emphasizing the Human side of Human-Computer InteractionEmphasizing the Human side of Human-Computer Interaction
Emphasizing the Human side of Human-Computer Interaction
 
Reverse Engineering
Reverse EngineeringReverse Engineering
Reverse Engineering
 
KLEINMAN-DISSERTATION
KLEINMAN-DISSERTATIONKLEINMAN-DISSERTATION
KLEINMAN-DISSERTATION
 
A Comparison Of The Rule And Case-Based Reasoning Approaches For The Automati...
A Comparison Of The Rule And Case-Based Reasoning Approaches For The Automati...A Comparison Of The Rule And Case-Based Reasoning Approaches For The Automati...
A Comparison Of The Rule And Case-Based Reasoning Approaches For The Automati...
 
Career counselling websites
Career counselling websitesCareer counselling websites
Career counselling websites
 
Witness wednesdays informing agile software development with continuous user...
Witness wednesdays  informing agile software development with continuous user...Witness wednesdays  informing agile software development with continuous user...
Witness wednesdays informing agile software development with continuous user...
 
BHendersonThesis final (2)
BHendersonThesis final (2)BHendersonThesis final (2)
BHendersonThesis final (2)
 
RHouraniDSFinalPaper
RHouraniDSFinalPaperRHouraniDSFinalPaper
RHouraniDSFinalPaper
 
Computation of Multi-Agent Based Relative Direction Learning Specification
Computation of Multi-Agent Based Relative Direction Learning SpecificationComputation of Multi-Agent Based Relative Direction Learning Specification
Computation of Multi-Agent Based Relative Direction Learning Specification
 
My research story (presentation at ICSE 2021 New Faculty Symposium)
My research story (presentation at ICSE 2021 New Faculty Symposium)My research story (presentation at ICSE 2021 New Faculty Symposium)
My research story (presentation at ICSE 2021 New Faculty Symposium)
 
Assignment Title Conducting Primary ResearchDeveloping the ab.docx
Assignment Title Conducting Primary ResearchDeveloping the ab.docxAssignment Title Conducting Primary ResearchDeveloping the ab.docx
Assignment Title Conducting Primary ResearchDeveloping the ab.docx
 
Gradschooltalk
GradschooltalkGradschooltalk
Gradschooltalk
 
thesis
thesisthesis
thesis
 

Más de Patricia Deshane (8)

ICPC
ICPCICPC
ICPC
 
ICPC Demo
ICPC DemoICPC Demo
ICPC Demo
 
PhD Dissertation
PhD DissertationPhD Dissertation
PhD Dissertation
 
CASCON
CASCONCASCON
CASCON
 
FSE DS
FSE DSFSE DS
FSE DS
 
PhD Proposal
PhD ProposalPhD Proposal
PhD Proposal
 
ETX
ETXETX
ETX
 
OOPSLA DS
OOPSLA DSOOPSLA DS
OOPSLA DS
 

Último

Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Último (20)

Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 

Patricia Deshane - PhD Dissertation

  • 1. CLARKSON UNIVERSITY MANAGING THE COPY-AND-PASTE PROGRAMMING PRACTICE A Dissertation By Patricia Deshane Coulter School of Engineering Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy, Engineering Science April 30, 2010 Accepted by the Graduate School ______________, _____________________ Date, Dean
  • 3. The undersigned have examined the dissertation entitled “Managing the Copy-and- Paste Programming Practice” presented by Patricia Deshane, a candidate for the degree of Doctor of Philosophy (Engineering Science), and hereby certify that it is worthy of acceptance. April 30, 2010 ______________________________ Date Dr. Daqing Hou, Advisor Electrical and Computer Engineering ______________________________ Dr. Susan Conry, Examining Committee Electrical and Computer Engineering ______________________________ Dr. Christopher Lynch, Examining Committee Mathematics & Computer Science ______________________________ Dr. Robert Meyer, Examining Committee Electrical and Computer Engineering ______________________________ Dr. Christino Tamon, Examining Committee Mathematics & Computer Science
  • 4. Abstract CLARKSON UNIVERSITY Managing the Copy-and-Paste Programming Practice By: Patricia Deshane Advisor: Daqing Hou Programmers often copy and paste source code in order to reuse an existing solution in the completion of a current task. Copying and pasting results in code clones (similar code fragments) throughout a code base, which need to be properly maintained over time. Forgetting the cloning information and correspondence relationships within a piece of code can be problematic for the software maintainer. Furthermore, inconsistent editing to clones can introduce undetected bugs, decreasing the quality of the software. This dissertation presents a suite of software tools, Eclipse plug-ins named CnP, that aid the programmer during copy, paste, and modify programming. The purpose is to provide tool support throughout a clone’s entire lifecycle, from its creation to its removal from the system. More than just traditional clone detection and removal, these clone tracking tools have a particular focus on clone editing. One CnP plug-in helps with consistent identifier renaming within clones (CReN), another one renames substrings consistently within clones (LexId), and a third plug-in in the CnP suite visualizes user edits within a clone for better clone comparison (CSeR). A user study was conducted on CnP’s basic visualization, CReN, and LexId features with analysis in terms of task completion time, solution correctness, and method of completion. iv
  • 5. To my wonderful husband, Todd Deshane.
  • 6. Acknowledgments Personal Reflections With the completion of this dissertation paper and defense, I feel like I have had the full PhD experience, and I am finally personally ready to graduate! I have completed the course work, the qualifying exam, and the research component of the degree program. Over the years, I have given various seminar presentations in the computer science and engineering departments at the university, I have presented at one conference per year during the research phase of my PhD (OOPSLA 2007 in Montreal, FSE 2008 in Atlanta, and CASCON 2009 in Toronto), and I have written many paper submissions and drafts. A conference trip was often a reward for having a successful paper submission. All three conference trips that I presented at were milestones during my PhD career and each trip was truly unforgettable. I really enjoyed traveling to these cities and learning from other researchers in the software engineering discipline. As a presenter, I personally got a lot of valuable feedback and advice from the conference attendees that I may not have gotten otherwise. I believe that the conferences were a vital part of my growth as a researcher as they helped me to “get out of the lab” and experience the rest of the world. Special Thanks I would first like to thank my husband, Todd Deshane, for everything over the past ten years that we have been together. I would not be where I am today without him. When it seems like the only thing that is constant is change, it is comforting to know that Todd is always there to help me through the tough times and to celebrate with me in the joyous times. Todd, all of this hard work during our PhD years was worth it – soon we vi
  • 7. will both officially be “computer doctors”. “To every thing there is a season, and a time to every purpose under heaven” (Ecclesiastes 3:1). I look forward to beginning the next chapter of our lives together. I would like to thank Professor Hou, my advisor, who took me in as his first PhD student at a time when I had nowhere else to go. I thank him for having put up with me for the past four years. I have learned so much from him, most importantly, I believe that “instead of just being fed, I was taught how to fish”. For the first time, I was able to get the guidance that I needed, with the independence that I wanted. Thanks also to the Software Engineering Research Laboratory (SERL) at Clarkson. I am truly grateful for having the lab/office space and resources that allowed me to work more productively. I thank the other graduate students in SERL, especially: Cheng (Jerry) Wang, Chandan Rupakheti, Ferosh Jacob, Yuejiao (Gloria) Wang, Xiaojia (Joanna) Yao, Dave Pletcher, and Lin Li. I really appreciate the friendship and support from each of you as we all spent countless hours in the lab. I would also like to thank Jeanna Matthews for supporting me during the first year and a half of my PhD. Thanks also go to Eli M. Dow, my mentor at IBM, who helped me with early research and has continued to be supportive. Finally, thank you, my PhD committee (Susan Conry, Christopher Lynch, Robert Meyer, and Christino Tamon), who have given me early feedback during my PhD proposal and remarks on this dissertation. Personal thanks to all of my family – my parents and siblings – extended family, and best friends both from college and from back home. Special thanks to my best buddy, Wenjin Hu, for his never-ending kindness and friendship, while here at school. I would also like to mention my other “best friend”, my dog, Lady. I truly do miss her. vii
  • 8. I give special thanks to my grandfather, B. John Jablonski, who continually kept me motivated during my college years. It is hard to believe that it has already been four years since his death, but I know that it has been that long since I have gotten an email letter from him. His letters and emails always meant so much to me, with his words of encouragement when many others did not approve or understand why I was still in school especially without proper funding. He continues to be my inspiration. Finally, I thank God (literally). He is everything – my counselor, comforter, and keeper. In particular, I would like to thank God for always keeping me grounded. One of the toughest things that I experienced during my PhD is rejection (of paper submissions). While I feel that the rejections may have at times hindered my research progress, I feel that if all of my paper submissions were accepted, then I may have incorrectly assumed that this process was very simple and I may have become too proud of my own successes. During this whole ordeal, I have learned that there is always room for improvement even if it is difficult for me to see on my own. As Randy Pausch said, “Experience is what you get when you didn’t get what you wanted.” But, regardless of past rejections, I know that I have always done my best and that “with God all things are possible” (Matthew 19:26). Ultimately, my success is not measured by men’s approval, but by God’s*. Patricia A Deshane Clarkson University January 2010 * “Study to show yourself approved to God, a workman that needs not to be ashamed.” (2 Timothy 2:15) Clarkson University Motto Disclaimer: The views and opinions expressed in this dissertation are solely those of the author and do not necessarily represent the views and opinions of anyone else affiliated with Clarkson University. viii
  • 9. Contents LIST OF TABLES ....................................................................................................................................... XI LIST OF ILLUSTRATIONS ......................................................................................................................XII LIST OF PUBLICATIONS....................................................................................................................... XIV CHAPTER 1 INTRODUCTION.....................................................................................................................1 1.1. COPY, PASTE, AND MODIFY PROGRAMMING.................................................................................1 1.2. THE TRADITIONAL PERSPECTIVE: CLONES ARE BAD ....................................................................3 1.3. A NEW PERSPECTIVE: CLONES CAN BE GOOD...............................................................................8 1.4. RESEARCH CONTRIBUTIONS ........................................................................................................10 1.5. OUTLINE OF THIS DISSERTATION ................................................................................................11 CHAPTER 2 LITERATURE REVIEW........................................................................................................13 2.1. CLONE DETECTION AND REMOVAL.............................................................................................13 2.2. CLONE LIFECYCLE MANAGEMENT ..............................................................................................16 2.2.1. BRIEF CNP TOOL DESCRIPTIONS .................................................................................................16 2.2.2. DEFINITIONS OF CLONE PROPERTIES ...........................................................................................19 2.2.2.1. CLONE SIMILARITY ................................................................................................................20 2.2.2.2. CLONE MODEL .......................................................................................................................24 2.2.2.3. CLONE VISUALIZATION ..........................................................................................................28 2.2.2.4. CLONE PERSISTENCE ..............................................................................................................38 2.2.2.5. CLONE DOCUMENTATION AND CLONE ATTRIBUTES ..............................................................39 2.2.3. CLONE LIFECYCLE SUPPORT .......................................................................................................39 2.2.3.1. CLONE CREATION ..................................................................................................................42 2.2.3.2. CLONE CAPTURE ....................................................................................................................43 2.2.3.3. CLONE EDITING......................................................................................................................47 2.2.3.4. CLONE EXTINCTION ...............................................................................................................66 2.3. PREVALENCE OF CLONES, RENAMING, AND RELATED ERRORS IN PRODUCTION CODE ...............68 CHAPTER 3 METHODOLOGY..................................................................................................................73 3.1. USER STUDY ON CNP’S VISUALIZATION, CREN, AND LEXID .....................................................73 3.1.1. USER STUDY HYPOTHESES .........................................................................................................74 3.1.2. SUBJECT CHARACTERISTICS........................................................................................................75 3.1.3. STUDY PROCEDURE ....................................................................................................................76 3.1.4. TASK DESCRIPTIONS ...................................................................................................................79 3.1.4.1. DEBUGGING AND MODIFYING WITHIN A CLONE .....................................................................80 3.1.4.2. RENAMING WITHIN A CLONE ..................................................................................................88 CHAPTER 4 RESULTS ...............................................................................................................................93 4.1. TIME PER TASK ...........................................................................................................................93 4.2. SOLUTION CORRECTNESS............................................................................................................96 4.3. METHOD OF COMPLETION ...........................................................................................................98 CHAPTER 5 DISCUSSION .......................................................................................................................101 5.1. CONFOUNDING FACTORS FOR CLONE VISUALIZATION..............................................................101 5.2. THREATS TO VALIDITY .............................................................................................................103 5.3. TOOL DESIGN ............................................................................................................................103 CHAPTER 6 CONCLUSION .....................................................................................................................105 6.1. RESEARCH CONTRIBUTIONS ......................................................................................................106 ix
  • 10. 6.2. FUTURE WORK..........................................................................................................................107 6.2.1. THEORY ABOUT COPY-AND-PASTE AND ABSTRACTIONS ..........................................................108 6.2.2. OTHER APPLICATIONS OF THIS RESEARCH ...............................................................................109 REFERENCES............................................................................................................................................110 APPENDIX A IRB RECRUITMENT LETTER.........................................................................................131 APPENDIX B IRB CONSENT FORM ......................................................................................................132 APPENDIX C IRB QUESTIONNAIRE .....................................................................................................134 x
  • 11. List of Tables TABLE 1: SUMMARY OF CLONE TRACKING TOOLS WITH THEIR DEFINITIONS OF CLONE PROPERTIES ...........21 TABLE 2: SUMMARY OF CLONE TRACKING TOOLS WITH THEIR CLONE LIFECYCLE SUPPORT ........................41 TABLE 3: EXAMPLES OF WHAT LEXID CONSIDERS TO BE SUBSTRINGS. ..........................................................54 TABLE 4: THREE EXAMPLES FROM LITERATURE THAT SHOW AN INCONSISTENT RENAMING OF IDENTIFIERS IN THE PASTED CODE FRAGMENT. .............................................................................................................72 TABLE 5: HIGH-LEVEL DESCRIPTION OF THE TASKS IN THE USER STUDY........................................................73 TABLE 6: THE TIME (IN MINUTES) TO COMPLETE EACH PAIR OF TASKS...........................................................94 TABLE 7: STATISTICAL HYPOTHESIS TESTING ON THE PAIRED TIME DATA......................................................96 TABLE 8: CORRECT STATES WHEN RUNNING THE PROGRAM OR WHEN FINISHED............................................97 TABLE 9: NUMBER OF SUBJECTS WHO USED EACH LOCATION AND INSPECTION METHOD FOR DEBUGGING AND MODIFICATION TASKS. ..........................................................................................................................99 TABLE 10: NUMBER OF TIMES EACH RENAMING METHOD WAS USED FOR RENAMING TASKS........................100 xi
  • 12. List of Illustrations FIGURE 1: THE IDENTIFIER INSTANCES IN THE COPIED CODE ARE MATCHED WITH THEIR CORRESPONDING IDENTIFIER INSTANCES IN THE PASTED CODE. .......................................................................................18 FIGURE 2: THE IDENTIFIER INSTANCES IN THE COPIED AND PASTED CODE ARE PARTITIONED INTO GROUPS AND MAPPED TO EACH OTHER. .....................................................................................................................19 FIGURE 3: THE POSITION OF THE SOURCE CODE CHARACTERS AS REPRESENTED IN AN ASTNODE.................25 FIGURE 4: THE THREE CASES WHEN CAPTURING A RANGE OF SOURCE CODE USING THE ECLIPSE AST API. ..26 FIGURE 5: CNP CLONE VISUALIZATION HAS DISTINCTION BETWEEN CLONE GROUPS AND THE CLONE ORIGIN AND ITS SUBSEQUENT PASTES. ..............................................................................................................29 FIGURE 6: CSER SHOWS THE CHANGES THAT WOULD BE MADE TO THE EXCLUSIONINCLUSIONDIALOG CLASS (HIGHLIGHTED CODE FOR INSERTS, DELETES, UPDATES, MOVES; AND HOVER INFORMATION FOR DELETES, UPDATES) TO MAKE THE SETFILTERWIZARDPAGE CLASS IN SETFILTERWIZARDPAGE’S FILE IN THE ECLIPSE EDITOR.........................................................................................................................30 FIGURE 7: THE CLONE LIFECYCLE – CLONE CREATION, CLONE CAPTURE, CLONE EDITING, AND CLONE EXTINCTION. ........................................................................................................................................40 FIGURE 8: CONSISTENT IDENTIFIER RENAMING WITHIN A CLONE USING CREN..............................................50 FIGURE 9: THE PROGRAMMER CAN CHOOSE TO RENAME AN INSTANCE SEPARATELY FROM THE OTHERS (NOTICE THAT ONE “I” IN THE PASTED LOOP ON LINE 33 IS NOT BEING RENAMED AS A “J” WITH THE OTHERS ANYMORE)...............................................................................................................................51 FIGURE 10: THE ABSTRACT SYNTAX TREE (AST) OF A FOR LOOP WITH THE IDENTIFIER GROUPS HIGHLIGHTED. .............................................................................................................................................................53 FIGURE 11: LEXID CHANGES THE SUBSTRINGS “LEFT” TO “RIGHT” WHEN ONE IS EDITED. IN THE FUTURE, LEXID CAN BE MADE TO AUTOMATICALLY INFER THE SUBSTRING “RIGHT” IN THE PASTED CODE BASED ON “LEFT” BY MAINTAINING A DATABASE OF COMMON NAMING PAIRS. ...............................................55 FIGURE 12: LEXID RENAMES A SUBSTRING “B” TO “Y” CONSISTENTLY IN PASTED CODE. ..............................56 FIGURE 13: A NEW FEATURE OF LEXID CAN BE SUPPORT FOR AUTO-INCREMENTING TOKENS (LEFT) AS WELL AS LEXICAL PATTERNS IN IDENTIFIERS (RIGHT).....................................................................................57 FIGURE 14: LEXID CAN BE MADE TO INFER THAT THE CONSTRUCTOR THAT IS CALLED WITHIN A COMMON METHOD SHOULD BE THE SAME AS THE CURRENT SUBCLASS’ NAME (“XXX”). ....................................58 FIGURE 15: FIND & REPLACE CAN RENAME ALL INSTANCES OF “I” (AS A WHOLE WORD) TO “J” IN THE SELECTED LINES, BUT THIS NEEDS TO BE SPECIFIED BY THE PROGRAMMER AND IS SIMPLY A TEXT- BASED SEARCH. ....................................................................................................................................61 FIGURE 16: RENAME REFACTORING DOES NOT WORK WITH CODE THAT DOES NOT TYPE CHECK (BINDING IS REQUIRED FOR IT TO WORK)..................................................................................................................62 xii
  • 13. FIGURE 17: CREN WORKS WITH CODE THAT DOES NOT TYPE CHECK (BINDING IS NOT REQUIRED FOR IT TO WORK). .................................................................................................................................................62 FIGURE 18: RENAME REFACTORING IS NOT LIMITED TO RENAMING WITHIN A CLONE (FOR EXAMPLE, ONLY IN THE PASTED FOR LOOP).........................................................................................................................62 FIGURE 19: REFACTORING (TOP) VS. CREN (BOTTOM). .................................................................................63 FIGURE 20: CREN WORKS ACROSS MULTIPLE FILES (FILE 1 IS ON TOP, FILE 2 IS ON THE BOTTOM).................64 FIGURE 21: LINKED RENAMING DOES NOT WORK WITH CODE THAT DOES NOT PARSE (NOTICE THE ADDED SEMI-COLON BETWEEN THE ++ ON LINE 33)..........................................................................................64 FIGURE 22: CREN WORKS WITH CODE THAT DOES NOT PARSE (NOTICE THE ADDED SEMI-COLON BETWEEN THE ++ ON LINE 33). .............................................................................................................................65 FIGURE 23: LINKED RENAMING IS NOT LIMITED TO RENAMING WITHIN A CLONE (FOR EXAMPLE, ONLY IN THE PASTED FOR LOOP)................................................................................................................................65 FIGURE 24: THE CMU PAINT PROGRAM USED IN THE USER STUDY WITH WIDGETS ANNOTATED BY CORRESPONDING INSTANCE VARIABLES. ..............................................................................................78 FIGURE 25: TASK 1 – RSLIDER SHOULD BE BSLIDER (ON LINE 120)................................................................82 FIGURE 26: TASK 2 – COLORCHANGELISTENER SHOULD BE THICKNESSCHANGELISTENER (ON LINE 142). ...83 FIGURE 27: TITLED BORDERS ARE SHOWN AROUND THE COLOR PANEL AND THE THICKNESS PANEL..............84 FIGURE 28: TASK 3 – ADD A TITLED BORDER TO COLORPANEL AND TO THICKNESSPANEL.............................85 FIGURE 29: THE LABELS OF THE RED, GREEN, AND BLUE SLIDERS ARE SHOWN COLORED...............................86 FIGURE 30: TASK 4 – ADD COLOR TO THE LABEL OF EACH COLOR SLIDER: RED, GREEN, AND BLUE. ..............87 FIGURE 31: TASK 5 – RENAME COLORPANEL TO THICKNESSPANEL. ..............................................................89 FIGURE 32: TASK 6 – RENAME TOOLPANEL TO CLEARUNDOPANEL. ..............................................................90 FIGURE 33: TASK 7 (PART 1) – RENAME RPANEL TO GPANEL AND RSLIDER TO GSLIDER IN THE GREEN SLIDER CLONE...................................................................................................................................................91 FIGURE 34: TASK 8 – RENAME BPANEL TO TPANEL AND BSLIDER TO TSLIDER IN THE THICKNESS SLIDER CLONE...................................................................................................................................................92 xiii
  • 14. List of Publications [1] P. Jablonski and D. Hou, “Renaming Parts of Identifiers Consistently within Code Clones”, IEEE International Conference on Program Comprehension (ICPC), 2010. (2 pages) [2] P. Jablonski and D. Hou, “Aiding Software Maintenance with Copy-and-Paste Clone-Awareness”, IEEE International Conference on Program Comprehension (ICPC), 2010. (10 pages) [3] F. Jacob, D. Hou, and P. Jablonski, “Actively Comparing Clones Inside The Code Editor”, International Workshop on Software Clones (IWSC), 2010. (8 pages) [4] D. Hou, F. Jacob, and P. Jablonski, “Exploring the Design Space of Proactive Tool Support for Copy-and-Paste Programming”, IBM Conference of the Centre for Advanced Studies on Collaborative Research (CASCON), 2009. (15 pages) [5] D. Hou, F. Jacob, and P. Jablonski, “Proactively Managing Copy-and-Paste Induced Code Clones”, IEEE International Conference on Software Maintenance (ICSM), 2009. (2 pages) [6] D. Hou, P. Jablonski, and F. Jacob, “CnP: Towards an Environment for the Proactive Management of Copy-and-Paste Programming”, IEEE International Conference on Program Comprehension (ICPC), 2009. (5 pages) [7] P. Jablonski, “Clone-Aware Editing with CnP”, ACM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE), Student Research Forum, 2008. (poster) [8] P. Jablonski, “Techniques for Detecting and Preventing Copy-and-Paste Errors during Software Development”, Clarkson University, PhD Dissertation Proposal, 2007. (21 pages) [9] P. Jablonski and D. Hou, “CReN: A Tool for Tracking Copy-and-Paste Code Clones and Renaming Identifiers Consistently in the IDE”, Eclipse Technology Exchange Workshop at OOPSLA (ETX), 2007. (5 pages) [10] P. Jablonski, “Managing the Copy-and-Paste Programming Practice in Modern IDEs”, ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), 2007. (2 pages) xiv
  • 15. Copy and paste is a design error. - David Parnas Chapter 1 Copying all or parts of a program is as natural to a programmer as breathing, and as productive. Introduction - Richard Stallman 1.1. Copy, Paste, and Modify Programming All programming is maintenance programming, because you are rarely writing original code. - Dave Thomas Copy and paste [236, 237, 238, 239, 240] – some people love it, others hate it. Why? Copying and pasting obviously provides some short-term benefits such as saving typing and remembering a name’s spelling. In a study on copy-and-paste usage, approximately 74% of programmers copied very small pieces of code of less than a single line (such as variable names, type names, or method names) [132], which indicates that they were copying and pasting for these kinds of reasons. The same study also concluded that the programmers on average made four non- trivial copy-and-pastes per hour [132]. It seems natural for programmers to copy and paste larger code fragments (such as blocks, methods, or classes) when they see a similar existing solution to their current task rather than write the new software solution entirely from scratch. Not only can copying and pasting make programmers more productive in this way, but it can be especially useful when working in an unfamiliar domain, for instance, when learning a new programming language or framework. To help get started, programmers can copy and paste examples from the framework’s documentation [28], from a software repository consisting of past projects [87, 92, 217], or from an online search engine (such as Google Code Search) [28, 86] to use as a base to work from. 1
  • 16. Reusing Source Code Examples Example-based programming is a legitimate form of software reuse (unlike cases of copying and pasting in order to plagiarize [176, 198, 218], which a variety of plagiarism detection tools have been developed to help deter, including AntiPlagiarist, CopyCatch, DOC Cop, Eve2, Glatt, GPlag, JPlag, MyDropBox, PAIRwise, SNITCH, SPlaT, TurnItIn, and WCopyFind*). Research findings in the psychology and AI fields verify that working with concrete examples can be advantageous [51, 209]. However, though some software components are especially designed to be reused (such as libraries, frameworks, APIs, and software product lines), not all examples that a programmer may find were specifically made for reuse purposes. As such, the programmer must be careful to extract only the functionality that is needed for reuse, while also dealing with dependencies that this code fragment may have to other parts of the software. Tool support has been developed to aid programmers in the whole process of pragmatic reuse [85, 86, 88, 89, 90, 91], reengineering [64], and in the comparison of examples [42]. The Psychology of Software Reuse Novices generally copy and paste when they do not have a full understanding of the programming task. Since they are new to programming or to a particular language, they do not have the syntactic, semantic, and schematic knowledge that experts have in order to craft a solution. Novices are not the only ones who copy and paste for reuse, however. According to [51, 52], expert programmers have “schemas” (plans) that * http://www.anticutandpaste.com/antiplagiarist/, http://www.copycatchgold.com/, http://www.doccop.com/, http://www.canexus.com/, http://www.plagiarism.com/, http://research.microsoft.com/apps/pubs/default.aspx?id=73093, https://www.ipd.uni-karlsruhe.de/jplag/, http://www.mydropbox.com/, http://www.pairwise.cits.ucsb.edu/, http://actlab.csc.villanova.edu/simtools/, http://splat.cs.arizona.edu/, http://www.turnitin.com/, http://plagiarism.phys.virginia.edu/Wsoftware.html 2
  • 17. represent generic solutions kept in their memories specific to a programming domain that they can retrieve and instantiate to solve a particular programming problem. In other words, as experts become familiar with a problem domain, they develop domain-specific schemas, representing their knowledge of certain types of problems [52], which they can later recall to help them design a new program. Having prior knowledge and experience, expert programmers can use their familiarity with the situation to gain efficiency and the ability to solve more difficult tasks than if they had to design the solution entirely from scratch. Routine tasks can even become impossible to do if every part is treated as new [51, 52]. Humans naturally reuse knowledge from prior experience in the present time. The copy and paste of source code (both large and small) tends to be a natural behavior that provides immediate benefits. The copy-and-paste operation is not bad by itself, but the result of copying and pasting is what is considered bad, since the resulting clones need to be consistently modified and maintained in the long-term (the “modify” part of “copy, paste, and modify programming” [234]). Still, many people continue to strongly dislike copy-and-paste itself and blame it as the culprit of the maintenance problem of clones (which often leads to code inconsistencies) [182]. This and some other perceived problems of code clones are discussed in the following section. 1.2. The Traditional Perspective: Clones are Bad So, copy-and-paste is not necessarily bad in the short run, if you are copying good code. But it is always bad in the long run. - Ralph Johnson Traditionally code cloning was considered “harmful” to a system. Some problem areas include software maintenance, evolution, quality, and code aesthetics or design. 3
  • 18. Clones as a Software Maintenance Problem Copying and pasting within the same code base results in code duplication [243] that needs to be properly managed and maintained. The clones are exactly the same when initially copied and pasted, but start to differ as the newly pasted code is modified to fit its task. At the time the copy and paste occurs, the programmer sees the similarity between the clones (otherwise he or she would not have made an exact duplicate as a base to work from) and he or she also has an idea of the differences that need to be made for the new code to be properly adapted. A natural dependency exists between the clones, which are assumed to have a certain level of similarity that must remain between them. This invisible relationship between copied and pasted code fragments consists of the correspondences and differences between the clones that must be maintained as the software is updated, for example, with new features and bug fixes. It is important for the software maintainer to remember the parts of the related clones that should remain unchanged, parts that must change in the same way, and parts between the clones that are meant to differ [72]. Identifying the locations of all clones in a system and remembering their invisible relationships to one another can be extremely difficult over time. Clones as a Software Evolution Problem As changes to the software (like new features or bug fixes) are required over time, the clones in the system may also naturally change. In some cases, the programmer may have copied and pasted in order to get a quick solution rather than taking the time to create an abstraction such as a procedure, function, or method. If so, these clones are likely to be replaced by an abstraction as the code matures. The issue here is that even 4
  • 19. though the creation of the clones is avoidable to begin with and the clones will eventually disappear anyway, there is still a time when the clones exist in which they need to be properly maintained. Though these particular clones are only in the system temporarily and their entire life may be short, there is still significant effort needed in refactoring the code. On the other hand, perpetual clones are problematic in that they require continuous, long-term maintenance. Clones as a Software Quality Problem The increase in source code maintenance is not the only concern of opponents to code cloning. The potential increase in the number of software bugs in the system is one of the most widely cited reasons for avoiding clone creation. Some scenarios where bugs are introduced into the system as a result of cloning include: • The addition of a new feature: When the system needs to be updated to include a new feature, the software maintainer must know whether to apply this particular change to all related clones or only to some of them. If the maintainer fails to apply this change to all of the correct clones, a bug (inconsistency) is made. • A bug is propagated and fixed: It is possible that the original code that was copied had an existing bug in it that has now been multiplied as it was pasted throughout the system. Once this bug has been noticed, it then needs to be fixed in all clones that it is in. If one of those bugs is not fixed, there remains an inconsistency, which is actually a new bug introduced into the system! • A clone is modified to fit its task: Changes are made to a single clone when it is being modified to fit its own individual task. The newly pasted code fragment typically has identifiers changed to a new name related to the current task. If all 5
  • 20. identifier instances are not renamed consistently within the code fragment, this will create an inconsistency (bug). In all of these cases, the clone-related bugs can remain undetected. It may take a long time for the absence of a new feature in a clone to be detected (especially if that part of the software is not used often in practice). In the second case, since the existing bug was not detected earlier, it is possible that the same bug might remain hidden somewhere else in the code. Lastly, though a renaming inconsistency could be caught by the compiler, there are cases when the unchanged identifier instance is still in scope (Section 2.3 – Errors), which can remain undetected by both the compiler and programmer. All of these clone-related bugs occur when the implicit rules in the cloning relationship are broken. Clones as an Aesthetic or Design Problem Number 1 in the stink parade is duplicated code. If you see the same code structure in more than one place, you can be sure that your program will be better if you find a way to unify them. - Kent Beck and Martin Fowler In addition to the potential decrease in software quality, some people say that clones in software just look bad and that their presence in the code might indicate an underlying design problem. Clones can artificially increase the number of lines of code by adding “unnecessary” lines that otherwise would be in the body of a single abstraction [226]. Charles Simonyi, who introduced the concept of “intentional programming” [29, 222], is a proponent of programming with abstractions rather than with clones. He states that “...it is still pretty easy to decide at a glance that the code is bad – by the identifiers, by the juxtapositions, by the size of the expressions, or by evidences of code copying” 6
  • 21. [221]. But he also says that a program can still be beautiful even if it is not strictly structured, as long as the program has other redeeming features [159]. Code clones are often labeled as a “code smell” [235], which is a hint that something could be wrong with the code. This part of the code should be inspected further to determine whether there is actually a problem that needs to be fixed or that the smell can just be tolerated [179]. The term “clone smell” [13] was later made to describe an individual clone that appears to be problematic over time, which should be looked at. The existence of clones may indicate a design problem, since it could be that the programmer did not fully think through the design of the software solution if abstractions were not used wherever possible. Abstraction-supported programming languages are designed so that programmers can take advantage of these powerful tools [29]. So, when programmers do not use the abstractions (for whatever reason) [150], they are not getting all of the benefits that the programming language has to offer and they may not be properly utilizing the language as it was intended to be used by design. If the clones are to be refactored out of the code later on anyway, it might be worth spending the effort and time to design the abstractions correctly from the beginning. Martin Fowler sees a connection between a code’s look and smell: “I wrote that about aesthetics in discussing when you apply refactorings. To some extent, the situations I describe in the refactoring guidelines are fairly vague notions of aesthetics. But I try to provide more guidance than just saying, ‘Refactor when the code looks ugly.’ I say, for instance, that duplicated code is a bad smell. I say that long methods are a bad smell. Big classes are a bad smell.” 7
  • 22. 1.3. A New Perspective: Clones can be Good If you have a procedure with ten parameters, you probably missed some. - Anonymous Duplicated or cloned code is often considered harmful to software quality, however it can also be a reasonable or beneficial design option. Cloning can be done with “good intentions”, including when 1) it keeps the code clean and understandable rather than introducing an unreadable, complicated abstraction, and 2) the programming language lacks expressiveness, so a trusted solution is reused (for example, in COBOL) [122, 125]. If a procedure would have too many parameters or if a programming language does not support abstractions, then clones can be a viable alternative. There are times when it is advised to keep clones in the source code. An empirical study of code clone genealogies that looked at clones over multiple versions of a program [137], found that it may not be worth refactoring short-lived clones if they are likely to diverge soon and that the long-living clones are often in the system due to shortcomings of the programming language. As a result, limitations of the programming language design may result in unavoidable duplicates in a code [132]. Research from Cordy claims that making changes to clones (which includes refactoring them) can be considered risky from a corporate standpoint, so to be safe, the clones should remain in the system [39]. 8
  • 23. Have People Been Led Astray? All we like sheep have gone astray. - Isaiah 53:6 According to MythSE 2007, the statement that “clones are evil” is actually a myth in software engineering [81]. Various facts are used to refute the myth, including [8, 39, 122, 125, 137, 151, 180, 191] with reasons explained on the website [81]. Godfrey says that people may have been led astray like sheep, in their thinking as a group that cloning is bad. He reiterates that cloning (or starting with the familiar) is both natural and good. For example, he claims that in both arts and life, people explore new things by carefully venturing away from the familiar and that humans find comfort in ritual, and more importantly, repetition of trusted design elements is a part of engineering [74]. Regardless of the outcome of the debate about the value of copy-and-paste and cloning, this PhD research focused on the fact that code clones do exist and thus need to be managed. Even if clones are made with good intentions or out of necessity, they can still be problematic if not handled properly. One contribution of this work, the software tool CnP, is a proactive clone management environment that tracks copy-and-paste- induced clones upon creation. Based on the tracked cloning information, CnP provides support for clone-related maintenance activities. This dissertation shows how CnP’s support for copy-and-paste clone-awareness may be able to help programmers benefit from this clone information during debugging and modification tasks, develop software more efficiently, and prevent inconsistent identifier renaming within clones. A user study was performed to measure the effects of this kind of clone-aware programming. 9
  • 24. 1.4. Research Contributions The main contributions of this research included: • The copy-and-paste (CnP) tool o Proactive tracking – CnP/CReN were the first known clone tracking tools published (in 2007), which took a more proactive approach to capturing clones upon creation (by detecting when a copy and paste occurs and gathering the initial clone and identifier information at that time when the clones are identical). o Intra-clone editing – CReN was the only known tool to support editing within a clone (all previous tools only supported between-clone editing). Intra-clone editing is done when programmers copy, paste, and modify the pasted code to fit the current task. The kind of modification that is made in these cases is often identifier renaming, which is what CReN supports. o AST-based – CnP makes use of the abstract syntax tree (AST) representation of the source code, which is a better approach than the text- based methods that cannot differentiate between source code and any other text. CSeR is one of the few differencing tools to take advantage of ASTs. • Dimensions of clone tracking tool development – When comparing CnP with related clone tracking tools, a variety of clone properties were determined that these kinds of tools must explicitly define. Listing the properties can be useful in the creation of new tools or to help redefine a tool’s current property definitions. 10
  • 25. Definition of the clone lifecycle – The comparison of tools also led to a definition of the clone lifecycle stages, including some areas where there is current tool support and areas that need more support. • Realization about clone visualization – After completing a user study on CnP, CnP’s clone visualization was not found to provide statistically quicker and correct solutions than without it. Observation and other analysis (in Section 5.1) helped better determine whether and when a programmer may exploit clone information. There is no other known similar analysis of the role of clone information in maintenance tasks, and, thus the analysis in and of itself can be a contribution. The analysis can be used in the design of future experiments. 1.5. Outline of This Dissertation This dissertation first presents the traditional perspective on copying and pasting and code cloning (Section 1.2), including the clone detection and removal approach (Section 2.1). It then introduces the new perspective that states that even though cloning can be problematic, clones can be reasonable and beneficial to a software system (Section 1.3). Furthermore, since these clones can be in the source code for any length of time, this dissertation proposes that clones should be managed throughout their lifecycles until extinction, that is, if they ever get to that stage (Section 2.2). As most of the problems with cloning revolve around the issue of software maintenance, support for modification or editing is the main focus of the related clone tracking tools (Section 2.2.3.3). An additional distinction between these clone tracking tools is whether they are proactive or retroactive, that is, whether they start capturing clone information upon the clone’s creation (via copy and paste) or whether they use 11
  • 26. clone detection or clone selection by the programmer, which can start the clone tracking much later in the clone’s life (Section 2.2.3.2). Each tool can also define the properties of clones differently, with some tool designs and implementations preferred over others (Section 2.2.2). Finally, this dissertation presents the design (Chapter 3) and results (Chapter 4) of a user study that tested the CnP tool’s basic visualization and renaming features, followed by a discussion related to this study (Chapter 5). Lastly, this paper contains a conclusion and future work (Chapter 6). 12
  • 27. If something is worth doing once, Chapter 2 it's worth building a tool to do it. - A Software Engineering Proverb Literature Review 2.1. Clone Detection and Removal Software entities are more complex for their size than perhaps any other human construct because no two parts are alike (at least above the statement level). If they are, we make the two similar parts into a subroutine – open or closed. In this respect, software systems differ profoundly from computers, buildings, or automobiles, where repeated elements abound. - Frederick P. Brooks, Jr. Clone Detection There is a wide variety of clone-related research [148, 149]. Traditionally, much of the focus has been on clone detection [162, 211, 213, 214] and removal. In this field, researchers often contribute a variety of clone detection techniques, including algorithms [57, 60, 61, 69, 109, 110, 112, 113, 120, 193, 207, 215, 216], heuristics [17, 18] and processes [158]. Many early algorithms made use of program dependence graphs (PDGs) [20, 63, 93, 94, 144, 152] and program slicing [24, 145]. Beginning research dealt with finding exact code duplicates, while later work expanded to detect “near-miss clones” (code fragments that are not identical, but have some level of similarity) [10, 11, 21, 40, 212, 223]. Some algorithms were implemented as clone detection tools [22, 23] (such as AntiCutAndPaste, CCFinderX, Clone Digger, CloneDR, Dup, Duplo, DupMan, Moss, SDD, Simian, and SimScan*) whose purpose is to find code clones in pre-existing code. * http://www.anticutandpaste.com/anticutandpaste/, http://www.ccfinder.net/ccfinderx.html, http://clonedigger.sourceforge.net/, http://www.semdesigns.com/Products/Clone/, http://cm.bell- labs.com/who/bsb/research.html, http://sourceforge.net/projects/duplo/, http://sourceforge.net/projects/dupman/, http://theory.stanford.edu/~aiken/moss/, http://wiki.eclipse.org/index.php/Duplicated_code_detection_tool_(SDD), http://www.redhillconsulting.com.au/products/simian/, http://www.blue-edge.bg/download.html 13
  • 28. Clone detection tools are retroactive and as a result, can reveal a number of false positives and false negatives that must be sorted through by the programmer. The fact that humans need to go through a clone detection tool’s results to verify its accuracy in returning actual clones of interest is a major disadvantage of these kinds of tools. Clone Removal People who dislike copy-and-paste and code clones tend to want to solve the problems of cloning by removing the clones from the system as soon as possible. The main reason for clone detection has been for subsequent clone removal, that is, to get rid of the clones in legacy systems (already existing source code). As previously mentioned, this approach is retroactive and thus is not solving the problem as it happens. On the other hand, one way of proactive “clone prevention” [21] that is suggested is to simply run a clone detection tool on the code as it is being developed, so that the clones can be removed instantaneously by the programmer. Others even suggest preventing the creation of clones by disabling the copy and paste functionality in the programming editor! But, prevention is not enough, since some clones must or should remain in the source code. The most common method of clone removal is refactoring [67], which means to restructure or change the source code without changing its external functional behavior. One of the most common forms of refactored clones is as a functional abstraction – to replace the multiple, similar code fragments with a single procedure [142, 143] to make maintenance easier since updates could be made in one spot. The common portion between the clones would be the function body and the differences would be handled by the function parameters. Cloned classes can be refactored such that “a base class encapsulates the commonalities and the derived classes specialize in the peculiarities” 14
  • 29. [74]. Using generics [108] and templates for classes [19] can also add an acceptable form of abstraction into the system thus eliminating class-level clones. Other forms of refactored clones [74, 148] include: macros [3], design patterns [148], program slices [71], and software product lines [68, 184, 185]. The process of code refactoring can be error-prone when done manually [79], but there is some default refactoring support in the IDE (like renaming and moving [252]) and separate refactoring tools (such as [78, 79, 82, 84, 227]), which can help the programmer determine how and where to refactor. When to Refactor The first time you do something, you just do it. The second time you do something similar, you wince at the duplication, but you do the duplicate thing anyway. The third time you do something similar, you refactor. - Don Roberts There are varying perspectives about when to refactor. Purists believe that all code smells (including code clones) should be avoided with no exceptions [235]. They agree with the “Don’t Repeat Yourself (DRY)” principle, which states that “every piece of knowledge must have a single, unambiguous, authoritative representation within a system” [242]. The Extreme Programming (XP) software development methodology calls this “Once and Only Once” (that is, that “each and every declaration of behavior should appear once and only once”) [244]. Followers of these rules would favor refactoring to make a single abstraction as soon as possible. The “rule of thumb” of when to refactor, however, states that copying and pasting of the same code is allowed up to three times until the clones should be refactored [246, 247], called the “Rule of Three”. In general, it takes at least three applications of something for it to be considered a pattern [247], so it seems that the “Rule of Three” would be what is more often done naturally in practice. 15
  • 30. Despite the potential benefits of refactoring to make the code more maintainable and less complex, refactoring can be done prematurely before it would happen naturally. This could be problematic and require significant effort to fix. Also, creating an abstraction can be difficult or impossible, for example, due to the programmer’s inability to create the abstraction [76, 150] or due to language constraints. Furthermore, even though there are rules about when to refactor, the rules can be broken, which would leave clones in the system that need to be managed for a temporary or extended period of time. 2.2. Clone Lifecycle Management Cloning is a good strategy if you have the right tools in place. Let programmers copy and adjust, and then let tools factor out the differences with appropriate mechanisms. - Ira Baxter Since clones will continue to exist and some clones may even be intentionally permanent, tool support is needed for all stages of the clone lifecycle. The term “clone management” has been used to refer to “clone removal” [146, 147] and also one kind of “clone editing” that links together clones for common changes to be made simultaneously among them [54, 55, 189, 231]. Both “clone editing” and “clone removal” (in other words, clone extinction) are parts of the clone lifecycle that can be managed with the aid of software tools. This dissertation presents the dimensions of a software tool, CnP, which provides copy-and-paste-induced clone management in the Eclipse IDE. 2.2.1. Brief CnP Tool Descriptions The entire suite of Eclipse plug-ins from this research that support copy, paste, and modify programming are called CnP. At the time of this writing, the CnP project 16
  • 31. consists of three plug-ins: CReN (for consistent identifier renaming), LexId (for consistent substring renaming), and CSeR (for clone comparison). All CnP plug-ins utilize the abstract syntax tree (AST) source code representation that is available in the Eclipse framework. First, the tools track the cloning relationship right when the code is copied and pasted before any changes are made. Each clone’s location is accurately tracked according to its starting character position and length in number of characters within a source code file. Only copied and pasted code that is fully contained within an AST node is captured in this model. Related clones from the same copy and paste sequence are also noted (Section 2.2.2.2 – Clone Model). CnP’s basic visualization (used in CReN and LexId) consists of colored bars next to the clone’s code fragment within the source code file. CSeR has its own unique method of visualization that differentiates between inserts, deletes, updates, and moves, highlighting each kind of user-made change with a different color (Section 2.2.2.3 – Clone Visualization). In addition to clone tracking and visualization, CReN and LexId track identifiers within these related clones. First, the identifier instance locations between the clones (which are AST leaf nodes of type SimpleName) are matched, which represents the correspondence relationship, as in Figure 1. (Note: this correspondence is not used by CReN or LexId yet). Then, all of the same identifier instances are grouped together, which are assumed to be renamed together consistently, as in Figure 2. This way when the programmer edits any one of the identifier instances, all others of the same program element or name are renamed with it automatically and consistently. All identifier 17
  • 32. instances that are currently being edited within a clone are shown boxed, similar to Eclipse’s Linked Renaming (Section 2.2.3.3 – Clone Editing). Figure 1: The identifier instances in the copied code are matched with their corresponding identifier instances in the pasted code. 18
  • 33. Figure 2: The identifier instances in the copied and pasted code are partitioned into groups and mapped to each other. LexId further adds onto this default functionality of CReN by tracking and grouping together common substrings between the different identifiers within a clone. LexId tracks corresponding identifier pieces and renames these identical parts of identifier names consistently together within copied and pasted code fragments. All instances of a common substring between all identifiers within a clone are renamed together as one of those is renamed by the programmer (Section 2.2.3.3 – Clone Editing). 2.2.2. Definitions of Clone Properties Certain properties of clones need to be explicitly defined when creating a software tool that tracks code clones. CnP and related software tools can define each clone property in different ways. The following subsections give a variety of definitions that are used for clone similarity, clone model, clone visualization, clone persistence, and clone 19
  • 34. documentation and clone attributes. Table 1 (on the next page) summarizes the design and implementation details for each of the related clone tracking tools: Clonescape [38], CPC [251], Codelink [231], LAPIS [189], and CloneTracker [54, 55], including CnP [95, 96, 97, 100, 101, 102] (and its parts: CReN consistent identifier renaming [103], LexId consistent substring renaming [104], and CSeR clone comparison [106, 107])*, and it specifically highlights the problems that the related tools did not address that CnP does. The emphasis of these six tools, in particular, is in supporting the editing phase of the lifecycle to avoid inconsistent modifications to clones. 2.2.2.1. Clone Similarity Software clones are segments of code that are similar according to some definition of similarity. - Ira Baxter As mentioned in Chapter 1, programmers often copy and paste (which creates code clones) when they see a similarity between existing code and the current task at hand. Research in the psychology field agrees that people’s minds work in this way – new problems are often solved by using prior problems’ solutions [51, 52, 65, 73, 160, 170, 253]. People, even as children, recognize analogy and similarity when comparing things and they know the correspondence relationship between the objects, whether the object attributes are shared (similarity) or not (analogy) [73]. * http://s88387243.onlinehome.us/wiki/Clonescape/, http://cpc.anetwork.de/, http://harmonia.cs.berkeley.edu/harmonia/projects/codelink/, http://www.cs.cmu.edu/~rcm/lapis/, http://www.cs.mcgill.ca/~swevo/clonetracker/, http://www.clarkson.edu/~dhou/projects/CnP/ 20
  • 35. Table 1: Summary of Clone Tracking Tools with their Definitions of Clone Properties 21
  • 36. In general, code clones are defined as “similar” code fragments in software, from a few lines of code to whole files. The similarity relationship between clones is often defined in terms of the characteristics of the code that make up the clones such as its text, syntax, semantics, or pattern [148]. Four types of clones have been defined [23]: • A Type 1 clone is an exact copy without modifications (except for white space and comments). • A Type 2 clone is a syntactically identical copy in which only variable, type, or function identifiers were changed. • A Type 3 clone is a copy with further modifications such that statements were changed, added, or removed. • And a Type 4 clone is a semantically (or functionally) equivalent segment, which may differ significantly in terms of textual equivalence. Clones that are a result of copying and pasting usually remain textually similar (Types 1- 3) [23] and are the kind of clones that most clone detection research has focused on. Semantic clones (Type 4), however, can be very difficult [69] or nearly impossible to find retroactively [23]. All clone detection tools rely on some notion of similarity in source code in order to define clones and they return “sets of code blocks within a user-supplied similarity threshold of each other” [223]. But, clone detection tool results are not perfect, even for identical code, since other things like clone boundaries need to be considered. Like with clone detection tools, determining the similarities and differences between code fragments is also useful in managing clones. The next two subsections explain some ways that clone tracking tools use similarity to define what a clone is and how to manage these clones, respectively. 22
  • 37. Defining Clones For the retroactive tools that rely on clone detection (CloneTracker), there is a level of similarity that must exist for existing code pieces to be considered clones that is defined by the clone detection tool. For the retroactive tools that rely on the programmer’s selection (Codelink and LAPIS), the initial level of similarity is defined by the programmer who is selecting the clones. Either selecting clones or using the clone detection tool, if done after the cloning relationships have been forgotten by the programmer, can yield inaccurate clones. For proactive tools that capture copy-and-paste- induced clones (CnP, Clonescape, and CPC), the new code fragment is guaranteed to be a clone and is identical to the original when initially pasted. Because of this, proactive tools only need to consider what happens to the similarity between clones as they evolve. Managing Clones CnP’s approach to the definition of clone similarity can be characterized as being constructive and extensional. For example, the consistent renaming (CReN) portion of CnP manages similarity such that clones in the same clone group all have corresponding identifiers, which must be renamed together in each clone. The corresponding identifier groups need to be constructed ahead of time and tracked thereafter. This correspondence between identifiers can thus be considered as part of the similarity between clones within the same clone group. In addition to identifier extraction, LexId goes further by grouping and tracking parts of identifiers (substrings) together. The CSeR correspondence map currently tracks fields, methods, parameters, conditional expressions, method calls, simple names, and literal constants between the clone and its origin. It also uses the Levenshtein Distance (LD) to connect similar but not identical changes as an “update”. 23
  • 38. Codelink uses the longest-common subsequence (LCS) algorithm (like the one implemented by the UNIX Diff utility) to determine the commonalities and differences of clones within a clone group. The main shortcomings of the LCS algorithm include its potentially long running time and lack of intuitive results [231]. The most popular method of code similarity in related work seems to be the Levenshtein Distance (LD) (in Clonescape, CPC, CloneTracker, and CSeR), which is a metric of the amount of editing (the edit distance) needed to make two strings the same. CloneTracker does its line mapping technique by calculating the LD for two lines of code at a time. Unlike the constructive, extensional nature of CReN and LexId’s approach, the code can be tokenized whenever LD needs to be calculated. Thus, LD is not calculated ahead of time and there is no need to track the result of LD. Also, since the Levenshtein Distance only returns a numerical value representing clone similarity, it will not tell additional information about similarity, like which parts of each clone are different. CReN and LexId’s notion of similarity, on the other hand, is purely syntax-based and requires parsing to reveal the exact commonalities and differences among clones. 2.2.2.2. Clone Model The following subsections describe the clone model for each tool, both in terms of how clone locations and clone relationships are represented. Clone Location CnP and other clone-related tools that use a tree-based representation of the source code specifically use the abstract syntax tree (AST) API provided in the Eclipse JDT framework [157]. In Eclipse, an AST node (ASTNode) contains a part of the 24
  • 39. program’s source code. The source code characters and their absolute position in the source code file are captured in the AST. Each ASTNode has a starting position that denotes the numeric position of the first character in the node’s content and an ending position that denotes the numeric position of the last character in the node’s content. An ASTNode node’s character starting position can be represented as StartPos, whose value can be retrieved with the Java code: node.getStartPosition() and its character ending position can be represented as EndPos, whose value can be calculated with the Java code: node.getStartPosition() + node.getLength() – 1, as shown in Figure 3. Figure 3: The position of the source code characters as represented in an ASTNode. CnP represents the actual source code that is copied and pasted to the largest continuous set of whole AST nodes within the range. The beginning of the code fragment (that is selected and copied-then-pasted) can be denoted as BegIntRange and the end of the code fragment can be denoted as EndIntRange, which defines the range. The case which CnP supports is when the node is all within the range (in other words, CnP captures only the nodes that are fully contained within the copied-and-pasted code fragment), which is case 1 in Figure 4. In this case, the node that is captured is: if(BegIntRange <= StartPos && EndIntRange >= EndPos). Copied and pasted source code that is only partially contained within an AST node is not captured in this 25
  • 40. representation (CnP does not capture the node’s contents for cases 2 and 3 in Figure 4, which is when the node is partly within the range or not within the range at all). Figure 4: The three cases when capturing a range of source code using the Eclipse AST API. Therefore, in general, CnP uses the character offset and length from the source code to determine a clone’s location in a particular file. The actual source code that is copied and pasted is represented to the largest continuous set of whole abstract syntax 26
  • 41. tree (AST) nodes within the range. Although it is not said in [231], Codelink probably also uses offsets, since they use a token-oriented rather than a line-based algorithm for similarity comparisons between clones. So does CPC. LAPIS represents a text region as a substring with a start offset and an end offset relative to the start of the file. Some clone detection tools and clone management tools represent a clone’s location by the file name that it is in with its line range, for example, Clonescape. The problem with a line-based representation, however, is that it could give an imprecise clone boundary because a single line may contain multiple statements. On the other hand, the character offset representation would be able to pinpoint the exact range of all clones. CloneTracker was the first to create a way to represent the location of clones without using file name with character or line ranges. Instead, CloneTracker uses a “clone region descriptor (CRD)”, which tells of the clone’s relative location in the file using syntactic, structural, and lexical information (for example, the clone’s alignment with code blocks). It is possible to use a CRD calculated for a code clone in an early release to locate the same clone in future releases. However, CRDs may fail to locate clones when the assumptions that the approach relies on are broken. CnP is guaranteed to always provide accurate clone locations. Clone Relationship A lot of clone-related research, such as [54, 55, 111, 137, 251], including this one, refers to all similar clones belonging to a “clone group”. Other research refers to a clone group as a “region set” [189] or a “clone class” [13, 40, 123]. In all of these cases, the related clones are viewed at the same level of group membership symmetrically. Clonescape, on the other hand, distinguishes the original as the parent and the duplicated 27
  • 42. copy as the child. As a result, clones of the same parent can be called siblings. All related clones form what they call a “clone family”. While it may be useful to know the clone’s origin for comparison against the pasted code and for clone visualization, the origin information could and should be separated from the basic clone model. 2.2.2.3. Clone Visualization Clone visualization can be an effective means to make programmers aware of the clones in a system. Markers – Colored Bars and Highlights The latest version of CnP’s clone visualization feature was improved to distinguish clone groups (related, similar clones that result from a series of copy and pastes) by coloring all clones within the same group with the same color of bars. It distinguishes between the origin and its pastes by slightly darkening the colored bar that is next to each pasted region. For example, in Figure 5, the origin was the method “more_variables” (shown in the back), which has a regular shade of yellow for its visualization bar (since it is the original code fragment that was copied), while its pastes (the newly modified and related methods “more_arrays” and “more_functions”) are shown with slightly more grayed versions of the color yellow. These three clone instances belong to the same clone group, hence they are displayed with variations of the same color (yellow). A different code fragment that is copied and pasted (belonging to a different clone group) would be represented with shades of a different color, such as the color red. 28
  • 43. Figure 5: CnP clone visualization has distinction between clone groups and the clone origin and its subsequent pastes. Visualizing clones is often a challenge that all clone-related tools must address. Similar to CnP, CPC uses colored rulers to show the lines of each clone visually and CloneTracker marks the lines of clones visually in the sidebar of Eclipse. Codelink addresses the visualization issue by allowing similar parts of the clones to be hidden from view (and indicating the commonalities between linked clones in blue and differences in yellow). CSeR determines or infers each user-made change to clones as an insert, delete, update, or move, and then highlights each kind of change with a different color. Unchanged code within a clone is not highlighted. Mouse hover events reveal details about the change, including what the updated code was before in the original and what 29
  • 44. has been deleted from the original. A screenshot of CSeR’s highlights and hover information is shown in Figure 6. Figure 6: CSeR shows the changes that would be made to the ExclusionInclusionDialog class (highlighted code for inserts, deletes, updates, moves; and hover information for deletes, updates) to make the SetFilterWizardPage class in SetFilterWizardPage’s file in the Eclipse editor. The four kinds of user-made differences between related clones, according to CSeR, are: 1. Insert – the addition of an AST (abstract syntax tree) node, highlighted in green. 2. Delete – the removal of an existing AST node, highlighted in red. 3. Update – the modification of an existing AST node, highlighted in yellow. 4. Move – the difference between the matching statements of the clones is that they have different neighbors, highlighted in blue. 30
  • 45. Differencing and Comparison Tools Some research looks at comparison [153] and its application, including comparing source code examples [42]. Differencing tools must somehow show the differences between files visually to the user. Though visualization is still a challenge to these tools, most are very simple in how they display files’ differences, and the main distinguishing feature to these related tools is the choice of differencing algorithm used. There are many text-based differencing tools available. Most make use of the diff algorithm [99, 241] and are based on solving the LCS (Longest Common Subsequence) problem. Since this approach is developed for text files, it has obvious disadvantages when used for Java source code [106, 107]. Some differencing tools that are based on the diff or LCS algorithm include UNIX Diff, Eclipse’s Compare Editor (which can be invoked by right clicking selected file(s) in Eclipse’s Package Explorer view and then choosing the “Compare With” menu option), Ldiff [32, 33], and Version Editor (ve) [7]*. Ve provides tight integration of the revision history and the editor so it has the limitations and disadvantages of the text-based tools and the version control system. There are a variety of graph-based differencing algorithms [5, 230, 233] and tools such as Cdiff [25, 259], Jdiff [5], Semantic Diff [105], and Exas [193]*. The graph-based approach has an advantage over the text-based tools, which only focused on syntax, since these take into account the program’s semantics as well. However, they can be slower and it is not always clear whether the extra analysis pays off. * http://directory.fsf.org/project/diffutils/, http://help.eclipse.org/help32/topic/org.eclipse.platform.doc.user/reference/ref-25.htm, http://sourceforge.net/projects/ldiff/, http://ix.cs.uoregon.edu/~datkins/ve.html * http://www.ece.iastate.edu/~nampham/projects/clone/Exas/ 31
  • 46. Many differencing tools are abstract syntax tree (AST)-based such as LaDiff [37], Breakaway [41], Jigsaw [43, 44], ChangeDistiller [66], and Coogle [215, 216], including CSeR [106, 107]*. These tools in general have the advantage of being able to obtain structured information from the tree-based representation of the source code. CSeR differs from these tools in terms of its purpose (clone differencing), its interactive and incremental updating of correspondence rather than re-computing from scratch (in contrast to what is done in Breakaway [41]), and the heuristics that it uses to infer change categories (which differs from, for example, those of ChangeDistiller [66]). Another way of looking at program changes is to use mapping or origin analysis as part of the differencing algorithm [138, 205] or the tool implementation such as Beagle [75]. More recently, additional logic has been incorporated as well to get a better understanding of the changes. The UMLDiff approach tracks the evolution of higher- level program elements (at the level of UML models) over versions of systems [256, 257, 258] and other research utilizes a novel rule-based and combination algorithm (LSdiff) [133, 136] to infer regular change patterns and overcome some of the disadvantages of the other differencing approaches. Capturing Program Structure and Edits There is a body of research that proposes structure-based editors and semantics- preserving editing environments [16, 27, 80, 139, 140, 141, 188, 199, 200, 203, 204, 206, 219, 229, 250, 261] rather than traditional text-based editors. These structured editors and IDEs can benefit programmers by letting them know exactly which edit operations are being performed, however these specific, and often stand-alone, editors are not * http://lsmr.cs.ucalgary.ca/projects/breakaway/, http://lsmr.cs.ucalgary.ca/projects/jigsaw/, http://www.clarkson.edu/~dhou/projects/CnP/ 32
  • 47. commonly used in practice. Instead, other research focuses on determining and presenting structural correspondence [41, 43, 44, 106, 107, 193] to programmers in the IDEs that they already use, like Eclipse, by utilizing the tree-based representation of the source code. Rather than bombarding the programmer with too much extra information, CSeR makes a few general categories of possible user-edits and infers which category a sequence of edits belongs to incrementally. How to efficiently parse code is a research problem itself [249]. For better performance, CSeR only compares the smallest corresponding sub-trees that contain the positions where the programmer last edited. Capturing Program Changes … the problem [with software projects] isn’t change, per se, because change is going to happen; the problem, rather, is the inability to cope with change when it comes. - Kent Beck Not only is it important to capture the current state of the clones in a system (by continuously updating clone locations and contents as they are changed), but capturing change information over time and presenting this to the programmer can be extremely beneficial. CSeR captures and displays certain clone changes in the editor, which can help programmers see the level of similarity between the clones better. Seeing update and deletion information that otherwise is not shown in the file can also be very useful in learning about the code [186, 187]. Related research in the area of software evolution looks further into program changes and multi-version programs [31, 37, 66, 77, 129, 131, 133, 134, 136, 154, 205, 262], changeability [177, 178], and evolutionary history [256]. Specifically studying the evolution of clones over multiple versions of the program helps determine whether these clones require frequent consistent changes or whether they 33
  • 48. remain dormant and impose no significant maintenance challenges. It can also pinpoint at what stage clones are refactored (when they are changed in form) and it can conclude whether the clones need to be refactored at all [1, 135]. Seeing the code as it has evolved over time in a version control system instead of just seeing the current version in the editor can be extremely beneficial in learning how and why the program changes [14]. Using Change Information from Version Control Systems There is a large body of research that focuses on mining software repositories and then analyzing the historical information from version control systems, such as SVN (Subversion) or CVS (Concurrent Versions System), for a variety of reasons [6, 7, 15, 31, 33, 70, 82, 133, 134, 154, 262]. Clone-related tools that use version control system information include Cleman and ClemanX [194, 195, 196], Clever [197], Clone Detection Toolbox [190], Clone Smell Extractor [13], and Vaci [119]*. However, this approach is limited since the information obtained is only from snapshots of when the program’s source code was checked-in or checked-out and it often requires additional analysis and inferences to be useful. Furthermore, the program histories may contain a lot of irrelevant information that is not clone-related. Given program version changes, people would need to sort through to detect likely copied-and-pasted code and eliminate extra information. Also, although people might be able to obtain information about specific changes made to a particular file, they would not automatically have correspondence information (between files) from the histories alone. * http://www.ece.iastate.edu/~nampham/projects/clone/Cleman/, http://www.ece.iastate.edu/~nampham/projects/clever/, http://www.ccfinder.net/vaci.html 34
  • 49. Warnings – Error Prevention or Detection Not only is support for clone management important, but the prevention or detection of clone-related errors (also called bugs [13, 111, 171, 172, 174], inconsistencies [111, 171, 172], or anomalies [251]) should also be provided. CnP contains features that may either prevent errors (like CReN does) or detect potential errors (warnings) in the tracked clones. CnP issues a warning if any identifier in the pasted code binds to a declaration in the context where it is pasted (external identifier scoping) [95, 96, 97]. For example, when a method is copied and pasted within the same class, CnP can provide a warning for each identifier within the method that is defined at the class level (outside of the method, but within the class). These warnings will alert the programmer that these particular identifier instances within the clone (method) may need to be renamed. This is useful, since it is common for programmers to copy and paste a code fragment that contains references to external identifiers that are intended only in the original fragment. The programmer can then use CReN to rename the identifier instances in the pasted location, if desired. There are a number of software quality tools [26, 30, 50], including Axivion, CloneDetective [116], ConQAT, and PMD,* and clone bug detection/prevention tools such as CP-Miner [171, 172], CPC [251], DECKARD-based tool [111], and FixWizard.* The famous Alice software prevents syntax errors by providing a drag-and-drop programming system to aid novice programmers [127], who tend to make errors by misunderstanding program constructs [260] and breaking implied system rules [58]. Bug * http://www.axivion.com/index-en.html, http://conqat.cs.tum.edu/index.php/CloneDetective, http://conqat.cs.tum.edu/index.php/ConQAT, http://pmd.sourceforge.net/ * http://opera.cs.uiuc.edu/Projects/ARTS/CP-Miner.htm, http://cpc.anetwork.de/, http://wwwcsif.cs.ucdavis.edu/~jiangl/research.html, http://www.ece.iastate.edu/~nampham/projects/fixwizard/ 35
  • 50. detection, on the other hand, (rather than prevention) is often done by finding inconsistencies between the clones [118, 128, 255] when changes are made [13, 161], especially inconsistencies in identifiers [111, 171, 172], and spelling errors [45, 98, 192]. CP-Miner uses identifier mapping such that an identifier is considered consistent when it always maps to the same identifier (which could be a different name) in the other fragment and it is inconsistent when it maps itself to multiple identifiers [171, 172]. For Example 1 in Table 4 of Section 2.3, the identifier “prom_phys_total” in the copied code fragment maps to both “prom_prom_taken” and “prom_phys_total” in the pasted code. Because “prom_phys_total” does not map only to “prom_prom_taken” in all instances, for example, CP-Miner would detect it as an inconsistency. The DECKARD-based tool claims that an inconsistency exists if the two code fragments contain different numbers of unique identifiers [111]. For Example 3 in Table 4 of Section 2.3, the DECKARD-based tool would count two instances of the identifier “l_stride” in the copied code fragment, but only one instance in the pasted code fragment. Since both instances of “l_stride” were not renamed to “r_stride” in the pasted code, for example, the DECKARD-based tool was able to find this inconsistency. However, both CP-Miner and the DECKARD-based tool produce false positives, which need to be inspected manually in order to verify the existence of an actual bug. Similarly, the clone smell detection tool [13] also requires human intervention to determine if the detected “unusual” changes are, in fact, bugs. Instead of being retroactive in terms of bug prevention and detection, CnP provides a form of automatic bug prevention (with its CReN and LexId renaming tools) and can give warnings on code as it is being edited. 36
  • 51. Alerts – Clone Modification Notification Clone modification notification is a new feature found in clone-related tools. Clonescape alerts programmers when they edit a clone by showing a red status line message. CloneTracker uses notifications to alert programmers when tracked clones are being modified (for example, so that they can choose to turn on the simultaneous editing feature). CPC uses notifications to warn the programmer about possible update anomalies. Clones can be marked as “ignored”, meaning that no more notifications will be generated for this particular clone. CnP lets the programmer know via visualization that a clone is being or was edited (boxes with CReN/LexId, and highlights with CSeR). Views and Graphs Research in the area of clone detection visualizes clones with graphs [83, 114, 232, 228] and views [14, 208, 228]. CnP provides two views: one view to list the clone detection tool results that are reported and one view to list the clones being tracked by CnP [96]. Clones that are being tracked by CnP can be either clones that have been automatically tracked since they were copied and pasted in the IDE or clones that started being tracked after they were manually imported from the clone detection tool. The LAPIS editor suggests three possible views for future work, including a bird’s-eye view, an abbreviated context view, and an “unusual matches” view. CloneTracker uses a view to list clones and clone groups. Clonescape proposes a multi-view approach, where only the one or two views of interest are automatically shown to the programmer at one time, a technique known as fisheye view, but these are unimplemented. CPC contains a few main views, including a clone list view, tree clone view, and a clone replay view. The use of graphs and views, like markers, is an issue that all clone-related tools face. The challenge 37
  • 52. is to find an alternative to the separate views that programmers need to invoke and the relatively complex graphs that they need to learn and understand. 2.2.2.4. Clone Persistence While all software tools make use of data structures (such as vectors and maps) that store information in the system’s memory while the tool is currently being run, this information must be recorded in some way so that it can be accessed and updated when the programmer works on the source code again at a later time. Storing the clone information between programming sessions is what is called clone persistence. CnP persists the information about the tracked clones between sessions in a flat database (simple text file). Specifically, it stores each clone’s location (the file name that contains the clone with the clone’s starting character position in the file and its length in number of characters) within each clone group. In addition, as part of the information needed for consistent identifier renaming within code fragments, each identifier’s location (the identifier’s starting character position in the file and its length in number of characters) within each identifier group is also stored. The information gets saved automatically whenever Eclipse quits, and loaded automatically when Eclipse starts up. This single file covers the whole workspace, not just individual projects. CPC also persists clone information. Codelink saves the links between clones as file meta-data, making the links persistent between sessions. However, the persistent links are not robust to edits. The latest version of CloneTracker persists the clone information that it tracks for the current project. A unique feature of CPC is that it also gathers information about the copying and pasting activities in general, and it persists the full modification history of each clone in relation to its clone group. 38
  • 53. 2.2.2.5. Clone Documentation and Clone Attributes Some people believe that clone tracking and visualization act as a form of source code documentation by themselves. Though many tools claim to “document” the clones that they are tracking or managing in this way, clone documentation is actually defined as support for additional information to be written about the clone (which forms the clone’s external attributes). Clone documentation, such as why the clone was created (for example, for hardware variation or a bug workaround [38]), generally cannot be retrieved by the system and must be added by the programmer. Clonescape and CPC define “clone classification”, however, their approaches include documenting the structural information about clones, which does not fully fit into the previous definition. Similarly, other research in the topic of clone classification [121, 123, 124] groups clones by the region that they occur in (the level of abstraction involved and the location of the clones in a file), which also does not require user intervention. Instead, the “reasoning” type of clone classification described by the authors of Clonescape [38] is consistent with the definition provided here. In addition, the clone attribute “severity” that is set at low, medium, or high by the programmer depending on whether the clone should be removed from the system, is an example of a resulting clone attribute according to the above definition. 2.2.3. Clone Lifecycle Support Proactive clone management must be actively done at all times during software development and maintenance, throughout a clone’s lifecycle. When designing CnP and reviewing related work, various definitions of clone properties (in the previous section) and a variety of tool support (and lack of tool support) for each stage of the clone 39
  • 54. lifecycle (in this section) were learned. In this dissertation, the clone lifecycle is explicitly defined (shown in Figure 7) from clone creation, through clone capture and clone editing, to clone extinction. The following subsections present a variety of tool support for the phases of a clone’s life. Table 2 (on a following page) summarizes the design and implementation details for each of the related clone tracking tools: Clonescape [38], CPC [251], Codelink [231], LAPIS [189], and CloneTracker [54, 55], including CnP [95, 96, 97, 100, 101, 102] (and its parts: CReN consistent identifier renaming [103], LexId consistent substring renaming [104], and CSeR clone comparison [106, 107]), and it specifically highlights the problems that the related tools did not address that CnP does. The emphasis of these six tools, in particular, is in supporting the editing phase of the lifecycle to avoid inconsistent modifications to clones. Figure 7: The Clone Lifecycle – Clone Creation, Clone Capture, Clone Editing, and Clone Extinction. 40
  • 55. Table 2: Summary of Clone Tracking Tools with their Clone Lifecycle Support 41
  • 56. 2.2.3.1. Clone Creation When the concept of clone creation is considered, two questions come to mind: how were the clones created, and why the clones were created? The answers to these two questions determine whether or not that particular clone is tracked and supported by the clone tracking tool. How were the clones created? Code clones can be created in a number of ways [115], but many, if not most, clones are undoubtedly created via copying and pasting, since duplication is very easy to do with either a simple menu selection (Edit - Copy, Edit - Paste) or a keyboard shortcut (Ctrl+C, Ctrl+V). As a result, the software tool CnP, which supports copy-and-paste- induced code clones upon creation, essentially captures one of the most common kinds of clones made and it guarantees 100% accuracy in clone “detection”, since the copying and pasting is known exactly as it happens. Why were the clones created? A key distinction between clones is also the reason for the clone creation. Existing research distinguishes between intentional clones (code that the programmer intended to reuse) [212, 213] and accidental clones (code that is similar due to a protocol requirement) [2]. This and most other clone-related research focus on intentional clones, but realize that accidental clones do exist. To address accidental clones, tools often allow some form of user control such as allowing the programmer to remove certain clones from those that are automatically being tracked. 42