Automatically Generated Patches as Debugging Aids: A Human Study (FSE 2014)

Automatically Generated Patches as
Debugging Aids: A Human Study
Yida Tao, Jindae Kim, Sunghun Kim
Dept. of CSE, The Hong Kong University of Science and Technology
Chang Xu
State Key Lab for Novel Software Technology, Nanjing University

• Promising research progress
• ClearView1: Prevent all 10 Firefox exploits
• GenProg2: Fix 55/105 real bugs
[1] Automatically Patching Errors in Deployed Software.
Perkins et al. SOSP’09
[2] A systematic study of automated program repair: fixing
55 out of 105 bugs for $8 each. Le Goues et al. ICSE’12
2
Automatic Program Repair

“It won't get your bug patched any quicker.
You’ll just have shifted the coders' attention away from
their own app's bugs, and onto the repair tool’s bugs.”
- Slashdot discussion:
http://science.slashdot.org/story/09/10/29/2248246/Fixing-Bugs-But-
Bypassing-the-Source-Code
4
Automatic Program Repair

#what-could-possibly-go-wrong
• Blackbox repair
• Increasing maintenance cost
• Vulnerable to attack
- A human study of patch maintainability. ISSTA’12
5
- Automatic patch generation learned from human-written patches. ICSE’13

#what-could-possibly-go-wrong
#program-out-of-control
- A human study of patch maintainability. ISSTA’12
6
- Automatic patch generation learned from human-written patches. ICSE’13
• Blackbox repair
• Increasing maintenance cost
• Vulnerable to attack

Use automatically
generated patches as
debugging aids
7

Use automatically
generated patches as
debugging aids
Our Human Study
• Investigate the usefulness of
generated patches as debugging aids
• Discuss the impact of patch quality
on debugging performance
• Explore practitioners’ feedback on
adopting automatic program repair
8

Debugging aid Participants Bugs
10
is given to Debug

Debugging aid Participants Bugs 11

Low-quality
generated patch

Low-quality
generated patch
High-quality
generated patch

Low-quality
generated patch
High-quality
generated patch
Buggy method
location

Grad:
44
MTurk:
23
Engr:
28
95 Participants
CS graduate students
Amazon Mechanical
Turk workers
Industrial software
engineers

44 Graduate students
• Between-group design
14 students
15 students
15 students

Low-quality generated patch
High-quality generated patch
Buggy method location
14 students
15 students
15 students

• Onsite setting
• Eclipse IDE
• Supervised session
Low-quality generated patch
High-quality generated patch
Buggy method location
14 students
15 students
15 students

Low-quality
generated patch
High-quality
generated patch
Buggy method
location
Remote participants
(28 Engr + 23 MTurk)
• Within-group design

Remote participants
(28 Engr + 23 MTurk)
• Within-group design
• Online debugging system
Low-quality
generated patch
High-quality
generated patch
Buggy method
location

Bug Selection Criteria
• Real bugs
• The bug has accepted patches written by developers
• Proper number of bugs
• The bug has generated patches with different quality

Automatic patch generation learned from human-written patches.
Kim et al. ICSE’13

for (int i=0; i<parenCount; i++)
SubString sub = (SubString)parens.get(i)
if(sub!=null){
args[i+1] = sub.toString();
Auto-generated patch A Auto-generated patch B
}
}
args[parenCount+1] =
new Integer(reImpl.leftContext.length);
}

if(sub!=null){
avg. ranking from 85 devs and students
}
}
}
1.6
2.8

if(sub!=null){
High-Quality Patch Low-Quality patch
avg. ranking from 85 devs and students
}
}
}
1.6
2.8

Participants submit 337 patches as their debugging outcome

Location
109
LowQ
112
HighQ
# submitted patches 116
w.r.t debugging aid

Location
109
LowQ
112
HighQ
# submitted patches 116
w.r.t debugging aid
Bug1
66
Bug2
74
Bug5
62
Bug3
59
Bug4
76
# submitted patches
w.r.t bugs

Evaluation of debugging performance
32

Patch Correctness
Correctness
33

Patch Correctness
• Passing test cases
Correctness
34

Patch Correctness
• Matching the semantics of original accepted patches
Correctness
35

Patch Correctness
• Matching the semantics of original accepted patches
• 3 evaluators
Correctness
36

Debugging Time
• Eclipse Plug-in
•Website Timer
Correctness
Debugging time
37

Correctness
Debugging time
• Independent variables
• Debugging aids
• Bugs
• Participant types
• Programming experience
38

Multiple Regression Analysis
Correctness
Debugging time
• Independent variables
• Debugging aids
• Bugs
• Participant types
• Programming experience
correctness = α0 + α1 ∙ x1 + α2 ∙ x2 + α3 ∙ x3 + α4 ∙ x4
debugging time = β0 + β1 ∙ x1 + β2 ∙ x2 + β3 ∙ x3 + β4 ∙ x4
39

Post-study Survey
• Helpfulness of debugging aids
• Difficulty of bugs
• Opinions on using generated patches as debugging aids
Correctness
Debugging time
Survey feedback
40

High-quality patches significantly
improve debugging correctness
1
48%
33%
71%
42

1
% of correct patches
48%
33%
71%
43
Location LowQ HighQ

Location LowQ HighQ
1
Positive Coefficient = 1.25
p-value= 0.00 < 0.05 48%
71%
44

Low-quality patches slightly
undermine debugging correctness
Location LowQ HighQ
2
48%
33%
71%
45

Low-quality patches slightly
Location LowQ HighQ
2
Negative Coefficient = -0.55
p-value= 0.09 48%
33%
71%
46

Low-quality patches can
Location LowQ HighQ
2
Negative Coefficient = -0.55
p-value= 0.09 48%
33%
71%
47

High-quality patches are more useful for
3 difficult bugs
48

3 difficult bugs
49
5
4
3
2
Bug Difficulty
Bug1
Math-280
Bug2
Rhino-114493
Bug3
Rhino-192226
Bug4
Rhino-217379
Bug5
Rhino-76683

3 difficult bugs
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Bug1 Bug2 Bug3 Bug4 Bug5
Location LowQ HighQ
50
5
4
3
2
Bug Difficulty
Bug1
Math-280
Bug2
Rhino-114493
Bug3
Rhino-192226
Bug4
Rhino-217379
Bug5
Rhino-76683

4
The type of debugging aid does not affect
debugging time
51

4
The type of debugging aid does not affect
debugging time
80
60
40
20
0
Debugging time (min)
Location LowQ HighQ
52

5
Other factors’ impact on debugging
performance
Difficult bugs significantly slow down debugging
Engr and MTurk are more likely to debug correctly
Novices tend to benefit more from HighQ patches
53

Helpfulness of debugging aids
Very helpful
Helpful
Medium
Slightly Helpful
Not Helpful
6
54
Participants consider high-quality generated patches
much more helpful than low-quality patches
Low-quality
generated patch
High-quality
generated patch
Mann-Whitney U test
p-value = 0.001

Quick starting point
• Point to the buggy area
• Brainstorm
“They would seem to be useful
in helping find various ideas
around fixing the issue, even
if the patch isn’t always
correct on its own.”
57

Quick starting point
• Point to the buggy area
• Brainstorm
Confusing, incomplete, misleading
• Wrong lead, especially for novices
• Require further human perfection
“They would seem to be useful
in helping find various ideas
around fixing the issue, even
if the patch isn’t always
correct on its own.”
58

“Generated patches would be
good at recognizing obvious
problems”
“…but may not recognize more
involved defects.”
59

“Generated patches would be
good at recognizing obvious
problems”
“…but may not recognize more
involved defects.”
60
“Generated patches simplify
the problem”
“…but they may over-simplify it by
not addressing the root cause.”

“I would use generated
patches as debugging aids, as
they provide extra diagnostic
information”
61

“I would use generated
patches as debugging aids, as
they provide extra diagnostic
information”
“…along with access to standard
debugging tools.”
62

Threats to Validity
• Bugs and generated patches may not be representative
• Quality measure of generated patches may not generalize
• May not generalize to domain experts
• Possibility of blindly reusing generated patches
• Remove patches that are submitted less than 1 minute
64

Takeaway
65
• Auto-generated patches can be useful as
debugging aids
• Participants fix bugs more correctly with auto-generated
patches
• Quality control is required
• Participants’ debugging correctness is
compromised with low-quality generated patches
• Maximize the benefits
• Difficult bugs
• Novice developers

Automatically Generated Patches as Debugging Aids: A Human Study (FSE 2014)

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (9)

Similar a Automatically Generated Patches as Debugging Aids: A Human Study (FSE 2014)

Similar a Automatically Generated Patches as Debugging Aids: A Human Study (FSE 2014) (20)

Más de Sung Kim

Más de Sung Kim (9)

Último

Último (20)

Automatically Generated Patches as Debugging Aids: A Human Study (FSE 2014)

Notas del editor