Similarity of Source Code in the Presence of Pervasive Modifications [SCAM'16]
1. Similarity of Source Code
in the Presence of Pervasive
Modifications
Chaiyong Ragkhitwetsagul, Jens Krinke, David Clark
Centre for Research on Evolution, Search and Testing (CREST)
Dept. of Computer Science, UCL, London, UK
2. Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK
Pervasive Modifications
2
/* ORIGINAL */
private static int partition
(Comparable[] a, int lo, int hi) {
int i = lo;
int j = hi+1;
Comparable v = a[lo];
while (true) {
while (less(a[++i], v)) {
if (i == hi) break;
}
while (less(v, a[--j])) {
if (j == lo) break;
}
if (i >= j) break;
exch(a, i, j);
}
exch(a, lo, j);
return j;
}
/* PERVASIVELY MODIFIED CODE */
private static int partition
(int[] bob, int left, int right){
int x = left;
int y = right+1;
for (;;) {
while (less(bob[left],bob[--y]))
if (y == left) break;
while (less(bob[++x],bob[left]))
if (x == right) break;
if (x >= y) break;
swap(bob, y, x);
}
swap(bob, y, left);
return y;
}
From: https://www.princeton.edu/pr/pub/integrity/pages/plagiarism/
3. Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK
Pervasive Modifications
3
Changes affecting many locations in the whole method,
file, or project
Examples: layout changes, identifier renaming, API
changes, refactoring
Code cloning, software plagiarism, software evolution
But do not include (strong) code obfuscation
4. Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK 4
When source code is pervasively
modified, which similarity detection
techniques or tools get the most
accurate results?
5. Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK
30 Similarity Analysers
5
CCFinderX
iClones
Simian, NiCad
Deckard
Clone detectors
JPlag
Plaggie, Sherlock
Sim
Plagiarism detectors
7zncd, bzip2ncd
gzipncd, xz-ncd
icd, ncd
Compression
diff, bsdiff
difflib, fuzzywuzzy
jellyfish, ngram, sklearn
Others
6. Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK
Test Data Generation
6
original
source
obfuscator
bytecode
obfuscator decompilers
InfixConverter.java
SqrtAlgorithm.java
Hanoi.java
Queens.java
MagicSquare.java
pervasively modified code
to be used in
detection phase
pervasively
modified code
compiler
javac
ARTIFICE
ProGuard Krakatau
Procyon
7. Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK
Parameter Settings
7
10. Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK
Best Threshold
10
F-measure
0.00
0.23
0.45
0.68
0.90
Threshold Value (T)
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
31
F-measure = 0.8282
11. Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK
Optimal Configuration
11
Best ThresholdBest Parameter Settings
12. Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK
Results
12
Tool Settings T Acc Prec Rec AUC Prec@n F1
ccfx b=20,t=1 4 0.9640 0.9145 0.9040 0.9468 0.9040 0.9095
simjava r=22 5 0.9568 0.8769 0.9120 0.9490 0.8840 0.8941
jplag-text t=8 2 0.9408 0.8235 0.8960 0.9453 0.8440 0.8582
py-difflib noautojunk 35 0.9392 0.8901 0.7940 0.9147 0.8080 0.8393
7zncd-BZip2 mx=1 39 0.9368 0.8977 0.7720 0.9419 0.8180 0.8301
ncd-bzlib 31 0.9336 0.8584 0.8000 0.9482 0.8200 0.8282
jplag-java t=3 43 0.9160 0.7526 0.8640 0.9667 0.7860 0.8045
py-sklearn 33 0.8488 0.5894 0.8040 0.9146 0.6200 0.6802
14. Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK 14
Highly specialised source code similarity
detection techniques and tools can perform
better than more general, textual similarity
measures.
15. Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK
Normalisation by Decompilation
15
javac
Krakatau
Procyon
Pervasively modified
code
Normalised
code
Normalisation
Compile
Decompile
16. Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK
Code Before Decompilation
16
17. Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK
Code After Decompilation
17
19. Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK 19
Compilation and decompilation can be used
as an effective normalisation method that
greatly improves similarity detection on Java
source code
20. Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK 20
Compilation and decompilation can be used as
an effective normalisation method that greatly
improves similarity detection on Java source code
Highly specialised source code similarity
detection techniques and tools can perform
better than more general, textual similarity
measures.
Similarity of Source Code
in the Presence of Pervasive Modifications
Chaiyong Ragkhitwetsagul, Jens Krinke, David Clark — CREST, UCL
More info: http://crest.cs.ucl.ac.uk/resources/cloplag/