SlideShare a Scribd company logo
1 of 52
Deduplication of large
amounts of code
Romain Keramitas
FOSDEM 2019
Clones
def foo(name: str):
print('Hello World, my name is ' + name)
def bar(name: str):
print('Hello World, my name is {}'.format(name))
def baz(name: str):
print('Hello World, my name is %s' % name)
I am so happy to speak at FOSDEM this year
since it's a really awesome conference !
Natural language clones
I will be speaking at FOSDEM this year, it makes me so happy
because that conference is really awesome !
I am delighted to speak at FOSDEM this year
since it's a really amazing conference !
Natural language clones
I will be speaking at FOSDEM this year, it makes me so happy
because that conference is really awesome !
Clone Taxonomy (Kapser; Roy and Cordy )
Type I : exactly the same
Type II : structurally the same, syntactical differences
Type III : combination of type I & II and minor structural changes
Type IV : semantically the same, different structure and syntax
Type IV clone
def foo(n: int):
k = 0
for i in range(n):
k += i
return k
def bar(m: int):
counter, l = 0, 0
while counter < m:
counter += 1
l+= counter
return l
Déjà Vu approach (Lopes et al.)
def foo(n: int):
k = 0
for i in range(n):
k += i
return k
[(def, 1), (foo, 1), (n,
2), (int, 1), (k, 3),
(0, 1), (for, 1), (i, 2),
(in, 1),
(range, 1), (return, 1)]
File hash
7d02b25e38eadb33e9f96d771e1844a6
Token hash
f2ea1bb5a6208b5d4f2c930a0d7042d6
SourcererCC
Gemini approach
1. Extract syntactical and structural features from each snippet
Features Matrix M:
N x M values
Mi,j = importance of feature j
for snippet i (0 if none)
Dataset
N snippets
Gemini approach
2. Create a pairwise similarity graph between snippets,
by hashing the feature matrix
Pairwise Similarity Graph
Graph with N nodes
nodes = snippets
edge = similarity
Dataset
N snippets
Features
NxM values
Gemini approach
3. Extract connected components from the similarity graph
Connected Components
Graphs where each pair of nodes
is connected by paths
Dataset
N snippets
Features
NxM values
Similarity
Graph
N nodes
Gemini approach
4. Perform community detection on each components to obtain
clone communities
Dataset
N snippets
Clone
Communities
Parts of components
where each node is a
clone
Features
NxM values
Similarity
Graph
N nodes
Connected
Components
Feature Extraction Step
Dataset
N snippets
Features
NxM values
Similarity
Graph
N nodes
Connected
Components
Clone
Communities
Abstract Syntax Trees
def foo(x):
return x + 42
def
foo x
x
+
return
42
Abstract Syntax Trees
def(decl)
foo(id) x(id)
x(id)
+(op)
return(stat
)
42(lit)
def foo(x):
return x + 42
Identifiers and Literals
def foo(n: int):
k = 0
for i in range(n):
k += i
return k + 42
Identifiers:
[(foo, 1),(n, 2),(k, 3),(i, 2)]
Literals:
[(0, 1), (42, 1)]
Graphlets and Children
Graphlets:
[(decl,[id,id,statement]),
(id, []),(id, []),
(statement,[op.]),(op,[id,lit])
(id,[]),(lit,[])]
Children:
[(decl,3), (id, 0),(id, 0),
(statement,1),(op,2)(id,0),
(lit,0)]
def(decl)
foo(id) x(id)
x(id)
+(op)
return(stat
)
42(lit)
Graphlets and Children
Graphlets:
[((decl,[id,id,statement]),1),
((id, []),3),
((statement,[op.]),1)
((op,[id,lit]),1),((lit,[]),1)]
Children:
[((decl,3),1),((id, 0),3),
((statement,1),1),((op,2),1),
((lit,0),1)]
def
foo x
x
+
return
42
TF-IDF
wx,y = tfx,y × log ( N / dfx )
tfx,y = frequency of feature x in snippet y
dfx = number of snippets containing feature x
N = total number of snippets
Feature extraction step
1. Convert snippets to UAST using Babelfish
2. Extract weighted bags of features from each UAST
3. Perform TF-IDF to reduce the amount of features
4. Find weights for each feature type (hyperparameters)
Hashing Step
Dataset
N snippets
Features
NxM values
Similarity
Graph
N nodes
Connected
Components
Clone
Communities
Similarity between weighted bags ?
Problem: Computing similarity between each pair of snippet
is not viable
1 million snippets 499,998,500,001 similarities
428 million snippets 91,591,999,358,000,000 similarities
Weighted Jaccard Similarity
S = set of all features
A , B = subsets of S
Minhashing
1. Create perm(A) and perm(B)
2. Hash all elements in perm(A) and perm(B)
3. Select the smallest hash for A and B
4. Probability it's the same value equals J(A, B) !
Minhash signatures
We take the k smallest hash values for each snippet:
For each pair of snippets, the probability they have the same value in
any row is still equal to their Jaccard similarity !
1 0 0 0 2
3 2 1 2 3
5 6 3 6 4
... ...
... ... ... k rows
... ......
Locality-Sensitive Hashing
We divide the signature matrix into b bands of r rows
candidate pair = any snippets which have the same values in at
least one band.
1 0 0 0 2
3 2 1 2 3
5 6 3 6 4
... ...
... ... ...
r rows
... ......
r × b = k rows
Locality-Sensitive Hashing
s = similarity between the two snippets
1. Probability the signatures are the same in one band: sr
Locality-Sensitive Hashing
s = similarity between the two snippets
1. Probability the signatures are the same in one band: sr
2. Probability the signatures are different in one band: 1 - sr
Locality-Sensitive Hashing
s = similarity between the two snippets
1. Probability the signatures are the same in one band: sr
2. Probability the signatures are different in one band: 1 - sr
3. Probability the signatures are different in all bands: (1 - sr)b
Locality-Sensitive Hashing
s = similarity between the two snippets
1. Probability the signatures are the same in one band: sr
2. Probability the signatures are different in one band: 1 - sr
3. Probability the signatures are different in all bands: (1 - sr)b
4. Probability the signatures are the same in at least one band,
i.e. the snippets are a candidate pair: 1 - (1 - sr)b
Locality-Sensitive Hashing
Regardless of r and b, we get an S-curve:
Locality-Sensitive Hashing
If we choose r and b well, we can get this:
Computing the similarity graph
1. Create the signatures for each snippet
2. Select the threshold from which we decide that snippets are
clones, deduce r and b, then for each band hash
sub-signatures into buckets
3. Snippets that land in at least one common bucket are the
candidates
Connected Components Step
Dataset
N snippets
Features
NxM values
Similarity
Graph
N nodes
Connected
Components
Clone
Communities
Extracting connected components
Community Detection Step
Dataset
N snippets
Features
NxM values
Similarity
Graph
N nodes
Connected
Components
Clone
Communities
Applying community detection
Public Git Archive
182,014 projects (repos on GitHub with over 50 stars)
3 TB of code
Commit history included
PGA HEAD = ~54.5 million files
With Babelfish driver = ~14.1 million files
Post-feature extraction
7,869,917 million files processed
102,114 project
6,194,874 distinct features
Features analysis
Feature distribution:
• identifiers: 9.93 %
• literals: 65.7 %
• graphlets: 11.84 %
• children: 0.02 %
• uast2seq (DFS) : 10.49%
• node2vec (random walk) : 2.02 %
Features analysis
Average number of features per file:
• 60 identifiers
• 38 literals
• 116 graphlets
• 37 children
• 336 uast2seq (DFS)
• 460 node2vec (random walk)
Connected components analysis
95 % threshold:
• 4,270,967 CCs
• 52.4 % of files in CCs with > 1 file
• 7.83 file per CC with > 1 file
80 % threshold:
• 3,551,648 CCs
• 61.9 % of files in CCs with > 1 file
• 8.79 file per CC with > 1 file
Connected components analysis
Connected components analysis
3,551,648
Communities analysis
Detection done with Walktrap algorithm
95 % threshold:
• 526,715 CCs with > 1 file
• In those, we detected 666,692 communities
80 % threshold:
• 553,997 CCs with > 1 file
• In those, we detected 918,333 communities
text_parser.go
422 files - 327 projects - 1 filename
2344 files - 1058 projects - 584 filename
filenames
filenames
Microsoft Azure SDK
4803 files - 3 projects - 2954 filename
Thank you !
For more:
blog.sourced.tech/post/deduplicating_pga_with_apollo
github.com/src-d/gemini
pga.sourced.tech/
r.keramitas@gmail.com

More Related Content

What's hot

Hacking Go Compiler Internals / GoCon 2014 Autumn
Hacking Go Compiler Internals / GoCon 2014 AutumnHacking Go Compiler Internals / GoCon 2014 Autumn
Hacking Go Compiler Internals / GoCon 2014 Autumn
Moriyoshi Koizumi
 
Babar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationBabar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and Representation
Pierre de Lacaze
 

What's hot (15)

LEX & YACC TOOL
LEX & YACC TOOLLEX & YACC TOOL
LEX & YACC TOOL
 
Huffman
HuffmanHuffman
Huffman
 
Hacking Go Compiler Internals / GoCon 2014 Autumn
Hacking Go Compiler Internals / GoCon 2014 AutumnHacking Go Compiler Internals / GoCon 2014 Autumn
Hacking Go Compiler Internals / GoCon 2014 Autumn
 
Workshop on command line tools - day 1
Workshop on command line tools - day 1Workshop on command line tools - day 1
Workshop on command line tools - day 1
 
Python strings presentation
Python strings presentationPython strings presentation
Python strings presentation
 
Locality sensitive hashing
Locality sensitive hashingLocality sensitive hashing
Locality sensitive hashing
 
Babar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationBabar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and Representation
 
The Next Great Functional Programming Language
The Next Great Functional Programming LanguageThe Next Great Functional Programming Language
The Next Great Functional Programming Language
 
M3 search
M3 searchM3 search
M3 search
 
Learning python
Learning pythonLearning python
Learning python
 
Learn python in 20 minutes
Learn python in 20 minutesLearn python in 20 minutes
Learn python in 20 minutes
 
SHA-1 backdooring & exploitation
SHA-1 backdooring & exploitationSHA-1 backdooring & exploitation
SHA-1 backdooring & exploitation
 
pairs
pairspairs
pairs
 
String in python use of split method
String in python use of split methodString in python use of split method
String in python use of split method
 
Aspects of software naturalness through the generation of IdentifierNames
Aspects of software naturalness through the generation of IdentifierNamesAspects of software naturalness through the generation of IdentifierNames
Aspects of software naturalness through the generation of IdentifierNames
 

Similar to Deduplication on large amounts of code

Pharo, an innovative and open-source Smalltalk
Pharo, an innovative and open-source SmalltalkPharo, an innovative and open-source Smalltalk
Pharo, an innovative and open-source Smalltalk
Serge Stinckwich
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparation
sonu sharma
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparation
Kgr Sushmitha
 
Declare Your Language (at DLS)
Declare Your Language (at DLS)Declare Your Language (at DLS)
Declare Your Language (at DLS)
Eelco Visser
 
cryptography and network security chap 3
cryptography and network security chap 3cryptography and network security chap 3
cryptography and network security chap 3
Debanjan Bhattacharya
 

Similar to Deduplication on large amounts of code (20)

Finding similar items in high dimensional spaces locality sensitive hashing
Finding similar items in high dimensional spaces  locality sensitive hashingFinding similar items in high dimensional spaces  locality sensitive hashing
Finding similar items in high dimensional spaces locality sensitive hashing
 
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
 
Text classification using Text kernels
Text classification using Text kernelsText classification using Text kernels
Text classification using Text kernels
 
Distributed Coordination
Distributed CoordinationDistributed Coordination
Distributed Coordination
 
Pharo, an innovative and open-source Smalltalk
Pharo, an innovative and open-source SmalltalkPharo, an innovative and open-source Smalltalk
Pharo, an innovative and open-source Smalltalk
 
C Interview Questions for Fresher
C Interview Questions for FresherC Interview Questions for Fresher
C Interview Questions for Fresher
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparation
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparation
 
C interview Question and Answer
C interview Question and AnswerC interview Question and Answer
C interview Question and Answer
 
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
 
I1
I1I1
I1
 
Nn
NnNn
Nn
 
Manipulating string data with a pattern in R
Manipulating string data with  a pattern in RManipulating string data with  a pattern in R
Manipulating string data with a pattern in R
 
Pune Clojure Course Outline
Pune Clojure Course OutlinePune Clojure Course Outline
Pune Clojure Course Outline
 
The Curious Clojurist - Neal Ford (Thoughtworks)
The Curious Clojurist - Neal Ford (Thoughtworks)The Curious Clojurist - Neal Ford (Thoughtworks)
The Curious Clojurist - Neal Ford (Thoughtworks)
 
Declare Your Language (at DLS)
Declare Your Language (at DLS)Declare Your Language (at DLS)
Declare Your Language (at DLS)
 
Ctcompare: Comparing Multiple Code Trees for Similarity
Ctcompare: Comparing Multiple Code Trees for SimilarityCtcompare: Comparing Multiple Code Trees for Similarity
Ctcompare: Comparing Multiple Code Trees for Similarity
 
Cryptography and Network Security William Stallings Lawrie Brown
Cryptography and Network Security William Stallings Lawrie BrownCryptography and Network Security William Stallings Lawrie Brown
Cryptography and Network Security William Stallings Lawrie Brown
 
cryptography and network security chap 3
cryptography and network security chap 3cryptography and network security chap 3
cryptography and network security chap 3
 
Declarative Thinking, Declarative Practice
Declarative Thinking, Declarative PracticeDeclarative Thinking, Declarative Practice
Declarative Thinking, Declarative Practice
 

More from source{d}

More from source{d} (15)

Overton, Apple Flavored ML
Overton, Apple Flavored MLOverton, Apple Flavored ML
Overton, Apple Flavored ML
 
Unlocking Engineering Observability with advanced IT analytics
Unlocking Engineering Observability with advanced IT analyticsUnlocking Engineering Observability with advanced IT analytics
Unlocking Engineering Observability with advanced IT analytics
 
What's new in the latest source{d} releases!
What's new in the latest source{d} releases!What's new in the latest source{d} releases!
What's new in the latest source{d} releases!
 
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...
 
Gitbase, SQL interface to Git repositories
Gitbase, SQL interface to Git repositoriesGitbase, SQL interface to Git repositories
Gitbase, SQL interface to Git repositories
 
Assisted code review with source{d} lookout
Assisted code review with source{d} lookoutAssisted code review with source{d} lookout
Assisted code review with source{d} lookout
 
Machine Learning on Code - SF meetup
Machine Learning on Code - SF meetupMachine Learning on Code - SF meetup
Machine Learning on Code - SF meetup
 
Inextricably linked reproducibility and productivity in data science and ai ...
Inextricably linked reproducibility and productivity in data science and ai  ...Inextricably linked reproducibility and productivity in data science and ai  ...
Inextricably linked reproducibility and productivity in data science and ai ...
 
source{d} Engine - your code as data
source{d} Engine - your code as datasource{d} Engine - your code as data
source{d} Engine - your code as data
 
Introduction to the source{d} Stack
Introduction to the source{d} Stack Introduction to the source{d} Stack
Introduction to the source{d} Stack
 
source{d} Engine: Exploring git repos with SQL
source{d} Engine: Exploring git repos with SQLsource{d} Engine: Exploring git repos with SQL
source{d} Engine: Exploring git repos with SQL
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout
 
Machine learning on Go Code
Machine learning on Go CodeMachine learning on Go Code
Machine learning on Go Code
 
Improving go-git performance
Improving go-git performanceImproving go-git performance
Improving go-git performance
 
Machine learning on source code
Machine learning on source codeMachine learning on source code
Machine learning on source code
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Deduplication on large amounts of code

  • 1. Deduplication of large amounts of code Romain Keramitas FOSDEM 2019
  • 2. Clones def foo(name: str): print('Hello World, my name is ' + name) def bar(name: str): print('Hello World, my name is {}'.format(name)) def baz(name: str): print('Hello World, my name is %s' % name)
  • 3. I am so happy to speak at FOSDEM this year since it's a really awesome conference ! Natural language clones I will be speaking at FOSDEM this year, it makes me so happy because that conference is really awesome !
  • 4. I am delighted to speak at FOSDEM this year since it's a really amazing conference ! Natural language clones I will be speaking at FOSDEM this year, it makes me so happy because that conference is really awesome !
  • 5. Clone Taxonomy (Kapser; Roy and Cordy ) Type I : exactly the same Type II : structurally the same, syntactical differences Type III : combination of type I & II and minor structural changes Type IV : semantically the same, different structure and syntax
  • 6. Type IV clone def foo(n: int): k = 0 for i in range(n): k += i return k def bar(m: int): counter, l = 0, 0 while counter < m: counter += 1 l+= counter return l
  • 7. Déjà Vu approach (Lopes et al.) def foo(n: int): k = 0 for i in range(n): k += i return k [(def, 1), (foo, 1), (n, 2), (int, 1), (k, 3), (0, 1), (for, 1), (i, 2), (in, 1), (range, 1), (return, 1)] File hash 7d02b25e38eadb33e9f96d771e1844a6 Token hash f2ea1bb5a6208b5d4f2c930a0d7042d6 SourcererCC
  • 8. Gemini approach 1. Extract syntactical and structural features from each snippet Features Matrix M: N x M values Mi,j = importance of feature j for snippet i (0 if none) Dataset N snippets
  • 9. Gemini approach 2. Create a pairwise similarity graph between snippets, by hashing the feature matrix Pairwise Similarity Graph Graph with N nodes nodes = snippets edge = similarity Dataset N snippets Features NxM values
  • 10. Gemini approach 3. Extract connected components from the similarity graph Connected Components Graphs where each pair of nodes is connected by paths Dataset N snippets Features NxM values Similarity Graph N nodes
  • 11. Gemini approach 4. Perform community detection on each components to obtain clone communities Dataset N snippets Clone Communities Parts of components where each node is a clone Features NxM values Similarity Graph N nodes Connected Components
  • 12. Feature Extraction Step Dataset N snippets Features NxM values Similarity Graph N nodes Connected Components Clone Communities
  • 13. Abstract Syntax Trees def foo(x): return x + 42 def foo x x + return 42
  • 14. Abstract Syntax Trees def(decl) foo(id) x(id) x(id) +(op) return(stat ) 42(lit) def foo(x): return x + 42
  • 15. Identifiers and Literals def foo(n: int): k = 0 for i in range(n): k += i return k + 42 Identifiers: [(foo, 1),(n, 2),(k, 3),(i, 2)] Literals: [(0, 1), (42, 1)]
  • 16. Graphlets and Children Graphlets: [(decl,[id,id,statement]), (id, []),(id, []), (statement,[op.]),(op,[id,lit]) (id,[]),(lit,[])] Children: [(decl,3), (id, 0),(id, 0), (statement,1),(op,2)(id,0), (lit,0)] def(decl) foo(id) x(id) x(id) +(op) return(stat ) 42(lit)
  • 17. Graphlets and Children Graphlets: [((decl,[id,id,statement]),1), ((id, []),3), ((statement,[op.]),1) ((op,[id,lit]),1),((lit,[]),1)] Children: [((decl,3),1),((id, 0),3), ((statement,1),1),((op,2),1), ((lit,0),1)] def foo x x + return 42
  • 18. TF-IDF wx,y = tfx,y × log ( N / dfx ) tfx,y = frequency of feature x in snippet y dfx = number of snippets containing feature x N = total number of snippets
  • 19. Feature extraction step 1. Convert snippets to UAST using Babelfish 2. Extract weighted bags of features from each UAST 3. Perform TF-IDF to reduce the amount of features 4. Find weights for each feature type (hyperparameters)
  • 20. Hashing Step Dataset N snippets Features NxM values Similarity Graph N nodes Connected Components Clone Communities
  • 21.
  • 22. Similarity between weighted bags ? Problem: Computing similarity between each pair of snippet is not viable 1 million snippets 499,998,500,001 similarities 428 million snippets 91,591,999,358,000,000 similarities
  • 23. Weighted Jaccard Similarity S = set of all features A , B = subsets of S
  • 24. Minhashing 1. Create perm(A) and perm(B) 2. Hash all elements in perm(A) and perm(B) 3. Select the smallest hash for A and B 4. Probability it's the same value equals J(A, B) !
  • 25. Minhash signatures We take the k smallest hash values for each snippet: For each pair of snippets, the probability they have the same value in any row is still equal to their Jaccard similarity ! 1 0 0 0 2 3 2 1 2 3 5 6 3 6 4 ... ... ... ... ... k rows ... ......
  • 26. Locality-Sensitive Hashing We divide the signature matrix into b bands of r rows candidate pair = any snippets which have the same values in at least one band. 1 0 0 0 2 3 2 1 2 3 5 6 3 6 4 ... ... ... ... ... r rows ... ...... r × b = k rows
  • 27. Locality-Sensitive Hashing s = similarity between the two snippets 1. Probability the signatures are the same in one band: sr
  • 28. Locality-Sensitive Hashing s = similarity between the two snippets 1. Probability the signatures are the same in one band: sr 2. Probability the signatures are different in one band: 1 - sr
  • 29. Locality-Sensitive Hashing s = similarity between the two snippets 1. Probability the signatures are the same in one band: sr 2. Probability the signatures are different in one band: 1 - sr 3. Probability the signatures are different in all bands: (1 - sr)b
  • 30. Locality-Sensitive Hashing s = similarity between the two snippets 1. Probability the signatures are the same in one band: sr 2. Probability the signatures are different in one band: 1 - sr 3. Probability the signatures are different in all bands: (1 - sr)b 4. Probability the signatures are the same in at least one band, i.e. the snippets are a candidate pair: 1 - (1 - sr)b
  • 31. Locality-Sensitive Hashing Regardless of r and b, we get an S-curve:
  • 32. Locality-Sensitive Hashing If we choose r and b well, we can get this:
  • 33. Computing the similarity graph 1. Create the signatures for each snippet 2. Select the threshold from which we decide that snippets are clones, deduce r and b, then for each band hash sub-signatures into buckets 3. Snippets that land in at least one common bucket are the candidates
  • 34. Connected Components Step Dataset N snippets Features NxM values Similarity Graph N nodes Connected Components Clone Communities
  • 36. Community Detection Step Dataset N snippets Features NxM values Similarity Graph N nodes Connected Components Clone Communities
  • 38. Public Git Archive 182,014 projects (repos on GitHub with over 50 stars) 3 TB of code Commit history included PGA HEAD = ~54.5 million files With Babelfish driver = ~14.1 million files
  • 39.
  • 40.
  • 41. Post-feature extraction 7,869,917 million files processed 102,114 project 6,194,874 distinct features
  • 42. Features analysis Feature distribution: • identifiers: 9.93 % • literals: 65.7 % • graphlets: 11.84 % • children: 0.02 % • uast2seq (DFS) : 10.49% • node2vec (random walk) : 2.02 %
  • 43. Features analysis Average number of features per file: • 60 identifiers • 38 literals • 116 graphlets • 37 children • 336 uast2seq (DFS) • 460 node2vec (random walk)
  • 44. Connected components analysis 95 % threshold: • 4,270,967 CCs • 52.4 % of files in CCs with > 1 file • 7.83 file per CC with > 1 file 80 % threshold: • 3,551,648 CCs • 61.9 % of files in CCs with > 1 file • 8.79 file per CC with > 1 file
  • 47. Communities analysis Detection done with Walktrap algorithm 95 % threshold: • 526,715 CCs with > 1 file • In those, we detected 666,692 communities 80 % threshold: • 553,997 CCs with > 1 file • In those, we detected 918,333 communities
  • 48. text_parser.go 422 files - 327 projects - 1 filename
  • 49. 2344 files - 1058 projects - 584 filename filenames
  • 51. Microsoft Azure SDK 4803 files - 3 projects - 2954 filename
  • 52. Thank you ! For more: blog.sourced.tech/post/deduplicating_pga_with_apollo github.com/src-d/gemini pga.sourced.tech/ r.keramitas@gmail.com