SlideShare una empresa de Scribd logo
1 de 16
Descargar para leer sin conexión
Ctcompare:
 Comparing Multiple Code
   Trees for Similarity

       Warren Toomey
   School of IT, Bond University


    Using lexical analysis with
  techniques borrowed from DNA
sequencing, multiple code trees can
  be quickly compared to find any
          code similarities
Why Write a Code
          Comparison Tool?
    To detect student plagiarism
●




    To determine if your codebase is
●


    `infected' by proprietary code from
    elsewhere, or by open-source code
    covered by a license like the GPL

    To trace code genealogy between trees
●


    separated by time (e.g. versions), useful
    for the new field of computing history
Code Comparison Issues
  Can rearrangement of code be
●


  detected?
  - per line? per sub-line?
● Can “munging of code” be detected?


  - variable/function/struct renaming?
● What if one or both codebases are


  proprietary? How can third parties verify
  any comparison?
● Can a timely comparison be done?

● What is the rate of false positives?


  - of missed matches?
Lexical Comparison
  First version: break the source code
●


  from each tree into lexical tokens, then
  compare runs of tokens between trees.
● This removes all the code's semantics,


  but does deal with code rearrangement
  (but not code “munging”).
Performance of Lexical
            Approach

    Performance was O(M*N), when M and N
●


    the # of tokens in each code tree

    Basically: slow, and exponential
●




    I also wanted to compare multiple code
●


    trees, and also find duplicated code
    within a tree
When is Copying
            Significant?
    Common code fragments not significant:
●


                                    13 tokens
     for (i=0; i< val; i++)
                                    10 tokens
     if (x > max) max=x;
     err= stat(file, &sb); if (err) 14 tokens

    Axiom: runs of < 16 tokens insignificant
●




    Idea: use a run of 16 tokens as a lookup
●


    key to find other trees with the same run
    of tokens
Approach in Second
          Version
   Take all 16-tokens runs in a code tree
●

● Use each as a key, and insert the runs


  into a database along with position of
  the run in the source tree
● Keys (i.e. runs of 16 tokens) with


  multiple values indicate possible code
  copying of at least 16 tokens
● For these keys, we can evaluate all code


  trees from this point on to find any real
  code similarity
Approach in Second
     Version
Algorithm
for each key in the database with multiple record nodes {
  obtain the set of record nodes from the database;

 for each combination of node pairs Na and Nb {
  if (performing a cross-tree comparison)
    skip this combination if Na and Nb in the same tree;
  if (performing a code clone comparison)
    skip this combination if Na and Nb in the same file;

  perform a full token comparison beginning at Na and Nb to
  determine the full extent of the code similarity, i.e. determine
  the actual number of tokens in common;

  add the matching token sequence to a list of quot;potential runsquot;;
 }
}
walk the list of potential runs to merge overlapping runs, and
remove runs which are subsets of larger runs;

output the details of the actual matching token runs found.
Hashing Identifiers and
    Literals in Each Code Tree
    Simply comparing tokens yields false
●


    positives, e.g
      if (xx > yy) { aa = 2 * bb + cc; }
      if (ab > bc) {ab = 5 * ab + bc; }

    I hash each identifier and literal down to
●


    a 16-bit value

    This obfuscates the original values while
●


    minimising the amount of false positives
Isomorphic Comparison
   Hashing identifiers helps to ensure code
●


  similarities are found
● Does not help if variables have been


  renamed
● This can be solved with isomorphic


  comparison: keep a 2-way variable
  isomorphism table
● Code is isomorphic when variable


  names are isomorphic across a long run
  of code
Isomorphic Example
int maxofthree(int x, int y, int z)
 {
  if ((x>y) && (x>z)) return(x);
  if (y>z) return(y);
  return(z);
 }

int bigtriple(int b, int a, int c)
 {
  if ((b>a) && (b>c)) return(b);
  if (a>c) return(a);
  return(c);
 }
Bottlenecks
   In the second version, main bottleneck
●


  is the merging of overlapping code
  fragments which have been found to
  match between two trees
● Similar to the problem with DNA


  sequencing when genes are split up into
  multiple fragments before sequencing
● Use of AVL trees here has helped to


  reduce the cost of merging immensely
● Other more mundane optimisations


  such as mmap() have also helped
Current Performance




    54 seconds to compare 14 trees of code
●


    totalling 1 million lines of C code
Results
   Many lines of code written at UNSW in
●


  late 1970s still present in System V, and
  in Open Solaris
● Thousands of lines of replicated code


  inside Linux kernel
● AT&T vs BSDi lawsuit in early 1990s:


   - expert witness found 56 common lines
     in 230K lines between BSD and Unix
   - my tool find roughly same lines, plus
     100+ more isomorphic similarities
Question?

Más contenido relacionado

La actualidad más candente

C faqs interview questions placement paper 2013
C faqs interview questions placement paper 2013C faqs interview questions placement paper 2013
C faqs interview questions placement paper 2013srikanthreddy004
 
Compression project presentation
Compression project presentationCompression project presentation
Compression project presentationfaizang909
 
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...Codemotion
 
Avoiding Hardware Aliasing
Avoiding Hardware AliasingAvoiding Hardware Aliasing
Avoiding Hardware AliasingPeter Breuer
 
Phyton Learning extracts
Phyton Learning extracts Phyton Learning extracts
Phyton Learning extracts Pavan Babu .G
 
CMSC 350 HOMEWORK 3
CMSC 350 HOMEWORK 3CMSC 350 HOMEWORK 3
CMSC 350 HOMEWORK 3HamesKellor
 
BayFP: Concurrent and Multicore Haskell
BayFP: Concurrent and Multicore HaskellBayFP: Concurrent and Multicore Haskell
BayFP: Concurrent and Multicore HaskellBryan O'Sullivan
 
Study of the Subtyping Machine of Nominal Subtyping with Variance
Study of the Subtyping Machine of Nominal Subtyping with VarianceStudy of the Subtyping Machine of Nominal Subtyping with Variance
Study of the Subtyping Machine of Nominal Subtyping with VarianceOriRoth
 
Lecture 3 RE NFA DFA
Lecture 3   RE NFA DFA Lecture 3   RE NFA DFA
Lecture 3 RE NFA DFA Rebaz Najeeb
 
Lz77 (sliding window)
Lz77 (sliding window)Lz77 (sliding window)
Lz77 (sliding window)MANISH T I
 

La actualidad más candente (19)

Lzw compression
Lzw compressionLzw compression
Lzw compression
 
C faqs interview questions placement paper 2013
C faqs interview questions placement paper 2013C faqs interview questions placement paper 2013
C faqs interview questions placement paper 2013
 
More Pointers and Arrays
More Pointers and ArraysMore Pointers and Arrays
More Pointers and Arrays
 
Dictor
DictorDictor
Dictor
 
50120130405006
5012013040500650120130405006
50120130405006
 
Compression project presentation
Compression project presentationCompression project presentation
Compression project presentation
 
Haxe vs Unicode (en)
Haxe vs Unicode (en)Haxe vs Unicode (en)
Haxe vs Unicode (en)
 
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
 
Avoiding Hardware Aliasing
Avoiding Hardware AliasingAvoiding Hardware Aliasing
Avoiding Hardware Aliasing
 
Phyton Learning extracts
Phyton Learning extracts Phyton Learning extracts
Phyton Learning extracts
 
CMSC 350 HOMEWORK 3
CMSC 350 HOMEWORK 3CMSC 350 HOMEWORK 3
CMSC 350 HOMEWORK 3
 
BayFP: Concurrent and Multicore Haskell
BayFP: Concurrent and Multicore HaskellBayFP: Concurrent and Multicore Haskell
BayFP: Concurrent and Multicore Haskell
 
Study of the Subtyping Machine of Nominal Subtyping with Variance
Study of the Subtyping Machine of Nominal Subtyping with VarianceStudy of the Subtyping Machine of Nominal Subtyping with Variance
Study of the Subtyping Machine of Nominal Subtyping with Variance
 
C language
C languageC language
C language
 
Lecture 3 RE NFA DFA
Lecture 3   RE NFA DFA Lecture 3   RE NFA DFA
Lecture 3 RE NFA DFA
 
LZ78
LZ78LZ78
LZ78
 
Lz77 (sliding window)
Lz77 (sliding window)Lz77 (sliding window)
Lz77 (sliding window)
 
Computer Science Assignment Help
 Computer Science Assignment Help  Computer Science Assignment Help
Computer Science Assignment Help
 
7
77
7
 

Destacado

The internet Business: Success stories, Business Models etc. - A talk at IIMA
The internet Business: Success stories, Business Models etc. - A talk at IIMAThe internet Business: Success stories, Business Models etc. - A talk at IIMA
The internet Business: Success stories, Business Models etc. - A talk at IIMARohit Varma
 
Ten things that i have learnt in my marketing career_Talk at AICAR Business S...
Ten things that i have learnt in my marketing career_Talk at AICAR Business S...Ten things that i have learnt in my marketing career_Talk at AICAR Business S...
Ten things that i have learnt in my marketing career_Talk at AICAR Business S...Rohit Varma
 
المحاضرة الثالثه للدكتور صالح عراقى
المحاضرة الثالثه للدكتور صالح عراقى المحاضرة الثالثه للدكتور صالح عراقى
المحاضرة الثالثه للدكتور صالح عراقى guestfd118
 
المحاضرة الثالثه للدكتور صالح عراقى
المحاضرة الثالثه للدكتور صالح عراقىالمحاضرة الثالثه للدكتور صالح عراقى
المحاضرة الثالثه للدكتور صالح عراقىguestfd118
 
Less talked about aspects of digital marketing_The business and profession of...
Less talked about aspects of digital marketing_The business and profession of...Less talked about aspects of digital marketing_The business and profession of...
Less talked about aspects of digital marketing_The business and profession of...Rohit Varma
 
The Australian Community-Driven TV Guide
The Australian Community-Driven TV GuideThe Australian Community-Driven TV Guide
The Australian Community-Driven TV GuideDoctorWkt
 
Better Branding_McKinsey Quarterly_Nov 2003
Better Branding_McKinsey Quarterly_Nov 2003Better Branding_McKinsey Quarterly_Nov 2003
Better Branding_McKinsey Quarterly_Nov 2003Rohit Varma
 
Learnings from handling social media products for the indian market an intern...
Learnings from handling social media products for the indian market an intern...Learnings from handling social media products for the indian market an intern...
Learnings from handling social media products for the indian market an intern...Rohit Varma
 
Reconstructive Software Archaeology
Reconstructive Software ArchaeologyReconstructive Software Archaeology
Reconstructive Software ArchaeologyDoctorWkt
 
The evolution of the Internet and it's impact on print and other media_IDC_II...
The evolution of the Internet and it's impact on print and other media_IDC_II...The evolution of the Internet and it's impact on print and other media_IDC_II...
The evolution of the Internet and it's impact on print and other media_IDC_II...Rohit Varma
 
Ten things that are happening around us and ten ways India's students batch o...
Ten things that are happening around us and ten ways India's students batch o...Ten things that are happening around us and ten ways India's students batch o...
Ten things that are happening around us and ten ways India's students batch o...Rohit Varma
 
Steven Marks Engineer
Steven Marks EngineerSteven Marks Engineer
Steven Marks Engineerguestb541e
 
How does the ue know when an eNB is talking to it
How does the ue know when an eNB is talking to itHow does the ue know when an eNB is talking to it
How does the ue know when an eNB is talking to italiirfan04
 
Social networking learnings & opportunities web innovation conference, bangal...
Social networking learnings & opportunities web innovation conference, bangal...Social networking learnings & opportunities web innovation conference, bangal...
Social networking learnings & opportunities web innovation conference, bangal...Rohit Varma
 
LTE and EPC Specifications
LTE and EPC SpecificationsLTE and EPC Specifications
LTE and EPC Specificationsaliirfan04
 
Lte security overview
Lte security overviewLte security overview
Lte security overviewaliirfan04
 

Destacado (17)

The internet Business: Success stories, Business Models etc. - A talk at IIMA
The internet Business: Success stories, Business Models etc. - A talk at IIMAThe internet Business: Success stories, Business Models etc. - A talk at IIMA
The internet Business: Success stories, Business Models etc. - A talk at IIMA
 
Ten things that i have learnt in my marketing career_Talk at AICAR Business S...
Ten things that i have learnt in my marketing career_Talk at AICAR Business S...Ten things that i have learnt in my marketing career_Talk at AICAR Business S...
Ten things that i have learnt in my marketing career_Talk at AICAR Business S...
 
المحاضرة الثالثه للدكتور صالح عراقى
المحاضرة الثالثه للدكتور صالح عراقى المحاضرة الثالثه للدكتور صالح عراقى
المحاضرة الثالثه للدكتور صالح عراقى
 
المحاضرة الثالثه للدكتور صالح عراقى
المحاضرة الثالثه للدكتور صالح عراقىالمحاضرة الثالثه للدكتور صالح عراقى
المحاضرة الثالثه للدكتور صالح عراقى
 
Culture/Social Media
Culture/Social MediaCulture/Social Media
Culture/Social Media
 
Less talked about aspects of digital marketing_The business and profession of...
Less talked about aspects of digital marketing_The business and profession of...Less talked about aspects of digital marketing_The business and profession of...
Less talked about aspects of digital marketing_The business and profession of...
 
The Australian Community-Driven TV Guide
The Australian Community-Driven TV GuideThe Australian Community-Driven TV Guide
The Australian Community-Driven TV Guide
 
Better Branding_McKinsey Quarterly_Nov 2003
Better Branding_McKinsey Quarterly_Nov 2003Better Branding_McKinsey Quarterly_Nov 2003
Better Branding_McKinsey Quarterly_Nov 2003
 
Learnings from handling social media products for the indian market an intern...
Learnings from handling social media products for the indian market an intern...Learnings from handling social media products for the indian market an intern...
Learnings from handling social media products for the indian market an intern...
 
Reconstructive Software Archaeology
Reconstructive Software ArchaeologyReconstructive Software Archaeology
Reconstructive Software Archaeology
 
The evolution of the Internet and it's impact on print and other media_IDC_II...
The evolution of the Internet and it's impact on print and other media_IDC_II...The evolution of the Internet and it's impact on print and other media_IDC_II...
The evolution of the Internet and it's impact on print and other media_IDC_II...
 
Ten things that are happening around us and ten ways India's students batch o...
Ten things that are happening around us and ten ways India's students batch o...Ten things that are happening around us and ten ways India's students batch o...
Ten things that are happening around us and ten ways India's students batch o...
 
Steven Marks Engineer
Steven Marks EngineerSteven Marks Engineer
Steven Marks Engineer
 
How does the ue know when an eNB is talking to it
How does the ue know when an eNB is talking to itHow does the ue know when an eNB is talking to it
How does the ue know when an eNB is talking to it
 
Social networking learnings & opportunities web innovation conference, bangal...
Social networking learnings & opportunities web innovation conference, bangal...Social networking learnings & opportunities web innovation conference, bangal...
Social networking learnings & opportunities web innovation conference, bangal...
 
LTE and EPC Specifications
LTE and EPC SpecificationsLTE and EPC Specifications
LTE and EPC Specifications
 
Lte security overview
Lte security overviewLte security overview
Lte security overview
 

Similar a Ctcompare: Comparing Multiple Code Trees for Similarity

C Interview Questions for Fresher
C Interview Questions for FresherC Interview Questions for Fresher
C Interview Questions for FresherJaved Ahmad
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparationsonu sharma
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparationKgr Sushmitha
 
Distributed Coordination
Distributed CoordinationDistributed Coordination
Distributed CoordinationLuis Galárraga
 
Clone detection in Python
Clone detection in PythonClone detection in Python
Clone detection in PythonValerio Maggio
 
Faster persistent data structures through hashing
Faster persistent data structures through hashingFaster persistent data structures through hashing
Faster persistent data structures through hashingJohan Tibell
 
Regular expressions in Python
Regular expressions in PythonRegular expressions in Python
Regular expressions in PythonSujith Kumar
 
Chapter-3 compiler.pptx course materials
Chapter-3 compiler.pptx course materialsChapter-3 compiler.pptx course materials
Chapter-3 compiler.pptx course materialsgadisaAdamu
 
cryptography and network security chap 3
cryptography and network security chap 3cryptography and network security chap 3
cryptography and network security chap 3Debanjan Bhattacharya
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMfnothaft
 
Sienna 12 huffman
Sienna 12 huffmanSienna 12 huffman
Sienna 12 huffmanchidabdu
 
Deduplication on large amounts of code
Deduplication on large amounts of codeDeduplication on large amounts of code
Deduplication on large amounts of codesource{d}
 

Similar a Ctcompare: Comparing Multiple Code Trees for Similarity (20)

I1
I1I1
I1
 
Parsing
ParsingParsing
Parsing
 
C Interview Questions for Fresher
C Interview Questions for FresherC Interview Questions for Fresher
C Interview Questions for Fresher
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparation
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparation
 
C interview Question and Answer
C interview Question and AnswerC interview Question and Answer
C interview Question and Answer
 
Distributed Coordination
Distributed CoordinationDistributed Coordination
Distributed Coordination
 
Perl_Part4
Perl_Part4Perl_Part4
Perl_Part4
 
Clone detection in Python
Clone detection in PythonClone detection in Python
Clone detection in Python
 
Faster persistent data structures through hashing
Faster persistent data structures through hashingFaster persistent data structures through hashing
Faster persistent data structures through hashing
 
Regular expressions in Python
Regular expressions in PythonRegular expressions in Python
Regular expressions in Python
 
Chapter-3 compiler.pptx course materials
Chapter-3 compiler.pptx course materialsChapter-3 compiler.pptx course materials
Chapter-3 compiler.pptx course materials
 
cryptography and network security chap 3
cryptography and network security chap 3cryptography and network security chap 3
cryptography and network security chap 3
 
Cryptography and Network Security William Stallings Lawrie Brown
Cryptography and Network Security William Stallings Lawrie BrownCryptography and Network Security William Stallings Lawrie Brown
Cryptography and Network Security William Stallings Lawrie Brown
 
Symbolic Execution And KLEE
Symbolic Execution And KLEESymbolic Execution And KLEE
Symbolic Execution And KLEE
 
Bh36352357
Bh36352357Bh36352357
Bh36352357
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAM
 
Sienna 12 huffman
Sienna 12 huffmanSienna 12 huffman
Sienna 12 huffman
 
Deduplication on large amounts of code
Deduplication on large amounts of codeDeduplication on large amounts of code
Deduplication on large amounts of code
 
Compiler Design
Compiler DesignCompiler Design
Compiler Design
 

Último

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 

Último (20)

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 

Ctcompare: Comparing Multiple Code Trees for Similarity

  • 1. Ctcompare: Comparing Multiple Code Trees for Similarity Warren Toomey School of IT, Bond University Using lexical analysis with techniques borrowed from DNA sequencing, multiple code trees can be quickly compared to find any code similarities
  • 2. Why Write a Code Comparison Tool? To detect student plagiarism ● To determine if your codebase is ● `infected' by proprietary code from elsewhere, or by open-source code covered by a license like the GPL To trace code genealogy between trees ● separated by time (e.g. versions), useful for the new field of computing history
  • 3. Code Comparison Issues Can rearrangement of code be ● detected? - per line? per sub-line? ● Can “munging of code” be detected? - variable/function/struct renaming? ● What if one or both codebases are proprietary? How can third parties verify any comparison? ● Can a timely comparison be done? ● What is the rate of false positives? - of missed matches?
  • 4. Lexical Comparison First version: break the source code ● from each tree into lexical tokens, then compare runs of tokens between trees. ● This removes all the code's semantics, but does deal with code rearrangement (but not code “munging”).
  • 5. Performance of Lexical Approach Performance was O(M*N), when M and N ● the # of tokens in each code tree Basically: slow, and exponential ● I also wanted to compare multiple code ● trees, and also find duplicated code within a tree
  • 6. When is Copying Significant? Common code fragments not significant: ● 13 tokens for (i=0; i< val; i++) 10 tokens if (x > max) max=x; err= stat(file, &sb); if (err) 14 tokens Axiom: runs of < 16 tokens insignificant ● Idea: use a run of 16 tokens as a lookup ● key to find other trees with the same run of tokens
  • 7. Approach in Second Version Take all 16-tokens runs in a code tree ● ● Use each as a key, and insert the runs into a database along with position of the run in the source tree ● Keys (i.e. runs of 16 tokens) with multiple values indicate possible code copying of at least 16 tokens ● For these keys, we can evaluate all code trees from this point on to find any real code similarity
  • 9. Algorithm for each key in the database with multiple record nodes { obtain the set of record nodes from the database; for each combination of node pairs Na and Nb { if (performing a cross-tree comparison) skip this combination if Na and Nb in the same tree; if (performing a code clone comparison) skip this combination if Na and Nb in the same file; perform a full token comparison beginning at Na and Nb to determine the full extent of the code similarity, i.e. determine the actual number of tokens in common; add the matching token sequence to a list of quot;potential runsquot;; } } walk the list of potential runs to merge overlapping runs, and remove runs which are subsets of larger runs; output the details of the actual matching token runs found.
  • 10. Hashing Identifiers and Literals in Each Code Tree Simply comparing tokens yields false ● positives, e.g if (xx > yy) { aa = 2 * bb + cc; } if (ab > bc) {ab = 5 * ab + bc; } I hash each identifier and literal down to ● a 16-bit value This obfuscates the original values while ● minimising the amount of false positives
  • 11. Isomorphic Comparison Hashing identifiers helps to ensure code ● similarities are found ● Does not help if variables have been renamed ● This can be solved with isomorphic comparison: keep a 2-way variable isomorphism table ● Code is isomorphic when variable names are isomorphic across a long run of code
  • 12. Isomorphic Example int maxofthree(int x, int y, int z) { if ((x>y) && (x>z)) return(x); if (y>z) return(y); return(z); } int bigtriple(int b, int a, int c) { if ((b>a) && (b>c)) return(b); if (a>c) return(a); return(c); }
  • 13. Bottlenecks In the second version, main bottleneck ● is the merging of overlapping code fragments which have been found to match between two trees ● Similar to the problem with DNA sequencing when genes are split up into multiple fragments before sequencing ● Use of AVL trees here has helped to reduce the cost of merging immensely ● Other more mundane optimisations such as mmap() have also helped
  • 14. Current Performance 54 seconds to compare 14 trees of code ● totalling 1 million lines of C code
  • 15. Results Many lines of code written at UNSW in ● late 1970s still present in System V, and in Open Solaris ● Thousands of lines of replicated code inside Linux kernel ● AT&T vs BSDi lawsuit in early 1990s: - expert witness found 56 common lines in 230K lines between BSD and Unix - my tool find roughly same lines, plus 100+ more isomorphic similarities