SlideShare una empresa de Scribd logo
1 de 49
Internet-scale Source Code
                Search and Analysis Framework

                                                        Iman Keivanloo

                                                             Advisor:
                                                       Dr. Juergen Rilling



PhD Seminar
Computer Science and Software Engineering Department
November-17-2011
Agenda

• Research Context

• Major questions & answers

• Next step

• Conclusion

• Time Table

                              2
Research Context



            Internet-Scale Code Search
“is searching the Internet for source code to help solve a software
                      development problem”




                           [Gallardo, SUITE’09]

                                                                  3
How to search for Source Code?
• Free-form Query:
  – “how to write into file in Java”




• Structural Query:
  – “select col1 from table1 where col1=“%write”

              [Keivanloo, SUITE’11]    [Keivanloo, ICSM’10]
                                                              4
Research Focus
                                                              Similar Fragment Search
                                                                                                                                                         XMLReadFile inFile=new XMLReadFile(“kb.xml”);
           Suggested simplified query:                                                                                                                   Window myWindow=new Window();
              Select line which has                                                       The ideal expected asnwer                                      myWindow.trigger(inFile);
(1) a method call statement on the trigger method.                                                                                                       OutputStream result=new OutputStream();
                                                                                                                                                         myWindow.flush(result);



Step 1: Input [the simplified structural query]                                                               Step 2: Input [the selected fragment
                                                                                                              in the first step and its target line (red)]

            Internet-Scale Structural Code                                                                                 Real-time Clone Search Engine
                    Search Engine                        ...
 ...                                                     10: Window myWindow=new Window();                                                                                                 The pattern is
                                                         11: CSVReadFile csvData=new CSVReadFile(“...             ...                                                                    similar but it uses
 59: Event e=new Event(50);                                                                                       55: Window r=new Window();
                                                         12: myWindow.trigger(csvData);                                                                                                 XMLStream instead
 60: e.trigger();                                        13: OutputStream o=new OutputStream();                   56: long timestamp=System.Now();
 61: e.update();                                                                                                                                                      Gapped clone       of XMLFile as the
                                                         14: myWindow.flush(o);                                   57: System.out.println(“Start reasoning...”);
 ...                                                     15: myWindow.close();                                    58: XMLStream xmldata=new XMLStream(io);                                     input
 ...                                                     ...                                                      59: r.trigger(xmldata);
 11: CSVReadFile csvData=new CSVReadFile(“input.csv”);                                                            60: OutputStream o=new OutputStream();
 12: myWindow.trigger(csvData);                                                                                   61: r.flush(o);
 13: OutputStream o=new OutputStream();                                                                           …
 …                                                                                                                                                                                          This match is
                                                         This line looks like a match, however it uses            …                                                                      acceptable, even if
 ...                                                                                                              89: Window var=new Window();
                                                         .CSV instead of .XML. We can use our clone               90: XMLReadFile r=new XMLReadFile (“k.xml”);
                                                                                                                                                                                        the order is different
 133: Listener res=new Listener();
                                                         search engine to find now other similar                  91: OutputStream o=new OutputStream();               Unordered core    from the 1:1 match
 134: res.trigger(“warm-up”);
 135: res.close();                                       code fragments to this one.                              92: var.trigger(r);
 ...                                                                                                              93: var.flush(o);
                                                                                                                  …


                                                                                                                                                                                                    5
Research Challenge




                     6
The Web Search Challenge




                           7
But Often Still Fail to Deliver the Expected Results
           After 10 Years of Research




                                                   8
No Ambiguity!




                9
Early Conclusion




Source Code Search is   similar   to   Web Search




                                                10
Early Conclusion




Source Code Search is       similar           to   Web Search

1. Search techniques = ?   Search
                Analysis (Ambiguity resolution)
2. Ambiguity resolution techniques = Code Analysis

                                                            11
Research Approach Overview


 Internet-scale Source Code
Search and Analysis Framework
Search     Analysis


        Code Clone Search
 Semantic Web-based Code Analysis
                                    12
Definitions & Requirements


         Search
Clone (Source Code Clone)
     • Similar code fragments
   for (AttributeEntity               for (AttributeEntity
   theAttributeEntity:aTableEntity.ge…theAttributeEntity:aTableEntity.ge…
   System.out.println(“Hello!");      System.out.println(“Hello!");



     •   Type 1: Identical except whitespaces …
     •   Type 2: Identical except variable names ...
     •   Type 3: Identical except a few missing…
     •   Type 4: Similar functionality

[Roy, C. K., Cordy, J. R., & Koschke, R. (2009). Comparison and evaluation of code clone detection techniques
and tools: A qualitative approach. Science of Computer Programming, 2009.]                             14
Clone Search

            Query                                             Code Database

for (Attribute
attribute:exampleSet.getAttributes())                 for (Attribute attribute:es1.getAttributes())
                                                      System.out.println(“Test");
System.out.println(“Hello!");


                                                               for (IAttribute
                                                               att:source.getAttributes()) {
                                                               System.out.println("Please do not
                                                               read me");

                                        for (JAttribute attribute:formType.getAttributes()) System.out.println(“Test");



                                                                                                                 16
Clone Search

           Answer




        Query




                    17
Internet-scale Clone Search
            Query

for (Attribute
attribute:exampleSet.getAttributes())
System.out.println(“Hello!");




                                        18
Internet-scale   Real-time Clone Search




                                          19
Internet-scale Real-time Clone Search




    Requirements?
                                        20
Internet-scale Real-time Clone Search




    Requirements:
  Millions LOC
  ~ 300 MLOC                            21
Internet-scale Real-time Clone Search




    Requirements:
                     100
  Millions LOC   Milliseconds           22
Internet-scale Real-time Clone Search

                                      for (IAttribute att:source.getAttributes()) {
                                      System.out.println("Please do not read me");

                                for (JAttribute attribute:formType.getAttributes()) System.out.println(“Test");




    Requirements:
                                         •Precision
                     100                 • Recall
  Millions LOC   Milliseconds            •Type-1, 2, 3… 23
Internet-scale Real-time Clone Search




    Requirements:
                                    Precision
                                    Recall
  Millions LOC   100 Milliseconds   Type-1, 2, 3…   24
Research Question #1


                       Real-time answer (faster than 100 ms)
                        Is it actually possible?
Our Initial Analysis
• SeClone: An Internet-scale Real-time Clone Search Engine




               Search               Analysis
                Phase 1               Phase 2
                        [Keivanloo, ICPC’11]                 26
Inside SeClone
                             Phase   1
                             • Syntactical Pattern matching




Phase 1            Phase 2
Pattern Matching




                                                         27
Inside SeClone
                                       Phase   2
                                       • Information Retrieval &
                                           Clustering algorithm


                                               1 for (Attribute attribute:exampleSet.getAttribute
                                                            System.out.println(“The end");
                                               2 for (Attribute attribute:es1.getAttributes())
                                                            System.out.println(“Test");

Phase 1            Phase 2
Pattern Matching   Semantic Matching
                                               3 for (AttributeEntity     theAttributeEntity:aTable
                                                            System.out.println(“Hello!");
                                               4 for (JAttribute   attribute:formType.getAttribute
                                                            System.out.println(“Test");
                                               5 for (IAttribute   att:source.getAttributes()) {
                                                                                         28
                                                            System.out.println("Please do not read m
Research Question #2



                                  The Dilemma
                       How to distribute the 100 milliseconds between
                                           phases?

                                   0      25       50      75       100




                                    Pattern Matching    Semantic Matching


                                         [Keivanloo, WCRE’11]
Our Further Analysis [WCRE’11]
                       •   100 Milliseconds
Requirements




                       •   Millions LOC
                       •   Precision




                                                            The Dilemma
                       •   Recall




                                              Constraints
                       •   Type-1, 2, 3…

                                                                          0     25       50      75        100
SeClone [ICPC 11]




                       O ( p * log n )
                                                                          Pattern Matching    Semantic Matching
Data Characteristics




                                                                                                      30
Source Code Characteristics




                              31
Analysis of the Data Characteristics:
          Dataset preparation

• Name: IJaDataset
   – Comprehensive (Inter-project)
       • To avoid project-specific result
   – ~18,000 Projects
   – 1,500,000 unique Java classes
       • No duplicate, empty, buggy file
   – ~300 MLOC


• online at   http://aseg.cs.concordia.ca/seclone
                                                    32
Analysis of the Data Characteristics:
           Granularity Effect
• Three Level Similarity (TLS): Set of similar three-line fragments
• First Level Similarity (FLS): single-line patterns




                                                                33
Analysis of the Data Characteristics:
            Clone frequency
• How many code fragment are analyzed by
  each query?

• Answer: 3 (Average)




                                           34
Analysis of the Data Characteristics:
            Clone frequency
• Observation result:
  – TLS distributes the candidates into 3.9 times more groups
  – Its group size is 6 times smaller than FLS




                                                                35
Analysis of the Data Characteristics:
            Clone frequency
• Conclusion:
  – TLS heuristic is practical for real-time clone search,
    as long as the outliers are handled properly
  – Why?
     • (1) each TLS group has 2.37 members on average
     • (2) it distributes candidates in small-size groups
     • (3) for each query, only one group must be evaluated




                                                              36
What Does an Outlier Look Like?
• Outlier Definition: patterns with more than 2,000 occurrences
• Observation result:
       • Only ~1000 patterns out of 30M
       • ~ 0.01% patterns
       • Mostly insignificant code patterns




                                                                  37
Analysis of the Data Characteristics:
           Sampling efficiency
• Can sampling be used to reduce the amount
  of data being analyzed?

• Answer: Yes (e.g., 33% contains 91% of popular patterns)




                                                             38
Analysis of the Data Characteristics:
                Indexing
• Can 32bit Hash keys (versus MD5) be used
  without affecting index quality?

   abc  123                                           abc  123
   aXc  456                                           aXc  123



• Answer: Yes       0.002% error rate
          Only 10 cases for same key for three distinct strings
                                                                   39
Method Names Are Reliable?
• Input Data: Koders 1-year query log
   – ~10M records
• Observation purpose:
   – Importance of method names
• Observation result:
   – 98% success rate vs. 69%
• Result interpretation:
   – Method names in this context are reliable source of information
   – They must be preserved to increase precision



                                                                   40
Source Code Search Framework




                               41
Internet-scale Real-time Code Clone
   Search via Multi-level Indexing




– Internet-scale & Speed
    • 32-bit Hash values
– Type-3 clone
    • Multi-level indexing
– Customized for Internet-scale Code Search
    • Special transformation rule



                                              42
Response Time (Pattern Matching)
                       [WCRE’11]



• Regular queries
  – 25 microseconds




• 99.99% queries
  – 900 microseconds

                                   43
Conclusion




             44
Answer:
       Research Question #1



Internet-scale Real-time Code Search Is
                Possible?



               YES
                                          45
Answer:
            Research Question #2

                 The Dilemma
How to distribute the 100 milliseconds between phases?

                       Answer:
                   0     25       50      75       100




                   Pattern Matching    Semantic Matching




            1 millisecond                      99 milliseconds
Research Opportunity



         0       25        50       75    100




  Pattern Matching    Semantic Matching

                99 milliseconds

                     Analysis
Summary
                                Step 1
• Studied characteristics of source code on the Internet
    –   unique patterns distribution (sampling application)
    –   Pattern frequencies (multi-level search)
    –   32-bit hashing strength (code pattern)
    –   Outlier patterns
    –   Method name importance

                               Step 2
• Designed an Internet-scale clone search
    –   Customized for code search (precision)
    –   Fine granularity
    –   Multi-level Indexing approach (Type-3 clone)
    –   Microsecond range response time (up to 10 times faster)


                                                                  48
Publication
    Code Clone Search and Detection (http://aseg.cs.concordia.ca/seclone/)
•   Iman Keivanloo, Juergen Rilling, Philippe Charland. Internet-scale Real-time Code Clone Search via Multi-level
    Indexing. 18th Working Conference on Reverse Engineering (WCRE 2011), Lero, Limerick , Ireland.
•   Iman Keivanloo, Juergen Rilling, Philippe Charland. SeClone – A Hybrid Approach to Internet-Scale Real-Time Code
    Clone Search. 19th IEEE International Conference on Program Comprehension (ICPC 2011), Kingston, Ontario,
    Canada.


    Source Code Sharing using Linked Data (secold.org)
•   Iman Keivanloo, Chris Forbes, Juergen Rilling, and Philippe Charland, "Towards Sharing Source Code Facts Using
    Linked Data," ICSE Workshop on Search-Driven Development: Users, Infrastructure, Tools and Evaluation (SUITE).
    2011.



    Source Code Search (http://aseg.cs.concordia.ca/codesearch)
•   Iman Keivanloo, Laleh Roostapour, Philipp Schugerl, Juergen Rilling. Semantic Web-based Source Code Search. 6th
    International Workshop on Semantic Web Enabled Software Engineering (SWESE 2010), June 35, San Francisco,
    USA.
•    Iman Keivanloo, Laleh Roostapour, Philipp Schugerl, Juergen Rilling. SE-CodeSearch: A Scalable Semantic Web-
    based Source Code Search Infrastructure. 26th IEEE International Conference on Software Maintenance (ICSM),
    Early Research Achievements (ERA) Track, Sept. 12-18, Timișoara, Romania.




                                                                                                                  49
Thank you for your kind attention

           QUESTION?

PhD Seminar
Computer Science and Software Engineering Department   50
November-17-2011

Más contenido relacionado

La actualidad más candente

Advance Java Programs skeleton
Advance Java Programs skeletonAdvance Java Programs skeleton
Advance Java Programs skeletonIram Ramrajkar
 
Final JAVA Practical of BCA SEM-5.
Final JAVA Practical of BCA SEM-5.Final JAVA Practical of BCA SEM-5.
Final JAVA Practical of BCA SEM-5.Nishan Barot
 
IDSECCONF2013 CTF online Write Up
IDSECCONF2013 CTF online Write Up IDSECCONF2013 CTF online Write Up
IDSECCONF2013 CTF online Write Up idsecconf
 
Pythia Reloaded: An Intelligent Unit Testing-Based Code Grader for Education
Pythia Reloaded: An Intelligent Unit Testing-Based Code Grader for EducationPythia Reloaded: An Intelligent Unit Testing-Based Code Grader for Education
Pythia Reloaded: An Intelligent Unit Testing-Based Code Grader for EducationECAM Brussels Engineering School
 
sizeof(Object): how much memory objects take on JVMs and when this may matter
sizeof(Object): how much memory objects take on JVMs and when this may mattersizeof(Object): how much memory objects take on JVMs and when this may matter
sizeof(Object): how much memory objects take on JVMs and when this may matterDawid Weiss
 
03 standard class library
03 standard class library03 standard class library
03 standard class libraryeleksdev
 
Core java pract_sem iii
Core java pract_sem iiiCore java pract_sem iii
Core java pract_sem iiiNiraj Bharambe
 
C# Starter L04-Collections
C# Starter L04-CollectionsC# Starter L04-Collections
C# Starter L04-CollectionsMohammad Shaker
 
Advanced Java - Praticals
Advanced Java - PraticalsAdvanced Java - Praticals
Advanced Java - PraticalsFahad Shaikh
 
How Data Flow analysis works in a static code analyzer
How Data Flow analysis works in a static code analyzerHow Data Flow analysis works in a static code analyzer
How Data Flow analysis works in a static code analyzerAndrey Karpov
 

La actualidad más candente (20)

Advance Java Programs skeleton
Advance Java Programs skeletonAdvance Java Programs skeleton
Advance Java Programs skeleton
 
Final JAVA Practical of BCA SEM-5.
Final JAVA Practical of BCA SEM-5.Final JAVA Practical of BCA SEM-5.
Final JAVA Practical of BCA SEM-5.
 
IDSECCONF2013 CTF online Write Up
IDSECCONF2013 CTF online Write Up IDSECCONF2013 CTF online Write Up
IDSECCONF2013 CTF online Write Up
 
Lab4
Lab4Lab4
Lab4
 
Spock framework
Spock frameworkSpock framework
Spock framework
 
Ad java prac sol set
Ad java prac sol setAd java prac sol set
Ad java prac sol set
 
Pythia Reloaded: An Intelligent Unit Testing-Based Code Grader for Education
Pythia Reloaded: An Intelligent Unit Testing-Based Code Grader for EducationPythia Reloaded: An Intelligent Unit Testing-Based Code Grader for Education
Pythia Reloaded: An Intelligent Unit Testing-Based Code Grader for Education
 
sizeof(Object): how much memory objects take on JVMs and when this may matter
sizeof(Object): how much memory objects take on JVMs and when this may mattersizeof(Object): how much memory objects take on JVMs and when this may matter
sizeof(Object): how much memory objects take on JVMs and when this may matter
 
Java 7 & 8 New Features
Java 7 & 8 New FeaturesJava 7 & 8 New Features
Java 7 & 8 New Features
 
03 standard class library
03 standard class library03 standard class library
03 standard class library
 
Hack ASP.NET website
Hack ASP.NET websiteHack ASP.NET website
Hack ASP.NET website
 
Core java pract_sem iii
Core java pract_sem iiiCore java pract_sem iii
Core java pract_sem iii
 
Java Programming - 03 java control flow
Java Programming - 03 java control flowJava Programming - 03 java control flow
Java Programming - 03 java control flow
 
C# Starter L04-Collections
C# Starter L04-CollectionsC# Starter L04-Collections
C# Starter L04-Collections
 
Java Programming - 06 java file io
Java Programming - 06 java file ioJava Programming - 06 java file io
Java Programming - 06 java file io
 
Java Concurrency by Example
Java Concurrency by ExampleJava Concurrency by Example
Java Concurrency by Example
 
Advanced Java - Praticals
Advanced Java - PraticalsAdvanced Java - Praticals
Advanced Java - Praticals
 
How Data Flow analysis works in a static code analyzer
How Data Flow analysis works in a static code analyzerHow Data Flow analysis works in a static code analyzer
How Data Flow analysis works in a static code analyzer
 
Sam wd programs
Sam wd programsSam wd programs
Sam wd programs
 
DCN Practical
DCN PracticalDCN Practical
DCN Practical
 

Destacado

Projects Presentation 2009 2010, Aiesec
Projects Presentation 2009 2010, AiesecProjects Presentation 2009 2010, Aiesec
Projects Presentation 2009 2010, Aiesecguest7f7d5c3
 
엔지니어의 꿈 Fmt 최종본
엔지니어의 꿈   Fmt 최종본엔지니어의 꿈   Fmt 최종본
엔지니어의 꿈 Fmt 최종본영범 정
 
MeCC: Memory Comparison-based Code Clone Detector
MeCC: Memory Comparison-based Code Clone DetectorMeCC: Memory Comparison-based Code Clone Detector
MeCC: Memory Comparison-based Code Clone Detector영범 정
 
An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and ...
An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and ...An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and ...
An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and ...Kamiya Toshihiro
 
The impact of supercomputers on MSR
The impact of supercomputers on MSRThe impact of supercomputers on MSR
The impact of supercomputers on MSRYasutaka Kamei
 
Introducing Parameter Sensitivity to Dynamic Code-Clone Analysis Methods
Introducing Parameter Sensitivity to Dynamic Code-Clone Analysis MethodsIntroducing Parameter Sensitivity to Dynamic Code-Clone Analysis Methods
Introducing Parameter Sensitivity to Dynamic Code-Clone Analysis MethodsKamiya Toshihiro
 
31911477 internet-banking-project-documentation
31911477 internet-banking-project-documentation31911477 internet-banking-project-documentation
31911477 internet-banking-project-documentationSwaroop Mane
 
SYNOPSIS ON BANK MANAGEMENT SYSTEM
SYNOPSIS ON BANK MANAGEMENT SYSTEMSYNOPSIS ON BANK MANAGEMENT SYSTEM
SYNOPSIS ON BANK MANAGEMENT SYSTEMNitish Xavier Tirkey
 
java Project report online banking system
java Project report online banking systemjava Project report online banking system
java Project report online banking systemVishNu KuNtal
 
Internet banking - College Project
Internet banking - College ProjectInternet banking - College Project
Internet banking - College ProjectSheril Daniel
 
Internet Banking
Internet BankingInternet Banking
Internet Bankingsnehateddy
 
Reuters: Pictures of the Year 2016 (Part 2)
Reuters: Pictures of the Year 2016 (Part 2)Reuters: Pictures of the Year 2016 (Part 2)
Reuters: Pictures of the Year 2016 (Part 2)maditabalnco
 
The Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post FormatsThe Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post FormatsBarry Feldman
 
The Outcome Economy
The Outcome EconomyThe Outcome Economy
The Outcome EconomyHelge Tennø
 

Destacado (15)

Projects Presentation 2009 2010, Aiesec
Projects Presentation 2009 2010, AiesecProjects Presentation 2009 2010, Aiesec
Projects Presentation 2009 2010, Aiesec
 
엔지니어의 꿈 Fmt 최종본
엔지니어의 꿈   Fmt 최종본엔지니어의 꿈   Fmt 최종본
엔지니어의 꿈 Fmt 최종본
 
MeCC: Memory Comparison-based Code Clone Detector
MeCC: Memory Comparison-based Code Clone DetectorMeCC: Memory Comparison-based Code Clone Detector
MeCC: Memory Comparison-based Code Clone Detector
 
An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and ...
An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and ...An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and ...
An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and ...
 
The impact of supercomputers on MSR
The impact of supercomputers on MSRThe impact of supercomputers on MSR
The impact of supercomputers on MSR
 
Introducing Parameter Sensitivity to Dynamic Code-Clone Analysis Methods
Introducing Parameter Sensitivity to Dynamic Code-Clone Analysis MethodsIntroducing Parameter Sensitivity to Dynamic Code-Clone Analysis Methods
Introducing Parameter Sensitivity to Dynamic Code-Clone Analysis Methods
 
31911477 internet-banking-project-documentation
31911477 internet-banking-project-documentation31911477 internet-banking-project-documentation
31911477 internet-banking-project-documentation
 
SYNOPSIS ON BANK MANAGEMENT SYSTEM
SYNOPSIS ON BANK MANAGEMENT SYSTEMSYNOPSIS ON BANK MANAGEMENT SYSTEM
SYNOPSIS ON BANK MANAGEMENT SYSTEM
 
Internet banking
Internet bankingInternet banking
Internet banking
 
java Project report online banking system
java Project report online banking systemjava Project report online banking system
java Project report online banking system
 
Internet banking - College Project
Internet banking - College ProjectInternet banking - College Project
Internet banking - College Project
 
Internet Banking
Internet BankingInternet Banking
Internet Banking
 
Reuters: Pictures of the Year 2016 (Part 2)
Reuters: Pictures of the Year 2016 (Part 2)Reuters: Pictures of the Year 2016 (Part 2)
Reuters: Pictures of the Year 2016 (Part 2)
 
The Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post FormatsThe Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post Formats
 
The Outcome Economy
The Outcome EconomyThe Outcome Economy
The Outcome Economy
 

Similar a Source Code Clone Search (Iman keivanloo PhD seminar)

Why Windows 8 drivers are buggy
Why Windows 8 drivers are buggyWhy Windows 8 drivers are buggy
Why Windows 8 drivers are buggyAndrey Karpov
 
Checking Clang 11 with PVS-Studio
Checking Clang 11 with PVS-StudioChecking Clang 11 with PVS-Studio
Checking Clang 11 with PVS-StudioAndrey Karpov
 
[RAT資安小聚] Study on Automatically Evading Malware Detection
[RAT資安小聚] Study on Automatically Evading Malware Detection[RAT資安小聚] Study on Automatically Evading Malware Detection
[RAT資安小聚] Study on Automatically Evading Malware DetectionAj MaChInE
 
Source code of WPF samples by Microsoft was checked
Source code of WPF samples by Microsoft was checkedSource code of WPF samples by Microsoft was checked
Source code of WPF samples by Microsoft was checkedPVS-Studio
 
Hibernate Import.Sql I18n
Hibernate Import.Sql I18nHibernate Import.Sql I18n
Hibernate Import.Sql I18nyifi2009
 
Static code analysis: what? how? why?
Static code analysis: what? how? why?Static code analysis: what? how? why?
Static code analysis: what? how? why?Andrey Karpov
 
PVS-Studio Meets Octave
PVS-Studio Meets Octave PVS-Studio Meets Octave
PVS-Studio Meets Octave PVS-Studio
 
05. Java Loops Methods and Classes
05. Java Loops Methods and Classes05. Java Loops Methods and Classes
05. Java Loops Methods and ClassesIntro C# Book
 
Synthesizing API Usage Examples
Synthesizing API Usage Examples Synthesizing API Usage Examples
Synthesizing API Usage Examples Ray Buse
 
A Unicorn Seeking Extraterrestrial Life: Analyzing SETI@home's Source Code
A Unicorn Seeking Extraterrestrial Life: Analyzing SETI@home's Source CodeA Unicorn Seeking Extraterrestrial Life: Analyzing SETI@home's Source Code
A Unicorn Seeking Extraterrestrial Life: Analyzing SETI@home's Source CodePVS-Studio
 
Linux version of PVS-Studio couldn't help checking CodeLite
Linux version of PVS-Studio couldn't help checking CodeLiteLinux version of PVS-Studio couldn't help checking CodeLite
Linux version of PVS-Studio couldn't help checking CodeLitePVS-Studio
 
Lab11bRevf.docLab 11b Alien InvasionCS 122 • 15 Points .docx
Lab11bRevf.docLab 11b Alien InvasionCS 122 • 15 Points .docxLab11bRevf.docLab 11b Alien InvasionCS 122 • 15 Points .docx
Lab11bRevf.docLab 11b Alien InvasionCS 122 • 15 Points .docxDIPESH30
 
Checking Notepad++: five years later
Checking Notepad++: five years laterChecking Notepad++: five years later
Checking Notepad++: five years laterPVS-Studio
 
A brief overview of java frameworks
A brief overview of java frameworksA brief overview of java frameworks
A brief overview of java frameworksMD Sayem Ahmed
 

Similar a Source Code Clone Search (Iman keivanloo PhD seminar) (20)

Why Windows 8 drivers are buggy
Why Windows 8 drivers are buggyWhy Windows 8 drivers are buggy
Why Windows 8 drivers are buggy
 
Checking Clang 11 with PVS-Studio
Checking Clang 11 with PVS-StudioChecking Clang 11 with PVS-Studio
Checking Clang 11 with PVS-Studio
 
[RAT資安小聚] Study on Automatically Evading Malware Detection
[RAT資安小聚] Study on Automatically Evading Malware Detection[RAT資安小聚] Study on Automatically Evading Malware Detection
[RAT資安小聚] Study on Automatically Evading Malware Detection
 
Source code of WPF samples by Microsoft was checked
Source code of WPF samples by Microsoft was checkedSource code of WPF samples by Microsoft was checked
Source code of WPF samples by Microsoft was checked
 
Hibernate Import.Sql I18n
Hibernate Import.Sql I18nHibernate Import.Sql I18n
Hibernate Import.Sql I18n
 
Static code analysis: what? how? why?
Static code analysis: what? how? why?Static code analysis: what? how? why?
Static code analysis: what? how? why?
 
PVS-Studio Meets Octave
PVS-Studio Meets Octave PVS-Studio Meets Octave
PVS-Studio Meets Octave
 
05. Java Loops Methods and Classes
05. Java Loops Methods and Classes05. Java Loops Methods and Classes
05. Java Loops Methods and Classes
 
Synthesizing API Usage Examples
Synthesizing API Usage Examples Synthesizing API Usage Examples
Synthesizing API Usage Examples
 
Exploring lambdas and invokedynamic for embedded systems
Exploring lambdas and invokedynamic for embedded systemsExploring lambdas and invokedynamic for embedded systems
Exploring lambdas and invokedynamic for embedded systems
 
A Unicorn Seeking Extraterrestrial Life: Analyzing SETI@home's Source Code
A Unicorn Seeking Extraterrestrial Life: Analyzing SETI@home's Source CodeA Unicorn Seeking Extraterrestrial Life: Analyzing SETI@home's Source Code
A Unicorn Seeking Extraterrestrial Life: Analyzing SETI@home's Source Code
 
Java Language fundamental
Java Language fundamentalJava Language fundamental
Java Language fundamental
 
Taking User Input in Java
Taking User Input in JavaTaking User Input in Java
Taking User Input in Java
 
Linux version of PVS-Studio couldn't help checking CodeLite
Linux version of PVS-Studio couldn't help checking CodeLiteLinux version of PVS-Studio couldn't help checking CodeLite
Linux version of PVS-Studio couldn't help checking CodeLite
 
Lab11bRevf.docLab 11b Alien InvasionCS 122 • 15 Points .docx
Lab11bRevf.docLab 11b Alien InvasionCS 122 • 15 Points .docxLab11bRevf.docLab 11b Alien InvasionCS 122 • 15 Points .docx
Lab11bRevf.docLab 11b Alien InvasionCS 122 • 15 Points .docx
 
Lecture 2 java.pdf
Lecture 2 java.pdfLecture 2 java.pdf
Lecture 2 java.pdf
 
Google Dart
Google DartGoogle Dart
Google Dart
 
Google Dart
Google DartGoogle Dart
Google Dart
 
Checking Notepad++: five years later
Checking Notepad++: five years laterChecking Notepad++: five years later
Checking Notepad++: five years later
 
A brief overview of java frameworks
A brief overview of java frameworksA brief overview of java frameworks
A brief overview of java frameworks
 

Último

Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 

Último (20)

Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 

Source Code Clone Search (Iman keivanloo PhD seminar)

  • 1. Internet-scale Source Code Search and Analysis Framework Iman Keivanloo Advisor: Dr. Juergen Rilling PhD Seminar Computer Science and Software Engineering Department November-17-2011
  • 2. Agenda • Research Context • Major questions & answers • Next step • Conclusion • Time Table 2
  • 3. Research Context Internet-Scale Code Search “is searching the Internet for source code to help solve a software development problem” [Gallardo, SUITE’09] 3
  • 4. How to search for Source Code? • Free-form Query: – “how to write into file in Java” • Structural Query: – “select col1 from table1 where col1=“%write” [Keivanloo, SUITE’11] [Keivanloo, ICSM’10] 4
  • 5. Research Focus Similar Fragment Search XMLReadFile inFile=new XMLReadFile(“kb.xml”); Suggested simplified query: Window myWindow=new Window(); Select line which has The ideal expected asnwer myWindow.trigger(inFile); (1) a method call statement on the trigger method. OutputStream result=new OutputStream(); myWindow.flush(result); Step 1: Input [the simplified structural query] Step 2: Input [the selected fragment in the first step and its target line (red)] Internet-Scale Structural Code Real-time Clone Search Engine Search Engine ... ... 10: Window myWindow=new Window(); The pattern is 11: CSVReadFile csvData=new CSVReadFile(“... ... similar but it uses 59: Event e=new Event(50); 55: Window r=new Window(); 12: myWindow.trigger(csvData); XMLStream instead 60: e.trigger(); 13: OutputStream o=new OutputStream(); 56: long timestamp=System.Now(); 61: e.update(); Gapped clone of XMLFile as the 14: myWindow.flush(o); 57: System.out.println(“Start reasoning...”); ... 15: myWindow.close(); 58: XMLStream xmldata=new XMLStream(io); input ... ... 59: r.trigger(xmldata); 11: CSVReadFile csvData=new CSVReadFile(“input.csv”); 60: OutputStream o=new OutputStream(); 12: myWindow.trigger(csvData); 61: r.flush(o); 13: OutputStream o=new OutputStream(); … … This match is This line looks like a match, however it uses … acceptable, even if ... 89: Window var=new Window(); .CSV instead of .XML. We can use our clone 90: XMLReadFile r=new XMLReadFile (“k.xml”); the order is different 133: Listener res=new Listener(); search engine to find now other similar 91: OutputStream o=new OutputStream(); Unordered core from the 1:1 match 134: res.trigger(“warm-up”); 135: res.close(); code fragments to this one. 92: var.trigger(r); ... 93: var.flush(o); … 5
  • 7. The Web Search Challenge 7
  • 8. But Often Still Fail to Deliver the Expected Results After 10 Years of Research 8
  • 10. Early Conclusion Source Code Search is similar to Web Search 10
  • 11. Early Conclusion Source Code Search is similar to Web Search 1. Search techniques = ? Search Analysis (Ambiguity resolution) 2. Ambiguity resolution techniques = Code Analysis 11
  • 12. Research Approach Overview Internet-scale Source Code Search and Analysis Framework Search Analysis Code Clone Search Semantic Web-based Code Analysis 12
  • 14. Clone (Source Code Clone) • Similar code fragments for (AttributeEntity for (AttributeEntity theAttributeEntity:aTableEntity.ge…theAttributeEntity:aTableEntity.ge… System.out.println(“Hello!"); System.out.println(“Hello!"); • Type 1: Identical except whitespaces … • Type 2: Identical except variable names ... • Type 3: Identical except a few missing… • Type 4: Similar functionality [Roy, C. K., Cordy, J. R., & Koschke, R. (2009). Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Science of Computer Programming, 2009.] 14
  • 15. Clone Search Query Code Database for (Attribute attribute:exampleSet.getAttributes()) for (Attribute attribute:es1.getAttributes()) System.out.println(“Test"); System.out.println(“Hello!"); for (IAttribute att:source.getAttributes()) { System.out.println("Please do not read me"); for (JAttribute attribute:formType.getAttributes()) System.out.println(“Test"); 16
  • 16. Clone Search Answer Query 17
  • 17. Internet-scale Clone Search Query for (Attribute attribute:exampleSet.getAttributes()) System.out.println(“Hello!"); 18
  • 18. Internet-scale Real-time Clone Search 19
  • 19. Internet-scale Real-time Clone Search Requirements? 20
  • 20. Internet-scale Real-time Clone Search Requirements: Millions LOC ~ 300 MLOC 21
  • 21. Internet-scale Real-time Clone Search Requirements: 100 Millions LOC Milliseconds 22
  • 22. Internet-scale Real-time Clone Search for (IAttribute att:source.getAttributes()) { System.out.println("Please do not read me"); for (JAttribute attribute:formType.getAttributes()) System.out.println(“Test"); Requirements: •Precision 100 • Recall Millions LOC Milliseconds •Type-1, 2, 3… 23
  • 23. Internet-scale Real-time Clone Search Requirements: Precision Recall Millions LOC 100 Milliseconds Type-1, 2, 3… 24
  • 24. Research Question #1 Real-time answer (faster than 100 ms) Is it actually possible?
  • 25. Our Initial Analysis • SeClone: An Internet-scale Real-time Clone Search Engine Search Analysis Phase 1 Phase 2 [Keivanloo, ICPC’11] 26
  • 26. Inside SeClone Phase 1 • Syntactical Pattern matching Phase 1 Phase 2 Pattern Matching 27
  • 27. Inside SeClone Phase 2 • Information Retrieval & Clustering algorithm 1 for (Attribute attribute:exampleSet.getAttribute System.out.println(“The end"); 2 for (Attribute attribute:es1.getAttributes()) System.out.println(“Test"); Phase 1 Phase 2 Pattern Matching Semantic Matching 3 for (AttributeEntity theAttributeEntity:aTable System.out.println(“Hello!"); 4 for (JAttribute attribute:formType.getAttribute System.out.println(“Test"); 5 for (IAttribute att:source.getAttributes()) { 28 System.out.println("Please do not read m
  • 28. Research Question #2 The Dilemma How to distribute the 100 milliseconds between phases? 0 25 50 75 100 Pattern Matching Semantic Matching [Keivanloo, WCRE’11]
  • 29. Our Further Analysis [WCRE’11] • 100 Milliseconds Requirements • Millions LOC • Precision The Dilemma • Recall Constraints • Type-1, 2, 3… 0 25 50 75 100 SeClone [ICPC 11] O ( p * log n ) Pattern Matching Semantic Matching Data Characteristics 30
  • 31. Analysis of the Data Characteristics: Dataset preparation • Name: IJaDataset – Comprehensive (Inter-project) • To avoid project-specific result – ~18,000 Projects – 1,500,000 unique Java classes • No duplicate, empty, buggy file – ~300 MLOC • online at http://aseg.cs.concordia.ca/seclone 32
  • 32. Analysis of the Data Characteristics: Granularity Effect • Three Level Similarity (TLS): Set of similar three-line fragments • First Level Similarity (FLS): single-line patterns 33
  • 33. Analysis of the Data Characteristics: Clone frequency • How many code fragment are analyzed by each query? • Answer: 3 (Average) 34
  • 34. Analysis of the Data Characteristics: Clone frequency • Observation result: – TLS distributes the candidates into 3.9 times more groups – Its group size is 6 times smaller than FLS 35
  • 35. Analysis of the Data Characteristics: Clone frequency • Conclusion: – TLS heuristic is practical for real-time clone search, as long as the outliers are handled properly – Why? • (1) each TLS group has 2.37 members on average • (2) it distributes candidates in small-size groups • (3) for each query, only one group must be evaluated 36
  • 36. What Does an Outlier Look Like? • Outlier Definition: patterns with more than 2,000 occurrences • Observation result: • Only ~1000 patterns out of 30M • ~ 0.01% patterns • Mostly insignificant code patterns 37
  • 37. Analysis of the Data Characteristics: Sampling efficiency • Can sampling be used to reduce the amount of data being analyzed? • Answer: Yes (e.g., 33% contains 91% of popular patterns) 38
  • 38. Analysis of the Data Characteristics: Indexing • Can 32bit Hash keys (versus MD5) be used without affecting index quality? abc  123 abc  123 aXc  456 aXc  123 • Answer: Yes 0.002% error rate Only 10 cases for same key for three distinct strings 39
  • 39. Method Names Are Reliable? • Input Data: Koders 1-year query log – ~10M records • Observation purpose: – Importance of method names • Observation result: – 98% success rate vs. 69% • Result interpretation: – Method names in this context are reliable source of information – They must be preserved to increase precision 40
  • 40. Source Code Search Framework 41
  • 41. Internet-scale Real-time Code Clone Search via Multi-level Indexing – Internet-scale & Speed • 32-bit Hash values – Type-3 clone • Multi-level indexing – Customized for Internet-scale Code Search • Special transformation rule 42
  • 42. Response Time (Pattern Matching) [WCRE’11] • Regular queries – 25 microseconds • 99.99% queries – 900 microseconds 43
  • 44. Answer: Research Question #1 Internet-scale Real-time Code Search Is Possible? YES 45
  • 45. Answer: Research Question #2 The Dilemma How to distribute the 100 milliseconds between phases? Answer: 0 25 50 75 100 Pattern Matching Semantic Matching 1 millisecond 99 milliseconds
  • 46. Research Opportunity 0 25 50 75 100 Pattern Matching Semantic Matching 99 milliseconds Analysis
  • 47. Summary Step 1 • Studied characteristics of source code on the Internet – unique patterns distribution (sampling application) – Pattern frequencies (multi-level search) – 32-bit hashing strength (code pattern) – Outlier patterns – Method name importance Step 2 • Designed an Internet-scale clone search – Customized for code search (precision) – Fine granularity – Multi-level Indexing approach (Type-3 clone) – Microsecond range response time (up to 10 times faster) 48
  • 48. Publication Code Clone Search and Detection (http://aseg.cs.concordia.ca/seclone/) • Iman Keivanloo, Juergen Rilling, Philippe Charland. Internet-scale Real-time Code Clone Search via Multi-level Indexing. 18th Working Conference on Reverse Engineering (WCRE 2011), Lero, Limerick , Ireland. • Iman Keivanloo, Juergen Rilling, Philippe Charland. SeClone – A Hybrid Approach to Internet-Scale Real-Time Code Clone Search. 19th IEEE International Conference on Program Comprehension (ICPC 2011), Kingston, Ontario, Canada. Source Code Sharing using Linked Data (secold.org) • Iman Keivanloo, Chris Forbes, Juergen Rilling, and Philippe Charland, "Towards Sharing Source Code Facts Using Linked Data," ICSE Workshop on Search-Driven Development: Users, Infrastructure, Tools and Evaluation (SUITE). 2011. Source Code Search (http://aseg.cs.concordia.ca/codesearch) • Iman Keivanloo, Laleh Roostapour, Philipp Schugerl, Juergen Rilling. Semantic Web-based Source Code Search. 6th International Workshop on Semantic Web Enabled Software Engineering (SWESE 2010), June 35, San Francisco, USA. • Iman Keivanloo, Laleh Roostapour, Philipp Schugerl, Juergen Rilling. SE-CodeSearch: A Scalable Semantic Web- based Source Code Search Infrastructure. 26th IEEE International Conference on Software Maintenance (ICSM), Early Research Achievements (ERA) Track, Sept. 12-18, Timișoara, Romania. 49
  • 49. Thank you for your kind attention QUESTION? PhD Seminar Computer Science and Software Engineering Department 50 November-17-2011

Notas del editor

  1. use of method names in queries resulted in a 98% "click rate" vs. 68% for queries without method names
  2. http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=SELECT+%3Ftitle%0D%0AWHERE+{%0D%0A++++%3Fgame+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2Fsubject%3E+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FCategory%3AFirst-person_shooters%3E+.%0D%0A++++%3Fgame+foaf%3Aname+%3Ftitle+.%0D%0A}%0D%0Alimit+3&debug=on&timeout=&format=text%2Fhtml&save=display&fname=
  3. http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=SELECT+%3Ftitle%0D%0AWHERE+{%0D%0A++++%3Fgame+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2Fsubject%3E+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FCategory%3AFirst-person_shooters%3E+.%0D%0A++++%3Fgame+foaf%3Aname+%3Ftitle+.%0D%0A}%0D%0Alimit+3&debug=on&timeout=&format=text%2Fhtml&save=display&fname=
  4. http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=SELECT+%3Ftitle%0D%0AWHERE+{%0D%0A++++%3Fgame+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2Fsubject%3E+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FCategory%3AFirst-person_shooters%3E+.%0D%0A++++%3Fgame+foaf%3Aname+%3Ftitle+.%0D%0A}%0D%0Alimit+3&debug=on&timeout=&format=text%2Fhtml&save=display&fname=
  5. http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=SELECT+%3Ftitle%0D%0AWHERE+{%0D%0A++++%3Fgame+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2Fsubject%3E+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FCategory%3AFirst-person_shooters%3E+.%0D%0A++++%3Fgame+foaf%3Aname+%3Ftitle+.%0D%0A}%0D%0Alimit+3&debug=on&timeout=&format=text%2Fhtml&save=display&fname=
  6. http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=SELECT+%3Ftitle%0D%0AWHERE+{%0D%0A++++%3Fgame+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2Fsubject%3E+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FCategory%3AFirst-person_shooters%3E+.%0D%0A++++%3Fgame+foaf%3Aname+%3Ftitle+.%0D%0A}%0D%0Alimit+3&debug=on&timeout=&format=text%2Fhtml&save=display&fname=