SlideShare una empresa de Scribd logo
1 de 36
Descargar para leer sin conexión
Nikos Katirtzis
Software Engineer @ Hotels.com
Improving your team’s source code
searching capabilities
Educational Background
� Meng in Electrical and Computer Engineering (Aristotle University of Thessaloniki)
� MSc in Computer Science (University of Edinburgh)
Working Experience
� Software Engineer at Hotels.com (Expedia Group)
• Part of the team that’s responsible for user authentication and identification (~2 years).
• Recently joined a team that’s exploring and evaluating new technologies.
Projects/Interests
� Developed Mantissa, a TDD code search engine, and CLAMS, an approach for mining API usage
examples from client source code.
� Particularly interested in source code searching/mining.
Who am I?
Part 1 – Searching for source code
• Why you need a source code search engine
• Overview and comparison between the most
popular code search engines
• Recommendations and what you need to
consider
• Recent advances
Presentation structure
Part 2 – Searching for API usage examples
• HApiDoc: A service that mines API usage
examples from client source code
• CLAMS or behind the scenes of HApiDoc
PART 1
Searching for source code
Monoliths are dead, long live microservices!
A monolithic application puts all its
functionality into a single process...
... and scales by replicating the monolith on
multiple servers.
A microservices architecture puts each
element of functionality into a separate
service...
... and scales by distributing these services
across servers, replicating as needed.
Source: https://martinfowler.com/articles/microservices.html
Monoliths are dead, long live microservices?
Monoliths are dead, long live microservices.
I can’t find
the code I’m
looking for!
We need to
buy him a code
search engine.
Why you need a source code search engine?
A. 0
How many searches does the average developer perform on
an internal code search engine on a typical weekday?
B. 1-2
C. 5-10 D. >10
Source: Sadowski, Caitlin, Kathryn T. Stolee, and Sebastian Elbaum. "How developers search for code: a case study."
Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. ACM, 2015. Available at https://research.google.com/pubs/pub43835.html.
“Software engineering is more about reading code than writing it, and part of this process is
finding the code that you should read”. (Han-Wen Nienhuys - author of Zoekt)
� Understand code dependencies in order to avoid breaking changes.
� Fix production issues faster by locating the root cause.
� Find references of hosts/code that will be deprecated.
� Avoid duplicating existing code.
� Share coding solutions and styles.
� Locate security problems (e.g. hardcoded keys/passwords).
Why you need a source code search engine?
Would you go to a souvlaki shop for fish?
Can’t I use my Git hosting service’s search?
� No partial/substring matching.
� Special characters are removed before indexing and are not allowed when searching.
� Case sensitive search not possible.
� No regex.
� Cannot configure or add any features since this is a solution that's integrated into GitHub
Enterprise.
� Best match is based on naive techniques (tf-idf).
� Too much unused space in the results page!
Can’t I use my Git hosting service’s search?
Popular CSEs
Hound
Zoekt
� Searchcode Server is a powerful code search engine with a sleek web user interface.
� Uses Lucene to index code and provides rich additional features.
� Developed by Ben Boyter.
� Originally a public source code search engine (searchcode.com), later the developer created
Searchcode Server which is the enterprise version.
Searchcode Server screenshot
Searchcode Server pros/cons
Pros Cons
 Consistent speed regardless of
search
✘ Limited partial/substring
matching
 User friendly filtering by
repo/user/language
✘ Does not support case-sensitive
matching
 Rich UI ✘ Special characters removed
 Relatively easy to setup,
maintain and monitor
✘ Does not fully support regex
 APIs ✘ Inconsistent search results
Hound
Source:. Rus Cox. “Regular Expression Matching with a Trigram Index or How Google Code Search Worked.” 2015. Available at
https://swtch.com/~rsc/regexp/regexp4.html.
� Hound is an open-source source code search engine which uses a static React frontend that
talks to a Go backend.
� Uses ngrams for indexing and matching.
� Created at Etsy by Kelly Norton and Jonathan Klein.
� Its core is based on Russ Cox’s “Regular Expression Matching with a Trigram Index or How
Google Code Search Worked” article and code.
Hound screenshot
Hound pros/cons
Pros Cons
 Supports regex and substring
searches
✘ Response time heavily depends
on query
 Case-sensitive and case-
insensitive matching
✘ Scalability issues
 File path and repo filtering ✘ Limited searching options
 APIs ✘ Limited monitoring capabilities
 Open-source ✘ Limited additional features
� Zoekt is an open-source fast trigram based source code search engine developed by
Google.
� Uses positional ngrams for indexing and matching.
� Developed at Google by Han-Wen Nienhuys.
� 10x faster than Hound, rich support for filtering.
Zoekt
Coverage Speed Approximate
queries
Filtering Ranking
Zoekt design principles
Zoekt screenshot
Zoekt pros/cons
Pros Cons
 Super fast search, consistent speed
regardless of search
✘ Poor UI
 Sophisticated design and approach for
indexing using position trigrams
✘ Limited monitoring
 Rich searching options, fully supports regex
and substring searches
✘ Limited automation around running
and deploying the service
 Easy to build features on top of it ✘ Limited additional features
 Open-source ✘ No APIs
� Sourcegraph is a fast, solid, full-featured code navigation engine with code intelligence
features by Sourcegraph.
� It leverages git grep to find code and uses Zoekt for indexed searches.
� Its language models implement the Language Server Protocol (LSP) to provide Code
Intelligence features.
� Developed by Sourcegraph, a company often referred to as the “Google for Code”.
Did you know?
• The open-source Sourcegraph browser extension adds code intelligence to files and diffs on
GitHub, GitHub Enterprise, Phabricator, and Bitbucket Server for free!
Sourcegraph demo
Sourcegraph pros/cons
Pros Cons
 Rich features inc. cross-
reference and semantic search
✘ Requires many resources to run
properly
 Excellent support (company
dedicated on that)
✘ Constantly changing price
 Excellent documentation around
setting it up and running it.
✘ Sourcegraph Enterprise price
per user
 Numerous plugins (e.g. for text
editors, IDEs, browsers)
 Core version free
Comparison
Searchcode
Server
Hound Zoekt Sourcegraph
Speed    *
Scalability    *
Searching
options    
Additional
features    
Maintainability    
Support    
License/Price    **
*Unless deployed to a cluster, the service is relatively slow when compared to its alternatives.
**The basic version of the software is free, but companies would need the Enterprise version for which there’s a cost per user. Basic version lacks indexed search and
cluster deployments.
Use:
� Searchcode Server: if you’re looking for a more generic search engine, that can be easily
maintained and monitored.
� Hound: No reason to use Hound instead of Zoekt.
� Zoekt: if you focus on speed, scalability, and search options.
� Sourcegraph: if you want to invest on a source code search engine and need additional
features such as code intelligence and integrations.
Recommendations
✍ Use a local SSD (instance store volume) since these
services constantly hit the disk.
✍ Use permanent storage to store the cloned repos. This
allows you to achieve near zero-downtime in case the
instance goes down.
✍ Make sure you understand and experiment when configuring the services, i.e.
in many services you’ll need to set limits for max file size,
max lines, etc.
✍ Monitor logs for errors.
✍ Remember, bad documents can always slow down your service!
Things to consider
EC2 with
instance store
index
Repos
Logs
EBS
✌ GitHub is experimenting with semantic code search1,2.
✌ Microsoft offers semantic code search for Azure repos in Azure DevOps Services and TFS3.
✌ Google offers fast code search for its Cloud Source Repositories. It’s searching options are
quite similar to Zoekt’s4. Plus Google’s Bazel code search.
✌ Sourcegraph becomes more and more popular by adding more languages to its Code
Intelligence feature (thanks to Microsoft’s Language Server Protocol) and by providing more
integrations and open-source browser extensions.
Recent advances
1 “Towards Natural Language Semantic Code Search”: https://githubengineering.com/towards-natural-language-semantic-code-search/
2 “GitHub experiments; semantic code search.”: https://experiments.github.com/semantic-code-search
3 “Search across all your code and work items”: https://docs.microsoft.com/en-us/azure/devops/project/search/overview?view=vsts&tabs=new-nav
4 “Searching for code”: https://cloud.google.com/source-repositories/docs/searching-code
PART 2
Searching for API usage examples
The Problem
� Lack of proper documentation for the APIs
� How can I use this API/API method?
� Creating API usage examples is time-consuming
The Concept
✌ What if we mine examples from client source code?
✌ Would be nice to cluster results
✌ And show the most indicative example(s) of each cluster
✌ And provide a summarised version of the most indicative
example(s)
CLAMS
HApiDoc Architecture
HApiDoc UI
Behind the scenes - CLAMS
✌ HApiDoc is getting open-sourced! Looking for contributors!
✌ Take a look at our GitHub space: https://github.com/HotelsDotCom
✌ Presentation material: https://github.com/nikos912000/voxxed-thes-material
✌ Using any other code search engines? Let us know!
PS: We’re hiring!
Closing Notes
Thank you!
nikos912000@notmail.com @nikos912000
www.linkedin.com/in/nkatirtzis nikos912000
1 Icon made by Gregor Cresnar from www.flaticon.com.
2 Icon made by Freepik from www.flaticon.com.
3 Icon made by Freepik from www.flaticon.com.
4 Icon made by Pixel Perfect from www.flaticon.com.
1 2
3 4

Más contenido relacionado

La actualidad más candente

Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analyticsamesar0
 
Vonk fhir facade (christiaan)
Vonk fhir facade (christiaan)Vonk fhir facade (christiaan)
Vonk fhir facade (christiaan)DevDays
 
Evaluating Recommended Applications
Evaluating Recommended ApplicationsEvaluating Recommended Applications
Evaluating Recommended Applicationsrsse2008
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrLucidworks
 
How to Manage Open Source requirements with AboutCode
How to Manage Open Source requirements with AboutCodeHow to Manage Open Source requirements with AboutCode
How to Manage Open Source requirements with AboutCodenexB Inc.
 
Best practice recommendations for utilizing open source software (from a lega...
Best practice recommendations for utilizing open source software (from a lega...Best practice recommendations for utilizing open source software (from a lega...
Best practice recommendations for utilizing open source software (from a lega...Rogue Wave Software
 
Moving into API documentation writing
Moving into API documentation writingMoving into API documentation writing
Moving into API documentation writingEllis Pratt
 
An Ultimate Guide To Hire Python Developer
An Ultimate Guide To Hire Python DeveloperAn Ultimate Guide To Hire Python Developer
An Ultimate Guide To Hire Python DeveloperRishiVardhaniM
 
REST Coder: Auto Generating Client Stubs and Documentation for REST APIs
REST Coder: Auto Generating Client Stubs and Documentation for REST APIsREST Coder: Auto Generating Client Stubs and Documentation for REST APIs
REST Coder: Auto Generating Client Stubs and Documentation for REST APIsHiranya Jayathilaka
 
Working Smarter: SEO Automation to Increase Efficiency and Effectiveness - Pa...
Working Smarter: SEO Automation to Increase Efficiency and Effectiveness - Pa...Working Smarter: SEO Automation to Increase Efficiency and Effectiveness - Pa...
Working Smarter: SEO Automation to Increase Efficiency and Effectiveness - Pa...State of Search Conference
 

La actualidad más candente (11)

Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analytics
 
Introduction to shodan
Introduction to shodanIntroduction to shodan
Introduction to shodan
 
Vonk fhir facade (christiaan)
Vonk fhir facade (christiaan)Vonk fhir facade (christiaan)
Vonk fhir facade (christiaan)
 
Evaluating Recommended Applications
Evaluating Recommended ApplicationsEvaluating Recommended Applications
Evaluating Recommended Applications
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
 
How to Manage Open Source requirements with AboutCode
How to Manage Open Source requirements with AboutCodeHow to Manage Open Source requirements with AboutCode
How to Manage Open Source requirements with AboutCode
 
Best practice recommendations for utilizing open source software (from a lega...
Best practice recommendations for utilizing open source software (from a lega...Best practice recommendations for utilizing open source software (from a lega...
Best practice recommendations for utilizing open source software (from a lega...
 
Moving into API documentation writing
Moving into API documentation writingMoving into API documentation writing
Moving into API documentation writing
 
An Ultimate Guide To Hire Python Developer
An Ultimate Guide To Hire Python DeveloperAn Ultimate Guide To Hire Python Developer
An Ultimate Guide To Hire Python Developer
 
REST Coder: Auto Generating Client Stubs and Documentation for REST APIs
REST Coder: Auto Generating Client Stubs and Documentation for REST APIsREST Coder: Auto Generating Client Stubs and Documentation for REST APIs
REST Coder: Auto Generating Client Stubs and Documentation for REST APIs
 
Working Smarter: SEO Automation to Increase Efficiency and Effectiveness - Pa...
Working Smarter: SEO Automation to Increase Efficiency and Effectiveness - Pa...Working Smarter: SEO Automation to Increase Efficiency and Effectiveness - Pa...
Working Smarter: SEO Automation to Increase Efficiency and Effectiveness - Pa...
 

Similar a Improving your team's source code searching capabilities - Voxxed Thessaloniki 2018

System design for Web Application
System design for Web ApplicationSystem design for Web Application
System design for Web ApplicationMichael Choi
 
Open Source Security and ChatGPT-Published.pdf
Open Source Security and ChatGPT-Published.pdfOpen Source Security and ChatGPT-Published.pdf
Open Source Security and ChatGPT-Published.pdfJavier Perez
 
The Hacking Game - Think Like a Hacker Meetup 12072023.pptx
The Hacking Game - Think Like a Hacker Meetup 12072023.pptxThe Hacking Game - Think Like a Hacker Meetup 12072023.pptx
The Hacking Game - Think Like a Hacker Meetup 12072023.pptxlior mazor
 
Github Copilot vs Amazon CodeWhisperer for Java developers at JCON 2023
Github Copilot vs Amazon CodeWhisperer for Java developers at JCON 2023Github Copilot vs Amazon CodeWhisperer for Java developers at JCON 2023
Github Copilot vs Amazon CodeWhisperer for Java developers at JCON 2023Vadym Kazulkin
 
GitHub Copilot.pptx
GitHub Copilot.pptxGitHub Copilot.pptx
GitHub Copilot.pptxLuis Beltran
 
Tools to Find Source Code on the Web
Tools to Find Source Code on the WebTools to Find Source Code on the Web
Tools to Find Source Code on the Webrgallard
 
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)Jeremy Cabral
 
Publishing strategies for API documentation
Publishing strategies for API documentationPublishing strategies for API documentation
Publishing strategies for API documentationTom Johnson
 
Towards Reusable Research Software
Towards Reusable Research SoftwareTowards Reusable Research Software
Towards Reusable Research Softwaredgarijo
 
What would Jesus Developer do?
What would Jesus Developer do?What would Jesus Developer do?
What would Jesus Developer do?Lukáš Čech
 
Docs as Part of the Product - Open Source Summit North America 2018
Docs as Part of the Product - Open Source Summit North America 2018Docs as Part of the Product - Open Source Summit North America 2018
Docs as Part of the Product - Open Source Summit North America 2018Den Delimarsky
 
[Russia] Bugs -> max, time <= T
[Russia] Bugs -> max, time <= T[Russia] Bugs -> max, time <= T
[Russia] Bugs -> max, time <= TOWASP EEE
 
Rightsizing Open Source Software Identification
Rightsizing Open Source Software IdentificationRightsizing Open Source Software Identification
Rightsizing Open Source Software IdentificationnexB Inc.
 
IBM Developer Model Asset eXchange - Deep Learning for Everyone
IBM Developer Model Asset eXchange - Deep Learning for EveryoneIBM Developer Model Asset eXchange - Deep Learning for Everyone
IBM Developer Model Asset eXchange - Deep Learning for EveryoneNick Pentreath
 
API workshop: Introduction to APIs (TC Camp)
API workshop: Introduction to APIs (TC Camp)API workshop: Introduction to APIs (TC Camp)
API workshop: Introduction to APIs (TC Camp)Tom Johnson
 
Top 10 python frameworks for web development in 2020
Top 10 python frameworks for web development in 2020Top 10 python frameworks for web development in 2020
Top 10 python frameworks for web development in 2020Alaina Carter
 
PHPFrameworkDay 2020 - Different software evolutions from Start till Release ...
PHPFrameworkDay 2020 - Different software evolutions from Start till Release ...PHPFrameworkDay 2020 - Different software evolutions from Start till Release ...
PHPFrameworkDay 2020 - Different software evolutions from Start till Release ...Alexandr Savchenko
 
"Different software evolutions from Start till Release in PHP product" Oleksa...
"Different software evolutions from Start till Release in PHP product" Oleksa..."Different software evolutions from Start till Release in PHP product" Oleksa...
"Different software evolutions from Start till Release in PHP product" Oleksa...Fwdays
 
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - TrivadisTechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - TrivadisTrivadis
 

Similar a Improving your team's source code searching capabilities - Voxxed Thessaloniki 2018 (20)

System design for Web Application
System design for Web ApplicationSystem design for Web Application
System design for Web Application
 
Open Source Security and ChatGPT-Published.pdf
Open Source Security and ChatGPT-Published.pdfOpen Source Security and ChatGPT-Published.pdf
Open Source Security and ChatGPT-Published.pdf
 
The Hacking Game - Think Like a Hacker Meetup 12072023.pptx
The Hacking Game - Think Like a Hacker Meetup 12072023.pptxThe Hacking Game - Think Like a Hacker Meetup 12072023.pptx
The Hacking Game - Think Like a Hacker Meetup 12072023.pptx
 
Github Copilot vs Amazon CodeWhisperer for Java developers at JCON 2023
Github Copilot vs Amazon CodeWhisperer for Java developers at JCON 2023Github Copilot vs Amazon CodeWhisperer for Java developers at JCON 2023
Github Copilot vs Amazon CodeWhisperer for Java developers at JCON 2023
 
GitHub Copilot.pptx
GitHub Copilot.pptxGitHub Copilot.pptx
GitHub Copilot.pptx
 
Tools to Find Source Code on the Web
Tools to Find Source Code on the WebTools to Find Source Code on the Web
Tools to Find Source Code on the Web
 
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)
 
Publishing strategies for API documentation
Publishing strategies for API documentationPublishing strategies for API documentation
Publishing strategies for API documentation
 
Towards Reusable Research Software
Towards Reusable Research SoftwareTowards Reusable Research Software
Towards Reusable Research Software
 
What would Jesus Developer do?
What would Jesus Developer do?What would Jesus Developer do?
What would Jesus Developer do?
 
Docs as Part of the Product - Open Source Summit North America 2018
Docs as Part of the Product - Open Source Summit North America 2018Docs as Part of the Product - Open Source Summit North America 2018
Docs as Part of the Product - Open Source Summit North America 2018
 
[Russia] Bugs -> max, time <= T
[Russia] Bugs -> max, time <= T[Russia] Bugs -> max, time <= T
[Russia] Bugs -> max, time <= T
 
Rightsizing Open Source Software Identification
Rightsizing Open Source Software IdentificationRightsizing Open Source Software Identification
Rightsizing Open Source Software Identification
 
IBM Developer Model Asset eXchange - Deep Learning for Everyone
IBM Developer Model Asset eXchange - Deep Learning for EveryoneIBM Developer Model Asset eXchange - Deep Learning for Everyone
IBM Developer Model Asset eXchange - Deep Learning for Everyone
 
API workshop: Introduction to APIs (TC Camp)
API workshop: Introduction to APIs (TC Camp)API workshop: Introduction to APIs (TC Camp)
API workshop: Introduction to APIs (TC Camp)
 
Top 10 python frameworks for web development in 2020
Top 10 python frameworks for web development in 2020Top 10 python frameworks for web development in 2020
Top 10 python frameworks for web development in 2020
 
File000162
File000162File000162
File000162
 
PHPFrameworkDay 2020 - Different software evolutions from Start till Release ...
PHPFrameworkDay 2020 - Different software evolutions from Start till Release ...PHPFrameworkDay 2020 - Different software evolutions from Start till Release ...
PHPFrameworkDay 2020 - Different software evolutions from Start till Release ...
 
"Different software evolutions from Start till Release in PHP product" Oleksa...
"Different software evolutions from Start till Release in PHP product" Oleksa..."Different software evolutions from Start till Release in PHP product" Oleksa...
"Different software evolutions from Start till Release in PHP product" Oleksa...
 
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - TrivadisTechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
 

Último

Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdfSteve Caron
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesKrzysztofKkol1
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jNeo4j
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfmaor17
 
Advantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxAdvantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxRTS corp
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxUnderstanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxSasikiranMarri
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdfAndrey Devyatkin
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions
 
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfPros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfkalichargn70th171
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxRTS corp
 
Mastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxMastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxAS Design & AST.
 

Último (20)

Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdf
 
Advantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxAdvantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptx
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxUnderstanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
 
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfPros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
 
Mastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxMastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptx
 

Improving your team's source code searching capabilities - Voxxed Thessaloniki 2018

  • 1. Nikos Katirtzis Software Engineer @ Hotels.com Improving your team’s source code searching capabilities
  • 2. Educational Background � Meng in Electrical and Computer Engineering (Aristotle University of Thessaloniki) � MSc in Computer Science (University of Edinburgh) Working Experience � Software Engineer at Hotels.com (Expedia Group) • Part of the team that’s responsible for user authentication and identification (~2 years). • Recently joined a team that’s exploring and evaluating new technologies. Projects/Interests � Developed Mantissa, a TDD code search engine, and CLAMS, an approach for mining API usage examples from client source code. � Particularly interested in source code searching/mining. Who am I?
  • 3. Part 1 – Searching for source code • Why you need a source code search engine • Overview and comparison between the most popular code search engines • Recommendations and what you need to consider • Recent advances Presentation structure Part 2 – Searching for API usage examples • HApiDoc: A service that mines API usage examples from client source code • CLAMS or behind the scenes of HApiDoc
  • 4. PART 1 Searching for source code
  • 5. Monoliths are dead, long live microservices! A monolithic application puts all its functionality into a single process... ... and scales by replicating the monolith on multiple servers. A microservices architecture puts each element of functionality into a separate service... ... and scales by distributing these services across servers, replicating as needed. Source: https://martinfowler.com/articles/microservices.html
  • 6. Monoliths are dead, long live microservices?
  • 7. Monoliths are dead, long live microservices. I can’t find the code I’m looking for! We need to buy him a code search engine.
  • 8. Why you need a source code search engine? A. 0 How many searches does the average developer perform on an internal code search engine on a typical weekday? B. 1-2 C. 5-10 D. >10 Source: Sadowski, Caitlin, Kathryn T. Stolee, and Sebastian Elbaum. "How developers search for code: a case study." Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. ACM, 2015. Available at https://research.google.com/pubs/pub43835.html.
  • 9. “Software engineering is more about reading code than writing it, and part of this process is finding the code that you should read”. (Han-Wen Nienhuys - author of Zoekt) � Understand code dependencies in order to avoid breaking changes. � Fix production issues faster by locating the root cause. � Find references of hosts/code that will be deprecated. � Avoid duplicating existing code. � Share coding solutions and styles. � Locate security problems (e.g. hardcoded keys/passwords). Why you need a source code search engine?
  • 10. Would you go to a souvlaki shop for fish? Can’t I use my Git hosting service’s search?
  • 11. � No partial/substring matching. � Special characters are removed before indexing and are not allowed when searching. � Case sensitive search not possible. � No regex. � Cannot configure or add any features since this is a solution that's integrated into GitHub Enterprise. � Best match is based on naive techniques (tf-idf). � Too much unused space in the results page! Can’t I use my Git hosting service’s search?
  • 13. � Searchcode Server is a powerful code search engine with a sleek web user interface. � Uses Lucene to index code and provides rich additional features. � Developed by Ben Boyter. � Originally a public source code search engine (searchcode.com), later the developer created Searchcode Server which is the enterprise version.
  • 15. Searchcode Server pros/cons Pros Cons  Consistent speed regardless of search ✘ Limited partial/substring matching  User friendly filtering by repo/user/language ✘ Does not support case-sensitive matching  Rich UI ✘ Special characters removed  Relatively easy to setup, maintain and monitor ✘ Does not fully support regex  APIs ✘ Inconsistent search results
  • 16. Hound Source:. Rus Cox. “Regular Expression Matching with a Trigram Index or How Google Code Search Worked.” 2015. Available at https://swtch.com/~rsc/regexp/regexp4.html. � Hound is an open-source source code search engine which uses a static React frontend that talks to a Go backend. � Uses ngrams for indexing and matching. � Created at Etsy by Kelly Norton and Jonathan Klein. � Its core is based on Russ Cox’s “Regular Expression Matching with a Trigram Index or How Google Code Search Worked” article and code.
  • 18. Hound pros/cons Pros Cons  Supports regex and substring searches ✘ Response time heavily depends on query  Case-sensitive and case- insensitive matching ✘ Scalability issues  File path and repo filtering ✘ Limited searching options  APIs ✘ Limited monitoring capabilities  Open-source ✘ Limited additional features
  • 19. � Zoekt is an open-source fast trigram based source code search engine developed by Google. � Uses positional ngrams for indexing and matching. � Developed at Google by Han-Wen Nienhuys. � 10x faster than Hound, rich support for filtering. Zoekt
  • 20. Coverage Speed Approximate queries Filtering Ranking Zoekt design principles
  • 22. Zoekt pros/cons Pros Cons  Super fast search, consistent speed regardless of search ✘ Poor UI  Sophisticated design and approach for indexing using position trigrams ✘ Limited monitoring  Rich searching options, fully supports regex and substring searches ✘ Limited automation around running and deploying the service  Easy to build features on top of it ✘ Limited additional features  Open-source ✘ No APIs
  • 23. � Sourcegraph is a fast, solid, full-featured code navigation engine with code intelligence features by Sourcegraph. � It leverages git grep to find code and uses Zoekt for indexed searches. � Its language models implement the Language Server Protocol (LSP) to provide Code Intelligence features. � Developed by Sourcegraph, a company often referred to as the “Google for Code”. Did you know? • The open-source Sourcegraph browser extension adds code intelligence to files and diffs on GitHub, GitHub Enterprise, Phabricator, and Bitbucket Server for free!
  • 25. Sourcegraph pros/cons Pros Cons  Rich features inc. cross- reference and semantic search ✘ Requires many resources to run properly  Excellent support (company dedicated on that) ✘ Constantly changing price  Excellent documentation around setting it up and running it. ✘ Sourcegraph Enterprise price per user  Numerous plugins (e.g. for text editors, IDEs, browsers)  Core version free
  • 26. Comparison Searchcode Server Hound Zoekt Sourcegraph Speed    * Scalability    * Searching options     Additional features     Maintainability     Support     License/Price    ** *Unless deployed to a cluster, the service is relatively slow when compared to its alternatives. **The basic version of the software is free, but companies would need the Enterprise version for which there’s a cost per user. Basic version lacks indexed search and cluster deployments.
  • 27. Use: � Searchcode Server: if you’re looking for a more generic search engine, that can be easily maintained and monitored. � Hound: No reason to use Hound instead of Zoekt. � Zoekt: if you focus on speed, scalability, and search options. � Sourcegraph: if you want to invest on a source code search engine and need additional features such as code intelligence and integrations. Recommendations
  • 28. ✍ Use a local SSD (instance store volume) since these services constantly hit the disk. ✍ Use permanent storage to store the cloned repos. This allows you to achieve near zero-downtime in case the instance goes down. ✍ Make sure you understand and experiment when configuring the services, i.e. in many services you’ll need to set limits for max file size, max lines, etc. ✍ Monitor logs for errors. ✍ Remember, bad documents can always slow down your service! Things to consider EC2 with instance store index Repos Logs EBS
  • 29. ✌ GitHub is experimenting with semantic code search1,2. ✌ Microsoft offers semantic code search for Azure repos in Azure DevOps Services and TFS3. ✌ Google offers fast code search for its Cloud Source Repositories. It’s searching options are quite similar to Zoekt’s4. Plus Google’s Bazel code search. ✌ Sourcegraph becomes more and more popular by adding more languages to its Code Intelligence feature (thanks to Microsoft’s Language Server Protocol) and by providing more integrations and open-source browser extensions. Recent advances 1 “Towards Natural Language Semantic Code Search”: https://githubengineering.com/towards-natural-language-semantic-code-search/ 2 “GitHub experiments; semantic code search.”: https://experiments.github.com/semantic-code-search 3 “Search across all your code and work items”: https://docs.microsoft.com/en-us/azure/devops/project/search/overview?view=vsts&tabs=new-nav 4 “Searching for code”: https://cloud.google.com/source-repositories/docs/searching-code
  • 30. PART 2 Searching for API usage examples
  • 31. The Problem � Lack of proper documentation for the APIs � How can I use this API/API method? � Creating API usage examples is time-consuming The Concept ✌ What if we mine examples from client source code? ✌ Would be nice to cluster results ✌ And show the most indicative example(s) of each cluster ✌ And provide a summarised version of the most indicative example(s) CLAMS
  • 34. Behind the scenes - CLAMS
  • 35. ✌ HApiDoc is getting open-sourced! Looking for contributors! ✌ Take a look at our GitHub space: https://github.com/HotelsDotCom ✌ Presentation material: https://github.com/nikos912000/voxxed-thes-material ✌ Using any other code search engines? Let us know! PS: We’re hiring! Closing Notes
  • 36. Thank you! nikos912000@notmail.com @nikos912000 www.linkedin.com/in/nkatirtzis nikos912000 1 Icon made by Gregor Cresnar from www.flaticon.com. 2 Icon made by Freepik from www.flaticon.com. 3 Icon made by Freepik from www.flaticon.com. 4 Icon made by Pixel Perfect from www.flaticon.com. 1 2 3 4

Notas del editor

  1. Monoliths vs microservices Hotels.com’s website Shift brings advantages + challenges
  2. Survey at Google, 2015
  3. Dependencies; hundreds of microservices Prod issues: bridge between Splunk and BitBucket/GitHub Host: make sure your services don’t rely on servers you aim to kill
  4. Git hosting services don’t specialize on source code searching
  5. Substring matching; day-Monday Special characters; urls, xmls Search results page non-optimal in terms of content fitting
  6. Hound uses 3grams Ngrams: powerful concept for approximate matching. Ngrams are contiguous sequences of n items from a given sample of text.
  7. Coverage; the code that is of interest to you should be available for searching. Speed; Implements sophisticated techniques and by using an index which is based on positional ngrams. Approximate queries; performs substring or partial matching case-insensitively but it also gives you the option for case sensitive searches. Filtering; allows you to filter queries by adding extra atoms and filter out terms with the minus symbol (-). Ranking; uses ctags to find declarations such us class or method definitions and variable declarations, which are then boosted in the search ranking.
  8. Cannot be fairly compared to any existing solutions. You’ll need the Enterprise version for Enterprise environments. Charges per user. Prices keep changing.
  9. Until last year it was hard to convince someone that they should invest on source code searching. There was limited interest and progress in that area.
  10. Orchestrate any required tasks and to act as an end-to-end solution for mining usage examples using CLAMS Automates the filtering step which is the most time-consuming task for systems like CLAMS, stores results to a MongoDB instance and shows these on a neat web service.
  11. Type the fully qualified name of the method Returns summarized snippets alongside with information such as: Repository name where method is called Support; number of other client methods that are calling your API method in a similar way Additional calls to the same API