SlideShare una empresa de Scribd logo
1 de 36
Nikos Katirtzis
Software Engineer @ Hotels.com
Improving your team’s source code
searching capabilities
Educational Background
� Meng in Electrical and Computer Engineering (Aristotle University of Thessaloniki)
� MSc in Computer Science (University of Edinburgh)
Working Experience
� Software Engineer at Hotels.com (Expedia Group)
• Part of the team that’s responsible for user authentication and identification (~2 years).
• Recently joined a team that’s exploring and evaluating new technologies.
Projects/Interests
� Developed Mantissa, a TDD code search engine, and CLAMS, an approach for mining API usage
examples from client source code.
� Particularly interested in source code searching/mining.
Who am I?
Part 1 – Searching for source code
• Why you need a source code search engine
• Overview and comparison between the most
popular code search engines
• Recommendations and what you need to
consider
• Recent advances
Presentation structure
Part 2 – Searching for API usage examples
• HApiDoc: A service that mines API usage
examples from client source code
• CLAMS or behind the scenes of HApiDoc
PART 1
Searching for source code
Monoliths are dead, long live microservices!
A monolithic application puts all its
functionality into a single process...
... and scales by replicating the monolith on
multiple servers.
A microservices architecture puts each
element of functionality into a separate
service...
... and scales by distributing these services
across servers, replicating as needed.
Source: https://martinfowler.com/articles/microservices.html
Monoliths are dead, long live microservices?
Monoliths are dead, long live microservices.
I can’t find
the code I’m
looking for!
We need to
buy him a code
search engine.
Why you need a source code search engine?
A. 0
How many searches does the average developer perform on
an internal code search engine on a typical weekday?
B. 1-2
C. 5-10 D. >10
Source: Sadowski, Caitlin, Kathryn T. Stolee, and Sebastian Elbaum. "How developers search for code: a case study."
Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. ACM, 2015. Available at https://research.google.com/pubs/pub43835.html.
“Software engineering is more about reading code than writing it, and part of this process is
finding the code that you should read”. (Han-Wen Nienhuys - author of Zoekt)
� Understand code dependencies in order to avoid breaking changes.
� Fix production issues faster by locating the root cause.
� Find references of hosts/code that will be deprecated.
� Avoid duplicating existing code.
� Share coding solutions and styles.
� Locate security problems (e.g. hardcoded keys/passwords).
Why you need a source code search engine?
Would you go to a souvlaki shop for fish?
Can’t I use my Git hosting service’s search?
� No partial/substring matching.
� Special characters are removed before indexing and are not allowed when searching.
� Case sensitive search not possible.
� No regex.
� Cannot configure or add any features since this is a solution that's integrated into GitHub
Enterprise.
� Best match is based on naive techniques (tf-idf).
� Too much unused space in the results page!
Can’t I use my Git hosting service’s search?
Popular CSEs
Hound
Zoekt
� Searchcode Server is a powerful code search engine with a sleek web user interface.
� Uses Lucene to index code and provides rich additional features.
� Developed by Ben Boyter.
� Originally a public source code search engine (searchcode.com), later the developer created
Searchcode Server which is the enterprise version.
Searchcode Server screenshot
Searchcode Server pros/cons
Pros Cons
 Consistent speed regardless of
search
✘ Limited partial/substring
matching
 User friendly filtering by
repo/user/language
✘ Does not support case-sensitive
matching
 Rich UI ✘ Special characters removed
 Relatively easy to setup,
maintain and monitor
✘ Does not fully support regex
 APIs ✘ Inconsistent search results
Hound
Source:. Rus Cox. “Regular Expression Matching with a Trigram Index or How Google Code Search Worked.” 2015. Available at
https://swtch.com/~rsc/regexp/regexp4.html.
� Hound is an open-source source code search engine which uses a static React frontend that
talks to a Go backend.
� Uses ngrams for indexing and matching.
� Created at Etsy by Kelly Norton and Jonathan Klein.
� Its core is based on Russ Cox’s “Regular Expression Matching with a Trigram Index or How
Google Code Search Worked” article and code.
Hound screenshot
Hound pros/cons
Pros Cons
 Supports regex and substring
searches
✘ Response time heavily depends
on query
 Case-sensitive and case-
insensitive matching
✘ Scalability issues
 File path and repo filtering ✘ Limited searching options
 APIs ✘ Limited monitoring capabilities
 Open-source ✘ Limited additional features
� Zoekt is an open-source fast trigram based source code search engine developed by
Google.
� Uses positional ngrams for indexing and matching.
� Developed at Google by Han-Wen Nienhuys.
� 10x faster than Hound, rich support for filtering.
Zoekt
Coverage Speed Approximate
queries
Filtering Ranking
Zoekt design principles
Zoekt screenshot
Zoekt pros/cons
Pros Cons
 Super fast search, consistent speed
regardless of search
✘ Poor UI
 Sophisticated design and approach for
indexing using position trigrams
✘ Limited monitoring
 Rich searching options, fully supports regex
and substring searches
✘ Limited automation around running
and deploying the service
 Easy to build features on top of it ✘ Limited additional features
 Open-source ✘ No APIs
� Sourcegraph is a fast, solid, full-featured code navigation engine with code intelligence
features by Sourcegraph.
� It leverages git grep to find code and uses Zoekt for indexed searches.
� Its language models implement the Language Server Protocol (LSP) to provide Code
Intelligence features.
� Developed by Sourcegraph, a company often referred to as the “Google for Code”.
Did you know?
• The open-source Sourcegraph browser extension adds code intelligence to files and diffs on
GitHub, GitHub Enterprise, Phabricator, and Bitbucket Server for free!
Sourcegraph demo
Sourcegraph pros/cons
Pros Cons
 Rich features inc. cross-
reference and semantic search
✘ Requires many resources to run
properly
 Excellent support (company
dedicated on that)
✘ Constantly changing price
 Excellent documentation around
setting it up and running it.
✘ Sourcegraph Enterprise price
per user
 Numerous plugins (e.g. for text
editors, IDEs, browsers)
 Core version free
Comparison
Searchcode
Server
Hound Zoekt Sourcegraph
Speed    *
Scalability    *
Searching
options    
Additional
features    
Maintainability    
Support    
License/Price    **
*Unless deployed to a cluster, the service is relatively slow when compared to its alternatives.
**The basic version of the software is free, but companies would need the Enterprise version for which there’s a cost per user. Basic version lacks indexed search and
cluster deployments.
Use:
� Searchcode Server: if you’re looking for a more generic search engine, that can be easily
maintained and monitored.
� Hound: No reason to use Hound instead of Zoekt.
� Zoekt: if you focus on speed, scalability, and search options.
� Sourcegraph: if you want to invest on a source code search engine and need additional
features such as code intelligence and integrations.
Recommendations
✍ Use a local SSD (instance store volume) since these
services constantly hit the disk.
✍ Use permanent storage to store the cloned repos. This
allows you to achieve near zero-downtime in case the
instance goes down.
✍ Make sure you understand and experiment when configuring the services, i.e.
in many services you’ll need to set limits for max file size,
max lines, etc.
✍ Monitor logs for errors.
✍ Remember, bad documents can always slow down your service!
Things to consider
EC2 with
instance store
index
Repos
Logs
EBS
✌ GitHub is experimenting with semantic code search1,2.
✌ Microsoft offers semantic code search for Azure repos in Azure DevOps Services and TFS3.
✌ Google offers fast code search for its Cloud Source Repositories. It’s searching options are
quite similar to Zoekt’s4. Plus Google’s Bazel code search.
✌ Sourcegraph becomes more and more popular by adding more languages to its Code
Intelligence feature (thanks to Microsoft’s Language Server Protocol) and by providing more
integrations and open-source browser extensions.
Recent advances
1 “Towards Natural Language Semantic Code Search”: https://githubengineering.com/towards-natural-language-semantic-code-search/
2 “GitHub experiments; semantic code search.”: https://experiments.github.com/semantic-code-search
3 “Search across all your code and work items”: https://docs.microsoft.com/en-us/azure/devops/project/search/overview?view=vsts&tabs=new-nav
4 “Searching for code”: https://cloud.google.com/source-repositories/docs/searching-code
PART 2
Searching for API usage examples
The Problem
� Lack of proper documentation for the APIs
� How can I use this API/API method?
� Creating API usage examples is time-consuming
The Concept
✌ What if we mine examples from client source code?
✌ Would be nice to cluster results
✌ And show the most indicative example(s) of each cluster
✌ And provide a summarised version of the most indicative
example(s)
CLAMS
HApiDoc Architecture
HApiDoc UI
Behind the scenes - CLAMS
✌ HApiDoc is getting open-sourced! Looking for contributors!
✌ Take a look at our GitHub space: https://github.com/HotelsDotCom
✌ Presentation material: https://github.com/nikos912000/voxxed-thes-material
✌ Using any other code search engines? Let us know!
PS: We’re hiring!
Closing Notes
Thank you!
nikos912000@notmail.com @nikos912000
www.linkedin.com/in/nkatirtzis nikos912000
1 Icon made by Gregor Cresnar from www.flaticon.com.
2 Icon made by Freepik from www.flaticon.com.
3 Icon made by Freepik from www.flaticon.com.
4 Icon made by Pixel Perfect from www.flaticon.com.
1 2
3 4

Más contenido relacionado

La actualidad más candente

Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analyticsamesar0
 
Vonk fhir facade (christiaan)
Vonk fhir facade (christiaan)Vonk fhir facade (christiaan)
Vonk fhir facade (christiaan)DevDays
 
Evaluating Recommended Applications
Evaluating Recommended ApplicationsEvaluating Recommended Applications
Evaluating Recommended Applicationsrsse2008
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrLucidworks
 
How to Manage Open Source requirements with AboutCode
How to Manage Open Source requirements with AboutCodeHow to Manage Open Source requirements with AboutCode
How to Manage Open Source requirements with AboutCodenexB Inc.
 
Best practice recommendations for utilizing open source software (from a lega...
Best practice recommendations for utilizing open source software (from a lega...Best practice recommendations for utilizing open source software (from a lega...
Best practice recommendations for utilizing open source software (from a lega...Rogue Wave Software
 
Moving into API documentation writing
Moving into API documentation writingMoving into API documentation writing
Moving into API documentation writingEllis Pratt
 
An Ultimate Guide To Hire Python Developer
An Ultimate Guide To Hire Python DeveloperAn Ultimate Guide To Hire Python Developer
An Ultimate Guide To Hire Python DeveloperRishiVardhaniM
 
REST Coder: Auto Generating Client Stubs and Documentation for REST APIs
REST Coder: Auto Generating Client Stubs and Documentation for REST APIsREST Coder: Auto Generating Client Stubs and Documentation for REST APIs
REST Coder: Auto Generating Client Stubs and Documentation for REST APIsHiranya Jayathilaka
 
Working Smarter: SEO Automation to Increase Efficiency and Effectiveness - Pa...
Working Smarter: SEO Automation to Increase Efficiency and Effectiveness - Pa...Working Smarter: SEO Automation to Increase Efficiency and Effectiveness - Pa...
Working Smarter: SEO Automation to Increase Efficiency and Effectiveness - Pa...State of Search Conference
 

La actualidad más candente (11)

Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analytics
 
Introduction to shodan
Introduction to shodanIntroduction to shodan
Introduction to shodan
 
Vonk fhir facade (christiaan)
Vonk fhir facade (christiaan)Vonk fhir facade (christiaan)
Vonk fhir facade (christiaan)
 
Evaluating Recommended Applications
Evaluating Recommended ApplicationsEvaluating Recommended Applications
Evaluating Recommended Applications
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
 
How to Manage Open Source requirements with AboutCode
How to Manage Open Source requirements with AboutCodeHow to Manage Open Source requirements with AboutCode
How to Manage Open Source requirements with AboutCode
 
Best practice recommendations for utilizing open source software (from a lega...
Best practice recommendations for utilizing open source software (from a lega...Best practice recommendations for utilizing open source software (from a lega...
Best practice recommendations for utilizing open source software (from a lega...
 
Moving into API documentation writing
Moving into API documentation writingMoving into API documentation writing
Moving into API documentation writing
 
An Ultimate Guide To Hire Python Developer
An Ultimate Guide To Hire Python DeveloperAn Ultimate Guide To Hire Python Developer
An Ultimate Guide To Hire Python Developer
 
REST Coder: Auto Generating Client Stubs and Documentation for REST APIs
REST Coder: Auto Generating Client Stubs and Documentation for REST APIsREST Coder: Auto Generating Client Stubs and Documentation for REST APIs
REST Coder: Auto Generating Client Stubs and Documentation for REST APIs
 
Working Smarter: SEO Automation to Increase Efficiency and Effectiveness - Pa...
Working Smarter: SEO Automation to Increase Efficiency and Effectiveness - Pa...Working Smarter: SEO Automation to Increase Efficiency and Effectiveness - Pa...
Working Smarter: SEO Automation to Increase Efficiency and Effectiveness - Pa...
 

Similar a Improving your team’s source code searching capabilities

System design for Web Application
System design for Web ApplicationSystem design for Web Application
System design for Web ApplicationMichael Choi
 
Open Source Security and ChatGPT-Published.pdf
Open Source Security and ChatGPT-Published.pdfOpen Source Security and ChatGPT-Published.pdf
Open Source Security and ChatGPT-Published.pdfJavier Perez
 
The Hacking Game - Think Like a Hacker Meetup 12072023.pptx
The Hacking Game - Think Like a Hacker Meetup 12072023.pptxThe Hacking Game - Think Like a Hacker Meetup 12072023.pptx
The Hacking Game - Think Like a Hacker Meetup 12072023.pptxlior mazor
 
Github Copilot vs Amazon CodeWhisperer for Java developers at JCON 2023
Github Copilot vs Amazon CodeWhisperer for Java developers at JCON 2023Github Copilot vs Amazon CodeWhisperer for Java developers at JCON 2023
Github Copilot vs Amazon CodeWhisperer for Java developers at JCON 2023Vadym Kazulkin
 
GitHub Copilot.pptx
GitHub Copilot.pptxGitHub Copilot.pptx
GitHub Copilot.pptxLuis Beltran
 
Tools to Find Source Code on the Web
Tools to Find Source Code on the WebTools to Find Source Code on the Web
Tools to Find Source Code on the Webrgallard
 
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)Jeremy Cabral
 
Publishing strategies for API documentation
Publishing strategies for API documentationPublishing strategies for API documentation
Publishing strategies for API documentationTom Johnson
 
Towards Reusable Research Software
Towards Reusable Research SoftwareTowards Reusable Research Software
Towards Reusable Research Softwaredgarijo
 
What would Jesus Developer do?
What would Jesus Developer do?What would Jesus Developer do?
What would Jesus Developer do?Lukáš Čech
 
Docs as Part of the Product - Open Source Summit North America 2018
Docs as Part of the Product - Open Source Summit North America 2018Docs as Part of the Product - Open Source Summit North America 2018
Docs as Part of the Product - Open Source Summit North America 2018Den Delimarsky
 
[Russia] Bugs -> max, time <= T
[Russia] Bugs -> max, time <= T[Russia] Bugs -> max, time <= T
[Russia] Bugs -> max, time <= TOWASP EEE
 
Rightsizing Open Source Software Identification
Rightsizing Open Source Software IdentificationRightsizing Open Source Software Identification
Rightsizing Open Source Software IdentificationnexB Inc.
 
IBM Developer Model Asset eXchange - Deep Learning for Everyone
IBM Developer Model Asset eXchange - Deep Learning for EveryoneIBM Developer Model Asset eXchange - Deep Learning for Everyone
IBM Developer Model Asset eXchange - Deep Learning for EveryoneNick Pentreath
 
API workshop: Introduction to APIs (TC Camp)
API workshop: Introduction to APIs (TC Camp)API workshop: Introduction to APIs (TC Camp)
API workshop: Introduction to APIs (TC Camp)Tom Johnson
 
Top 10 python frameworks for web development in 2020
Top 10 python frameworks for web development in 2020Top 10 python frameworks for web development in 2020
Top 10 python frameworks for web development in 2020Alaina Carter
 
PHPFrameworkDay 2020 - Different software evolutions from Start till Release ...
PHPFrameworkDay 2020 - Different software evolutions from Start till Release ...PHPFrameworkDay 2020 - Different software evolutions from Start till Release ...
PHPFrameworkDay 2020 - Different software evolutions from Start till Release ...Alexandr Savchenko
 
"Different software evolutions from Start till Release in PHP product" Oleksa...
"Different software evolutions from Start till Release in PHP product" Oleksa..."Different software evolutions from Start till Release in PHP product" Oleksa...
"Different software evolutions from Start till Release in PHP product" Oleksa...Fwdays
 
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - TrivadisTechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - TrivadisTrivadis
 

Similar a Improving your team’s source code searching capabilities (20)

System design for Web Application
System design for Web ApplicationSystem design for Web Application
System design for Web Application
 
Open Source Security and ChatGPT-Published.pdf
Open Source Security and ChatGPT-Published.pdfOpen Source Security and ChatGPT-Published.pdf
Open Source Security and ChatGPT-Published.pdf
 
The Hacking Game - Think Like a Hacker Meetup 12072023.pptx
The Hacking Game - Think Like a Hacker Meetup 12072023.pptxThe Hacking Game - Think Like a Hacker Meetup 12072023.pptx
The Hacking Game - Think Like a Hacker Meetup 12072023.pptx
 
Github Copilot vs Amazon CodeWhisperer for Java developers at JCON 2023
Github Copilot vs Amazon CodeWhisperer for Java developers at JCON 2023Github Copilot vs Amazon CodeWhisperer for Java developers at JCON 2023
Github Copilot vs Amazon CodeWhisperer for Java developers at JCON 2023
 
GitHub Copilot.pptx
GitHub Copilot.pptxGitHub Copilot.pptx
GitHub Copilot.pptx
 
Tools to Find Source Code on the Web
Tools to Find Source Code on the WebTools to Find Source Code on the Web
Tools to Find Source Code on the Web
 
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)
 
Publishing strategies for API documentation
Publishing strategies for API documentationPublishing strategies for API documentation
Publishing strategies for API documentation
 
Towards Reusable Research Software
Towards Reusable Research SoftwareTowards Reusable Research Software
Towards Reusable Research Software
 
What would Jesus Developer do?
What would Jesus Developer do?What would Jesus Developer do?
What would Jesus Developer do?
 
Docs as Part of the Product - Open Source Summit North America 2018
Docs as Part of the Product - Open Source Summit North America 2018Docs as Part of the Product - Open Source Summit North America 2018
Docs as Part of the Product - Open Source Summit North America 2018
 
[Russia] Bugs -> max, time <= T
[Russia] Bugs -> max, time <= T[Russia] Bugs -> max, time <= T
[Russia] Bugs -> max, time <= T
 
Rightsizing Open Source Software Identification
Rightsizing Open Source Software IdentificationRightsizing Open Source Software Identification
Rightsizing Open Source Software Identification
 
IBM Developer Model Asset eXchange - Deep Learning for Everyone
IBM Developer Model Asset eXchange - Deep Learning for EveryoneIBM Developer Model Asset eXchange - Deep Learning for Everyone
IBM Developer Model Asset eXchange - Deep Learning for Everyone
 
API workshop: Introduction to APIs (TC Camp)
API workshop: Introduction to APIs (TC Camp)API workshop: Introduction to APIs (TC Camp)
API workshop: Introduction to APIs (TC Camp)
 
Top 10 python frameworks for web development in 2020
Top 10 python frameworks for web development in 2020Top 10 python frameworks for web development in 2020
Top 10 python frameworks for web development in 2020
 
File000162
File000162File000162
File000162
 
PHPFrameworkDay 2020 - Different software evolutions from Start till Release ...
PHPFrameworkDay 2020 - Different software evolutions from Start till Release ...PHPFrameworkDay 2020 - Different software evolutions from Start till Release ...
PHPFrameworkDay 2020 - Different software evolutions from Start till Release ...
 
"Different software evolutions from Start till Release in PHP product" Oleksa...
"Different software evolutions from Start till Release in PHP product" Oleksa..."Different software evolutions from Start till Release in PHP product" Oleksa...
"Different software evolutions from Start till Release in PHP product" Oleksa...
 
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - TrivadisTechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
 

Último

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 

Último (20)

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 

Improving your team’s source code searching capabilities

  • 1. Nikos Katirtzis Software Engineer @ Hotels.com Improving your team’s source code searching capabilities
  • 2. Educational Background � Meng in Electrical and Computer Engineering (Aristotle University of Thessaloniki) � MSc in Computer Science (University of Edinburgh) Working Experience � Software Engineer at Hotels.com (Expedia Group) • Part of the team that’s responsible for user authentication and identification (~2 years). • Recently joined a team that’s exploring and evaluating new technologies. Projects/Interests � Developed Mantissa, a TDD code search engine, and CLAMS, an approach for mining API usage examples from client source code. � Particularly interested in source code searching/mining. Who am I?
  • 3. Part 1 – Searching for source code • Why you need a source code search engine • Overview and comparison between the most popular code search engines • Recommendations and what you need to consider • Recent advances Presentation structure Part 2 – Searching for API usage examples • HApiDoc: A service that mines API usage examples from client source code • CLAMS or behind the scenes of HApiDoc
  • 4. PART 1 Searching for source code
  • 5. Monoliths are dead, long live microservices! A monolithic application puts all its functionality into a single process... ... and scales by replicating the monolith on multiple servers. A microservices architecture puts each element of functionality into a separate service... ... and scales by distributing these services across servers, replicating as needed. Source: https://martinfowler.com/articles/microservices.html
  • 6. Monoliths are dead, long live microservices?
  • 7. Monoliths are dead, long live microservices. I can’t find the code I’m looking for! We need to buy him a code search engine.
  • 8. Why you need a source code search engine? A. 0 How many searches does the average developer perform on an internal code search engine on a typical weekday? B. 1-2 C. 5-10 D. >10 Source: Sadowski, Caitlin, Kathryn T. Stolee, and Sebastian Elbaum. "How developers search for code: a case study." Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. ACM, 2015. Available at https://research.google.com/pubs/pub43835.html.
  • 9. “Software engineering is more about reading code than writing it, and part of this process is finding the code that you should read”. (Han-Wen Nienhuys - author of Zoekt) � Understand code dependencies in order to avoid breaking changes. � Fix production issues faster by locating the root cause. � Find references of hosts/code that will be deprecated. � Avoid duplicating existing code. � Share coding solutions and styles. � Locate security problems (e.g. hardcoded keys/passwords). Why you need a source code search engine?
  • 10. Would you go to a souvlaki shop for fish? Can’t I use my Git hosting service’s search?
  • 11. � No partial/substring matching. � Special characters are removed before indexing and are not allowed when searching. � Case sensitive search not possible. � No regex. � Cannot configure or add any features since this is a solution that's integrated into GitHub Enterprise. � Best match is based on naive techniques (tf-idf). � Too much unused space in the results page! Can’t I use my Git hosting service’s search?
  • 13. � Searchcode Server is a powerful code search engine with a sleek web user interface. � Uses Lucene to index code and provides rich additional features. � Developed by Ben Boyter. � Originally a public source code search engine (searchcode.com), later the developer created Searchcode Server which is the enterprise version.
  • 15. Searchcode Server pros/cons Pros Cons  Consistent speed regardless of search ✘ Limited partial/substring matching  User friendly filtering by repo/user/language ✘ Does not support case-sensitive matching  Rich UI ✘ Special characters removed  Relatively easy to setup, maintain and monitor ✘ Does not fully support regex  APIs ✘ Inconsistent search results
  • 16. Hound Source:. Rus Cox. “Regular Expression Matching with a Trigram Index or How Google Code Search Worked.” 2015. Available at https://swtch.com/~rsc/regexp/regexp4.html. � Hound is an open-source source code search engine which uses a static React frontend that talks to a Go backend. � Uses ngrams for indexing and matching. � Created at Etsy by Kelly Norton and Jonathan Klein. � Its core is based on Russ Cox’s “Regular Expression Matching with a Trigram Index or How Google Code Search Worked” article and code.
  • 18. Hound pros/cons Pros Cons  Supports regex and substring searches ✘ Response time heavily depends on query  Case-sensitive and case- insensitive matching ✘ Scalability issues  File path and repo filtering ✘ Limited searching options  APIs ✘ Limited monitoring capabilities  Open-source ✘ Limited additional features
  • 19. � Zoekt is an open-source fast trigram based source code search engine developed by Google. � Uses positional ngrams for indexing and matching. � Developed at Google by Han-Wen Nienhuys. � 10x faster than Hound, rich support for filtering. Zoekt
  • 20. Coverage Speed Approximate queries Filtering Ranking Zoekt design principles
  • 22. Zoekt pros/cons Pros Cons  Super fast search, consistent speed regardless of search ✘ Poor UI  Sophisticated design and approach for indexing using position trigrams ✘ Limited monitoring  Rich searching options, fully supports regex and substring searches ✘ Limited automation around running and deploying the service  Easy to build features on top of it ✘ Limited additional features  Open-source ✘ No APIs
  • 23. � Sourcegraph is a fast, solid, full-featured code navigation engine with code intelligence features by Sourcegraph. � It leverages git grep to find code and uses Zoekt for indexed searches. � Its language models implement the Language Server Protocol (LSP) to provide Code Intelligence features. � Developed by Sourcegraph, a company often referred to as the “Google for Code”. Did you know? • The open-source Sourcegraph browser extension adds code intelligence to files and diffs on GitHub, GitHub Enterprise, Phabricator, and Bitbucket Server for free!
  • 25. Sourcegraph pros/cons Pros Cons  Rich features inc. cross- reference and semantic search ✘ Requires many resources to run properly  Excellent support (company dedicated on that) ✘ Constantly changing price  Excellent documentation around setting it up and running it. ✘ Sourcegraph Enterprise price per user  Numerous plugins (e.g. for text editors, IDEs, browsers)  Core version free
  • 26. Comparison Searchcode Server Hound Zoekt Sourcegraph Speed    * Scalability    * Searching options     Additional features     Maintainability     Support     License/Price    ** *Unless deployed to a cluster, the service is relatively slow when compared to its alternatives. **The basic version of the software is free, but companies would need the Enterprise version for which there’s a cost per user. Basic version lacks indexed search and cluster deployments.
  • 27. Use: � Searchcode Server: if you’re looking for a more generic search engine, that can be easily maintained and monitored. � Hound: No reason to use Hound instead of Zoekt. � Zoekt: if you focus on speed, scalability, and search options. � Sourcegraph: if you want to invest on a source code search engine and need additional features such as code intelligence and integrations. Recommendations
  • 28. ✍ Use a local SSD (instance store volume) since these services constantly hit the disk. ✍ Use permanent storage to store the cloned repos. This allows you to achieve near zero-downtime in case the instance goes down. ✍ Make sure you understand and experiment when configuring the services, i.e. in many services you’ll need to set limits for max file size, max lines, etc. ✍ Monitor logs for errors. ✍ Remember, bad documents can always slow down your service! Things to consider EC2 with instance store index Repos Logs EBS
  • 29. ✌ GitHub is experimenting with semantic code search1,2. ✌ Microsoft offers semantic code search for Azure repos in Azure DevOps Services and TFS3. ✌ Google offers fast code search for its Cloud Source Repositories. It’s searching options are quite similar to Zoekt’s4. Plus Google’s Bazel code search. ✌ Sourcegraph becomes more and more popular by adding more languages to its Code Intelligence feature (thanks to Microsoft’s Language Server Protocol) and by providing more integrations and open-source browser extensions. Recent advances 1 “Towards Natural Language Semantic Code Search”: https://githubengineering.com/towards-natural-language-semantic-code-search/ 2 “GitHub experiments; semantic code search.”: https://experiments.github.com/semantic-code-search 3 “Search across all your code and work items”: https://docs.microsoft.com/en-us/azure/devops/project/search/overview?view=vsts&tabs=new-nav 4 “Searching for code”: https://cloud.google.com/source-repositories/docs/searching-code
  • 30. PART 2 Searching for API usage examples
  • 31. The Problem � Lack of proper documentation for the APIs � How can I use this API/API method? � Creating API usage examples is time-consuming The Concept ✌ What if we mine examples from client source code? ✌ Would be nice to cluster results ✌ And show the most indicative example(s) of each cluster ✌ And provide a summarised version of the most indicative example(s) CLAMS
  • 34. Behind the scenes - CLAMS
  • 35. ✌ HApiDoc is getting open-sourced! Looking for contributors! ✌ Take a look at our GitHub space: https://github.com/HotelsDotCom ✌ Presentation material: https://github.com/nikos912000/voxxed-thes-material ✌ Using any other code search engines? Let us know! PS: We’re hiring! Closing Notes
  • 36. Thank you! nikos912000@notmail.com @nikos912000 www.linkedin.com/in/nkatirtzis nikos912000 1 Icon made by Gregor Cresnar from www.flaticon.com. 2 Icon made by Freepik from www.flaticon.com. 3 Icon made by Freepik from www.flaticon.com. 4 Icon made by Pixel Perfect from www.flaticon.com. 1 2 3 4

Notas del editor

  1. Monoliths vs microservices Hotels.com’s website Shift brings advantages + challenges
  2. Survey at Google, 2015
  3. Dependencies; hundreds of microservices Prod issues: bridge between Splunk and BitBucket/GitHub Host: make sure your services don’t rely on servers you aim to kill
  4. Git hosting services don’t specialize on source code searching
  5. Substring matching; day-Monday Special characters; urls, xmls Search results page non-optimal in terms of content fitting
  6. Hound uses 3grams Ngrams: powerful concept for approximate matching. Ngrams are contiguous sequences of n items from a given sample of text.
  7. Coverage; the code that is of interest to you should be available for searching. Speed; Implements sophisticated techniques and by using an index which is based on positional ngrams. Approximate queries; performs substring or partial matching case-insensitively but it also gives you the option for case sensitive searches. Filtering; allows you to filter queries by adding extra atoms and filter out terms with the minus symbol (-). Ranking; uses ctags to find declarations such us class or method definitions and variable declarations, which are then boosted in the search ranking.
  8. Cannot be fairly compared to any existing solutions. You’ll need the Enterprise version for Enterprise environments. Charges per user. Prices keep changing.
  9. Until last year it was hard to convince someone that they should invest on source code searching. There was limited interest and progress in that area.
  10. Orchestrate any required tasks and to act as an end-to-end solution for mining usage examples using CLAMS Automates the filtering step which is the most time-consuming task for systems like CLAMS, stores results to a MongoDB instance and shows these on a neat web service.
  11. Type the fully qualified name of the method Returns summarized snippets alongside with information such as: Repository name where method is called Support; number of other client methods that are calling your API method in a similar way Additional calls to the same API