SlideShare una empresa de Scribd logo
1 de 17
Grow-your-own search engine



      An introduction to Solr
          Andrew Clegg

   http://lucene.apache.org/solr/
Search engine toolkit

     Indexing
     Relevance ranking
     Spellchecking
     Hit highlighting
     Text processing tools
     etc. etc.

                      Originally in Java
                      Ported to C#.NET, Python, C (Lucy)
                      Perl & Ruby bindings for Lucy

Not an off-the-shelf application — requires coding
Solr is a search server built on Lucene

     Web admin interface

     Configured via XML — no coding required

     Controlled via HTTP — REST-like web services

     Database-like schema — typed fields, unique keys

     Extended query syntax

     Caching, replication, clustering, delta indexing, sharding

     Used by: AOL, news.com, Digg, Apple, Disney, MTV,
     CiteSeerX, PubGet, NASA, the White House…
Indexing




 ‗documents‘            indexing            index

 XML                    Tokenization etc.
 CSV                    Term extraction
 HTTP request           Frequency
 Database query         calculations


 Web crawling & PDF/Word/etc. extraction via other tools
Querying




users       query parsing    index lookup       index




                      ranked hits:
                      database/document IDs
                      … and/or …
                      content stored in index
Query syntax example:

  name:(kinase OR phosphotransferase)
   AND species:Saccharomyces
   AND description:phosphoryl*
   AND createdate:[-1YEAR/DAY TO NOW]


  ―name contains kinase or phosphotransferase
   and species contains Saccharomyces
   and description contains word starting with phosphoryl
   and createdate at any time between one year ago
   (rounded to the nearest day) and now‖
Ranking based on tf.idf model…

     tf = term frequency
     idf = inverse document frequency



          Query terms appearing frequently in this
          document are up-weighted

          … but …

          Query terms appearing in many
          documents are down-weighted


Highest ranked docs contain multiple instances of rare query terms
Case study: indexing the CATH protein
     domain database with Solr




           http://cathdb.info/
Whole protein
Whole protein




Chain
Whole protein




Chain                   Domain
CLASS           e.g. Mixed Alpha/Beta    depth = 1



ARCHITECTURE    e.g. Alpha/Beta Barrel   depth = 2



TOPOLOGY        e.g. TIM Barrel          depth = 3
(fold family)


HOMOLOGOUS      e.g. Aldolase class I    depth = 4
SUPERFAMILY
Previous CATH search box used plain SQL –
lots of LIKE/equals matches against relevant fields

        • Slow
        • Poor coverage (exact string match needed)
        • No relevance ordering
        • Slow
        • Hard to debug unexpected results
        • Can‘t be exposed as a service
        • Did we mention slow...?
Entities to index:
• CATH nodes, esp. superfamilies
• Proteins
• Chains
• Domains         UniProt        PDB


        GO          CATHDB             fetch &
                                       index
                  (PostgreSQL)                        user
                                                      query


           EC                                     CATH
                      KEGG                       website


Fetching/indexing data needs no coding –
just XML and a bit of SQL
Fun with indexes



       • “Give me the 10 most similar domains to this one”
       Based on term similarity.
       Useful for quickly finding functional analogues.


       • “Give me the top 10 keywords for this superfamily”
       Can be used to generate tag clouds & summaries.
       Useful for data mining too.
Some facts and figures



       • Indexing takes ~ 6 hours (dedicated box)
       • Index is currently 2.0GB
       • 357,850 entities in index (some are 1-3MB)
       • Typical single query time in Solr: 1ms – 200ms
       • Total web query time: ~ 3s




Any questions?

Más contenido relacionado

La actualidad más candente

Get the most out of Solr search with PHP
Get the most out of Solr search with PHPGet the most out of Solr search with PHP
Get the most out of Solr search with PHPPaul Borgermans
 
Integrating the Solr search engine
Integrating the Solr search engineIntegrating the Solr search engine
Integrating the Solr search engineth0masr
 
Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Erik Hatcher
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Using Apache Solr
Using Apache SolrUsing Apache Solr
Using Apache Solrpittaya
 
Solr Black Belt Pre-conference
Solr Black Belt Pre-conferenceSolr Black Belt Pre-conference
Solr Black Belt Pre-conferenceErik Hatcher
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrRahul Jain
 
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)Erik Hatcher
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Solr introduction
Solr introductionSolr introduction
Solr introductionLap Tran
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and SolrGrant Ingersoll
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)Alexandre Rafalovitch
 
Introduction to Apache Solr.
Introduction to Apache Solr.Introduction to Apache Solr.
Introduction to Apache Solr.ashish0x90
 
Tutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginTutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginsearchbox-com
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBertrand Delacretaz
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 

La actualidad más candente (20)

Get the most out of Solr search with PHP
Get the most out of Solr search with PHPGet the most out of Solr search with PHP
Get the most out of Solr search with PHP
 
Integrating the Solr search engine
Integrating the Solr search engineIntegrating the Solr search engine
Integrating the Solr search engine
 
Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Using Apache Solr
Using Apache SolrUsing Apache Solr
Using Apache Solr
 
Solr Black Belt Pre-conference
Solr Black Belt Pre-conferenceSolr Black Belt Pre-conference
Solr Black Belt Pre-conference
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
Building a Search Engine Using Lucene
Building a Search Engine Using LuceneBuilding a Search Engine Using Lucene
Building a Search Engine Using Lucene
 
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Solr introduction
Solr introductionSolr introduction
Solr introduction
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
 
Introduction to Apache Solr.
Introduction to Apache Solr.Introduction to Apache Solr.
Introduction to Apache Solr.
 
Tutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginTutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component plugin
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and Solr
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Solr Presentation
Solr PresentationSolr Presentation
Solr Presentation
 

Similar a Building your own search engine with Apache Solr

An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.Jurriaan Persyn
 
2009 Dils Flyweb
2009 Dils Flyweb2009 Dils Flyweb
2009 Dils FlywebJun Zhao
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to ElasticsearchClifford James
 
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...Fwdays
 
2010 03 Lodoxf Openflydata
2010 03 Lodoxf Openflydata2010 03 Lodoxf Openflydata
2010 03 Lodoxf OpenflydataJun Zhao
 
Tagging search solution design
Tagging search solution designTagging search solution design
Tagging search solution designAlexander Tokarev
 
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearchBigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearchNetConstructor, Inc.
 
Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchRafał Kuć
 
Tagging search solution design Advanced edition
Tagging search solution design Advanced editionTagging search solution design Advanced edition
Tagging search solution design Advanced editionAlexander Tokarev
 
NoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopNoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopDmitry Kan
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearchpmanvi
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
About elasticsearch
About elasticsearchAbout elasticsearch
About elasticsearchMinsoo Jun
 
Slug: A Semantic Web Crawler
Slug: A Semantic Web CrawlerSlug: A Semantic Web Crawler
Slug: A Semantic Web CrawlerLeigh Dodds
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friendslucenerevolution
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsJulien Nioche
 

Similar a Building your own search engine with Apache Solr (20)

An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
 
Solr
SolrSolr
Solr
 
2009 Dils Flyweb
2009 Dils Flyweb2009 Dils Flyweb
2009 Dils Flyweb
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
 
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
 
ProjectHub
ProjectHubProjectHub
ProjectHub
 
2010 03 Lodoxf Openflydata
2010 03 Lodoxf Openflydata2010 03 Lodoxf Openflydata
2010 03 Lodoxf Openflydata
 
Solr -
Solr - Solr -
Solr -
 
Tagging search solution design
Tagging search solution designTagging search solution design
Tagging search solution design
 
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearchBigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
 
Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearch
 
Tagging search solution design Advanced edition
Tagging search solution design Advanced editionTagging search solution design Advanced edition
Tagging search solution design Advanced edition
 
NoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopNoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache Hadoop
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
Apache solr
Apache solrApache solr
Apache solr
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
About elasticsearch
About elasticsearchAbout elasticsearch
About elasticsearch
 
Slug: A Semantic Web Crawler
Slug: A Semantic Web CrawlerSlug: A Semantic Web Crawler
Slug: A Semantic Web Crawler
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 

Más de Biogeeks

Perl cures coronary heart disease
Perl cures coronary heart diseasePerl cures coronary heart disease
Perl cures coronary heart diseaseBiogeeks
 
Poing: a coder’s take on protein modelling
Poing: a coder’s take on protein modellingPoing: a coder’s take on protein modelling
Poing: a coder’s take on protein modellingBiogeeks
 
Identifying genes and proteins in text: a short review of available tools and...
Identifying genes and proteins in text: a short review of available tools and...Identifying genes and proteins in text: a short review of available tools and...
Identifying genes and proteins in text: a short review of available tools and...Biogeeks
 
DASbrick: A cloud based Rich internet application for Synthetic Biology Parts...
DASbrick: A cloud based Rich internet application for Synthetic Biology Parts...DASbrick: A cloud based Rich internet application for Synthetic Biology Parts...
DASbrick: A cloud based Rich internet application for Synthetic Biology Parts...Biogeeks
 
Ondex: Data integration and visualisation
Ondex: Data integration and visualisationOndex: Data integration and visualisation
Ondex: Data integration and visualisationBiogeeks
 
ABC-SysBio – Approximate Bayesian Computation in Python with GPU support
ABC-SysBio – Approximate Bayesian Computation in Python with GPU supportABC-SysBio – Approximate Bayesian Computation in Python with GPU support
ABC-SysBio – Approximate Bayesian Computation in Python with GPU supportBiogeeks
 

Más de Biogeeks (6)

Perl cures coronary heart disease
Perl cures coronary heart diseasePerl cures coronary heart disease
Perl cures coronary heart disease
 
Poing: a coder’s take on protein modelling
Poing: a coder’s take on protein modellingPoing: a coder’s take on protein modelling
Poing: a coder’s take on protein modelling
 
Identifying genes and proteins in text: a short review of available tools and...
Identifying genes and proteins in text: a short review of available tools and...Identifying genes and proteins in text: a short review of available tools and...
Identifying genes and proteins in text: a short review of available tools and...
 
DASbrick: A cloud based Rich internet application for Synthetic Biology Parts...
DASbrick: A cloud based Rich internet application for Synthetic Biology Parts...DASbrick: A cloud based Rich internet application for Synthetic Biology Parts...
DASbrick: A cloud based Rich internet application for Synthetic Biology Parts...
 
Ondex: Data integration and visualisation
Ondex: Data integration and visualisationOndex: Data integration and visualisation
Ondex: Data integration and visualisation
 
ABC-SysBio – Approximate Bayesian Computation in Python with GPU support
ABC-SysBio – Approximate Bayesian Computation in Python with GPU supportABC-SysBio – Approximate Bayesian Computation in Python with GPU support
ABC-SysBio – Approximate Bayesian Computation in Python with GPU support
 

Último

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 

Último (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Building your own search engine with Apache Solr

  • 1. Grow-your-own search engine An introduction to Solr Andrew Clegg http://lucene.apache.org/solr/
  • 2. Search engine toolkit Indexing Relevance ranking Spellchecking Hit highlighting Text processing tools etc. etc. Originally in Java Ported to C#.NET, Python, C (Lucy) Perl & Ruby bindings for Lucy Not an off-the-shelf application — requires coding
  • 3. Solr is a search server built on Lucene Web admin interface Configured via XML — no coding required Controlled via HTTP — REST-like web services Database-like schema — typed fields, unique keys Extended query syntax Caching, replication, clustering, delta indexing, sharding Used by: AOL, news.com, Digg, Apple, Disney, MTV, CiteSeerX, PubGet, NASA, the White House…
  • 4. Indexing ‗documents‘ indexing index XML Tokenization etc. CSV Term extraction HTTP request Frequency Database query calculations Web crawling & PDF/Word/etc. extraction via other tools
  • 5. Querying users query parsing index lookup index ranked hits: database/document IDs … and/or … content stored in index
  • 6. Query syntax example: name:(kinase OR phosphotransferase) AND species:Saccharomyces AND description:phosphoryl* AND createdate:[-1YEAR/DAY TO NOW] ―name contains kinase or phosphotransferase and species contains Saccharomyces and description contains word starting with phosphoryl and createdate at any time between one year ago (rounded to the nearest day) and now‖
  • 7. Ranking based on tf.idf model… tf = term frequency idf = inverse document frequency Query terms appearing frequently in this document are up-weighted … but … Query terms appearing in many documents are down-weighted Highest ranked docs contain multiple instances of rare query terms
  • 8. Case study: indexing the CATH protein domain database with Solr http://cathdb.info/
  • 12. CLASS e.g. Mixed Alpha/Beta depth = 1 ARCHITECTURE e.g. Alpha/Beta Barrel depth = 2 TOPOLOGY e.g. TIM Barrel depth = 3 (fold family) HOMOLOGOUS e.g. Aldolase class I depth = 4 SUPERFAMILY
  • 13. Previous CATH search box used plain SQL – lots of LIKE/equals matches against relevant fields • Slow • Poor coverage (exact string match needed) • No relevance ordering • Slow • Hard to debug unexpected results • Can‘t be exposed as a service • Did we mention slow...?
  • 14. Entities to index: • CATH nodes, esp. superfamilies • Proteins • Chains • Domains UniProt PDB GO CATHDB fetch & index (PostgreSQL) user query EC CATH KEGG website Fetching/indexing data needs no coding – just XML and a bit of SQL
  • 15.
  • 16. Fun with indexes • “Give me the 10 most similar domains to this one” Based on term similarity. Useful for quickly finding functional analogues. • “Give me the top 10 keywords for this superfamily” Can be used to generate tag clouds & summaries. Useful for data mining too.
  • 17. Some facts and figures • Indexing takes ~ 6 hours (dedicated box) • Index is currently 2.0GB • 357,850 entities in index (some are 1-3MB) • Typical single query time in Solr: 1ms – 200ms • Total web query time: ~ 3s Any questions?