SlideShare a Scribd company logo
1 of 26
Building a Search Engine
Using Apache Lucene/Solr
Road Map
• Problem Definition
• A Basic Search Engine Pipeline
• Meet Lucene
• Lucene API Examples
• Lucene Wrappers (Apache Solr, ElasticSearch, Regain, etc….)
• Applied Lucene (Real Examples)
Problem Definition
You got a farm of data, and you want it to be searchable.
Analogy: Searching for a needle in a haystack with adding more hay to
the stack!
- SQL Databases Cons ( > 500,000,000 records …)
- Scalability
- Decentralization
A Basic Search Engine Pipeline
• Crawling: Grapping the data
• Parsing [Optional]: Understanding the data
• Indexing: Build the holding structure
• Ranking: Sort the data
• Searching: Read that holding structure
Behind The Scenes: Analysis, Tokenization, Query Parsing, Boosting,
Calculating Term Vectors, Token Filtration,
Index Inversion, etc…
What is Lucene?
• Doug Cutting (Lucene 1999, Nutch 2003, Hadoop 2006)
• Free, Java information retrieval library
• Application related: Indexing, Searching
• High performance, A decade of research
• Heavily supported, simply customized
• No dependencies
What Lucene Ain’t
• A complete search engine
• An application
• A crawler
• A document filter/recognizer
Lucene Roles
Rich Document Rich Document
Gather
Parse
Make Doc
Search UI
Search App
e.g. webapp
Search
Index
Index
Lucene Strength Points
• Simple API
• Speed
• Concurrency
• Smart indexing (Incremental)
• Near Real Time Search
• Vector Space Search
• Heavily Used, Supported
Lucene Query Types
• Single Term VS. Multi-Term “+name: camel + type: animal”
• Wildcard Queries “text:wonder*”
• Fuzzy Queries “room~0.8”
• Range Queries “date:[25/5/2000 To *]”
• Grouped Queries “text: animal AND small”
• Proximity Queries “hamlet macbeth”~10
• Boosted Queries “hamlet^5.0 AND macbeth”
API Sample I (Indexing)
private IndexWriter writer;
public Indexer(String indexDir) throws IOException {
Directory dir = FSDirectory.open(new File(indexDir));
writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_CURRENT), true,
IndexWriter.MaxFieldLength.UNLIMITED);
}
public void close() throws IOException {
writer.close();
}
public void index(String dataDir, FileFilter filter) throws Exception {
File[] files = new File(dataDir).listFiles();
for (File f: files) {
Document doc = new Document();
doc.add(new Field("contents", new FileReader(f)));
doc.add(new Field("filename", f.getName(), Field.Store.YES, Field.Index.NOT_ANALYZED));
writer.addDocument(doc);
}
}
Indexing Pipeline (Simplified)
Tokenizer TokenFilterDocument Document
Writer
Inverted
Index
add
Analysis Basic Types
"The quick brown fox jumped over the lazy dogs"
WhitespaceAnalyzer :
[The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]
SimpleAnalyzer :
[the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]
StopAnalyzer :
[quick] [brown] [fox] [jumped] [over] [lazy] [dogs]
StandardAnalyzer:
[quick] [brown] [fox] [jumped] [over] [lazy] [dogs]
"XY&Z Corporation - xyz@example.com"
WhitespaceAnalyzer:
[XY&Z] [Corporation] [-] [xyz@example.com]
SimpleAnalyzer:
[xy] [z] [corporation] [xyz] [example] [com]
StopAnalyzer:
[xy] [z] [corporation] [xyz] [example] [com]
StandardAnalyzer:
[xy&z] [corporation] [xyz@example.com]
The Inverted Index (In a nutshell)
API Sample II (Searching)
public void search(String indexDir, String q) throws IOException, ParseException {
Directory dir = FSDirectory.open(new File(indexDir));
IndexSearcher is = new IndexSearcher(dir, true);
QueryParser parser = new QueryParser("contents",
new StandardAnalyzer(Version.LUCENE_CURRENT));
Query query = parser.parse(q);
TopDocs hits = is.search(query, 10);
System.err.println("Found " + hits.totalHits + " document(s)");
for (int i=0; i<hits.scoreDocs.length; i++) {
ScoreDoc scoreDoc = hits.scoreDocs[i];
Document doc = is.doc(scoreDoc.doc);
System.out.println(doc.get("filename"));
}
is.close();
}
Index Update
• Lucene doesn’t have an update mechanism. So?
• Incremental Indexing (Index Merging)
• Delete + Add = Update
• Index Optimization
API Sample III (Deleting)
Via IndexReader
void deleteDocument(int docNum)
Deletes the document numbered docNum
int deleteDocuments(Term term)
Deletes all documents that have a given term indexed.
Via IndexWriter
void deleteAll()
Delete all documents in the index.
void deleteDocuments(Query query)
Deletes the document(s) matching the provided query.
void deleteDocuments(Query[] queries)
Deletes the document(s) matching any of the provided queries.
void deleteDocuments(Term term)
Deletes the document(s) containing term.
void deleteDocuments(Term[] terms)
Deletes the document(s) containing any of the terms.
Some Statistics
• Dependent on Lucene.NET (a .NET port of Lucene)
Local Testing (Index, Search are on the same device)
Over Network Testing (File server for index file, Standalone searching workstations)
Dataset Size Indexing (Min.) Retrieval (Ms.) Opt.(Min.)
4.3 GB ~32, 180 MB ~50 -> 300 0.2
40 GB ~360, 2.6 GB ~100 -> 3000 3.2
Dataset Size Indexing (Min.) Retrieval (Ms.) Opt.(Min.)
4.3 GB X,180 MB ~300 -> 700 X
40 GB X, 2.6 GB ~400 -> 4500 X
Lucene Wrappers (Apache Solr)
• A Java wrapper over Lucene
• A web application that can be deployed on any
servlet container (Apache Tomcat, Jetty)
• A REST service
• It has an administration interface
• Built-in configuration with Apache Tika (a repository of parsers)
• Scalable
• Integration with Apache Hadoop, Apache Cassandra
Solr Administration Interface
Solr Architecture (The Big Picture)
Note: It includes JSON, PHP, Python,… Not only XML.
Communication with Solr (Sending Docs)
• Direct Connection OR Through APIs (SolrJ, SolrNET)
// make a connection to Solr server
SolrServer server = new HttpSolrServer("http://localhost:8080/solr/");
// prepare a doc
final SolrInputDocument doc1 = new SolrInputDocument();
doc1.addField("id", 1);
doc1.addField("firstName", "First Name");
doc1.addField("lastName", "Last Name");
final Collection<SolrInputDocument> docs = new ArrayList<SolrInputDocument>();
docs.add(doc1);
// add docs to Solr
server.add(docs);
server.commit();
Communication with Solr (Searching)
final SolrQuery query = new SolrQuery();
query.setQuery("*:*");
query.addSortField("firstName", SolrQuery.ORDER.asc);
final QueryResponse rsp = server.query(query);
final SolrDocumentList solrDocumentList = rsp.getResults();
for (final SolrDocument doc : solrDocumentList) {
final String firstName = (String) doc.getFieldValue("firstName");
final String id = (String) doc.getFieldValue("id");
}
Some Statistics
Note 1: We’re sending HTTP POST requests to Solr server, That can take a lot if we compared it with
The pure Lucene.NET model.
Note 2: Consider a server with upcoming requests from everywhere, OS related issues with queuing
can cause some delay depending on the queuing strategy.
Dataset Size Indexing (Min.) Retrieval (Ms.) Opt.(Min.)
4.3 GB ~39.5, 169 MB ~300 -> 3000 0.203
40 GB ~400 (Not accurate), 40 GB ~300 -> 10000 ~7 (Not accurate)
Lucene/Solr Users
• Instagram (geo-search API)
• NetFlix (Generic search feature)
• SourceForge (Generic search feature)
• Eclipse (Documentation search)
• LinkedIn (Recently, Job Search)
• Krugle (SourceCode Search)
• Wikipedia (Recently, Generic Content Search)
References
• Manning Lucene in Action (2nd Edition)
• Lucene Main Website
• Another Presentation on SlideShare
Thank You

More Related Content

What's hot

Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
Rahul Jain
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
Erik Hatcher
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
Erik Hatcher
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
Erik Hatcher
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
Erik Hatcher
 
20130310 solr tuorial
20130310 solr tuorial20130310 solr tuorial
20130310 solr tuorial
Chris Huang
 

What's hot (20)

Introduction To Apache Lucene
Introduction To Apache LuceneIntroduction To Apache Lucene
Introduction To Apache Lucene
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and Solr
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
Tutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginTutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component plugin
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Lucene indexing
Lucene indexingLucene indexing
Lucene indexing
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Integrating the Solr search engine
Integrating the Solr search engineIntegrating the Solr search engine
Integrating the Solr search engine
 
Lucene
LuceneLucene
Lucene
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
 
Introduction to apache lucene
Introduction to apache luceneIntroduction to apache lucene
Introduction to apache lucene
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Retrieving Information From Solr
Retrieving Information From SolrRetrieving Information From Solr
Retrieving Information From Solr
 
20130310 solr tuorial
20130310 solr tuorial20130310 solr tuorial
20130310 solr tuorial
 

Viewers also liked

Cataloguing in the Real World
Cataloguing in the Real WorldCataloguing in the Real World
Cataloguing in the Real World
Emily Porta
 
Presentacion mineria
Presentacion mineriaPresentacion mineria
Presentacion mineria
viktor93
 
Indexing or dividing_head
Indexing or dividing_headIndexing or dividing_head
Indexing or dividing_head
Javaria Chiragh
 

Viewers also liked (20)

Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Web scraping with nutch solr
Web scraping with nutch solrWeb scraping with nutch solr
Web scraping with nutch solr
 
The Future of Library Cataloguing
The Future of Library CataloguingThe Future of Library Cataloguing
The Future of Library Cataloguing
 
Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin
Datech2014 - Cataloguing for a Billion Word Library of Greek and LatinDatech2014 - Cataloguing for a Billion Word Library of Greek and Latin
Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin
 
Cataloguing in the Real World
Cataloguing in the Real WorldCataloguing in the Real World
Cataloguing in the Real World
 
Day in the life of a data librarian [presentation for ANU 23Things group]
Day in the life of a data librarian [presentation for ANU 23Things group]Day in the life of a data librarian [presentation for ANU 23Things group]
Day in the life of a data librarian [presentation for ANU 23Things group]
 
Library of Congress New Bibliographic Framework - What is it?
Library of Congress New Bibliographic Framework - What is it?Library of Congress New Bibliographic Framework - What is it?
Library of Congress New Bibliographic Framework - What is it?
 
Censorship by Omission: Closing off fiction in cataloguing
Censorship by Omission: Closing off fiction in cataloguingCensorship by Omission: Closing off fiction in cataloguing
Censorship by Omission: Closing off fiction in cataloguing
 
Library Carpentry: software skills training for library professionals, Chart...
 Library Carpentry: software skills training for library professionals, Chart... Library Carpentry: software skills training for library professionals, Chart...
Library Carpentry: software skills training for library professionals, Chart...
 
Microdata cataloging tool (nada)
Microdata cataloging tool (nada)Microdata cataloging tool (nada)
Microdata cataloging tool (nada)
 
Computer Science Library Training
Computer Science Library TrainingComputer Science Library Training
Computer Science Library Training
 
Presentacion mineria
Presentacion mineriaPresentacion mineria
Presentacion mineria
 
Laravel and SOLR
Laravel and SOLRLaravel and SOLR
Laravel and SOLR
 
Taller de catalogación Linked Open Data y RDA: posibilidades y desafíos. Prim...
Taller de catalogación Linked Open Data y RDA: posibilidades y desafíos. Prim...Taller de catalogación Linked Open Data y RDA: posibilidades y desafíos. Prim...
Taller de catalogación Linked Open Data y RDA: posibilidades y desafíos. Prim...
 
RDA y el proceso de catalogación
RDA y el proceso de catalogaciónRDA y el proceso de catalogación
RDA y el proceso de catalogación
 
Library of Congress Subject Headings
Library of Congress Subject HeadingsLibrary of Congress Subject Headings
Library of Congress Subject Headings
 
POPSI
POPSIPOPSI
POPSI
 
Post coordinate indexing .. Library and information science
Post coordinate indexing .. Library and information sciencePost coordinate indexing .. Library and information science
Post coordinate indexing .. Library and information science
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with Solr
 
Indexing or dividing_head
Indexing or dividing_headIndexing or dividing_head
Indexing or dividing_head
 

Similar to Building a Search Engine Using Lucene

Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with Lucene
WO Community
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
JSGB
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
Tommaso Teofili
 
DIY Percolator
DIY PercolatorDIY Percolator
DIY Percolator
jdhok
 
Data Access Options in SharePoint 2010
Data Access Options in SharePoint 2010Data Access Options in SharePoint 2010
Data Access Options in SharePoint 2010
Rob Windsor
 

Similar to Building a Search Engine Using Lucene (20)

IR with lucene
IR with luceneIR with lucene
IR with lucene
 
Fast track to lucene
Fast track to luceneFast track to lucene
Fast track to lucene
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with Lucene
 
Examiness hints and tips from the trenches
Examiness hints and tips from the trenchesExaminess hints and tips from the trenches
Examiness hints and tips from the trenches
 
Lucene in Action
Lucene in ActionLucene in Action
Lucene in Action
 
Java Search Engine Framework
Java Search Engine FrameworkJava Search Engine Framework
Java Search Engine Framework
 
Full Text Search In PostgreSQL
Full Text Search In PostgreSQLFull Text Search In PostgreSQL
Full Text Search In PostgreSQL
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
Ingesting and Manipulating Data with JavaScript
Ingesting and Manipulating Data with JavaScriptIngesting and Manipulating Data with JavaScript
Ingesting and Manipulating Data with JavaScript
 
Dapper
DapperDapper
Dapper
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Solr introduction
Solr introductionSolr introduction
Solr introduction
 
DIY Percolator
DIY PercolatorDIY Percolator
DIY Percolator
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Hands On Spring Data
Hands On Spring DataHands On Spring Data
Hands On Spring Data
 
Get docs from sp doc library
Get docs from sp doc libraryGet docs from sp doc library
Get docs from sp doc library
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation Engines
 
Data Access Options in SharePoint 2010
Data Access Options in SharePoint 2010Data Access Options in SharePoint 2010
Data Access Options in SharePoint 2010
 

Recently uploaded

Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
SayantanBiswas37
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 

Recently uploaded (20)

Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 

Building a Search Engine Using Lucene

  • 1. Building a Search Engine Using Apache Lucene/Solr
  • 2. Road Map • Problem Definition • A Basic Search Engine Pipeline • Meet Lucene • Lucene API Examples • Lucene Wrappers (Apache Solr, ElasticSearch, Regain, etc….) • Applied Lucene (Real Examples)
  • 3. Problem Definition You got a farm of data, and you want it to be searchable. Analogy: Searching for a needle in a haystack with adding more hay to the stack! - SQL Databases Cons ( > 500,000,000 records …) - Scalability - Decentralization
  • 4. A Basic Search Engine Pipeline • Crawling: Grapping the data • Parsing [Optional]: Understanding the data • Indexing: Build the holding structure • Ranking: Sort the data • Searching: Read that holding structure Behind The Scenes: Analysis, Tokenization, Query Parsing, Boosting, Calculating Term Vectors, Token Filtration, Index Inversion, etc…
  • 5. What is Lucene? • Doug Cutting (Lucene 1999, Nutch 2003, Hadoop 2006) • Free, Java information retrieval library • Application related: Indexing, Searching • High performance, A decade of research • Heavily supported, simply customized • No dependencies
  • 6. What Lucene Ain’t • A complete search engine • An application • A crawler • A document filter/recognizer
  • 7. Lucene Roles Rich Document Rich Document Gather Parse Make Doc Search UI Search App e.g. webapp Search Index Index
  • 8. Lucene Strength Points • Simple API • Speed • Concurrency • Smart indexing (Incremental) • Near Real Time Search • Vector Space Search • Heavily Used, Supported
  • 9. Lucene Query Types • Single Term VS. Multi-Term “+name: camel + type: animal” • Wildcard Queries “text:wonder*” • Fuzzy Queries “room~0.8” • Range Queries “date:[25/5/2000 To *]” • Grouped Queries “text: animal AND small” • Proximity Queries “hamlet macbeth”~10 • Boosted Queries “hamlet^5.0 AND macbeth”
  • 10. API Sample I (Indexing) private IndexWriter writer; public Indexer(String indexDir) throws IOException { Directory dir = FSDirectory.open(new File(indexDir)); writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.UNLIMITED); } public void close() throws IOException { writer.close(); } public void index(String dataDir, FileFilter filter) throws Exception { File[] files = new File(dataDir).listFiles(); for (File f: files) { Document doc = new Document(); doc.add(new Field("contents", new FileReader(f))); doc.add(new Field("filename", f.getName(), Field.Store.YES, Field.Index.NOT_ANALYZED)); writer.addDocument(doc); } }
  • 11. Indexing Pipeline (Simplified) Tokenizer TokenFilterDocument Document Writer Inverted Index add
  • 12. Analysis Basic Types "The quick brown fox jumped over the lazy dogs" WhitespaceAnalyzer : [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] SimpleAnalyzer : [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] StopAnalyzer : [quick] [brown] [fox] [jumped] [over] [lazy] [dogs] StandardAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dogs] "XY&Z Corporation - xyz@example.com" WhitespaceAnalyzer: [XY&Z] [Corporation] [-] [xyz@example.com] SimpleAnalyzer: [xy] [z] [corporation] [xyz] [example] [com] StopAnalyzer: [xy] [z] [corporation] [xyz] [example] [com] StandardAnalyzer: [xy&z] [corporation] [xyz@example.com]
  • 13. The Inverted Index (In a nutshell)
  • 14. API Sample II (Searching) public void search(String indexDir, String q) throws IOException, ParseException { Directory dir = FSDirectory.open(new File(indexDir)); IndexSearcher is = new IndexSearcher(dir, true); QueryParser parser = new QueryParser("contents", new StandardAnalyzer(Version.LUCENE_CURRENT)); Query query = parser.parse(q); TopDocs hits = is.search(query, 10); System.err.println("Found " + hits.totalHits + " document(s)"); for (int i=0; i<hits.scoreDocs.length; i++) { ScoreDoc scoreDoc = hits.scoreDocs[i]; Document doc = is.doc(scoreDoc.doc); System.out.println(doc.get("filename")); } is.close(); }
  • 15. Index Update • Lucene doesn’t have an update mechanism. So? • Incremental Indexing (Index Merging) • Delete + Add = Update • Index Optimization
  • 16. API Sample III (Deleting) Via IndexReader void deleteDocument(int docNum) Deletes the document numbered docNum int deleteDocuments(Term term) Deletes all documents that have a given term indexed. Via IndexWriter void deleteAll() Delete all documents in the index. void deleteDocuments(Query query) Deletes the document(s) matching the provided query. void deleteDocuments(Query[] queries) Deletes the document(s) matching any of the provided queries. void deleteDocuments(Term term) Deletes the document(s) containing term. void deleteDocuments(Term[] terms) Deletes the document(s) containing any of the terms.
  • 17. Some Statistics • Dependent on Lucene.NET (a .NET port of Lucene) Local Testing (Index, Search are on the same device) Over Network Testing (File server for index file, Standalone searching workstations) Dataset Size Indexing (Min.) Retrieval (Ms.) Opt.(Min.) 4.3 GB ~32, 180 MB ~50 -> 300 0.2 40 GB ~360, 2.6 GB ~100 -> 3000 3.2 Dataset Size Indexing (Min.) Retrieval (Ms.) Opt.(Min.) 4.3 GB X,180 MB ~300 -> 700 X 40 GB X, 2.6 GB ~400 -> 4500 X
  • 18. Lucene Wrappers (Apache Solr) • A Java wrapper over Lucene • A web application that can be deployed on any servlet container (Apache Tomcat, Jetty) • A REST service • It has an administration interface • Built-in configuration with Apache Tika (a repository of parsers) • Scalable • Integration with Apache Hadoop, Apache Cassandra
  • 20. Solr Architecture (The Big Picture) Note: It includes JSON, PHP, Python,… Not only XML.
  • 21. Communication with Solr (Sending Docs) • Direct Connection OR Through APIs (SolrJ, SolrNET) // make a connection to Solr server SolrServer server = new HttpSolrServer("http://localhost:8080/solr/"); // prepare a doc final SolrInputDocument doc1 = new SolrInputDocument(); doc1.addField("id", 1); doc1.addField("firstName", "First Name"); doc1.addField("lastName", "Last Name"); final Collection<SolrInputDocument> docs = new ArrayList<SolrInputDocument>(); docs.add(doc1); // add docs to Solr server.add(docs); server.commit();
  • 22. Communication with Solr (Searching) final SolrQuery query = new SolrQuery(); query.setQuery("*:*"); query.addSortField("firstName", SolrQuery.ORDER.asc); final QueryResponse rsp = server.query(query); final SolrDocumentList solrDocumentList = rsp.getResults(); for (final SolrDocument doc : solrDocumentList) { final String firstName = (String) doc.getFieldValue("firstName"); final String id = (String) doc.getFieldValue("id"); }
  • 23. Some Statistics Note 1: We’re sending HTTP POST requests to Solr server, That can take a lot if we compared it with The pure Lucene.NET model. Note 2: Consider a server with upcoming requests from everywhere, OS related issues with queuing can cause some delay depending on the queuing strategy. Dataset Size Indexing (Min.) Retrieval (Ms.) Opt.(Min.) 4.3 GB ~39.5, 169 MB ~300 -> 3000 0.203 40 GB ~400 (Not accurate), 40 GB ~300 -> 10000 ~7 (Not accurate)
  • 24. Lucene/Solr Users • Instagram (geo-search API) • NetFlix (Generic search feature) • SourceForge (Generic search feature) • Eclipse (Documentation search) • LinkedIn (Recently, Job Search) • Krugle (SourceCode Search) • Wikipedia (Recently, Generic Content Search)
  • 25. References • Manning Lucene in Action (2nd Edition) • Lucene Main Website • Another Presentation on SlideShare

Editor's Notes

  1. Don’t forget, The concept of documents
  2. Note that you can even make your custom analyzers, It depends on the application needs