SlideShare una empresa de Scribd logo
1 de 16
SEARCH ENGINE CAPABILITIES
WITH
APACHE SOLR/LUCENE
AGENDA
 About - Search Engine & its capabilities
 Apache Solr/Lucene - Introduction
 Exploring Lucene
 Features & Capabilities
 Library Component Stack
 Architecture Framework
 Work-out
 Exploring Solr
 Features & Capabilities
 Architecture Overview
 Work-out search capabilities with Solr
 About Pre-requisites & Set up
 Work-out search capabilities
 What Next – Scope & Future
ABOUT - SEARCH ENGINE & CAPABILITIES
An engine/tool which processes the input provided by the end user and
find/locate an index of information, documents or web page via applying a
certain set of algorithms( indexing, ranking, spider, crawling, querying etc.)
defined.
A Search engine capabilities varies per the demand, context, content information, model . In
basic term, top level of categorization can be derived as –
Multiple web sites page/full text search
Single site /document full text search
Further to above, Search Engine can be categorized as –
Crawler-Based Search Engines
Social search engines
Directories Search Engines
Hybrid Search Engines
Specialty Search Engines
Paid/Promotional Inclusion search engines
Pay Per Click (sponsored results)
Open source search engines
APACHE SOLR/LUCENE - INTRODUCTION
Apache Lucene is a java based high-performance full-featured text search engine
library.
 Is Developed by Doug Cutting in 1999 and released under Apache Software
 Document oriented model architecture
 Widely recognized for full text indexing searching capability
 Fast indexing up to 150GB/hr and low memory (only 1MB heap)
 Flexible API (independent of File Format ex.- pdf, html, word, open document)
 Can be used for text/document searching across documents locally and web
Extended in the project i.e.– Nutch ,Solr,Elastic search, Compass, DocFetcher
APACHE SOLR/LUCENE - INTRODUCTION
Apache Solr is enterprise high performance java based (written over Lucene)
search server platform which demonstrate distributed indexing, replication, load-
balanced querying, automated failover /recovery with centralized configuration.
 Is developed by Yonik Seeley in 2004 at Cnetwork & donated to Apache in 2006
 Runs within Servlet container like Tomcat Or Jetty (Default)
 Multi Core Architecture
Ability to have multiple cores
running in the same webapp
 Well recognized for distributed search
capabilities ex.- cluster search
 Open source and extendable
via independent plug-in ex. – Carrot
 SolrCloud Support fro Cloud based application (2012 Edition)
EXPLORING LUCENE – FEATURES & CAPABILITIES
Five key fundamentals on which Lucene works i.e. –
 Document
 Field
 Analyzer(tokens/filter, stop words,
synonym, multilingual support…)
 Indexing (Inverted Index, encoding,
segmentation, data compression,
Commit strategy)
 Querying/Searching ( Lucene query
model, evaluation, scoring, Similarity,extns)
As the result of the above, Lucene provides –
 High-Performance Indexing ( incremental/batch. Also, size 20-30% the size of text indexed )
 Powerful/Complex query processing e.g.- phrase, wildcard, proximity, range, facet , fuzzy query …
 Fielded searching and sorting e.g. title, author, contents
 Ranked searching ( best results returned first)
 Multiple-index searching with merged results
 Allows simultaneous index update and searching
 Flexible faceting, highlighting, joins and result grouping
 Pluggable ranking models including Vector Space Model
 Configurable storage engine (Codec's)
LUCENE – LIBRARY COMPONENT STACK
Lucene
Test
Framework
Lucene Analyzer
Lucene Indexer
Spatial
Benchmark
Grouping
Analyzer ICU
Suggest
facet
Sandbox
Highlighter
Query Parser
Query
Analyzer Common
Analyzer
Phonetic
Analyzer
UIMA
Analyzer
Stemple
Analyzer
Smart CN
Analyzer
Koromogi
Analyzer
Morfologik
misc joinmemory
Lucene Codec
Lucene Core Search
Payload
Similarity
Store
Finite State
Transducer
Compress
UtilAutomation Document
Packed
Int/Array
Span
AnalysisIndex Codec
Codec
Per Field
Exploring Lucene - Architecture Framework
Directories
Codec
Index Writer Index Reader
query Scoring API Collection
Text Analysis Chain Query Parser
Doc
Writer
Index
Chain
Segment Segment
Reader
Collection Stat
TextEnum
DocAndPositionEnum
<doc>
<field></field>
<field></field>
…
</doc> Ranked
Result
Search
segment
Flush/commit Open/Reopen
Add/
update
Retrieve
Stored
value
Per field
token
stream
Exploring Lucene – Work out
Setting your CLASSPATH
Download and extract Lucene distribution and jars (Lucene Core , Queryparser, common analysis)in your Java
CLASSPATH
Indexing Files
Analyzer analyzer = new StandardAnalyzer (
Version.LUCENE_CURRENT );
Directory directory =
FSDirectory.open("/tmp/testindex");
IndexWriterConfig config = new
IndexWriterConfig(Version.LUCENE_CU
RRENT, analyzer);
IndexWriter iwriter = new IndexWriter(
directory, config);
Document doc = new Document();
String text = "This is the text to be indexed.";
doc.add(new Field("fieldname", text,
TextField.TYPE_STORED));
iwriter.addDocument(doc);
iwriter.close();
Searching Files
DirectoryReader ireader =
DirectoryReader.open( directory );
IndexSearcher isearcher =
new IndexSearcher( ireader );
QueryParser parser = new
QueryParser(Version.LUCENE_CURREN
T, "fieldname", analyzer);
Query query = parser.parse("text");
ScoreDoc[] hits = isearcher.search (query, null,
1000).scoreDocs;
for (int i = 0; i < hits.length; i++) {
Document hitDoc = isearcher.doc(hits[i].doc);
assertEquals("This is the text to be indexed.",
hitDoc.get("fieldname")); }
ireader.close();
directory.close();
EXPLORING SOLR – FEATURES & CAPABILITIES
Solr Feature (in addition to Lucene)
Caching
Document cache instances
 User level caching
Pluggable Cache implementations
SolrCloud
Automated distributed indexing/sharding
Real time indexing
Transaction log
Query fail over and recovery
Additional
Ajax Based Admin Interface with a bundle of
functionality
cache and logging mgmt
Monitoring
 text analysis debugging
schema browser
web query output
solr cloud dashboard etc…
Rich Document Parsing and Indexing (PDF,
Word, HTML, etc) using Apache Tika
Apache UIMA integration for configurable
metadata extraction
Solr Core
Multi-Core Analysis and Indices
Dynamically create/ delete document collections
Pluggable query handlers
Extensible XML data format
Component based request handler
Distributed search support
Uniqueness/duplicate document Detection
Custom index processing chains
SOLR ARCHITECTURE
11
Apache Lucene
/select /spell XML CSVXML Binary
JSO
N
Data Import Handler
(SQL/RSS)
Extracting Request
Handler (PDF/WORD)
CachingFaceting
Query
Parsing
Apache Tika
binary/admin
High-lighting
Schema<fieldType name=“text1”>
<filter=“whitespace”>
<filter=“customFilter” …>
<filter=“synonyms” file=..>
<filter=“porter” except=..>
<field name=“title” type=“text1”
<field name=“cust1” class=
Index Replication
Update HandlersResponse Writers
Query
Spelling
Faceting
Highlighting
Signature
Logging
Update Processors
Indexing
SolrConfig
Debug
Statistics
More like this
Distributed Search
Clustering
Filtering Search
Core Search
IndexReader/Searcher
Indexing
IndexWriterText Analysis
Analysis
Request Handler
http://.../select?q=cheese&wt=xml
APACHE SOLR – PRE-REQUISITES & SET UP
Solr Component Set up Container
 Contrib module for extensions to Solr
 Analysis -extras text analysis components for multilingual support i.e.- Chinese
 Clustering engine for clustering search results
 DataImportHandler (DIH) is contrib module that imports data into Solr from a srces
 Extraction contains integration with Apache Tika ( a framework for extracting text
from common file formats and also used by DIH's TikaEntityProcessor).
 UIMA for integration with Apache UIMA (a framework for extracting metadata out
of text, identify proper names in text and identify the language).
 Velocity is Simple Search UI framework based on the Velocity templating language.
 Dist Solr distributable WAR and contrib jar files
APACHE SOLR – PRE-REQUISITES & SET UP
Solr Component Set up Container
 Example contains a complete Solr server with Jetty servlet engine, serving as demo
• example/etc contains Jetty's server configuration
• exampledocs contains Sample documents to be indexed into the default Solr
configuration along with the post.jar for sending documents to Solr.
• example/solr is the default sample Solr configuration
• example/webapps is the place Jetty expects to deploy Solr from
QUERY EXAMPLES
DisMax - http://solr/select?qt=dismax&start=0&rows=2
&q=super man // user query
&qf=title^3 subject^2 body // field to query
&pf=title^2,body // fields to do phrase queries
&ps=100 // slop for those phrase q’s
&tie=.1 // multi-field match reward
&mm=2 // # of terms that should match
&bf=popularity // boost function
Facet - http://solr/select?q=foo&wt=json&indent=on&facet=true&facet.field=cat
&facet.query=price:[0 TO 100]&facet.query=manu:IBM
Filter - &q=memory&fq=inStock:true&facet=true&…
Highlighting - http://solr/select?q=lcd&wt=json&indent=on&hl=true&hl.fl=features
 Date Range - releaseDate:[2000 TO 2007]
 Wildcard - sup?r, su*r, super*
 Fuzzy - Levenshtein distance Optional minimum similarity: spider~0.7
 Boolean - (Superman AND “Lex Luthor”) OR (+Batman +Joker)
 Balanced quotes for phrase query - ‘+’ for required, ‘-’ for prohibited
 Functional - log(sum(popularity,1))
REFERENCES
 http://lucene.apache.org/core/features.html
 http://lucene.apache.org/solr/4_1_0/tutorial.html
 http://en.wikipedia.org/wiki/Index_(search_engine)
 http://lucene.apache.org/solr/4_1_0/tutorial.html
 http://wiki.apache.org/solr/SolrQuerySyntax
 http://wiki.apache.org/solr/SolrFacetingOverview
 http://horicky.blogspot.in/2013/02/text-processing-part-2-inverted-index.html
THANK YOU

Más contenido relacionado

La actualidad más candente

TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopGrant Ingersoll
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)dnaber
 
Introduction to apache lucene
Introduction to apache luceneIntroduction to apache lucene
Introduction to apache luceneShrikrishna Parab
 
Lifecycle of a Solr Search Request - Chris "Hoss" Hostetter, Lucidworks
Lifecycle of a Solr Search Request - Chris "Hoss" Hostetter, LucidworksLifecycle of a Solr Search Request - Chris "Hoss" Hostetter, Lucidworks
Lifecycle of a Solr Search Request - Chris "Hoss" Hostetter, LucidworksLucidworks
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Erik Hatcher
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Tutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginTutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginsearchbox-com
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Apache Solr + ajax solr
Apache Solr + ajax solrApache Solr + ajax solr
Apache Solr + ajax solrNet7
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Integrating the Solr search engine
Integrating the Solr search engineIntegrating the Solr search engine
Integrating the Solr search engineth0masr
 

La actualidad más candente (20)

TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Apache Lucene Basics
Apache Lucene BasicsApache Lucene Basics
Apache Lucene Basics
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Introduction to apache lucene
Introduction to apache luceneIntroduction to apache lucene
Introduction to apache lucene
 
Lucene basics
Lucene basicsLucene basics
Lucene basics
 
Lifecycle of a Solr Search Request - Chris "Hoss" Hostetter, Lucidworks
Lifecycle of a Solr Search Request - Chris "Hoss" Hostetter, LucidworksLifecycle of a Solr Search Request - Chris "Hoss" Hostetter, Lucidworks
Lifecycle of a Solr Search Request - Chris "Hoss" Hostetter, Lucidworks
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Tutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginTutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component plugin
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Apache Solr + ajax solr
Apache Solr + ajax solrApache Solr + ajax solr
Apache Solr + ajax solr
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Solr 4
Solr 4Solr 4
Solr 4
 
Integrating the Solr search engine
Integrating the Solr search engineIntegrating the Solr search engine
Integrating the Solr search engine
 
Mule properties
Mule propertiesMule properties
Mule properties
 

Destacado

Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...hannonhill
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platformmteutelink
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsJulien Nioche
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01David Smiley
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-endgagravarr
 
Content Analysis with Apache Tika
Content Analysis with Apache TikaContent Analysis with Apache Tika
Content Analysis with Apache TikaPaolo Mottadelli
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friendslucenerevolution
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutchsebastian_nagel
 
An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm CrawlerJulien Nioche
 
Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content TransformationAlfresco Software
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache NutchJulien Nioche
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrAndy Jackson
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrLucidworks (Archived)
 
BLOSEN: BLOG SEARCH ENGINE BASED ON POST CONCEPT CLUSTERING
BLOSEN: BLOG SEARCH ENGINE BASED ON POST CONCEPT CLUSTERINGBLOSEN: BLOG SEARCH ENGINE BASED ON POST CONCEPT CLUSTERING
BLOSEN: BLOG SEARCH ENGINE BASED ON POST CONCEPT CLUSTERINGijasa
 
Lucene Connector Framework(Lcf)
Lucene Connector Framework(Lcf)Lucene Connector Framework(Lcf)
Lucene Connector Framework(Lcf)Rondhuit
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr WorkshopJSGB
 

Destacado (20)

Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platform
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-end
 
Content Analysis with Apache Tika
Content Analysis with Apache TikaContent Analysis with Apache Tika
Content Analysis with Apache Tika
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
ProjectHub
ProjectHubProjectHub
ProjectHub
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutch
 
An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm Crawler
 
Search engine
Search engineSearch engine
Search engine
 
Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content Transformation
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with Solr
 
BLOSEN: BLOG SEARCH ENGINE BASED ON POST CONCEPT CLUSTERING
BLOSEN: BLOG SEARCH ENGINE BASED ON POST CONCEPT CLUSTERINGBLOSEN: BLOG SEARCH ENGINE BASED ON POST CONCEPT CLUSTERING
BLOSEN: BLOG SEARCH ENGINE BASED ON POST CONCEPT CLUSTERING
 
Lucene Connector Framework(Lcf)
Lucene Connector Framework(Lcf)Lucene Connector Framework(Lcf)
Lucene Connector Framework(Lcf)
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 

Similar a Search Engine Capabilities - Apache Solr(Lucene)

Apache Solr-Webinar
Apache Solr-WebinarApache Solr-Webinar
Apache Solr-WebinarEdureka!
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAsad Abbas
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes WorkshopErik Hatcher
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!Alex Kursov
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearchpmanvi
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneRahul Jain
 
Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5israelekpo
 
Apace Solr Web Development.pdf
Apace Solr Web Development.pdfApace Solr Web Development.pdf
Apace Solr Web Development.pdfAbanti Aazmin
 
Solr facets and custom indices
Solr facets and custom indicesSolr facets and custom indices
Solr facets and custom indicescgmonroe
 
ElasticSearch Basic Introduction
ElasticSearch Basic IntroductionElasticSearch Basic Introduction
ElasticSearch Basic IntroductionMayur Rathod
 
Apache Solr! Enterprise Search Solutions at your Fingertips!
Apache Solr! Enterprise Search Solutions at your Fingertips!Apache Solr! Enterprise Search Solutions at your Fingertips!
Apache Solr! Enterprise Search Solutions at your Fingertips!Murshed Ahmmad Khan
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneSwapnil & Patil
 
Introduction to ElasticSearch
Introduction to ElasticSearchIntroduction to ElasticSearch
Introduction to ElasticSearchSimobo
 
Assamese search engine using SOLR by Moinuddin Ahmed ( moin )
Assamese search engine using SOLR by Moinuddin Ahmed ( moin )Assamese search engine using SOLR by Moinuddin Ahmed ( moin )
Assamese search engine using SOLR by Moinuddin Ahmed ( moin )'Moinuddin Ahmed
 
Solr中国6月21日企业搜索
Solr中国6月21日企业搜索Solr中国6月21日企业搜索
Solr中国6月21日企业搜索longkeyy
 

Similar a Search Engine Capabilities - Apache Solr(Lucene) (20)

Apache Lucene Searching The Web
Apache Lucene Searching The WebApache Lucene Searching The Web
Apache Lucene Searching The Web
 
Apache solr
Apache solrApache solr
Apache solr
 
Apache Solr-Webinar
Apache Solr-WebinarApache Solr-Webinar
Apache Solr-Webinar
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
 
Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5
 
Apace Solr Web Development.pdf
Apace Solr Web Development.pdfApace Solr Web Development.pdf
Apace Solr Web Development.pdf
 
Solr facets and custom indices
Solr facets and custom indicesSolr facets and custom indices
Solr facets and custom indices
 
ElasticSearch Basic Introduction
ElasticSearch Basic IntroductionElasticSearch Basic Introduction
ElasticSearch Basic Introduction
 
Solr 101
Solr 101Solr 101
Solr 101
 
Solr5
Solr5Solr5
Solr5
 
Apache Solr! Enterprise Search Solutions at your Fingertips!
Apache Solr! Enterprise Search Solutions at your Fingertips!Apache Solr! Enterprise Search Solutions at your Fingertips!
Apache Solr! Enterprise Search Solutions at your Fingertips!
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
 
Introduction to ElasticSearch
Introduction to ElasticSearchIntroduction to ElasticSearch
Introduction to ElasticSearch
 
Assamese search engine using SOLR by Moinuddin Ahmed ( moin )
Assamese search engine using SOLR by Moinuddin Ahmed ( moin )Assamese search engine using SOLR by Moinuddin Ahmed ( moin )
Assamese search engine using SOLR by Moinuddin Ahmed ( moin )
 
Solr中国6月21日企业搜索
Solr中国6月21日企业搜索Solr中国6月21日企业搜索
Solr中国6月21日企业搜索
 

Search Engine Capabilities - Apache Solr(Lucene)

  • 2. AGENDA  About - Search Engine & its capabilities  Apache Solr/Lucene - Introduction  Exploring Lucene  Features & Capabilities  Library Component Stack  Architecture Framework  Work-out  Exploring Solr  Features & Capabilities  Architecture Overview  Work-out search capabilities with Solr  About Pre-requisites & Set up  Work-out search capabilities  What Next – Scope & Future
  • 3. ABOUT - SEARCH ENGINE & CAPABILITIES An engine/tool which processes the input provided by the end user and find/locate an index of information, documents or web page via applying a certain set of algorithms( indexing, ranking, spider, crawling, querying etc.) defined. A Search engine capabilities varies per the demand, context, content information, model . In basic term, top level of categorization can be derived as – Multiple web sites page/full text search Single site /document full text search Further to above, Search Engine can be categorized as – Crawler-Based Search Engines Social search engines Directories Search Engines Hybrid Search Engines Specialty Search Engines Paid/Promotional Inclusion search engines Pay Per Click (sponsored results) Open source search engines
  • 4. APACHE SOLR/LUCENE - INTRODUCTION Apache Lucene is a java based high-performance full-featured text search engine library.  Is Developed by Doug Cutting in 1999 and released under Apache Software  Document oriented model architecture  Widely recognized for full text indexing searching capability  Fast indexing up to 150GB/hr and low memory (only 1MB heap)  Flexible API (independent of File Format ex.- pdf, html, word, open document)  Can be used for text/document searching across documents locally and web Extended in the project i.e.– Nutch ,Solr,Elastic search, Compass, DocFetcher
  • 5. APACHE SOLR/LUCENE - INTRODUCTION Apache Solr is enterprise high performance java based (written over Lucene) search server platform which demonstrate distributed indexing, replication, load- balanced querying, automated failover /recovery with centralized configuration.  Is developed by Yonik Seeley in 2004 at Cnetwork & donated to Apache in 2006  Runs within Servlet container like Tomcat Or Jetty (Default)  Multi Core Architecture Ability to have multiple cores running in the same webapp  Well recognized for distributed search capabilities ex.- cluster search  Open source and extendable via independent plug-in ex. – Carrot  SolrCloud Support fro Cloud based application (2012 Edition)
  • 6. EXPLORING LUCENE – FEATURES & CAPABILITIES Five key fundamentals on which Lucene works i.e. –  Document  Field  Analyzer(tokens/filter, stop words, synonym, multilingual support…)  Indexing (Inverted Index, encoding, segmentation, data compression, Commit strategy)  Querying/Searching ( Lucene query model, evaluation, scoring, Similarity,extns) As the result of the above, Lucene provides –  High-Performance Indexing ( incremental/batch. Also, size 20-30% the size of text indexed )  Powerful/Complex query processing e.g.- phrase, wildcard, proximity, range, facet , fuzzy query …  Fielded searching and sorting e.g. title, author, contents  Ranked searching ( best results returned first)  Multiple-index searching with merged results  Allows simultaneous index update and searching  Flexible faceting, highlighting, joins and result grouping  Pluggable ranking models including Vector Space Model  Configurable storage engine (Codec's)
  • 7. LUCENE – LIBRARY COMPONENT STACK Lucene Test Framework Lucene Analyzer Lucene Indexer Spatial Benchmark Grouping Analyzer ICU Suggest facet Sandbox Highlighter Query Parser Query Analyzer Common Analyzer Phonetic Analyzer UIMA Analyzer Stemple Analyzer Smart CN Analyzer Koromogi Analyzer Morfologik misc joinmemory Lucene Codec Lucene Core Search Payload Similarity Store Finite State Transducer Compress UtilAutomation Document Packed Int/Array Span AnalysisIndex Codec Codec Per Field
  • 8. Exploring Lucene - Architecture Framework Directories Codec Index Writer Index Reader query Scoring API Collection Text Analysis Chain Query Parser Doc Writer Index Chain Segment Segment Reader Collection Stat TextEnum DocAndPositionEnum <doc> <field></field> <field></field> … </doc> Ranked Result Search segment Flush/commit Open/Reopen Add/ update Retrieve Stored value Per field token stream
  • 9. Exploring Lucene – Work out Setting your CLASSPATH Download and extract Lucene distribution and jars (Lucene Core , Queryparser, common analysis)in your Java CLASSPATH Indexing Files Analyzer analyzer = new StandardAnalyzer ( Version.LUCENE_CURRENT ); Directory directory = FSDirectory.open("/tmp/testindex"); IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_CU RRENT, analyzer); IndexWriter iwriter = new IndexWriter( directory, config); Document doc = new Document(); String text = "This is the text to be indexed."; doc.add(new Field("fieldname", text, TextField.TYPE_STORED)); iwriter.addDocument(doc); iwriter.close(); Searching Files DirectoryReader ireader = DirectoryReader.open( directory ); IndexSearcher isearcher = new IndexSearcher( ireader ); QueryParser parser = new QueryParser(Version.LUCENE_CURREN T, "fieldname", analyzer); Query query = parser.parse("text"); ScoreDoc[] hits = isearcher.search (query, null, 1000).scoreDocs; for (int i = 0; i < hits.length; i++) { Document hitDoc = isearcher.doc(hits[i].doc); assertEquals("This is the text to be indexed.", hitDoc.get("fieldname")); } ireader.close(); directory.close();
  • 10. EXPLORING SOLR – FEATURES & CAPABILITIES Solr Feature (in addition to Lucene) Caching Document cache instances  User level caching Pluggable Cache implementations SolrCloud Automated distributed indexing/sharding Real time indexing Transaction log Query fail over and recovery Additional Ajax Based Admin Interface with a bundle of functionality cache and logging mgmt Monitoring  text analysis debugging schema browser web query output solr cloud dashboard etc… Rich Document Parsing and Indexing (PDF, Word, HTML, etc) using Apache Tika Apache UIMA integration for configurable metadata extraction Solr Core Multi-Core Analysis and Indices Dynamically create/ delete document collections Pluggable query handlers Extensible XML data format Component based request handler Distributed search support Uniqueness/duplicate document Detection Custom index processing chains
  • 11. SOLR ARCHITECTURE 11 Apache Lucene /select /spell XML CSVXML Binary JSO N Data Import Handler (SQL/RSS) Extracting Request Handler (PDF/WORD) CachingFaceting Query Parsing Apache Tika binary/admin High-lighting Schema<fieldType name=“text1”> <filter=“whitespace”> <filter=“customFilter” …> <filter=“synonyms” file=..> <filter=“porter” except=..> <field name=“title” type=“text1” <field name=“cust1” class= Index Replication Update HandlersResponse Writers Query Spelling Faceting Highlighting Signature Logging Update Processors Indexing SolrConfig Debug Statistics More like this Distributed Search Clustering Filtering Search Core Search IndexReader/Searcher Indexing IndexWriterText Analysis Analysis Request Handler http://.../select?q=cheese&wt=xml
  • 12. APACHE SOLR – PRE-REQUISITES & SET UP Solr Component Set up Container  Contrib module for extensions to Solr  Analysis -extras text analysis components for multilingual support i.e.- Chinese  Clustering engine for clustering search results  DataImportHandler (DIH) is contrib module that imports data into Solr from a srces  Extraction contains integration with Apache Tika ( a framework for extracting text from common file formats and also used by DIH's TikaEntityProcessor).  UIMA for integration with Apache UIMA (a framework for extracting metadata out of text, identify proper names in text and identify the language).  Velocity is Simple Search UI framework based on the Velocity templating language.  Dist Solr distributable WAR and contrib jar files
  • 13. APACHE SOLR – PRE-REQUISITES & SET UP Solr Component Set up Container  Example contains a complete Solr server with Jetty servlet engine, serving as demo • example/etc contains Jetty's server configuration • exampledocs contains Sample documents to be indexed into the default Solr configuration along with the post.jar for sending documents to Solr. • example/solr is the default sample Solr configuration • example/webapps is the place Jetty expects to deploy Solr from
  • 14. QUERY EXAMPLES DisMax - http://solr/select?qt=dismax&start=0&rows=2 &q=super man // user query &qf=title^3 subject^2 body // field to query &pf=title^2,body // fields to do phrase queries &ps=100 // slop for those phrase q’s &tie=.1 // multi-field match reward &mm=2 // # of terms that should match &bf=popularity // boost function Facet - http://solr/select?q=foo&wt=json&indent=on&facet=true&facet.field=cat &facet.query=price:[0 TO 100]&facet.query=manu:IBM Filter - &q=memory&fq=inStock:true&facet=true&… Highlighting - http://solr/select?q=lcd&wt=json&indent=on&hl=true&hl.fl=features  Date Range - releaseDate:[2000 TO 2007]  Wildcard - sup?r, su*r, super*  Fuzzy - Levenshtein distance Optional minimum similarity: spider~0.7  Boolean - (Superman AND “Lex Luthor”) OR (+Batman +Joker)  Balanced quotes for phrase query - ‘+’ for required, ‘-’ for prohibited  Functional - log(sum(popularity,1))
  • 15. REFERENCES  http://lucene.apache.org/core/features.html  http://lucene.apache.org/solr/4_1_0/tutorial.html  http://en.wikipedia.org/wiki/Index_(search_engine)  http://lucene.apache.org/solr/4_1_0/tutorial.html  http://wiki.apache.org/solr/SolrQuerySyntax  http://wiki.apache.org/solr/SolrFacetingOverview  http://horicky.blogspot.in/2013/02/text-processing-part-2-inverted-index.html

Notas del editor

  1. Apache Nutch — provides web crawling and HTML parsing Apache Solr — an enterprise search server ElasticSearch — an enterprise search server Compass — a Java Search Engine Framework DocFetcher — a multiplatform desktop search applicatio
  2. Accepts several types of queries: – Term query (e.g., buffer edit) – Phrase query (e.g., “buffer edit”) – Boolean query (e.g., buffer AND edit OR modify) – Wildcard query (e.g., te?t, test*, te*t) – Range query (e.g., date: [20020101 TO 20030101) – Fuzzy query - uses the Levenshtein Distance between strings (e.g., roam~ searches fo r terms similar to roam, like “roam”, “foam”) – Proximity query – finds terms within a specific distance away (e.g., “jakarta apache”~10 searches for a “apache” and “jakarta” within 10 terms of each other in a document
  3. RequestHandlers – handle a request at a URL like /select SearchComponents – part of a SearchHandler, a componentized request handler Includes, Query, Facet, Highlight, Debug, Stats Distributed Search capable UpdateHandlers – handle an indexing request Update Processor Chains – per-handler componentized chain that handle updates Query Parser plugins Mix and match query types in a single request Function plugins for Function Query Text Analysis plugins: Analyzers, Tokenizers, TokenFilters ResponseWriters serialize & stream response to client Each request handler can be mapped to a different URL SearchHandler is a componentized RequestHandler that allows search components to be chained together and also enables the framework for distributed search operations. Each Searchhandler can have it’s own custom set of search components, along with default or invariant parameters All of the configuration is declarative – including adding new request handlers or search components. The QueryResponse object is very generic and can handle returning any type of data