Search Engine Capabilities - Apache Solr(Lucene)

SEARCH ENGINE CAPABILITIES
WITH
APACHE SOLR/LUCENE

AGENDA
 About - Search Engine & its capabilities
 Apache Solr/Lucene - Introduction
 Exploring Lucene
 Features & Capabilities
 Library Component Stack
 Architecture Framework
 Work-out
 Exploring Solr
 Features & Capabilities
 Architecture Overview
 Work-out search capabilities with Solr
 About Pre-requisites & Set up
 Work-out search capabilities
 What Next – Scope & Future

ABOUT - SEARCH ENGINE & CAPABILITIES
An engine/tool which processes the input provided by the end user and
find/locate an index of information, documents or web page via applying a
certain set of algorithms( indexing, ranking, spider, crawling, querying etc.)
defined.
A Search engine capabilities varies per the demand, context, content information, model . In
basic term, top level of categorization can be derived as –
Multiple web sites page/full text search
Single site /document full text search
Further to above, Search Engine can be categorized as –
Crawler-Based Search Engines
Social search engines
Directories Search Engines
Hybrid Search Engines
Specialty Search Engines
Paid/Promotional Inclusion search engines
Pay Per Click (sponsored results)
Open source search engines

APACHE SOLR/LUCENE - INTRODUCTION
Apache Lucene is a java based high-performance full-featured text search engine
library.
 Is Developed by Doug Cutting in 1999 and released under Apache Software
 Document oriented model architecture
 Widely recognized for full text indexing searching capability
 Fast indexing up to 150GB/hr and low memory (only 1MB heap)
 Flexible API (independent of File Format ex.- pdf, html, word, open document)
 Can be used for text/document searching across documents locally and web
Extended in the project i.e.– Nutch ,Solr,Elastic search, Compass, DocFetcher

APACHE SOLR/LUCENE - INTRODUCTION
Apache Solr is enterprise high performance java based (written over Lucene)
search server platform which demonstrate distributed indexing, replication, load-
balanced querying, automated failover /recovery with centralized configuration.
 Is developed by Yonik Seeley in 2004 at Cnetwork & donated to Apache in 2006
 Runs within Servlet container like Tomcat Or Jetty (Default)
 Multi Core Architecture
Ability to have multiple cores
running in the same webapp
 Well recognized for distributed search
capabilities ex.- cluster search
 Open source and extendable
via independent plug-in ex. – Carrot
 SolrCloud Support fro Cloud based application (2012 Edition)

EXPLORING LUCENE – FEATURES & CAPABILITIES
Five key fundamentals on which Lucene works i.e. –
 Document
 Field
 Analyzer(tokens/filter, stop words,
synonym, multilingual support…)
 Indexing (Inverted Index, encoding,
segmentation, data compression,
Commit strategy)
 Querying/Searching ( Lucene query
model, evaluation, scoring, Similarity,extns)
As the result of the above, Lucene provides –
 High-Performance Indexing ( incremental/batch. Also, size 20-30% the size of text indexed )
 Powerful/Complex query processing e.g.- phrase, wildcard, proximity, range, facet , fuzzy query …
 Fielded searching and sorting e.g. title, author, contents
 Ranked searching ( best results returned first)
 Multiple-index searching with merged results
 Allows simultaneous index update and searching
 Flexible faceting, highlighting, joins and result grouping
 Pluggable ranking models including Vector Space Model
 Configurable storage engine (Codec's)

LUCENE – LIBRARY COMPONENT STACK
Lucene
Test
Framework
Lucene Analyzer
Lucene Indexer
Spatial
Benchmark
Grouping
Analyzer ICU
Suggest
facet
Sandbox
Highlighter
Query Parser
Query
Analyzer Common
Analyzer
Phonetic
Analyzer
UIMA
Analyzer
Stemple
Analyzer
Smart CN
Analyzer
Koromogi
Analyzer
Morfologik
misc joinmemory
Lucene Codec
Lucene Core Search
Payload
Similarity
Store
Finite State
Transducer
Compress
UtilAutomation Document
Packed
Int/Array
Span
AnalysisIndex Codec
Codec
Per Field

Exploring Lucene - Architecture Framework
Directories
Codec
Index Writer Index Reader
query Scoring API Collection
Text Analysis Chain Query Parser
Doc
Writer
Index
Chain
Segment Segment
Reader
Collection Stat
TextEnum
DocAndPositionEnum
<doc>
<field></field>
<field></field>
…
</doc> Ranked
Result
Search
segment
Flush/commit Open/Reopen
Add/
update
Retrieve
Stored
value
Per field
token
stream

Exploring Lucene – Work out
Setting your CLASSPATH
Download and extract Lucene distribution and jars (Lucene Core , Queryparser, common analysis)in your Java
CLASSPATH
Indexing Files
Analyzer analyzer = new StandardAnalyzer (
Version.LUCENE_CURRENT );
Directory directory =
FSDirectory.open("/tmp/testindex");
IndexWriterConfig config = new
IndexWriterConfig(Version.LUCENE_CU
RRENT, analyzer);
IndexWriter iwriter = new IndexWriter(
directory, config);
Document doc = new Document();
String text = "This is the text to be indexed.";
doc.add(new Field("fieldname", text,
TextField.TYPE_STORED));
iwriter.addDocument(doc);
iwriter.close();
Searching Files
DirectoryReader ireader =
DirectoryReader.open( directory );
IndexSearcher isearcher =
new IndexSearcher( ireader );
QueryParser parser = new
QueryParser(Version.LUCENE_CURREN
T, "fieldname", analyzer);
Query query = parser.parse("text");
ScoreDoc[] hits = isearcher.search (query, null,
1000).scoreDocs;
for (int i = 0; i < hits.length; i++) {
Document hitDoc = isearcher.doc(hits[i].doc);
assertEquals("This is the text to be indexed.",
hitDoc.get("fieldname")); }
ireader.close();
directory.close();

EXPLORING SOLR – FEATURES & CAPABILITIES
Solr Feature (in addition to Lucene)
Caching
Document cache instances
 User level caching
Pluggable Cache implementations
SolrCloud
Automated distributed indexing/sharding
Real time indexing
Transaction log
Query fail over and recovery
Additional
Ajax Based Admin Interface with a bundle of
functionality
cache and logging mgmt
Monitoring
 text analysis debugging
schema browser
web query output
solr cloud dashboard etc…
Rich Document Parsing and Indexing (PDF,
Word, HTML, etc) using Apache Tika
Apache UIMA integration for configurable
metadata extraction
Solr Core
Multi-Core Analysis and Indices
Dynamically create/ delete document collections
Pluggable query handlers
Extensible XML data format
Component based request handler
Distributed search support
Uniqueness/duplicate document Detection
Custom index processing chains

SOLR ARCHITECTURE
11
Apache Lucene
/select /spell XML CSVXML Binary
JSO
N
Data Import Handler
(SQL/RSS)
Extracting Request
Handler (PDF/WORD)
CachingFaceting
Query
Parsing
Apache Tika
binary/admin
High-lighting
Schema<fieldType name=“text1”>
<filter=“whitespace”>
<filter=“customFilter” …>
<filter=“synonyms” file=..>
<filter=“porter” except=..>
<field name=“title” type=“text1”
<field name=“cust1” class=
Index Replication
Update HandlersResponse Writers
Query
Spelling
Faceting
Highlighting
Signature
Logging
Update Processors
Indexing
SolrConfig
Debug
Statistics
More like this
Distributed Search
Clustering
Filtering Search
Core Search
IndexReader/Searcher
Indexing
IndexWriterText Analysis
Analysis
Request Handler
http://.../select?q=cheese&wt=xml

APACHE SOLR – PRE-REQUISITES & SET UP
Solr Component Set up Container
 Contrib module for extensions to Solr
 Analysis -extras text analysis components for multilingual support i.e.- Chinese
 Clustering engine for clustering search results
 DataImportHandler (DIH) is contrib module that imports data into Solr from a srces
 Extraction contains integration with Apache Tika ( a framework for extracting text
from common file formats and also used by DIH's TikaEntityProcessor).
 UIMA for integration with Apache UIMA (a framework for extracting metadata out
of text, identify proper names in text and identify the language).
 Velocity is Simple Search UI framework based on the Velocity templating language.
 Dist Solr distributable WAR and contrib jar files

APACHE SOLR – PRE-REQUISITES & SET UP
Solr Component Set up Container
 Example contains a complete Solr server with Jetty servlet engine, serving as demo
• example/etc contains Jetty's server configuration
• exampledocs contains Sample documents to be indexed into the default Solr
configuration along with the post.jar for sending documents to Solr.
• example/solr is the default sample Solr configuration
• example/webapps is the place Jetty expects to deploy Solr from

QUERY EXAMPLES
DisMax - http://solr/select?qt=dismax&start=0&rows=2
&q=super man // user query
&qf=title^3 subject^2 body // field to query
&pf=title^2,body // fields to do phrase queries
&ps=100 // slop for those phrase q’s
&tie=.1 // multi-field match reward
&mm=2 // # of terms that should match
&bf=popularity // boost function
Facet - http://solr/select?q=foo&wt=json&indent=on&facet=true&facet.field=cat
&facet.query=price:[0 TO 100]&facet.query=manu:IBM
Filter - &q=memory&fq=inStock:true&facet=true&…
Highlighting - http://solr/select?q=lcd&wt=json&indent=on&hl=true&hl.fl=features
 Date Range - releaseDate:[2000 TO 2007]
 Wildcard - sup?r, su*r, super*
 Fuzzy - Levenshtein distance Optional minimum similarity: spider~0.7
 Boolean - (Superman AND “Lex Luthor”) OR (+Batman +Joker)
 Balanced quotes for phrase query - ‘+’ for required, ‘-’ for prohibited
 Functional - log(sum(popularity,1))

REFERENCES
 http://lucene.apache.org/core/features.html
 http://lucene.apache.org/solr/4_1_0/tutorial.html
 http://en.wikipedia.org/wiki/Index_(search_engine)
 http://lucene.apache.org/solr/4_1_0/tutorial.html
 http://wiki.apache.org/solr/SolrQuerySyntax
 http://wiki.apache.org/solr/SolrFacetingOverview
 http://horicky.blogspot.in/2013/02/text-processing-part-2-inverted-index.html

Search Engine Capabilities - Apache Solr(Lucene)

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Search Engine Capabilities - Apache Solr(Lucene)

Similar a Search Engine Capabilities - Apache Solr(Lucene) (20)

Search Engine Capabilities - Apache Solr(Lucene)

Notas del editor