2. AGENDA
About - Search Engine & its capabilities
Apache Solr/Lucene - Introduction
Exploring Lucene
Features & Capabilities
Library Component Stack
Architecture Framework
Work-out
Exploring Solr
Features & Capabilities
Architecture Overview
Work-out search capabilities with Solr
About Pre-requisites & Set up
Work-out search capabilities
What Next – Scope & Future
3. ABOUT - SEARCH ENGINE & CAPABILITIES
An engine/tool which processes the input provided by the end user and
find/locate an index of information, documents or web page via applying a
certain set of algorithms( indexing, ranking, spider, crawling, querying etc.)
defined.
A Search engine capabilities varies per the demand, context, content information, model . In
basic term, top level of categorization can be derived as –
Multiple web sites page/full text search
Single site /document full text search
Further to above, Search Engine can be categorized as –
Crawler-Based Search Engines
Social search engines
Directories Search Engines
Hybrid Search Engines
Specialty Search Engines
Paid/Promotional Inclusion search engines
Pay Per Click (sponsored results)
Open source search engines
4. APACHE SOLR/LUCENE - INTRODUCTION
Apache Lucene is a java based high-performance full-featured text search engine
library.
Is Developed by Doug Cutting in 1999 and released under Apache Software
Document oriented model architecture
Widely recognized for full text indexing searching capability
Fast indexing up to 150GB/hr and low memory (only 1MB heap)
Flexible API (independent of File Format ex.- pdf, html, word, open document)
Can be used for text/document searching across documents locally and web
Extended in the project i.e.– Nutch ,Solr,Elastic search, Compass, DocFetcher
5. APACHE SOLR/LUCENE - INTRODUCTION
Apache Solr is enterprise high performance java based (written over Lucene)
search server platform which demonstrate distributed indexing, replication, load-
balanced querying, automated failover /recovery with centralized configuration.
Is developed by Yonik Seeley in 2004 at Cnetwork & donated to Apache in 2006
Runs within Servlet container like Tomcat Or Jetty (Default)
Multi Core Architecture
Ability to have multiple cores
running in the same webapp
Well recognized for distributed search
capabilities ex.- cluster search
Open source and extendable
via independent plug-in ex. – Carrot
SolrCloud Support fro Cloud based application (2012 Edition)
6. EXPLORING LUCENE – FEATURES & CAPABILITIES
Five key fundamentals on which Lucene works i.e. –
Document
Field
Analyzer(tokens/filter, stop words,
synonym, multilingual support…)
Indexing (Inverted Index, encoding,
segmentation, data compression,
Commit strategy)
Querying/Searching ( Lucene query
model, evaluation, scoring, Similarity,extns)
As the result of the above, Lucene provides –
High-Performance Indexing ( incremental/batch. Also, size 20-30% the size of text indexed )
Powerful/Complex query processing e.g.- phrase, wildcard, proximity, range, facet , fuzzy query …
Fielded searching and sorting e.g. title, author, contents
Ranked searching ( best results returned first)
Multiple-index searching with merged results
Allows simultaneous index update and searching
Flexible faceting, highlighting, joins and result grouping
Pluggable ranking models including Vector Space Model
Configurable storage engine (Codec's)
8. Exploring Lucene - Architecture Framework
Directories
Codec
Index Writer Index Reader
query Scoring API Collection
Text Analysis Chain Query Parser
Doc
Writer
Index
Chain
Segment Segment
Reader
Collection Stat
TextEnum
DocAndPositionEnum
<doc>
<field></field>
<field></field>
…
</doc> Ranked
Result
Search
segment
Flush/commit Open/Reopen
Add/
update
Retrieve
Stored
value
Per field
token
stream
9. Exploring Lucene – Work out
Setting your CLASSPATH
Download and extract Lucene distribution and jars (Lucene Core , Queryparser, common analysis)in your Java
CLASSPATH
Indexing Files
Analyzer analyzer = new StandardAnalyzer (
Version.LUCENE_CURRENT );
Directory directory =
FSDirectory.open("/tmp/testindex");
IndexWriterConfig config = new
IndexWriterConfig(Version.LUCENE_CU
RRENT, analyzer);
IndexWriter iwriter = new IndexWriter(
directory, config);
Document doc = new Document();
String text = "This is the text to be indexed.";
doc.add(new Field("fieldname", text,
TextField.TYPE_STORED));
iwriter.addDocument(doc);
iwriter.close();
Searching Files
DirectoryReader ireader =
DirectoryReader.open( directory );
IndexSearcher isearcher =
new IndexSearcher( ireader );
QueryParser parser = new
QueryParser(Version.LUCENE_CURREN
T, "fieldname", analyzer);
Query query = parser.parse("text");
ScoreDoc[] hits = isearcher.search (query, null,
1000).scoreDocs;
for (int i = 0; i < hits.length; i++) {
Document hitDoc = isearcher.doc(hits[i].doc);
assertEquals("This is the text to be indexed.",
hitDoc.get("fieldname")); }
ireader.close();
directory.close();
10. EXPLORING SOLR – FEATURES & CAPABILITIES
Solr Feature (in addition to Lucene)
Caching
Document cache instances
User level caching
Pluggable Cache implementations
SolrCloud
Automated distributed indexing/sharding
Real time indexing
Transaction log
Query fail over and recovery
Additional
Ajax Based Admin Interface with a bundle of
functionality
cache and logging mgmt
Monitoring
text analysis debugging
schema browser
web query output
solr cloud dashboard etc…
Rich Document Parsing and Indexing (PDF,
Word, HTML, etc) using Apache Tika
Apache UIMA integration for configurable
metadata extraction
Solr Core
Multi-Core Analysis and Indices
Dynamically create/ delete document collections
Pluggable query handlers
Extensible XML data format
Component based request handler
Distributed search support
Uniqueness/duplicate document Detection
Custom index processing chains
11. SOLR ARCHITECTURE
11
Apache Lucene
/select /spell XML CSVXML Binary
JSO
N
Data Import Handler
(SQL/RSS)
Extracting Request
Handler (PDF/WORD)
CachingFaceting
Query
Parsing
Apache Tika
binary/admin
High-lighting
Schema<fieldType name=“text1”>
<filter=“whitespace”>
<filter=“customFilter” …>
<filter=“synonyms” file=..>
<filter=“porter” except=..>
<field name=“title” type=“text1”
<field name=“cust1” class=
Index Replication
Update HandlersResponse Writers
Query
Spelling
Faceting
Highlighting
Signature
Logging
Update Processors
Indexing
SolrConfig
Debug
Statistics
More like this
Distributed Search
Clustering
Filtering Search
Core Search
IndexReader/Searcher
Indexing
IndexWriterText Analysis
Analysis
Request Handler
http://.../select?q=cheese&wt=xml
12. APACHE SOLR – PRE-REQUISITES & SET UP
Solr Component Set up Container
Contrib module for extensions to Solr
Analysis -extras text analysis components for multilingual support i.e.- Chinese
Clustering engine for clustering search results
DataImportHandler (DIH) is contrib module that imports data into Solr from a srces
Extraction contains integration with Apache Tika ( a framework for extracting text
from common file formats and also used by DIH's TikaEntityProcessor).
UIMA for integration with Apache UIMA (a framework for extracting metadata out
of text, identify proper names in text and identify the language).
Velocity is Simple Search UI framework based on the Velocity templating language.
Dist Solr distributable WAR and contrib jar files
13. APACHE SOLR – PRE-REQUISITES & SET UP
Solr Component Set up Container
Example contains a complete Solr server with Jetty servlet engine, serving as demo
• example/etc contains Jetty's server configuration
• exampledocs contains Sample documents to be indexed into the default Solr
configuration along with the post.jar for sending documents to Solr.
• example/solr is the default sample Solr configuration
• example/webapps is the place Jetty expects to deploy Solr from
14. QUERY EXAMPLES
DisMax - http://solr/select?qt=dismax&start=0&rows=2
&q=super man // user query
&qf=title^3 subject^2 body // field to query
&pf=title^2,body // fields to do phrase queries
&ps=100 // slop for those phrase q’s
&tie=.1 // multi-field match reward
&mm=2 // # of terms that should match
&bf=popularity // boost function
Facet - http://solr/select?q=foo&wt=json&indent=on&facet=true&facet.field=cat
&facet.query=price:[0 TO 100]&facet.query=manu:IBM
Filter - &q=memory&fq=inStock:true&facet=true&…
Highlighting - http://solr/select?q=lcd&wt=json&indent=on&hl=true&hl.fl=features
Date Range - releaseDate:[2000 TO 2007]
Wildcard - sup?r, su*r, super*
Fuzzy - Levenshtein distance Optional minimum similarity: spider~0.7
Boolean - (Superman AND “Lex Luthor”) OR (+Batman +Joker)
Balanced quotes for phrase query - ‘+’ for required, ‘-’ for prohibited
Functional - log(sum(popularity,1))
Apache Nutch — provides web crawling and HTML parsing
Apache Solr — an enterprise search server
ElasticSearch — an enterprise search server
Compass — a Java Search Engine Framework
DocFetcher — a multiplatform desktop search applicatio
Accepts several types of queries:
–
Term query
(e.g., buffer edit)
–
Phrase query
(e.g., “buffer edit”)
–
Boolean query
(e.g., buffer AND edit OR modify)
–
Wildcard query
(e.g., te?t, test*, te*t)
–
Range query
(e.g., date: [20020101 TO 20030101)
–
Fuzzy query
- uses the Levenshtein Distance between
strings (e.g., roam~ searches fo
r terms similar to roam, like
“roam”, “foam”)
–
Proximity query
– finds terms within a specific distance
away (e.g., “jakarta apache”~10 searches for a “apache”
and “jakarta” within 10 terms of
each other in a document
RequestHandlers – handle a request at a URL like /select
SearchComponents – part of a SearchHandler, a componentized request handler
Includes, Query, Facet, Highlight, Debug, Stats
Distributed Search capable
UpdateHandlers – handle an indexing request
Update Processor Chains – per-handler componentized chain that handle updates
Query Parser plugins
Mix and match query types in a single request
Function plugins for Function Query
Text Analysis plugins: Analyzers, Tokenizers, TokenFilters
ResponseWriters serialize & stream response to client
Each request handler can be mapped to a different URL
SearchHandler is a componentized RequestHandler that allows search components to be chained together and also enables the framework for distributed search operations.
Each Searchhandler can have it’s own custom set of search components, along with default or invariant parameters
All of the configuration is declarative – including adding new request handlers or search components.
The QueryResponse object is very generic and can handle returning any type of data