Enterprise Search platform building scalable REST services on Apache Lucene

Enterprise Search platform
Building solid scalable enterprise search REST services on top of Apache Lucene

Tommaso Teoﬁli

Agenda

• Apache Lucene overview

• Why do we need Apache Solr?

• Everyman tales from Solr

• Enterprise what?

• One step beyond...

Apache Lucene overview

• Information Retrieval library

• Inverted indexes are quick and efﬁcient

• Vector space model

• Advanced search options (synonims, stopwords, similarity, nearness)

• Different language implementations (Java, .NET, C, Python)

The Lucene API

• Lucene indexes are built on a Directory

• Directory can be accessed by IndexReaders and IndexWriters

• IndexSearchers are built on top of Directories and IndexReaders

• IndexWriters can write Documents inside the index

• Documents are made of Fields

• Fields have value(s) and options

• Directory > IndexReader/Writer > Document > Field

Indexing Lucene

• A Lucene index has one or more segments and a generation

• Changes to the index must be committed (and optimized)

• No fixed schema

• Each field can be STORED, INDEXED and ANALYZED

• Each field can have NORMS and TERM VECTORS

Searching Lucene

• Open an IndexSearcher on top of an IndexReader over a Directory

• Many query types: TermQuery, MultiTermQuery, BooleanQuery,
WildcardQuery, PhraseQuery, PreﬁxQuery, MultiPhraseQuery, FuzzyQuery,
TermRangeQuery, NumericRangeQuery

• Get results from a TopDocs object

Why do we need Apache Solr?

• Lucene is a library

• Lucene by itself can only be queried programmatically

• Often the search system has to be totally independent from other
systems (i.e.: CMS)

• A ready to deploy search server is what you need

• Need to scale both vertically and horizontally

Apache Solr - Overview

• Ready to use enterprise search server

• REST (and programmatic) API

• Results in XML, JSON, PHP, Ruby, etc...

• Exploit Lucene power

• Scaling capabilities (replication, distributed search)

• Easy administration interface

• Easy to extend and customize (plugin architecture)

Apache Solr - Project status

• Latest release 1.4.1 on June 2010

• Lots of new features on trunk

• Most of new features on branch 3.0

• A huge very active community

• Lucid Imagination powered project

Solr - 5 minutes tutorial

• Download latest release (1.4.1)

• cd $SOLR_HOME/example

• java -jar -server start.jar

• You have an up and running Solr instance you can access via http://localhost:8983/solr
(this runs on top of Jetty)

• cd $SOLR_HOME/example/exampledocs

• Index with the command: sh post.sh *.xml

• Search with your browser

Solr - Query syntax

• Default operator is OR (you can override adding &q.op=AND to the HTTP req)

• You can query fields with fieldname:value

• Common + - AND OR NOT modifiers

• Range queries on date or numeric fields timestamp:[* TO NOW]

• Boost terms, i.e.: roma^2 inter

• Fuzzy search roam~0.6

• ...

Solr - Basic configuration steps
• Define fields, types and analysis inside schema.xml

• Play with solrconfig.xml:

• request handlers (update, search)

• index parameters

• caches

• deletion policy

• autowarming

• replication, clustering, etc...

Solr - schema.xml

• Types

• Analyzers to use for each type

• Fields with name, type and options

• Unique key

• Dynamic ﬁelds

• Copy ﬁelds

• Don’t use the default schema.xml, write it from scratch!

Solr - Type deﬁnition
Analyzers for querying and indexing
inside the schema

Solr - solrconfig.xml

• Where Solr will write the index

• Index merge factor

• Control different caches: documents, query results, filters

• Request handlers available to consume (HTTP) requests, typically at least a (standard)
search and an update handler exist

• Update request processor chains to configure indexing behavior

• Event listeners (newSearcher, firstSearcher)

• and much more...

Solr - Indexing

• Update requests on index are given with XML commands via HTTP POST

• <add> to insert and update

• <del> to remove by unique key or query

Solr - Searching

• HTTP GET to Solr instance with mandatory q parameter which specify the
query

• df - the default field to query

• fl - the list of fields to return (stored fields only)

• sort - fields used for sorting, default to score (it’s not a field)

• start, rows - paging attributes

• wt - response type, default to xml, can be json, php, ruby, etc

Solr - Data import

• Typically “old” systems rely on databases

• Data can be imported from DBs using the DataImportHandler component

• Deﬁne datasource, driver and mappings

Solr - Highlighting

• Useful when a snippet of the search results is needed

• In Solr 1.4.1 only stored fields can be highlighted

• Add &hl=true&hl.fl=field1,field2 to HTTP search request in order to enable
highlighting on field1 and field2

Solr - Faceting

• Break up search results into multiple categories showing counts for each

• Often used in stores

• Can be very useful in guiding user experience

• User can then drill down only results of a certain category

Solr - Filter queries

• Queries used as filters against the actual query

• Define document superset without influencing score

• Useful for domain specific queries where you want the user to search only in
certain “areas” of the index

• Add &fq=somefilterquery with the default Solr syntax

Solr - Enterprise
what?
Multicore
Replication
Distributed search
...

Solr - Multi core

• Deﬁne multiple Solr cores inside one only Solr instance

• Each cores maintain its own index

• Uniﬁed administration interface

• Runtime commands to create, swap, load, unload, delete cores

Solr - Replication

• It’s useful in case of high trafﬁc to replicate a Solr instance and split (with
eventually some load balancer in front) the queries

• Master has the original index

• Slave polls master asking the last version of index

• If slave has a lower version of the index asks the master for the difference
(rsync like)

• In the meanwhile indexes remain available

Solr - Distributed search

• When an index is too large, in terms of space or memory required, it can be
useful to deﬁne two or more shards

• A shard is a Solr instance and can be searched or indexed independently

• At the same time it’s possible to query all the shards having the result be
merged from the sub-results of each shard

• http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/
solr&indent=true&q=category:information

• Note that the document distribution among indexes is up to the user (or who
feeds the indexes)

One step beyond...

• Solr in the cloud

• Spatial search

• Solr & UIMA :-)

References

• http://lucene.apache.org/solr/

• http://lucene.apache.org/solr/tutorial.html

• http://wiki.apache.org/solr/FrontPage

Enterprise Search platform building scalable REST services on Apache Lucene

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Enterprise Search platform building scalable REST services on Apache Lucene

Similar a Enterprise Search platform building scalable REST services on Apache Lucene (20)

Más de Tommaso Teofili

Más de Tommaso Teofili (16)

Último

Último (20)

Enterprise Search platform building scalable REST services on Apache Lucene

Notas del editor