Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

Search Engine-Building
with Lucene and Solr
Part 2
Kai Chan
SoCal Code Camp, November 2013

Overview
●
●
●
●
●
●
●

indexing process
searching process
advanced features
scaling/redundancy
resources
demo
questions/answers

Indexing Process
● request handler
○ data are read to create documents

● update request processor chain
○
○
○
○

optional document-wide processing
fields can be added, changed, removed
analysis
creation of indexed and stored fields

● update handler
○ the index is updated

Update Request Processor Chain
● de-duplication
○ creates a signature (hash) for each document to be
added
○ replaces (delete) existing documents with the same
signature
○ MD5Signature
■ exact hashing
○ Lookup3Signature
■ faster calculation and smaller hash than MD5
○ TextProfileSignature
■ fuzzy hashing, near-duplicate detection

Update Request Processor Chain
● language detection
○ detects the language used in field(s)
○ adds a language field to the document
○ TikaLanguageIdentifierUpdateProcessorFa
ctory
■ uses Apache Tika
○ LangDetectLanguageIdentifierUpdateProce
ssorFactory
■ uses language-detection library
○ external programs
■ e.g. Chromium Compact Language Detector
See Also: Language detection with Google's Compact Language Detector <http://blog.mikemccandless.com/2011/10/languagedetection-with-googles-compact.html>

Analysis
● analyzed
○ tokenization, i.e. breaking down the content to be
search into smaller units (“tokens”)
○ manipulation of tokens

● not analyzed
○ the whole content treated as 1 unit for searching

● analyzed v.s. not analyzed
○ are individual tokens meaningful on their own?
○ are individual tokens used in queries?

Example 1: book title
Lucene in Action, Second Edition: Covers Apache Lucene 3.0


search for “Lucene”: no match


makes more sense to tokenize

Example 2: ISBN

1-933-98817-7

1 933 98817 7

makes more sense to not tokenize

1 933 98817 7

search for “933”: match

Analysis
analyzed:
● text

How about URL?

not analyzed:
● number
● serial number
● GUID
● checksum

Analysis
● character filter(s)
○ character replacement
○ e.g. accent marks with their base forms
café → cafe
jalapeño → jalapeno

● tokenizer
● token filter(s)

Analysis
● tokenizer
○ create tokens (“words”) from characters
○ sometimes straightforward
○ many unusual cases:
e-mail address, URL, code, etc.

● token filter(s)

Analysis
● tokenizer
● token filter(s)
○ token replacement
■ change case, remove apostrophe
■ remove stop words (a, and, the, for)
■ split/join words (ice-cream, ice cream, icecream)
■ stemming (importing, imported → import)
■ synonym (nation → country)

Field value:
Let's sign up for the amazing So-Cal Code Camp® at http://bit.ly/oZiZsu. Free WiFi!

Tokens (text_general):
1
2
3 4
6
6
7 8
9
10
11 12
13
14
17
let's sign up for the amazing so cal code camp at http bit.ly oZiZsu
fi

Tokens (text_en):
1
2
3
17
let
sign up
fi
Tokens (text_en_splitting):
1
2
3
20
let
sign up
fi

6
amaz

6
amaz

7

8

9

10

so cal code camp

7

8

9

10

so cal code camp
socal

12
http

12
http

13

14

bit.li ozizsu

13

14 1516 17

20

16

free wi

15

16

free wi

18

19

bit ly o zi zsu free wi
httpbitlyozizsu

wifi
8

15

17

Searching Process
●
●
●
●
●
●

query parsing
analysis
scoring
sorting
loading of stored fields
optional search components
○
○
○
○

faceting
term vector
More Like This
highlighting

Scoring
● for a given query, each document not filtered
out gets a score (float)
● higher score: higher in the results
● scoring algorithms
○ default: TF-IDF
○ other: Okapi BM25, etc.
○ very customizable

See Also: Lucene/Solr Revolution 2013 presentation “Beyond TF-IDF: Why, What and How”

Scoring - TF-IDF
● term frequency (TF)
○ how many times does this term appear in this
document?

● inverse document frequency (IDF)
○ how many documents contain this term?
○ score proportional to the inverse of document
frequency

Scoring - Other Factors
● coordination factor (coord)
○ documents that contains all or most query terms get
higher scores

● normalizing factor (norm)
○ adjust for field length and query complexity

Scoring - Boost
● manual override: ask Lucene/Solr to give a
higher score to some particular thing(s)
● index-time
○ per document
○ per field (of a particular document)

● search-time
○ per query

More Like This
● finds documents similar in content (of one
field) to those matched
● constructs a query based on the highest
scoring terms in a document
● requires the field to:
○ have stored term vectors (recommended), or
○ be stored

Credit: How MoreLikeThis Works in Lucene <http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/>

Spell Checking
● typos in queries happen
● returns spell checking suggestion (if any)
within the same result
● can also be used for auto-complete
○ treating a prefix as a spelling mistake
○ returning full words as suggestions

/select?q=text:"busness comunication"&spellcheck=true&wt=xml

<lst name="spellcheck">
<lst name="suggestions">
<lst name="busness">
<int name="numFound">1</int>
<int name="startOffset">6</int>
<int name="endOffset">13</int>
<arr name="suggestion">
<str>business</str>
</arr>
</lst>
<lst name="comunication">
<int name="numFound">1</int>
<int name="startOffset"
>14</int>
<int name="endOffset">26</int>
<arr name="suggestion">
<str>communication</str>
</arr>
</lst>
</lst>
</lst>

Query Elevation
● a.k.a. “sponsored search”
● make sure certain documents appear at the
top of the results for a certain query

Credit: Google Web Search <http://www.google.com/>

Query Elevation
● configure the elevator search component
in solrconfig.xml
● in elevate.xml, specify the queries and
the list of documents (by id) to elevate or
exclude
● enable query elevation:
enableElevation=true
● (optional) override the sort parameter:
forceElevation=true

Function Query
● like formulas in Excel
● apply functions to field values for filtering
and scoring

Function Query
● query:
q={!func} cos(angle)
● query (range):
q={!frange l=0.5 u=1} cos(angle)
● field:
fl=angle,cos(angle)
● sort:
sort=cos(angle) desc

Spatial Search
● data: contains locations
(longitudes, latitudes)
○ e.g. merchants with store locations

● search: filter and/or sort by location

Credit: Google Maps <http://maps.google.com/>

Spatial Search
● geofilt
○ circle centered at a given point
○ distance from a given point
○ fq={!geofilt sfield=store}&pt=45.15,
-93.85&d=5

● bbox
○ square (“bounding box”) centered at a given point
○ distance from a given point + corners
○ fq={!bbox sfield=store}&pt=45.15,-93.85
&d=5

Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>

geofilt

bbox

5 km

(45.15, -93.85)


5 km

(45.15, -93.85)

geofilt

x

bbox

x

x
x

o
o

5 km

5 km
o

o

(45.15, -93.85)

(45.15, -93.85)

o

o

x


o

Spatial Search
● geodist
○ returns the distance between the location given in a
field and a certain coordinate
○ e.g. sort by ascending distance from (45.15,-93.85),
and return the distances as the score:
q={!func}geodist()&sfield=store&pt=45.
15,-93.85&sort=score+asc


Scaling/Redundancy - Problems
● collection too large for a single machine
● too many requests for a single machine
● a machine can go down

Scaling/Redundancy - Solutions
● collection too large for a single machine
○ distribution
■ spread the collection across multiple machines

● too many requests for a single machine
○ distribution
■ spread the requests across multiple machines

● a machine can go down
○ replication
■ copy data and configuration across multiple
machines
■ make sure no single point of failure

SolrCloud
● Solr instances
● ZooKeeper instances

SolrCloud
● Solr instances
○ collection (logical index) divided into one or more
partial collections (“shards”)
○ for each shard, one or more Solr instances keep
copies of the data
■ one as leader - handles reads and writes
■ others as replicas - handle reads


SolrCloud
● Solr instances
○ management of Solr instances
○ leader election
○ node discovery

collection (i.e. logical index)

shard 1:
⅓ of the
collection

shard 2:
⅓ of the
collection

shard 3:
⅓ of the
collection

leader

replica

replica

leader

replica

replica

leader

replica

replica

replica


shard 1:
⅓ of the
collection

shard 2:
⅓ of the
collection

shard 3:
⅓ of the
collection

leader

replica

replica

replica

leader

replica

replica

replica

leader

replica

replica


shard 1:
⅓ of the
collection

shard 2:
⅓ of the
collection

shard 3:
⅓ of the
collection

leader

replica

replica

replica

(offline)

leader

replica

replica

leader

replica

replica


shard 1:
⅓ of the
collection

shard 2:
⅓ of the
collection

shard 3:
⅓ of the
collection

leader

replica

replica

replica

replica

leader

replica

replica

leader

replica

replica

Resources - Books
● Lucene in Action
○ written by 3 committer and PMC members
○ somewhat outdated (2010; covers Lucene 3.0)
○ http://www.manning.com/hatcher3/

● Solr in Action
○ early access; coming out later this year
○ http://www.manning.com/grainger/

● Apache Solr 4 Cookbook
○ common problems and useful tips
○ http://www.packtpub.com/apache-solr-4cookbook/book

Resources - Books
● Introduction to Information Retrieval
○ not specific to Lucene/Solr, but about IR concepts
○ free e-book
○ http://nlp.stanford.edu/IR-book/

● Managing Gigabytes
○ indexing, compression and other topics
○ accompanied by MG4J - a full-text search software
○ http://mg4j.di.unimi.it/

Resources - Web
● official websites
○ Lucene Core - http://lucene.apache.org/core/
○ Solr - http://lucene.apache.org/solr/

● mailing lists
● Wiki sites
○ Lucene Core - http://wiki.apache.org/lucene-java/
○ Solr - http://wiki.apache.org/solr/

● reference guides
○ API Documentation for Lucene and Solr
○ Apache Solr Reference Guide

Getting Started
● download Solr
○ requires Java 6 or newer to run

● Solr comes bundled/configured with Jetty
○ <Solr directory>/example/start.jar

● "exampledocs" directory contains sample
documents
○ <Solr directory>/example/exampledocs/post.jar
○ java -Durl=http://localhost:
8983/solr/update -jar post.jar *.xml

● use the Solr admin interface
○ http://localhost:8983/solr/

Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

Similar to Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013) (20)

Recently uploaded

Recently uploaded (20)

Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)