These are the slides for the session I presented at SoCal Code Camp Los Angeles on November 10, 2013.
http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=8cdfd955-2cd4-44a2-ad08-5353e079685a
3. Indexing Process
● request handler
○ data are read to create documents
● update request processor chain
○
○
○
○
optional document-wide processing
fields can be added, changed, removed
analysis
creation of indexed and stored fields
● update handler
○ the index is updated
4. Update Request Processor Chain
● de-duplication
○ creates a signature (hash) for each document to be
added
○ replaces (delete) existing documents with the same
signature
○ MD5Signature
■ exact hashing
○ Lookup3Signature
■ faster calculation and smaller hash than MD5
○ TextProfileSignature
■ fuzzy hashing, near-duplicate detection
5. Update Request Processor Chain
● language detection
○ detects the language used in field(s)
○ adds a language field to the document
○ TikaLanguageIdentifierUpdateProcessorFa
ctory
■ uses Apache Tika
○ LangDetectLanguageIdentifierUpdateProce
ssorFactory
■ uses language-detection library
○ external programs
■ e.g. Chromium Compact Language Detector
See Also: Language detection with Google's Compact Language Detector <http://blog.mikemccandless.com/2011/10/languagedetection-with-googles-compact.html>
6. Analysis
● analyzed
○ tokenization, i.e. breaking down the content to be
search into smaller units (“tokens”)
○ manipulation of tokens
● not analyzed
○ the whole content treated as 1 unit for searching
● analyzed v.s. not analyzed
○ are individual tokens meaningful on their own?
○ are individual tokens used in queries?
7. Example 1: book title
Lucene in Action, Second Edition: Covers Apache Lucene 3.0
Lucene in Action, Second Edition: Covers Apache Lucene 3.0
search for “Lucene”: no match
Lucene in Action, Second Edition: Covers Apache Lucene 3.0
makes more sense to tokenize
Example 2: ISBN
1-933-98817-7
1 933 98817 7
makes more sense to not tokenize
1 933 98817 7
search for “933”: match
9. Analysis
● character filter(s)
○ character replacement
○ e.g. accent marks with their base forms
café → cafe
jalapeño → jalapeno
● tokenizer
● token filter(s)
10. Analysis
● character filter(s)
● tokenizer
○ create tokens (“words”) from characters
○ sometimes straightforward
○ many unusual cases:
e-mail address, URL, code, etc.
● token filter(s)
12. Field value:
Let's sign up for the amazing So-Cal Code Camp® at http://bit.ly/oZiZsu. Free WiFi!
Tokens (text_general):
1
2
3 4
6
6
7 8
9
10
11 12
13
14
17
let's sign up for the amazing so cal code camp at http bit.ly oZiZsu
fi
Tokens (text_en):
1
2
3
17
let
sign up
fi
Tokens (text_en_splitting):
1
2
3
20
let
sign up
fi
6
amaz
6
amaz
7
8
9
10
so cal code camp
7
8
9
10
so cal code camp
socal
12
http
12
http
13
14
bit.li ozizsu
13
14 1516 17
20
16
free wi
15
16
free wi
18
19
bit ly o zi zsu free wi
httpbitlyozizsu
wifi
8
15
17
14. Scoring
● for a given query, each document not filtered
out gets a score (float)
● higher score: higher in the results
● scoring algorithms
○ default: TF-IDF
○ other: Okapi BM25, etc.
○ very customizable
See Also: Lucene/Solr Revolution 2013 presentation “Beyond TF-IDF: Why, What and How”
15. Scoring - TF-IDF
● term frequency (TF)
○ how many times does this term appear in this
document?
● inverse document frequency (IDF)
○ how many documents contain this term?
○ score proportional to the inverse of document
frequency
16. Scoring - Other Factors
● coordination factor (coord)
○ documents that contains all or most query terms get
higher scores
● normalizing factor (norm)
○ adjust for field length and query complexity
17. Scoring - Boost
● manual override: ask Lucene/Solr to give a
higher score to some particular thing(s)
● index-time
○ per document
○ per field (of a particular document)
● search-time
○ per query
18. More Like This
● finds documents similar in content (of one
field) to those matched
● constructs a query based on the highest
scoring terms in a document
● requires the field to:
○ have stored term vectors (recommended), or
○ be stored
Credit: How MoreLikeThis Works in Lucene <http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/>
19. Spell Checking
● typos in queries happen
● returns spell checking suggestion (if any)
within the same result
● can also be used for auto-complete
○ treating a prefix as a spelling mistake
○ returning full words as suggestions
23. Query Elevation
● configure the elevator search component
in solrconfig.xml
● in elevate.xml, specify the queries and
the list of documents (by id) to elevate or
exclude
● enable query elevation:
enableElevation=true
● (optional) override the sort parameter:
forceElevation=true
24. Function Query
● like formulas in Excel
● apply functions to field values for filtering
and scoring
28. Spatial Search
● geofilt
○ circle centered at a given point
○ distance from a given point
○ fq={!geofilt sfield=store}&pt=45.15,
-93.85&d=5
● bbox
○ square (“bounding box”) centered at a given point
○ distance from a given point + corners
○ fq={!bbox sfield=store}&pt=45.15,-93.85
&d=5
Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>
31. Spatial Search
● geodist
○ returns the distance between the location given in a
field and a certain coordinate
○ e.g. sort by ascending distance from (45.15,-93.85),
and return the distances as the score:
q={!func}geodist()&sfield=store&pt=45.
15,-93.85&sort=score+asc
Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>
32. Scaling/Redundancy - Problems
● collection too large for a single machine
● too many requests for a single machine
● a machine can go down
33. Scaling/Redundancy - Solutions
● collection too large for a single machine
○ distribution
■ spread the collection across multiple machines
● too many requests for a single machine
○ distribution
■ spread the requests across multiple machines
● a machine can go down
○ replication
■ copy data and configuration across multiple
machines
■ make sure no single point of failure
35. SolrCloud
● Solr instances
○ collection (logical index) divided into one or more
partial collections (“shards”)
○ for each shard, one or more Solr instances keep
copies of the data
■ one as leader - handles reads and writes
■ others as replicas - handle reads
● ZooKeeper instances
37. collection (i.e. logical index)
shard 1:
⅓ of the
collection
shard 2:
⅓ of the
collection
shard 3:
⅓ of the
collection
leader
replica
replica
leader
replica
replica
leader
replica
replica
replica
38. collection (i.e. logical index)
shard 1:
⅓ of the
collection
shard 2:
⅓ of the
collection
shard 3:
⅓ of the
collection
leader
replica
replica
replica
leader
replica
replica
replica
leader
replica
replica
39. collection (i.e. logical index)
shard 1:
⅓ of the
collection
shard 2:
⅓ of the
collection
shard 3:
⅓ of the
collection
leader
replica
replica
replica
(offline)
leader
replica
replica
leader
replica
replica
40. collection (i.e. logical index)
shard 1:
⅓ of the
collection
shard 2:
⅓ of the
collection
shard 3:
⅓ of the
collection
leader
replica
replica
replica
replica
leader
replica
replica
leader
replica
replica
41. Resources - Books
● Lucene in Action
○ written by 3 committer and PMC members
○ somewhat outdated (2010; covers Lucene 3.0)
○ http://www.manning.com/hatcher3/
● Solr in Action
○ early access; coming out later this year
○ http://www.manning.com/grainger/
● Apache Solr 4 Cookbook
○ common problems and useful tips
○ http://www.packtpub.com/apache-solr-4cookbook/book
42. Resources - Books
● Introduction to Information Retrieval
○ not specific to Lucene/Solr, but about IR concepts
○ free e-book
○ http://nlp.stanford.edu/IR-book/
● Managing Gigabytes
○ indexing, compression and other topics
○ accompanied by MG4J - a full-text search software
○ http://mg4j.di.unimi.it/
43. Resources - Web
● official websites
○ Lucene Core - http://lucene.apache.org/core/
○ Solr - http://lucene.apache.org/solr/
● mailing lists
● Wiki sites
○ Lucene Core - http://wiki.apache.org/lucene-java/
○ Solr - http://wiki.apache.org/solr/
● reference guides
○ API Documentation for Lucene and Solr
○ Apache Solr Reference Guide
44. Getting Started
● download Solr
○ requires Java 6 or newer to run
● Solr comes bundled/configured with Jetty
○ <Solr directory>/example/start.jar
● "exampledocs" directory contains sample
documents
○ <Solr directory>/example/exampledocs/post.jar
○ java -Durl=http://localhost:
8983/solr/update -jar post.jar *.xml
● use the Solr admin interface
○ http://localhost:8983/solr/