4. 4
Indexing In-Depth
• Deletions and Updates
• Optimize
• Important Internals
– File Formats
– Segments, Commits, Merging
– Compound File System
• Performance
5. 5
Lucene File Formats and
Structures
• http://lucene.apache.org/java/2_4_0/fileformats.html
• A Lucene index is made up of one or more
Segments
• Lucene tracks Documents internally by an int “id”
• This id may change across index operations
– You should not rely on it unless you know your index isn’t
changing
• You can ask for a Document by this id on the
IndexReader
6. 6
Segments
• Each Segment is an independent index containing:
– Field Names
– Stored Field values
– Term Dictionary, proximity info and normalization
factors
– Term Vectors (optional)
– Deleted Docs
• Compound File System (CFS) stores all of these logical
pieces in a single file
7. How Lucene Indexes
• Lucene indexes Documents into memory
– At certain trigger points, memory (segments)
are committed/flushed to the Directory
• Can be forced by calling commit()
– Segments are periodically merged (more in a
moment)
8. 8
Segments and Merging
• May be created when new documents are
added
• Are merged from time to time based on
segment size in relation to:
– MergePolicy
– MergeScheduler
– Optimization
9. 9
Merge Policy
• Identifies Segments to be merged
• Two Current Implementations
– LogDocMergePolicy
– LogByteSizeMergePolicy
• mergeFactor - Max # of segments allowed
before merging
11. 11
Optimize
• Optimize is the process of merging
segments down into a single segment
• This process can yield significant speedups
in search
• Can be slow
• Can also do partial optimizes
12. 12
Final Thoughts On Merging
• Usually don’t have to think about it, except
when to optimize
• In high update, performance critical
environments, you may need to dig into it
more as it can sometimes cause long pauses
• Good to optimize when you can, otherwise,
keep a low mergeFactor
13. Deletion
• A deletion only marks the Document as
deleted
– Doesn’t get physically removed until a merge
• Deletions can be a bit confusing
– Both IndexReader and IndexWriter
have delete methods
• By: id, term(s), Query(s)
14. 14
Task
– Build your index from yesterday and then try
some deletes
• Id, term, Query
– Also try out an optimize on a FSDirectory
against the full Reuters sample
– 15-20 minutes
15. 15
Updates
• Updates are always a delete and an add
• Updates are always a delete and an add
– Yes, that is a repeat!
– Nature of data structures used in search
• See
IndexWriter.updateDocument()
16. Performance Factors
• setRAMBufferSizeMB
– New model for automagically controlling indexing
factors based on the amount of memory in use
– Obsoletes setMaxBufferedDocs
• maxBufferedDocs
– Minimum # of docs before merge occurs and a new segment is
created
– Usually, Larger == faster, but more RAM
17. 17
More Factors
• mergeFactor
– How often segments are merged
– Smaller == less RAM, better for incremental updates
– Larger == faster, better for batch indexing
• maxFieldLength
– Limit the number of terms in a Document
• Analysis
• Reuse
– Document, TokenStream, Token
18. Index Threading
• IndexWriter and IndexReader are thread-
safe and can be shared between threads without
external synchronization
• One open IndexWriter per Directory
• Parallel Indexing
– Index to separate Directory instances
– Merge using IndexWriter.addIndexes
– Could also distribute and collect
19. Benchmarking Indexing
• contrib/benchmark
• Try out different algorithms between Lucene 2.2
and 2.3
– contrib/benchmark/conf:
• indexing.alg
• indexing-multithreaded.alg
• Info:
– Mac Pro 2 x 2GHz Dual-Core Xeon
– 4 GB RAM
– ant run-task -Dtask.alg=./conf/indexing.alg -Dtask.mem=1024M
21. Searching
• Earlier we touched on basics of search
using the QueryParser
• Now look at:
– Searcher/IndexReader Lifecycle
– Query classes
– More details on the QueryParser
– Filters
– Sorting
22. Lifecycle
• Recall that the IndexReader loads a snapshot
of index into memory
– This means updates made since loading the index will
not be seen
• Business rules are needed to define how often to
reload the index, if at all
– IndexReader.isCurrent() can help
• Loading an index is an expensive operation
– Do not open a Searcher/IndexReader for every
search
23. 23
Reopen
• It is possible to have IndexReader reopen new
or changed segments
– Save some on the cost of loading a new index
• Does not close the old reader, so application must
• See
DeletionsUpdatesTest.testReopen()
24. Query Classes
• TermQuery is basis for all non-span queries
• BooleanQuery combines multiple Query
instances as clauses
– should
– required
• PhraseQuery finds terms occurring near each
other, position-wise
– “slop” is the edit distance between two terms
• Take 2-3 minutes to explore Query
implementations
25. Spans
• Spans provide information about where
matches took place
• Not supported by the QueryParser
• Can be used in BooleanQuery clauses
• Take 2-3 minutes to explore SpanQuery
classes
– SpanNearQuery useful for doing phrase
matching
26. QueryParser
• MultiFieldQueryParser
• Boolean operators cause confusion
– Better to think in terms of required (+ operator) and not
allowed (- operator)
• Check JIRA for QueryParser issues
• http://www.gossamer-threads.com/lists/lucene/java-user/40945
• Most applications either modify QP, create their
own, or restrict to a subset of the syntax
• Your users may not need all the “flexibility” of
the QP
27. Sorting
• Lucene default sort is by score
• Searcher has several methods that take in a
Sort object
• Sorting should be addressed during indexing
• Sorting is done on Fields containing a single
term that can be used for comparison
• The SortField defines the different sort types
available
– AUTO, STRING, INT, FLOAT, CUSTOM, SCORE,
DOC
28. Sorting II
• Look at Searcher, Sort and
SortField
• Custom sorting is done with a
SortComparatorSource
• Sorting can be very expensive
– Terms are cached in the FieldCache
29. Filters
• Filters restrict the search space to a
subset of Documents
• Use Cases
– Search within a Search
– Restrict by date
– Rating
– Security
– Author
30. Filter Classes
• QueryWrapperFilter (QueryFilter)
– Restrict to subset of Documents that match a Query
• RangeFilter
– Restrict to Documents that fall within a range
– Better alternative to RangeQuery
• CachingWrapperFilter
– Wrap another Filter and provide caching
31. 31
Task
• Modify your program to sort by a field and
to filter by a query or some other criteria
– ~15 minutes
32. Searchers
• MultiSearcher
– Search over multiple Searchables, including remote
• MultiReader
– Not a Searcher, but can be used with
IndexSearcher to achieve same results for local
indexes
• ParallelMultiSearcher
– Like MultiSearcher, but threaded
• RemoteSearchable
– RMI based remote searching
• Look at MultiSearcherTest in example
code
33. Expert Results
• Searcher has several “expert” methods
• HitCollector allows low-level access to all
Documents as they are scored
34. Search Performance
• Search speed is based on a number of factors:
– Query Type(s)
– Query Size
– Analysis
– Occurrences of Query Terms
– Optimize
– Index Size
– Index type (RAMDirectory, other)
– Usual Suspects
• CPU
• Memory
• I/O
• Business Needs
35. Query Types
• Be careful with WildcardQuery as it rewrites
to a BooleanQuery containing all the terms
that match the wildcards
• Avoid starting a WildcardQuery with wildcard
• Use ConstantScoreRangeQuery instead of
RangeQuery
• Be careful with range queries and dates
– User mailing list and Wiki have useful tips for
optimizing date handling
36. Query Size
• Stopword removal
• Search an “all” field instead of many fields with the same
terms
• Disambiguation
– May be useful when doing synonym expansion
– Difficult to automate and may be slower
– Some applications may allow the user to disambiguate
• Relevance Feedback/More Like This
– Use most important words
– “Important” can be defined in a number of ways
37. Usual Suspects
• CPU
– Profile your application
• Memory
– Examine your heap size, garbage collection approach
• I/O
– Cache your Searcher
• Define business logic for refreshing based on indexing needs
– Warm your Searcher before going live -- See Solr
• Business Needs
– Do you really need to support Wildcards?
– What about date range queries down to the millisecond?
38. FieldSelector
• Prior to version 2.1, Lucene always loaded all
Fields in a Document
• FieldSelector API addition allows Lucene to
skip large Fields
– Options: Load, Lazy Load, No Load, Load and Break,
Load for Merge, Size, Size and Break
• Makes storage of original content more viable
without large cost of loading it when not used
• FieldSelectorTest in example code
39. 39
Relevance
• At some point along your journey, you will
get results that you think are “bad”
• Is it a big deal?
– Content, Content, Content!
– Relevance Judgments
– Don’t break other queries just to “fix” one
• Hardcode it!
– A query doesn’t always have to result in a
“search”
40. Scoring and Similarity
• Lucene has sophisticated scoring
mechanism designed to meet most needs
• Has hooks for modifying scores
• Scoring is handled by the Query, Weight
and Scorer class
41. Explanations
• explain(Query, int) method is
useful for understanding why a Document
scored the way it did
• Shows all the pieces that went into scoring
the result:
– Tf, DF, boosts, etc.
42. Tuning Relevance
• FunctionQuery from Solr (variation in
Lucene)
• Override Similarity
• Implement own Query and related classes
• Payloads
• Boosts
43. 43
Task
• Open Luke and try some queries and then
use the “explain” button
• Or, write some code to do explains on a
query and some documents
• See how Query type, boosting, other
factors play a role in the score
44. 44
Terms and Term Vectors
• Sometimes you need access to the Term
Dictionary:
– Auto suggest
– Frequency information
• Sometimes you need a Document-centric
view of terms, frequencies, positions and
offsets
– Term Vectors
45. Term Information
• TermEnum gives access to terms and how many
Documents they occur in
– IndexReader.terms()
• TermDocs gives access to the frequency of a
term in a Document
– IndexReader.termDocs()
– TermPositions extends TermDocs and
provides access to position and payload info
– IndexReader.termPositions()
46. 46
Term Vectors
• Term Vectors give access to term frequency
information in a given Document
– IndexReader.getTermFreqVector
• TermVectorMapper provides callbacks
for working with Term Vectors
49. Recap
• Indexing
• Searching
• Performance
• Odds and Ends
– Explains
– FieldSelector
– Relevance
– Terms and Term Vectors
50. 50
Class Project
• Your chance to really dig in and get your
hands dirty
• Ask Questions
• Options…
51. 51
Option I
• Start building out your Lucene Application!
– Index your Data (or any data)
• Threading/Updates/Deletions
• Analysis
– Search
• Caching/Warming
• Dealing with Updates
• Multi-threaded
– Display
52. 52
Option II
• Dig deeper into an area of interest
– Performance
• How fast can you index?
• Search? Queries per Second?
– Analysis
– Query Parsing
– Scoring
– Contrib
53. 53
Option III
• Dig into JIRA issues and find something to
fix in Lucene
• https://issues.apache.org/jira/secure/Dashboard.jspa
• http://wiki.apache.org/lucene-java/HowToCon
57. Open Discussion
• Multilingual Best Practices
– UNICODE
– One Index versus many
• Advanced Analysis
• Distributed Lucene
• Crawling
• Hadoop
• Nutch
• Solr
59. Finally…
• Please take the time to fill out a survey to
help me improve this training
– Located in base directory of source
– Email it to me at trainer@lucenebootcamp.com
• There are several Lucene related talks on
Wednesday