Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Munching & crunching - Lucene index post-processing

6.122 visualizaciones

Publicado el

Lucene EuroCon 10 presentation on index post-processing (splitting, merging, sorting, pruning), tiered search, bitwise search, and a few slides on MapReduce indexing models (I ran out of time to show them, but they are there...)

Publicado en: Tecnología
  • .DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... .DOWNLOAD PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... .DOWNLOAD EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... .DOWNLOAD doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... .DOWNLOAD PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... .DOWNLOAD EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... .DOWNLOAD doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Responder 
    ¿Estás seguro?    No
    Tu mensaje aparecerá aquí

Munching & crunching - Lucene index post-processing

  1. 1. 1 Munching & crunching Lucene index post-processing and applications Andrzej Białecki <andrzej.bialecki@lucidimagination.com>
  2. 2. Intro  Started using Lucene in 2003 (1.2-dev?)  Created Luke – the Lucene Index Toolbox  Nutch, Hadoop committer, Lucene PMC member  Nutch project lead
  3. 3. Munching and crunching? But really...  Stir your imagination  Think outside the box  Show some unorthodox use and practical applications  Close ties to scalability, performance, distributed search and query latency
  4. 4. Agenda ● Post-processing ● Splitting, merging, sorting, pruning ● Tiered search ● Bit-wise search ● (Map-reduce indexing models) Apache Lucene EuroCon 20 May 2010
  5. 5. Why post-process indexes?  Isn't it better to build them right from the start?  Sometimes it's not convenient or feasible  Correcting impact of unexpected common words  Targetting specific index size or composition:  Creating evenly-sized shards  Re-balancing shards across servers  Fitting indexes completely in RAM  … and sometimes impossible to do it right  Trimming index size while retaining quality of top-N results Apache Lucene EuroCon 20 May 2010
  6. 6. Merging indexes  It's easy to merge several small indexes into one  Fundamental Lucene operation during indexing (SegmentMerger)  Command-line utilities exist: IndexMergeTool  API:  IndexWriter.addIndexes(IndexReader...)  IndexWriter.addIndexesNoOptimize(Directory...)  Hopefully a more flexible API on the flex branch  Solr: through CoreAdmin action=mergeindexes  Note: schema must be compatible Apache Lucene EuroCon 20 May 2010
  7. 7. Splitting indexes original index segments_2  IndexSplitter tool:  Moves whole segments to standalone indexes _0 _1 _2  Pros: nearly no IO/CPU involved – just rename & create new SegmentInfos file Cons: segments_0 segments_0  segments_0  Requires a multi-segment index!  Very limited control over content of resulting indexes → MergePolicy new indexes Apache Lucene EuroCon 20 May 2010
  8. 8. Splitting indexes, take 2 original index del2 del1 d1  MultiPassIndexSplitter tool: d2  Uses an IndexReader that keeps the list of deletions in memory  The source index remains unmodified d3  For each partition: d4  Marks all source documents not in the partition as deleted  Writes a target split using IndexWriter.addIndexes(IndexReader)  IndexWriter knows how to skip deleted documents  Removes the “deleted” mark from all source documents pass 1 pass 2  Pros: d1 d2  Arbitrary splits possible (even partially overlapping) d3 d4  Source index remains intact  Cons: new indexes  Reads complete index N times – I/O is O(N * indexSize)  Takes twice as much space (source index remains intact) … but maybe it's a feature?  Apache Lucene EuroCon 20 May 2010
  9. 9. Splitting indexes, take 3 1 2 3 4 5 6 7 8 9 10 ... stored fields term dict  SinglePassSplitter postings+payloads  Uses the same processing workflow as term vectors SegmentMerger, only with multiple outputs  Write new SegmentInfos and FieldInfos 1 3 5… 1' 2' 3' 4' 5' 6'... stored  Merge (pass-through) stored fields terms partitioner  Merge (pass-through) term dictionary postings term vectors  Merge (pass-through) postings with payloads 246… 1' 2' 3' 4' 5' 6'...  Merge (pass-through) term vectors stored  Renumbers document id-s on-the-fly to form terms contiguous space postings term vectors  Pros: flexibility as with MultiPassIndexSplitter  Status: work started, to be contributed soon... renumber Apache Lucene EuroCon 20 May 2010
  10. 10. Splitting indexes, summary  SinglePassSplitter – best tradeoff of flexibility/IO/CPU  Interesting scenarios with SinglePassSplitter:  Split by ranges, round-robin, by field value, by frequency, to a target size, etc...  “Extract” handful of documents to a separate index  “Move” documents between indexes:  “extract” from source  Add to target (merge)  Delete from source  Now the source index may reside on a network FS – the amount of IO is O(1 * indexSize) Apache Lucene EuroCon 20 May 2010
  11. 11. Index sorting - introduction  “Early termination” technique  If full execution of a query takes too long then terminate and estimate  Termination conditions:  Number of documents – LimitedCollector in Nutch  Time – TimeLimitingCollector (see also extended LUCENE-1720 TimeLimitingIndexReader)  Problems:  Difficult to estimate total hits  Important docs may not be collected if they have high docID-s Apache Lucene EuroCon 20 May 2010
  12. 12. Index sorting - details early termination == poor original index  Define a global ordering of 0 1 2 3 4 5 6 7 doc ID c e h f a d g b rank documents (e.g. PageRank, popularity, quality, etc)  Documents with good rank ID mapping should generally score higher 4 7 0 5 1 3 6 2 old doc ID 0 1 2 3 4 5 6 7 new doc ID  Sort (internal) ID-s by this ordering, descending sorted index  Map from old to new ID-s 0 1 2 3 4 5 6 7 doc ID to follow this ordering a b c d e f g h rank early termination == good  Change the ID-s in postings Apache Lucene EuroCon 20 May 2010
  13. 13. Index sorting - summary  Implementation in Nutch: IndexSorter  Based on PageRank – sorts by decreasing page quality  Uses FilterIndexReader  NOTE: “Early termination” will (significantly) reduce quality of results with non-sorted indexes – use both or neither Apache Lucene EuroCon 20 May 2010
  14. 14. Index pruning  Quick refresh on the index composition:  Stored fields  Term dictionary  Term frequency data  Positional data (postings)  With or without payload data  Term frequency vectors  Number of documents may be into millions  Number of terms commonly is well into millions  Not to mention individual postings … Apache Lucene EuroCon 20 May 2010
  15. 15. Index pruning & top-N retrieval  N is usually << 1000  Very often search quality is judged based on top-20  Question:  Do we really need to keep and process ALL terms and ALL postings for a good-quality top-N search for common queries? Apache Lucene EuroCon 20 May 2010
  16. 16. Index pruning hypothesis  There should be a way to remove some of the less important data  While retaining the quality of top-N results!  Question: what data is less important?  Some answers:  That of poorly-scoring documents  That of common (less selective) terms  Dynamic pruning: skips less relevant data during query processing → runtime cost...  But can we do this work in advance (static pruning)? Apache Lucene EuroCon 20 May 2010
  17. 17. What do we need for top-N results?  Work backwards  “Foreach” common query:  Run it against the full index  Record the top-N matching documents  “Foreach” document in results:  Record terms and term positions that contributed to the score  Finally: remove all non-recorded postings and terms  First proposed by D. Carmel (2001) for single term queries Apache Lucene EuroCon 20 May 2010
  18. 18. … but it's too simplistic: 0 quick 0 quick before pruning 1 brown 1 brown after pruning 2 fox 2 fox Query 1: brown - topN(full) == topN(pruned) Query 2: “brown fox” - topN(full) != topN(pruned)  Hmm, what about less common queries?  80/20 rule of “good enough”?  Term-level is too primitive  Document-centric pruning  Impact-centric pruning  Position-centric pruning Apache Lucene EuroCon 20 May 2010
  19. 19. Smarter pruning Freq  Not all term positions are equally corpus language important model document language  Metrics of term and position model importance:  Plain in-document term frequency (TF)  TF-IDF score obtained from top-N results of TermQuery (Carmel method)  Residual IDF – measure of term informativeness (selectivity)  Key-phrase positions, or term clusters  Kullback-Leibler divergence from a Term language model → Apache Lucene EuroCon 20 May 2010
  20. 20. Applications  Obviously, performance-related  Some papers claim a modest impact on quality when pruning up to 60% of postings  See LUCENE-1812 for some benchmarks confirming this claim  Removal / restructuring of (some) stored content  Legacy indexes, or ones created with a fossilized external chain Apache Lucene EuroCon 20 May 2010
  21. 21. Stored field pruning  Some stored data can be compacted, removed, or restructured  Use case: source text for generating “snippets”  Split content into sentences  Reorder sentences by a static “importance” score (e.g. how many rare terms they contain)  NOTE: this may use collection wide statistics!  Remove the bottom x% of sentences Apache Lucene EuroCon 20 May 2010
  22. 22. LUCENE-1812: contrib/pruning tools and API  Based on FilterIndexReader  Produces output indexes via IndexWriter.addIndexes(IndexReader[])  Design:  PruningReader – subclass of FilterIndexReader with necessary boilerplate and hooks for pruning policies  StorePruningPolicy – implements rules for modifying stored fields (and list of field names)  TermPruningPolicy – implements rules for modifying term dictionary, postings and payloads  PruningTool – command-line utility to configure and run PruningReader Apache Lucene EuroCon 20 May 2010
  23. 23. Details of LUCENE-1812 source index target index stored fields StorePruningPolicy stored fields IndexWriter term dict term dict postings+payloads TermPruningPolicy postings+payloads term vectors term vectors PruningReader IW.addIndexes(IndexReader...)  IndexWriter consumes source data filtered via PruningReader  Internal document ID-s are preserved – suitable for bitset ops and retrieval by internal ID  If source index has no deletions  If target index is empty Apache Lucene EuroCon 20 May 2010
  24. 24. API: StorePruningPolicy  May remove (some) fields from (some) documents  May as well modify the values  May rename / add fields Apache Lucene EuroCon 20 May 2010
  25. 25. API: TermPruningPolicy  Thresholds (in the order of precedence):  Per term  Per field  Default  Plain TF pruning – TFTermPruningPolicy  Removes all postings for a term where TF (in-document term frequency) is below a threshold  Top-N term-level – CarmelTermPruningPolicy  TermQuery search for top-N docs  Removes all postings for a term outside the top-N docs Apache Lucene EuroCon 20 May 2010
  26. 26. Results so far...  TF pruning:  Term query recall very good  Phrase query recall very poor – expected...  Carmel pruning – slightly better term position selection, but still heavy negative impact on phrase queries  Recognizing and keeping key phrases would help  Use query log for frequent-phrase mining?  Use collocation miner (Mahout)?  Savings on pruning will be smaller, but quality will significantly improve Apache Lucene EuroCon 20 May 2010
  27. 27. References  Static Index Pruning for Information Retrieval Systems, Carmel et al, SIGIR'01  A document-centric approach to static index pruning in text retrieval systems, Büttcher & Clark, CIKM'06  Locality-based pruning methods for web search, deMoura et al, ACM TIS '08  Pruning strategies for mixed-mode querying, Anh & Moffat, CIKM'06 Apache Lucene EuroCon 20 May 2010
  28. 28. Index pruning applied ...  Index 1: A heavily pruned index that fits in RAM:  excellent speed  poor search quality for many less-common query types  Index 2: Slightly pruned index that fits partially in RAM:  good speed, good quality for many common query types,  still poor quality for some other rare query types  Index 3: Full index on disk:  Slow speed  Excellent quality for all query types  QUESTION: Can we come up with a combined search strategy? Apache Lucene EuroCon 20 May 2010
  29. 29. Tiered search search box 1 search box 1 RAM 70% pruned search box 2 search box 2 SSD 30% pruned ? predict evaluate search box 3 search box 3 HDD 0% pruned  Can we predict the best tier without actually running the query?  How to evaluate if the predictor was right? Apache Lucene EuroCon 20 May 2010
  30. 30. Tiered search: tier selector and evaluator  Best tier can be predicted (often enough ):  Carmel pruning yields excellent results for simple term queries  Phrase-based pruning yields good results for phrase queries (though less often)  Quality evaluator: when is predictor wrong?  Could be very complex, based on gold standard and qrels  Could be very simple: acceptable number of results  Fall-back strategy:  Serial: poor latency, but minimizes load on bulkier tiers  Partially parallel:  submit to the next tier only the border-line queries  Pick the first acceptable answer – reduces latency Apache Lucene EuroCon 20 May 2010
  31. 31. Tiered versus distributed  Both applicable to indexes and query loads exceeding single machine capabilities  Distributed sharded search:  increases latency for all queries (send + execute + integrate from all shards)  … plus replicas to increase QPS:  Increases hardware / management costs  While not improving latency  Tiered search:  Excellent latency for common queries  More complex to build and maintain  Arguably lower hardware cost for comparable scale / QPS Apache Lucene EuroCon 20 May 2010
  32. 32. Tiered search benefits  Majority of common queries handled by first tier: RAM-based, high QPS, low latency  Partially parallel mode reduces average latency for more complex queries  Hardware investment likely smaller than for distributed search setup of comparable QPS / latency Apache Lucene EuroCon 20 May 2010
  33. 33. Example Lucene API for tiered search Could be implemented as a Solr SearchComponent... Apache Lucene EuroCon 20 May 2010
  34. 34. Lucene implementation details Apache Lucene EuroCon 20 May 2010
  35. 35. References  Efficiency trade-offs in two-tier web search systems, Baeza- Yates et al., SIGIR'09  ResIn: A combination of results caching and index pruning for high-performance web search engines, Baeza-Yates et al, SIGIR'08  Three-level caching for efficient query processing in large Web search engines, Long & Suel, WWW'05 Apache Lucene EuroCon 20 May 2010
  36. 36. Bit-wise search  Given a bit pattern query: 1010 1001 0101 0001  Find documents with matching bit patterns in a field  Applications:  Permission checking  De-duplication  Plagiarism detection  Two variants: non-scoring (filtering) and scoring Apache Lucene EuroCon 20 May 2010
  37. 37. Non-scoring bitwise search (LUCENE-2460)  Builds a Filter from intersection of: 0 1 2 3 4 docID 0x01 0x02 0x03 0x04 0x05 flags  DocIdSet of documents matching a Query a b b a a type  Integer value and operation (AND, OR, XOR) “type:a”  “Value source” that caches integer values of a field (from FieldCache) 0x01 0x02 0x03 0x04 0x05 flags  Corresponding Solr field type and QParser: SOLR-1913 op=AND val=0x01  Useful for filtering (not scoring) Filter Apache Lucene EuroCon 20 May 2010
  38. 38. Scoring bitwise search (SOLR-1918)  BooleanQuery in disguise: docID D1 D2 D3 flags 1010 1011 0011 1010 = Y-1000 | N-0100 | Y1000 Y1000 N1000 Y-0010 | N-0001 bits N0100 N0100 N0100 Y0010 Y0010 Y0010  Solr 32-bit BitwiseField N0001 Y0001 Y0001  Analyzer creates the bitmasks field  Currently supports only single value per field Q = bits:Y1000 bits:N0100 bits:Y0010 bits:N0001  Creates BooleanQuery from query int value Results:  Useful when searching for best matching (ranked) bit patterns D1 matches 4 of 4 → #1 D2 matches 3 of 4 → #2 D3 matches 2 of 4 → #3 Apache Lucene EuroCon 20 May 2010
  39. 39. Summary  Index post-processing covers a range of useful scenarios:  Merging and splitting, remodeling, extracting, moving ...  Pruning less important data  Tiered search + pruned indexes:  High performance  Practically unchanged quality  Less hardware  Bitwise search:  Filtering by matching bits  Ranking by best matching patterns Apache Lucene EuroCon 20 May 2010
  40. 40. Meta-summary  Stir your imagination  Think outside the box  Show some unorthodox use and practical applications  Close ties to scalability, performance, distributed search and query latency Apache Lucene EuroCon 20 May 2010
  41. 41. Q&A Apache Lucene EuroCon 20 May 2010
  42. 42. Thank you! Apache Lucene EuroCon 05/25/10
  43. 43. Massive indexing with map-reduce  Map-reduce indexing models  Google model  Nutch model  Modified Nutch model  Hadoop contrib/indexing model  Tradeoff analysis and recommendations Apache Lucene EuroCon 20 May 2010
  44. 44. Google model  Map():  Reduce() IN: <seq, docText> IN: <term, list(<seq,pos>)>  terms = analyze(docText)  foreach(<seq,pos>)  foreach (term) docId = calculate(seq, taskId) emit(term, <seq,position>) Postings(term).append(docId, pos)  Pros: analysis on the map side  Cons:  Too many tiny intermediate records → Combiner  DocID synchronization across map and reduce tasks  Lucene: very difficult (impossible?) to create index this way Apache Lucene EuroCon 20 May 2010
  45. 45. Nutch model (also in SOLR-1301)  Map():  Reduce() IN: <seq, docPart> IN: <docId, list(docPart)>  docId = docPart.get(“url”)  doc = luceneDoc(list(docPart))  emit(docId, docPart)  indexWriter.addDocument(doc)  Pros: easy to build Lucene index  Cons:  Analysis on the reduce side  Many costly merge operations (large indexes built from scratch on reduce side) (plus currently needs copy from local FS to HDFS – see LUCENE-2373) Apache Lucene EuroCon 20 May 2010
  46. 46. Modified Nutch model (N/A...)  Map():  Reduce() IN: <seq, docPart> IN: <docId, list(<docPart,ts>)>  docId = docPart.get(“url”)  doc = luceneDoc(list(<docPart,ts>))  ts = analyze(docPart)  indexWriter.addDocument(doc)  emit(docId, <docPart,ts>)  Pros:  Analysis on map side  Easy to build Lucene index  Cons:  Many costly merge operations (large indexes built from scratch on reduce side) (plus currently needs copy from local FS to HDFS – see LUCENE-2373) Apache Lucene EuroCon 20 May 2010
  47. 47. Hadoop contrib/indexing model  Map():  Reduce() IN: <seq, docText> IN: <random, list(indexData)>  doc = luceneDoc(docText)  foreach(indexData)  indexWriter.addDocument(doc) indexWriter.addIndexes(indexData)  emit(random, indexData)  Pros:  analysis on the map side  Many merges on the map side  Supports also other operations (deletes, updates)  Cons:  Serialization is costly, records are big and require more RAM to sort Apache Lucene EuroCon 20 May 2010
  48. 48. Massive indexing - summary  If you first need to collect document parts → SOLR-1301 model  If you use complex analysis → Hadoop contrib/index  NOTE: there is no good integration yet of Solr and Hadoop contrib/index module... Apache Lucene EuroCon 20 May 2010

×