Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Lucene/Solr 8: The Next Major Release Steve Rowe, Lucidworks

2.447 visualizaciones

Publicado el

Presented at Activate 2018

Publicado en: Tecnología
  • Sé el primero en comentar

Lucene/Solr 8: The Next Major Release Steve Rowe, Lucidworks

  1. 1. Lucene/Solr 8: 
 The next major release Steve Rowe Senior Software Developer, Lucidworks @steven_a_rowe #Activate18 #ActivateSearch
  2. 2. Agenda • Recent release cadence • 7.X • 8.0 • 8.X YOU
 ARE HERE
  3. 3. 7.X average: 11 weeks6.X average: 10 weeks
  4. 4. 7.X 1. Metrics 2. Autoscaling 3. CDCR 4. Time Routed Aliases 5. Replica types 6. Streaming expressions 7. JSON facet API 8. Configset / schema 9. Text Analysis / ML 10. Collections API 11. Queries 12. Large index segment merging 13. Replication / recovery / rolling updates 14. Block-join / nested docs 15. Miscellaneous
  5. 5. 7.X: Metrics • Continuation of 6.X work to support Autoscaling efforts • 7.0: - Aggregated metrics collected in overseer
 - solrconfig.xml <jmx> ➞ solr.xml <metrics><reporter> • 7.1: Prometheus metrics exporter contrib • 7.4: /admin/metrics/history API: basic long-term key metric time series aggregation • Fixed-width windows at
 several resolutions • Not yet in Admin UI:
 SOLR-12426
  6. 6. 7.X: Autoscaling • 7.0: - Preferences and policy DSL: flexible replica placement
 [ { minimize: cores }, { maximize: freedisk } ]
 { replica: "<2", shard: "#EACH", node: "#ANY" }
 - Diagnostics API: return sorted nodes, policy violations • 7.1: - autoAddReplicas ported to autoscaling framework
 - Add/remove/suspend/resume triggers and listeners
 - Triggers for added and lost nodes
 - ComputePlanAction / ExecutePlanAction
 - /autoscaling/history API: cluster events and actions • 7.2: - Search rate trigger
 - /autoscaling/suggestions API
 - UTILIZENODE collections API command
  7. 7. 7.X: Autoscaling • 7.3: - Simulation framework
 - Arbitrary metric threshold trigger
 - Scheduled trigger
 - Admin UI to display and execute suggestions
  8. 8. 7.X: Autoscaling • 7.4: - Periodic house-keeping task: cleans up inactive shards
 - Index size trigger: document count or size in bytes • 7.5: - Policy replica attribute: #ALL, #EQUAL, percentage,
 range, and floating point values
 - Policy cores attribute: #EQUAL, percentage, 
 range, and floating point values
 - Percentage in freedisk policy attribute
 - Simulation framework: test scaling up to 1 billion docs
  9. 9. 7.X: Cross Data Center Replication • 7.2: Support bi-directional syncing of CDCR clusters This is not active-active, 
 but rather
 passive-active or active-passive: only one active
 cluster at a time.
  10. 10. 7.X: Time Routed Aliases • 7.3: - Specialization of Solr’s collection alias feature
 - Support time series data, e.g. logs / sensor data
 - Maintain performance under continuous indexing
 - CREATEALIAS: start, interval, retention policy
 - Automatically create new collections
 - Automatically delete old collections (optional)
 - Route updates based on timestamp
 - Search against all aliased collections* • 7.5: Preemptively create the next collection when updates
 are near the latest collection’s end date (optional)
 * Pending optimization: minimize queried collections (SOLR-9562)
  11. 11. 7.X: Replica types • 7.0:
 
 
 
 
 
 
 • 7.4: Query param to prioritize replicas by type, e.g. shards.preference=replica.type:PULL,replica.type:TLOG Type Indexes
 locally Supports
 soft commit
 & RTG Pulls segments from leader Writes to
 TLog Can become shard leader Queryable NRT ✅ ✅ ✅ ✅ ✅ TLOG leader ✅ ✅ ✅ ✅ ✅ TLOG ✅ ✅ ✅ ✅ PULL ✅ ✅
  12. 12. 7.X: Streaming expressions • Parallel computation function suite • Some use cases: MapReduce, aggregations, parallel SQL, pub/ sub messaging, graph traversal, machine learning, statistical programming • Each 7.X release has added
 many new functions • 7.5: Ref guide:
 Math Expressions User Guide
  13. 13. 7.X: JSON Facet API • 7.0: Terms facets: added optional refinement support • 7.4: Semantic Knowledge Graph support via new 
 relatedness() aggregate function • Finds ad-hoc relationships by scoring documents relative to foreground and background document sets • 7.5: Heatmap facet support
  14. 14. 7.X: Configsets / schema • 7.0: - _default configset
 - Data-driven schema: auto-guessed text fields indexed 2 ways: • tokenized for search • strings for sorting/faceting: "*_str" string field, max 256 chars - Turn off data-driven schema functionality:
 curl http://host:8983/solr/mycollection/config 
 -d "{ set-user-property: { update.autoCreateFields: false }}" • 7.5: Disable configset upload: -Dconfigset.upload.enabled=false
  15. 15. 7.X: Text analysis / machine learning • 7.1: Bengali normalizer and stemmer • 7.2: Enable off-ZooKeeper storage of large (>1MB) LTR models • 7.3: OpenNLP integration: tokenization, POS tagging, phrase
 chunking, lemmatization, NER, language detection • 7.4: - ProtectedTermFilterFactory: don’t filter protected terms
 - TaggerRequestHandler (a.k.a. SolrTextTagger): NER • 7.5: - "nori" Korean morphological text analysis: "*_txt_ko"
 - PhrasesIdentificationComponent: identify and score
 candidate query phrases based on index statistics
 - UIMA integration removed
  16. 16. 7.X: Collections API • 7.3: Add collection level properties similar to cluster properties • 7.4: Cluster-wide defaults for numShards, nrtReplicas,
 tlogReplicas, pullReplicas • 7.5: - Support co-locating replicas of two or more collections
 together in a node via the withCollection parameter
 to the CREATE and MODIFYCOLLECTION commands
 - SPLITSHARD: New split method using hard links: splitMethod=link • 3-5 times faster than the original splitMethod=rewrite • Slows down replication • Increases disk usage on replica nodes
  17. 17. 7.X: Queries • 7.1: JSON query DSL
 curl http://localhost:8983/solr/books/query -d ' { query: { bool: { must: [ "title:solr", {lucene: {df: content, query: "lucene solr"}} ], must_not: [ {frange: {u: 3.0, query: ranking}} ]}}}'
  18. 18. 7.X: Queries • 7.2: New synonymQueryStyle field type option: enable
 generation of appropriate queries for hierarchical
 relations between overlapping terms • as_same_term (default): SynonymQuery(bird,robin) • pick_best: Dismax(bird,robin) • as_distinct_terms: (bird OR robin) • 7.4: JSON query DSL: Enable query/filter tagging,
 e.g. { "#colorfilt" : "color:blue" } 
 equivalent to local-param {!tag=colorfilt}color:blue

  19. 19. 7.X: Large index segment merging • Problem: Overly large segments (e.g. as a result of force-
 merge/optimize) stop being eligible for merging,
 and can start accumulating >50% deleted
 documents, wasting space and skewing index stats. • 7.5: - TieredMergePolicy now respects maxSegmentSizeMB
 by default when executing force-merge/optimize and
 expunge-deletes
 - TieredMergePolicy’s reclaimDeletesWeight has been
 replaced with a new deletesPctAllowed setting to
 control how aggressively deletes should be reclaimed
  20. 20. 7.X: Replication/recovery/rolling upgrades • 7.3: The old Leader-Initiated-Recovery (LIR) implementation
 is deprecated and replaced • To perform a rolling upgrade to Solr 8, you must be on Solr 7.3 or higher • 7.4: - IndexFetcher now skips fetching identical files
 - Buffering updates are written to a separate TLog
 - Parallel replay of buffering TLogs
  21. 21. 7.X: Block-join / nested documents • 7.3: Added filters and excludeTags local-params for
 {!parent} and {!child} query parsers, usable for
 multi-select faceting • 7.5: WIP: Allow Solr to more faithfully represent deeply
 nested document relationships, rather than requiring
 reconstruction based on the flattened list of child docs
 returned by Solr
  22. 22. 7.X: Miscellaneous • 7.3: add-distinct atomic updates • 7.4: - Ignore large document URP
 - TLog: maxSize auto hard-commit setting
 (in addition to maxDocs & maxTime) • 7.5: Custom cluster properties allowed with ext. prefix
  23. 23. 8.0 • Autoscaling • Index upgrades • HTTP/2 • Miscellaneous
  24. 24. 8.0: Autoscaling • Suggestions API: rebalance options even if no violations • Suggestions API: add-replica for lost replicas • maxOps limit for index size trigger • Autoscaling policy framework will be the default replica placement strategy
  25. 25. 8.0: Index upgrades • 7.0: Lucene indexes record the major Lucene version that
 created the index, and the minimum Lucene version
 that contributed to segments. • 8.0: Version N-2 or older indexes will now fail to open,
 even if they have been merged into an N-1 index. • IndexUpgrader will not upgrade 6.X or earlier indexes • Re-indexing will be required to upgrade
  26. 26. 8.0: HTTP/2 • May 2018: Mark Miller announced his Star Burst effort:
 many cleanups and performance enhancements • July 2018: Cao Manh Dat took up the HTTP/2 aspects: SOLR-12639 • Indexing test: 33M docs, 1 shard, 2 replicas (SOLR-12642) • Garbage: Leader: 26% less; replica: 76% less • Indexing throughput: 54% higher • CPU time: Leader: 39% higher; replica: 76% lower • Ready to merge back to master, pending release of
 Jetty 9.4.13, containing SPNEGO HTTP/2 implementation
  27. 27. 8.0: Miscellaneous • Lucene: scores must be non-negative • Function(Score)Query-s convert negative scores to zero • TODO: remove deprecations • Trie fields? Removal effectively blocked by: • SOLR-12074: Add numeric equivalent to StrField • SOLR-11127: Mechanism to migrate schema for .system collection (a.k.a. blob store) schema from Trie (pre-7.0) to Points (7.0+)
  28. 28. 8.X • Lucene/Solr minimum JDK • Luke: Lucene Toolbox • New Lucene features
  29. 29. 8.X: Lucene/Solr minimum JDK • Oracle will end free JDK 8 support in January 2019 • Both JDK 9 & 10 are already EOL, no more Oracle support • JDK 11 will very likely be next minimum supported JDK, no schedule yet • Under JDK 9+, Solr’s Hadoop-related functionality has problems, including with Kerberos • Uwe Schindler’s Jenkins server tests Lucene/Solr on Oracle 9+10+11+12 JDKs • All have higher Solr test failure rates than on JDK 8
  30. 30. 8.X: Luke: UI framework & licensing • Andrzej Bialecki: Initial implementation: Thinlet, GPL • Mark Harwood: GWT • Mark Miller: Apache Pivot • Dmitry Kan and Tomoko Uchida took ownership on Github • Tomoko Uchida: JavaFX (bundled w/JDK 8) • LUCENE-2562: Make Luke a Lucene/Solr Module • JavaFX/OpenJFX unbundled from Java 11 JDK, GPL+CPE • Tomoko Uchida: Swing (7.5 release available)
  31. 31. 8.X: New Lucene features • Index impacts, Block-Max WAND, similarity cleanups • Some queries (especially term queries and disjunctions) are much faster when number of hits is not required • FeatureField: incorporate static relevance signals, e.g. PageRank • Soft deletes • Merge policy retains deleted docs according to policy • Enables document history, e.g. for time-travel indexes • RAMDirectory replaced by ByteBuffersDirectory
  32. 32. Questions?
  33. 33. Thank you! Steve Rowe Senior Software Engineer, Lucidworks @steven_a_rowe #Activate18 #ActivateSearch

×