Petabyte search at scale: understand how DataStax Enterprise search enables complex real-time multi-dimensional queries on massive datasets. This talk will cover when and why to use DSE search, best practices, data modeling and performance tuning/optimization. Also covered will be a deep dive into how DSE Search operates, and the fundamentals of bitmap indexing.
37. Index Size
• Core index size
• Fields, term frequency, count, and settings
• Number of dynamic fields and frequency using Luke
• termVectors="false"
• termPositions="false"
• termOffsets="false"
• omitNorms="true"
• Only index fields you intend to search
39. Indexing throughput
• Set autoSoftCommit as high as possible
• Disable all caches except filterCache
• Increase RAM buffer to 512-1024MB
• Enable realtime indexing
• Large heap (20GB) with G1 or 8150 tuning
• Increase back_pressure_threshold_per_core to 2000-5000
• Set max_solr_concurrency_per_core to number of cores
• Recommend more cores (32)
43. Query Latency and Throughput
• Set autoSoftCommit as high as possible
• Disable all caches except filterCache
• Use docValues for faceted or sorted fields
• Large heap (20GB) with G1 or 8150 tuning
• Move query parameters to filters
• Use single pass queries where possible
• Recommend more cores (32)
46. CASSANDRA-7486 (G1) Tuning
MAX_HEAP_SIZE="20G"
JVM_OPTS="$JVM_OPTS -XX:+UseG1GC"
JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking"
JVM_OPTS="$JVM_OPTS -XX:+PerfDisableSharedMem"
JVM_OPTS="$JVM_OPTS -XX:G1RSetUpdatingPauseTimePercent=5"
# set these to the number of cores
JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=8"
JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=8"
JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"
JVM_OPTS="$JVM_OPTS -XX:G1ReservePercent=15"
JVM_OPTS="$JVM_OPTS -XX:InitiatingHeapOccupancyPercent=25"
JVM_OPTS="$JVM_OPTS -XX:MaxGCPauseMillis=500"
JVM_OPTS="$JVM_OPTS -XX:G1HeapRegionSize=32"
47. DSE 4.7 Improvements
DSP-4477 - Pivot faceting
DSP-4476 - Pagination
DSP-3740 - Live indexing
DSP-4091 - Remove support for stored copy fields
DSP-4703 - Query Solr from Spark
DSP-4518 - Improved memory usage for faceting
DSP-3931 - Filter cache sizing is now global across all segments
DSP-4475 - Verify/Integrate single pass distributed queries (SOLR-5768)
DSP-4091 - Remove support for stored copy fields
DSP-4072 - Fault-tolerant distributed queries
DSP-3958 - Improve shard routing by taking into account node health factors
DSP-3935 - Implement faceting inside CQL Solr queries
48. DSE vs ElasticSearch
Feature DSE ElasticSearch
Replication and multiple datacenters
Based on Cassandra, multi-DC support for free,
real-time replication, high availability
Master slave, long replication delay, doesn't do
multi-DC well
Scalability Hundreds of nodes, hundreds of terabytes 10s of nodes a couple terabytes
Data loss possible No Yes
Primary Data Store Yes No
Operational Complexity Single system Multiple systems
Analytics Yes No
Dynamic Schema Sorta Sorta, slightly easier