SlideShare una empresa de Scribd logo
1 de 7
Lucene and Bloom-
Filtered Segments
Performance improvements to be gained from
“knowing what we don’t know”


                                       Mark Harwood
Benefits
2 x speed up on primary-key lookups

Small speed-up on general text searches (1.06 x )

Optimised memory overhead

Minimal impact on indexing speeds

Minimal extra disk space
Approach
       One appropriately sized Bitset is held per segment, per
         Bloom-filtered field

        e.g. 4 segments x 2 filtered fields = 8 bitsets
URL     000010001000000101000001   URL      000010001000001   URL    000000001   URL    000000001

PKey    0010000001001001000001                                PKey   001000001   PKey   001000001
                                   PKey     001000000100001




                                                              Segment 3          Segment 4
                                          Segment 2
        Segment 1
Fail-fast searches: modified
             TermInfosReader
                 int hash=searchTerm.hashcode();
                 int bitIndex=hash%bitsetSize;
                 if(!bitset.contains(bitIndex)) return false;
                 //term might be in index – continue as normal
                        search


                           URL     00001000100000010100000
                                   1
                           PKey    0010000001001001000001

An unset bit
guarantees the term
is missing from the
segment and a
search can be                                                Is most effective on fields
avoided.                          Segment 1                  with many low doc-frequency
                                                             terms or scenarios where
                                                             query terms often don’t exist
                                                             in the index.
Memory efficiency
        Bitset sizes are automatically tuned according to:

        1. the volume of terms in the segment

        2. desired saturation settings (more sparse=more
             accurate)
URL    000010001000000101000001   URL      000010001000001   URL    000000001   URL    000010000

PKey   0010000001001001000001     PKey     001000000100001   PKey   001000001   PKey   001000001




                                                             Segment 3             Segment 4
                                         Segment 2
       Segment 1
Indexing: a modified TermInfosWriter
                            Term writes are gathered in a large bitset

00000000000000000000000000000000001000000000000000000000000000010000000

      The final flush operation
      consolidates information in
      the big bitset into a suitably
      compact bitset for storage
      on disk based on how
      many set bits were
      accumulated.
      This re-mapping saves disk
      space and the RAM
                                     000000000000001000000000010000000
      required when servicing
      queries
Notes
See JIRA LUCENE-4069 for patch to Lucene 3.6

Core modifications pass existing 3.6 Junit tests (but without
exercising any Bloom filtering logic).

Benchmarks contrasting Bloom-filtered indexes with non-filtered are
here: http://goo.gl/X7QqU

TODOs
    Currently relies on a field naming convention to introduce a Bloom filter to
    the index (use “_blm” on end of indexed field name when writing)
    How to properly declare need for Bloom filter?Changes to
    IndexWriterConfig? A new Fieldable/FieldInfo setting? Dare I invoke the
    “schema” word?
    Where to expose tuning settings e.g. saturation preferences?
    Can we give some up-front hints to TermInfosWriter about segment size
    being written so initial choice of BitSet size can be reduced?
    Formal Junit tests required to exercise Bloom-filtered indexes – no false
    negatives. Can this be covered as part of existing random testing
    frameworks which exercise various index config options?

Más contenido relacionado

Similar a Lucene with Bloom filtered segments

Similar a Lucene with Bloom filtered segments (8)

9.1-CSE3421-multicolumn-cache.pdf
9.1-CSE3421-multicolumn-cache.pdf9.1-CSE3421-multicolumn-cache.pdf
9.1-CSE3421-multicolumn-cache.pdf
 
Hybrid Model using Unsupervised Filtering Based on Ant Colony Optimization an...
Hybrid Model using Unsupervised Filtering Based on Ant Colony Optimization an...Hybrid Model using Unsupervised Filtering Based on Ant Colony Optimization an...
Hybrid Model using Unsupervised Filtering Based on Ant Colony Optimization an...
 
A Developer's View Into Spark's Memory Model with Wenchen Fan
A Developer's View Into Spark's Memory Model with Wenchen FanA Developer's View Into Spark's Memory Model with Wenchen Fan
A Developer's View Into Spark's Memory Model with Wenchen Fan
 
A Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen FanA Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen Fan
 
Deep Dive into Apache Kafka
Deep Dive into Apache KafkaDeep Dive into Apache Kafka
Deep Dive into Apache Kafka
 
Consuming RealTime Signals in Solr
Consuming RealTime Signals in Solr Consuming RealTime Signals in Solr
Consuming RealTime Signals in Solr
 
Protected addressing mode and Paging
Protected addressing mode and PagingProtected addressing mode and Paging
Protected addressing mode and Paging
 
Cache Design for an Alpha Microprocessor
Cache Design for an Alpha MicroprocessorCache Design for an Alpha Microprocessor
Cache Design for an Alpha Microprocessor
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Último (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Lucene with Bloom filtered segments

  • 1. Lucene and Bloom- Filtered Segments Performance improvements to be gained from “knowing what we don’t know” Mark Harwood
  • 2. Benefits 2 x speed up on primary-key lookups Small speed-up on general text searches (1.06 x ) Optimised memory overhead Minimal impact on indexing speeds Minimal extra disk space
  • 3. Approach One appropriately sized Bitset is held per segment, per Bloom-filtered field e.g. 4 segments x 2 filtered fields = 8 bitsets URL 000010001000000101000001 URL 000010001000001 URL 000000001 URL 000000001 PKey 0010000001001001000001 PKey 001000001 PKey 001000001 PKey 001000000100001 Segment 3 Segment 4 Segment 2 Segment 1
  • 4. Fail-fast searches: modified TermInfosReader int hash=searchTerm.hashcode(); int bitIndex=hash%bitsetSize; if(!bitset.contains(bitIndex)) return false; //term might be in index – continue as normal search URL 00001000100000010100000 1 PKey 0010000001001001000001 An unset bit guarantees the term is missing from the segment and a search can be Is most effective on fields avoided. Segment 1 with many low doc-frequency terms or scenarios where query terms often don’t exist in the index.
  • 5. Memory efficiency Bitset sizes are automatically tuned according to: 1. the volume of terms in the segment 2. desired saturation settings (more sparse=more accurate) URL 000010001000000101000001 URL 000010001000001 URL 000000001 URL 000010000 PKey 0010000001001001000001 PKey 001000000100001 PKey 001000001 PKey 001000001 Segment 3 Segment 4 Segment 2 Segment 1
  • 6. Indexing: a modified TermInfosWriter Term writes are gathered in a large bitset 00000000000000000000000000000000001000000000000000000000000000010000000 The final flush operation consolidates information in the big bitset into a suitably compact bitset for storage on disk based on how many set bits were accumulated. This re-mapping saves disk space and the RAM 000000000000001000000000010000000 required when servicing queries
  • 7. Notes See JIRA LUCENE-4069 for patch to Lucene 3.6 Core modifications pass existing 3.6 Junit tests (but without exercising any Bloom filtering logic). Benchmarks contrasting Bloom-filtered indexes with non-filtered are here: http://goo.gl/X7QqU TODOs Currently relies on a field naming convention to introduce a Bloom filter to the index (use “_blm” on end of indexed field name when writing) How to properly declare need for Bloom filter?Changes to IndexWriterConfig? A new Fieldable/FieldInfo setting? Dare I invoke the “schema” word? Where to expose tuning settings e.g. saturation preferences? Can we give some up-front hints to TermInfosWriter about segment size being written so initial choice of BitSet size can be reduced? Formal Junit tests required to exercise Bloom-filtered indexes – no false negatives. Can this be covered as part of existing random testing frameworks which exercise various index config options?