From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Lucene with Bloom filtered segments
1. Lucene and Bloom-
Filtered Segments
Performance improvements to be gained from
“knowing what we don’t know”
Mark Harwood
2. Benefits
2 x speed up on primary-key lookups
Small speed-up on general text searches (1.06 x )
Optimised memory overhead
Minimal impact on indexing speeds
Minimal extra disk space
3. Approach
One appropriately sized Bitset is held per segment, per
Bloom-filtered field
e.g. 4 segments x 2 filtered fields = 8 bitsets
URL 000010001000000101000001 URL 000010001000001 URL 000000001 URL 000000001
PKey 0010000001001001000001 PKey 001000001 PKey 001000001
PKey 001000000100001
Segment 3 Segment 4
Segment 2
Segment 1
4. Fail-fast searches: modified
TermInfosReader
int hash=searchTerm.hashcode();
int bitIndex=hash%bitsetSize;
if(!bitset.contains(bitIndex)) return false;
//term might be in index – continue as normal
search
URL 00001000100000010100000
1
PKey 0010000001001001000001
An unset bit
guarantees the term
is missing from the
segment and a
search can be Is most effective on fields
avoided. Segment 1 with many low doc-frequency
terms or scenarios where
query terms often don’t exist
in the index.
5. Memory efficiency
Bitset sizes are automatically tuned according to:
1. the volume of terms in the segment
2. desired saturation settings (more sparse=more
accurate)
URL 000010001000000101000001 URL 000010001000001 URL 000000001 URL 000010000
PKey 0010000001001001000001 PKey 001000000100001 PKey 001000001 PKey 001000001
Segment 3 Segment 4
Segment 2
Segment 1
6. Indexing: a modified TermInfosWriter
Term writes are gathered in a large bitset
00000000000000000000000000000000001000000000000000000000000000010000000
The final flush operation
consolidates information in
the big bitset into a suitably
compact bitset for storage
on disk based on how
many set bits were
accumulated.
This re-mapping saves disk
space and the RAM
000000000000001000000000010000000
required when servicing
queries
7. Notes
See JIRA LUCENE-4069 for patch to Lucene 3.6
Core modifications pass existing 3.6 Junit tests (but without
exercising any Bloom filtering logic).
Benchmarks contrasting Bloom-filtered indexes with non-filtered are
here: http://goo.gl/X7QqU
TODOs
Currently relies on a field naming convention to introduce a Bloom filter to
the index (use “_blm” on end of indexed field name when writing)
How to properly declare need for Bloom filter?Changes to
IndexWriterConfig? A new Fieldable/FieldInfo setting? Dare I invoke the
“schema” word?
Where to expose tuning settings e.g. saturation preferences?
Can we give some up-front hints to TermInfosWriter about segment size
being written so initial choice of BitSet size can be reduced?
Formal Junit tests required to exercise Bloom-filtered indexes – no false
negatives. Can this be covered as part of existing random testing
frameworks which exercise various index config options?