Faceting optimizations for Solr

OCTOBER 13-16, 2015 • AUSTIN, TX

Faceting optimizations for Solr
Toke Eskildsen
Search Engineer / Solr Hacker
State and University Library, Denmark
@TokeEskildsen / te@statsbiblioteket.dk

3
3/55
Overview

Web scale at the State and University Library,
Denmark

Field faceting 101

Optimizations
− Reuse
− Tracking
− Caching
− Alternative counters

4/55
Web scale for a small web

Denmark
− Consolidation circa 10th
century
− 5.6 million people

Danish Net Archive (http://netarkivet.dk)
− Constitution 2005
− 20 billion items / 590TB+ raw data

5/55
Indexing 20 billion web items / 590TB into Solr

Solr index size is 1/9th of real data = 70TB

Each shard holds 200M documents / 900GB
− Shards build chronologically by dedicated machine
− Projected 80 shards
− Current build time per shard: 4 days
− Total build time is 20 CPU-core years
− So far only 7.4 billion documents / 27TB in index

6/55
Searching a 7.4 billion documents / 27TB Solr index

SolrCloud with 2 machines, each having
− 16 HT-cores, 256GB RAM, 25 * 930GB SSD
− 25 shards @ 900GB
− 1 Solr/shard/SSD, Xmx=8g, Solr 4.10
− Disk cache 100GB or < 1% of index size

8/55
String faceting 101 (single shard)
counter = new int[ordinals]
for docID: result.getDocIDs()
for ordinal: getOrdinals(docID)
counter[ordinal]++
for ordinal = 0 ; ordinal < counters.length ; ordinal++
priorityQueue.add(ordinal, counter[ordinal])
for entry: priorityQueue
result.add(resolveTerm(ordinal), count)
ord term counter
0 A 0
1 B 3
2 C 0
3 D 1006
4 E 1
5 F 1
6 G 0
7 H 0
8 I 3

9/55
Test setup 1 (easy start)

Solr setup
− 16 HT-cores, 256GB RAM, SSD
− Single shard 250M documents / 900GB

URL field
− Single String value
− 200M unique terms

3 concurrent “users”

Random search terms

10/55
Vanilla Solr, single shard, 250M documents, 200M values, 3 users

11/55
Allocating and dereferencing 800MB arrays

12/55
Reuse the counter
counter = new int[ordinals]
counter[ordinal]++
<counter no more referenced and will be garbage collected at some point>

13/55
Reuse the counter
counter = pool.getCounter()
counter[ordinal]++
pool.release(counter)
Note: The JSON Facet API in Solr 5 already supports reuse of counters

14/55
Using and clearing 800MB arrays

15/55
Reusing counters vs. not doing so

16/55
Reusing counters, now with readable visualization

17/55
Reusing counters, now with readable visualization
Why does it always take more than 500ms?

18/55
Iteration is not free
counter[ordinal]++
200M unique terms = 800MB

19/55
ord counter
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
tracker
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
Tracking updated counters

20/55
ord counter
0 0
1 0
2 0
3 1
4 0
5 0
6 0
7 0
8 0
tracker
3
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
counter[3]++

21/55
ord counter
0 0
1 1
2 0
3 1
4 0
5 0
6 0
7 0
8 0
tracker
3
1
N/A
N/A
N/A
N/A
N/A
N/A
N/A
counter[3]++
counter[1]++

22/55
ord counter
0 0
1 3
2 0
3 1
4 0
5 0
6 0
7 0
8 0
tracker
3
1
N/A
N/A
N/A
N/A
N/A
N/A
N/A
counter[3]++
counter[1]++
counter[1]++
counter[1]++

23/55
ord counter
0 0
1 3
2 0
3 1006
4 1
5 1
6 0
7 0
8 3
tracker
3
1
8
4
5
N/A
N/A
N/A
N/A
counter[3]++
counter[1]++
counter[1]++
counter[1]++
counter[8]++
counter[8]++
counter[4]++
counter[8]++
counter[5]++
counter[1]++
counter[1]++
…
counter[1]++

24/55
if counter[ordinal]++ == 0 && tracked < maxTracked
tracker[tracked++] = ordinal
if tracked < maxTracked
for i = 0 ; i < tracked ; i++
priorityQueue.add(tracker[i], counter[tracker[i]])
else
for ordinal = 0 ; ordinal < counter.length ; ordinal++
ord counter
0 0
1 3
2 0
3 1006
4 1
5 1
6 0
7 0
8 3
tracker
3
1
8
4
5
N/A
N/A
N/A
N/A

25/55

26/55
Distributed faceting
Phase 1) All shards performs faceting.
The Merger calculates the top-X terms.
Phase 2) The term counts are requested from the shards
that did not return them in phase 1.
The Merger calculates the final counts for the top-X terms.
for term: fineCountRequest.getTerms()
result.add(term,
searcher.numDocs(query(field:term), base.getDocIDs()))

27/55
Test setup 2 (more shards, smaller field)

Solr setup
− 9 shards @ 250M documents / 900GB

domain field
− Single String value
− 1.1M unique terms per shard

1 concurrent “user”

Random search terms

28/55
Pit of Pain™ (or maybe “Horrible Hill”?)

29/55
Fine counting can be slow
Phase 1: Standard faceting
Phase 2:
result.add(term,
searcher.numDocs(query(field:term), base.getDocIDs()))

30/55
Alternative fine counting
counter.increment(ordinal)
result.add(term, counter.get(getOrdinal(term)))
}Same as phase 1, which yields
ord counter
0 0
1 3
2 0
3 1006
4 1
5 1
6 0
7 0
8 3

31/55
Using cached counters from phase 1 in phase 2
counter = pool.getCounter(key)
for term: query.getTerms()
result.add(term, counter.get(getOrdinal(term)))

32/55
Pit of Pain™ practically eliminated

33/55
Pit of Pain™ practically eliminated
Stick figure CC BY-NC 2.5 Randall Munroe xkcd.com

34/55
Test setup 3 (more shards, more fields)

Solr setup
− 23 shards @ 250M documents / 900GB

Faceting on 6 fields
− url: ~200M unique terms / shard
− domain & host: ~1M unique terms each / shard
− type, suffix, year: < 1000 unique terms / shard

35/55
1 machine, 7 billion documents / 23TB total index, 6 facet fields

36/55
High-cardinality can mean different things
Single shard / 250,000,000 docs / 900GB
Field References Max docs/term Unique terms
domain 250,000,000 3,000,000 1,100,000
url 250,000,000 56,000 200,000,000
links 5,800,000,000 5,000,000 610,000,000
2440 MB / counter

37/55
Remember: 1 machine = 25 shards
25 shards / 7 billion / 23TB
Field References Max docs/term Unique terms
domain 7,000,000,000 3,000,000 ~25,000,000
url 7,000,000,000 56,000 ~5,000,000,000
links 125,000,000,000 5,000,000 ~15,000,000,000
60 GB / facet call

38/55
Different distributions
domain 1.1M url 200M links 600M
High max
Low max
Very long tail
Short tail

39/55
Theoretical lower limit per counter: log2(max_count)
max=1
max=7
max=2047
max=3
max=63

40/55
int vs. PackedInts
domain: 4 MB
url: 780 MB
links: 2350 MB
int[ordinals] PackedInts(ordinals, maxBPV)
domain: 3 MB (72%)
url: 420 MB (53%)
links: 1760 MB (75%)

41/55
n-plane-z counters
Platonic ideal Harsh reality
Plane d
Plane c
Plane b
Plane a

42/55
Plane d
Plane c
Plane b
Plane a
L: 0 ≣ 000000

43/55
Plane d
Plane c
Plane b
Plane a
L: 0 ≣ 000000
L: 1 ≣ 000001

44/55
Plane d
Plane c
Plane b
Plane a
L: 0 ≣ 000000
L: 1 ≣ 000001
L: 2 ≣ 000011

45/55
Plane d
Plane c
Plane b
Plane a
L: 0 ≣ 000000
L: 1 ≣ 000001
L: 2 ≣ 000011
L: 3 ≣ 000101

46/55
Plane d
Plane c
Plane b
Plane a
L: 0 ≣ 000000
L: 1 ≣ 000001
L: 2 ≣ 000011
L: 3 ≣ 000101
L: 4 ≣ 000111
L: 5 ≣ 001001
L: 6 ≣ 001011
L: 7 ≣ 001101
...
L: 12 ≣ 010111

47/55
Comparison of counter structures
domain: 4 MB
url: 780 MB
links: 2350 MB
domain: 3 MB (72%)
url: 420 MB (53%)
links: 1760 MB (75%)
domain: 1 MB (30%)
url: 66 MB ( 8%)
links: 311 MB (13%)
int[ordinals] PackedInts(ordinals, maxBPV) n-plane-z

49/55
I could go on about

Threaded counting

Heuristic faceting

Fine count skipping

Counter capping

Monotonically increasing tracker for n-plane-z

Regexp filtering

50/55
What about huge result sets?

Rare for explorative term-based searches

Common for batch extractions

Threading works poorly as #shards > #CPUs

But how bad is it really?

52/55
Heuristic faceting

Use sampling to guess top-X terms
− Re-use the existing tracked counters
− 1:1000 sampling seems usable for the field links,
which has 5 billion references per shard

Fine-count the guessed terms

53/55
Over provisioning helps validity

55/55
Never enough time, but talk to me about

Threaded counting

Monotonically increasing tracker for n-plane-z

Regexp filtering

Fine count skipping

Counter capping

56/55
Extra info
The techniques presented can be tested with sparse faceting, available as a plug-in replacement
WAR for Solr 4.10 at https://tokee.github.io/lucene-solr/. A version for Solr 5 will eventually be
implemented, but the timeframe is unknown.
No current plans for incorporating the full feature set in the official Solr distribution exists.
Suggested approach for incorporation is to split it into multiple independent or semi-
independent features, starting with those applicable to most people, such as the distributes
faceting fine count optimization.
In-depth descriptions and performance tests of the different features can be found at
https://sbdevel.wordpress.com.

57/55
18M documents / 50GB, facet on 5 fields (2*10M values, 3*smaller)

58/55
6 billion docs / 20TB, 25 shards, single machine
facet on 6 fields (1*4000M, 2*20M, 3*smaller)

59/55
7 billion docs / 23TB, 25 shards, single machine
facet on 5 fields (2*20M, 3*smaller)

Faceting optimizations for Solr

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (11)

Similar a Faceting optimizations for Solr

Similar a Faceting optimizations for Solr (20)

Último

Último (20)

Faceting optimizations for Solr

Notas del editor