6. Search Infrastructure at Etsy
Parallel clusters ‘flip’ & ‘flop’ - dark/live
Listings index:
2013: unsharded Solr, one large JVM
2014: locally sharded Solr, 8 smaller JVM’s
big win on latency tail
7. Speeding up search
• Low-level improvements: use less CPU
• Parallelize: use more cores
8. Amdahl’s law
“The speedup of a program using multiple processors in parallel computing is
limited by the time needed for the sequential fraction of the program.”
Wikipedia
10. Why not shard?
New challenges arise once you go distributed, some examples:
• More moving parts and failure modes to deal with
• Missing features with distributed search
• Index statistics on shards can vary, distorting IDF
Tempting to defer sharding if not necessary due to index size.
12. Collectors!
• Hits get accumulated - ‘collected’ by the Collector abstraction.
• Invoked for every hit that matches the Query.
• Has the Scorer available to get the score for current hit if needed.
• Output - e.g. top N hits, number of hits, grouped hits - can be retrieved when done.
13. Existing solution
IndexSearcher(IndexReaderContext, ExecutorService)
Special-cased for:
TopScoreDocCollector (sort by score)
and
TopFieldCollector (arbitrary sort-spec)
Only ships with parallel search support for the above TopDocs collectors.
14. Not composable
Difficult to build parallelization for every possible permutation.
With Solr you may have:
TimeLimitingCollector
|— MultiCollector
|— TopScoreDocCollector
|— DocSetCollector
15. Sync considered harmful
protected void search(
List<LeafReaderContext> leaves,
Weight weight,
Collector collector
) throws IOException {
// TODO: should we make this
// threaded...? the Collector could be sync'd?
// always use single thread:
for (LeafReaderContext ctx : leaves) { // search each subreader
…
collect() called for every single document that matches the query.
Can expect a lot of contention!
IndexSearcher.java
17. API review: Before
public interface Collector {
LeafCollector getLeafCollector(LeafReaderContext context) throws IOException;
}
public interface LeafCollector {
void setScorer(Scorer scorer) throws IOException;
void collect(int doc) throws IOException;
boolean acceptsDocsOutOfOrder();
}
18. New methods: Collector
public interface Collector {
LeafCollector getLeafCollector(LeafReaderContext context) throws IOException;
// NEW METHODS:
boolean isParallelizable();
void setParallelized();
void done() throws IOException;
}
19. New methods: LeafCollector
public interface LeafCollector {
void setScorer(Scorer scorer) throws IOException;
void collect(int doc) throws IOException;
boolean acceptsDocsOutOfOrder();
// NEW METHOD:
void leafDone() throws IOException;
}
20. Opt-in
Collector.isParallelizable()
Need every Collector in the chain to be parallelizable - can start attacking at the
level of individual collectors.
public class MultiCollector implements Collector {
…
@Override
public boolean isParallelizable() {
for (Collector c: collectors) {
if (!c.isParallelizable()) {
return false;
}
}
return true;
}
…
}
21. Don’t penalize serial
Collector.setParallelized()
‘Heads-up’ to the Collector whether collection will be parallelized, so it can adapt in
case the parallelism-friendly approach has unnecessary cost in the serial case.
22. Non-blocking constructs
Guarantee to always execute on primary search thread
(existing)
LeafCollector Collector.getLeafCollector()
(new)
void LeafCollector.leafDone()
void Collector.done()
=> safe places to act on shared mutable state
23. New search strategy
IndexSearcher(IndexReaderContext, SearchStrategy)
IndexSearcher.search() factored into:
• SerialSearchStrategy
• ParallelSearchStrategy(Executor e, int parallelism)
• parallelism used to throttle maximum concurrent tasks at the request-level
24. Parallel search - not just collection
• Scoring is thread-safe and segment-level.
• Collection is also segment-level, but typically computes its outcome as shared state
between leafs e.g. TopDocs over your index.
• By making Collector API parallelism-friendly, we can parallelize search as a whole.
25. Stupidly parallelizable
public class TotalHitCountCollector implements Collector {
private int totalHits;
@Override
public LeafCollector getLeafCollector(LeafReaderContext context) throws IOException {
return new LeafCollector() {
private int totalHits = 0;
..
@Override
public void collect(int doc) throws IOException {
totalHits++;
}
..
@Override
public void leafDone() throws IOException {
TotalHitCountCollector.this.totalHits += totalHits;
}
};
}
..
@Override
public boolean isParallelizable() {
return true;
}
}
26. Fun to parallelize
Solr DocSetCollector
populates document ID’s in FixedBitSet(maxDoc) - internally a long[]
leaf docBase maxDoc docId range
0 0 42 [0 -‐ 41]
1 42 20 [42 -‐ 61]
To address possible race condition at segment boundaries, when parallelized:
• collect() first and last 64 document ID’s for the segment into LeafCollector-private
longs, all others into shared bitset.
• when leafDone() merge these boundary document ID’s into shared bitset.
27. Bigger tradeoffs
Lucene TopScoreDocCollector uses a single priority queue in serial case.
When parallelized:
• More memory: lazy pool of HitQueue - grab when getLeafCollector(), return
when leafDone(), merge when done().
• More computation: in addition to the merge step - less likely to immediately discard
hits that won’t eventually make it, as using multiple priority queues.
30. Replay testing
Replayed traffic from Etsy listing search request logs to an experimental cluster
running LUCENE-5299 changes in an unsharded setup.
p95 latency p99 latency
serial
parallel
31. Throughput
In general, system needs to do more work overall, which impacts throughput:
• concurrency overhead
• context switches
• locally optimal choices at the leaf-level
• merge cost
serial user cpu % parallel user cpu %
32. Sharding comparison
Segment-level parallelism Sharding
Limited to single JVM.
Distributed search not required.
Scalable across JVM’s.
Distributed search required.
Sensitive to segment count and sizing. Index shards can be kept similarly sized.
Prone to “shard lag” - limited by slowest shard.
In-process merging is cheaper. Merge cost higher due to serialization.
Existing solution has limited applicability.
LUCENE-5299 solution not in trunk.
Tried-and-tested approach.
34. Next steps
• Figure out whether serial penalty is real.
• Semantics around exceptions during collection and ‘done’ callbacks.
• Lots more collectors can be made parallelizable.
• Your contributions welcome - LUCENE-5299.
• Committer interest especially welcome!