Lucene @ Yelp provides various search services using Lucene including business search, phone search, list search, review search, and auto-completion. The services were originally too slow due to seeking across hard disks for the large index. The solution was to shard the index across multiple machines ("federation") and have a coordinator retrieve and combine results. Lucy is used for indexing and searching individual shards efficiently. Statistical modeling and simulations showed fewer hits needed to be retrieved from each shard to obtain the overall top results compared to naive approaches.
20. Federation
1. Split index across multiple machines
2. Shard on business id
3. TF-IDF scores from different machines should be
comparable
21. Mapping businesses to shards
1. Assigning businesses to shards
shard = shardlist[hash(business_id) % len(shardlist)]
Problems
1. Involves re-indexing all the businesses if we want to add a
new shard
24. Lucy Master Slave Architecture
Separate indexing (masters)
A master for each shard of a service
Searching (slaves)
A slave for every replica of a service
32. Federator: Combining results across
shards
1. Once we distribute an index across shards we need a
component which will search all these shards and combine
their results.
2. Written in Python (runs inside a python web process).
3. Uses Tornado IO loop to send requests to all shards.
4. The transfer protocol for the requests in JSON RPC
37. Executing queries
1. Gather the top results for a query
2. Collect attribute statitics for attributes like places, categories
38. Lucene
1. Efficiently executes queries over the index
2. Provides how relevant the business is to the words in the
query (word score)
3. Upgrading lucene to 2.9/3.1 is WIP
43. Efficiently Retrieving top k hits
1. When user moves through multiple pages the number of
hits to be returned increases
num hits = start + count
2. So if we need to retrieve 500 hits the naive way would be to
retrieve 500 hits from each shard and then sort them