Lucene at Yelp: Federating Index and Retrieving Top K Hits Efficiently

•

0 recomendaciones•1,193 vistas

Lucene @ Yelp provides various search services using Lucene including business search, phone search, list search, review search, and auto-completion. The services were originally too slow due to seeking across hard disks for the large index. The solution was to shard the index across multiple machines ("federation") and have a coordinator retrieve and combine results. Lucy is used for indexing and searching individual shards efficiently. Statistical modeling and simulations showed fewer hits needed to be retrieved from each shard to obtain the overall top results compared to naive approaches.

Tecnología

Bio

1. Over a decade of experience in information retrieval
2. Used IR techniques at Symantec's DLP group
3. Search Engineer at Yelp

Outline

1. Overview of search services at Yelp
2. Federation Motivation
3. Lucy Indexing
4. Lucy Searching
5. Efficiently Retrieving top k hits

Hard Disk Seek Latency
Disk seek 10,000,000 ns

Source Software Engineering Advice from
Building Large-Scale Distributed Systems
Jeffery Dean

RAM read latency

Main memory
reference
100 ns

Pinning Index in RAM

● vmtouch
● mlock
● http://hoytech.com/vmtouch/

Problem

Index is too large fit in memory on a single machine

Geographical Sharding drawbacks

1. Cumbersome manual process to determine shard boundary
2. No guarantee that a boundary can be found.

Federation

1. Split index across multiple machines
2. Shard on business id
3. TF-IDF scores from different machines should be
comparable

Mapping businesses to shards

1. Assigning businesses to shards

shard = shardlist[hash(business_id) % len(shardlist)]

Problems
1. Involves re-indexing all the businesses if we want to add a
new shard

Advantages

1. Flexibility (move vbuckets from one shard to another)
2. Split hot spot shards

Lucy Master Slave Architecture

Separate indexing (masters)
A master for each shard of a service

Searching (slaves)
A slave for every replica of a service

Federator: Combining results across
shards
1. Once we distribute an index across shards we need a
component which will search all these shards and combine
their results.
2. Written in Python (runs inside a python web process).
3. Uses Tornado IO loop to send requests to all shards.
4. The transfer protocol for the requests in JSON RPC

Executing queries

1. Gather the top results for a query
2. Collect attribute statitics for attributes like places, categories

Lucene

1. Efficiently executes queries over the index
2. Provides how relevant the business is to the words in the
query (word score)
3. Upgrading lucene to 2.9/3.1 is WIP

Efficiently Retrieving top k hits

1. When user moves through multiple pages the number of
hits to be returned increases

num hits = start + count

2. So if we need to retrieve 500 hits the naive way would be to
retrieve 500 hits from each shard and then sort them

Binomial Distribution
Probability (r of top k hits) are in a particular shard

Mean

Variance

Simulation

Formula Hits selected from each Results Missed (%)
shard
k = 100
p = 0.2

24 0.017

32 0.0001407

44 0.00000

Results

1. ~ 50% savings over 100 hits (44 hits requested from each
shard)
2. 77% savings over 1000 hits (228 hits requested from each
shard)

Future work

1. In memory index
2. Move towards real time search

Más contenido relacionado

Destacado

Customized Navigation Using SOLRLucidworks (Archived)

What’s New in Apache Lucene 3.0Lucidworks (Archived)

Lady gaga tanica

What’s new in apache lucene 3.0Lucidworks (Archived)

Understanding Lucene Search PerformanceLucidworks (Archived)

Web Design Course FETAC Level 5 CMD Training Institute

Hellosongtanica

Coterie 9 11LaRue

Ecma 262 5th Edition を読む #5 第9条彰村地

Building SaaS Solutions for Online Media Using Apache SolrLucidworks (Archived)

What’s New in Solr 1.4Lucidworks (Archived)

Guidelines for Managers: What Lucene and Solr Open Source Search can do for E...Lucidworks (Archived)

Impact of open source search on the intelligence communityLucidworks (Archived)

Ingles haititanica

Gaiety Hotel - full versiondummypackages

Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineLucidworks (Archived)

Pangaea providing access to geoscientific data using apache lucene javaLucidworks (Archived)

$C:\Fakepath\I Love You Mommy$ $C:\Fakepath\I Love You Mommy$

C:\Fakepath\I Love You MommyNyiah

In The Annals Of Rock History The Whotanica

Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Lucidworks (Archived)

Destacado (20)

Customized Navigation Using SOLR

What’s New in Apache Lucene 3.0

Lady gaga

What’s new in apache lucene 3.0

Understanding Lucene Search Performance

Web Design Course FETAC Level 5

Hellosong

Coterie 9 11

Ecma 262 5th Edition を読む #5 第9条

Building SaaS Solutions for Online Media Using Apache Solr

What’s New in Solr 1.4

Guidelines for Managers: What Lucene and Solr Open Source Search can do for E...

Impact of open source search on the intelligence community

Ingles haiti

Gaiety Hotel - full version

Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine

Pangaea providing access to geoscientific data using apache lucene java

$C:\Fakepath\I Love You Mommy$ $C:\Fakepath\I Love You Mommy$

C:\Fakepath\I Love You Mommy

In The Annals Of Rock History The Who

Exploration of multidimensional biomedical data in pub chem, Presented by Lia...

Similar a Lucene at Yelp: Federating Index and Retrieving Top K Hits Efficiently

(CMP305) Deep Learning on AWS Made EasyCmp305Amazon Web Services

Spark Based Distributed Deep Learning Framework For Big Data Applications Humoyun Ahmedov

My Master's ThesisHumoyun Ahmedov

Distributed Deep Learning with Docker at SalesforceDocker, Inc.

Introduction to Azure DocumentDBDenny Lee

"Introducing Distributed Tracing in a Large Software System", Kostiantyn Sha...Fwdays

The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin Databricks

"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)Tech in Asia ID

Realtime search at YammerBoris Aleksandrovsky

Real Time Search at YammerLucidworks (Archived)

Real-time Search at Yammer - By Aleksandrovsky Borislucenerevolution

Puppetcamp Melbourne - puppetdbm_richardson

Puppet Camp Melbourne 2014: Node Collaboration with PuppetDBPuppet

Conf orm - explainLouise Grandjonc

Functional solidMatt Stine

AI and Deep Learning Subrat Panda, PhD

Beyond php - it's not (just) about the codeWim Godden

SenseiDBVolodymyr Zhabiuk

Distributed deep learning_over_spark_20_nov_2014_ver_2.8Vijay Srinivas Agneeswaran, Ph.D

Crm saturday madrid 2017 jordi montaña - test automationDemian Raschkovan

Similar a Lucene at Yelp: Federating Index and Retrieving Top K Hits Efficiently (20)

(CMP305) Deep Learning on AWS Made EasyCmp305

Spark Based Distributed Deep Learning Framework For Big Data Applications

My Master's Thesis

Distributed Deep Learning with Docker at Salesforce

Introduction to Azure DocumentDB

"Introducing Distributed Tracing in a Large Software System", Kostiantyn Sha...

The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin

"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)

Realtime search at Yammer

Real Time Search at Yammer

Real-time Search at Yammer - By Aleksandrovsky Boris

Puppetcamp Melbourne - puppetdb

Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB

Conf orm - explain

Functional solid

AI and Deep Learning

Beyond php - it's not (just) about the code

SenseiDB

Distributed deep learning_over_spark_20_nov_2014_ver_2.8

Crm saturday madrid 2017 jordi montaña - test automation

Más de Lucidworks (Archived)

Integrating Hadoop & SolrLucidworks (Archived)

The Data-Driven ParadigmLucidworks (Archived)

Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Lucidworks (Archived)

SFBay Area Solr Meetup - July 15th: Integrating Hadoop and SolrLucidworks (Archived)

SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessLucidworks (Archived)

SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceLucidworks (Archived)

Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchLucidworks (Archived)

What's new in solr june 2014Lucidworks (Archived)

Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrLucidworks (Archived)

Minneapolis Solr Meetup - May 28, 2014: Target.com SearchLucidworks (Archived)

Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...Lucidworks (Archived)

Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Lucidworks (Archived)

Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCLucidworks (Archived)

What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCLucidworks (Archived)

Solr At AOL, Presented by Sean Timm at SolrExchage DCLucidworks (Archived)

Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCLucidworks (Archived)

Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCLucidworks (Archived)

Building a data driven search application with LucidWorks SiLKLucidworks (Archived)

Introducing LucidWorks App for Splunk Enterprise webinarLucidworks (Archived)

Solr4 nosql search_server_2013Lucidworks (Archived)

Más de Lucidworks (Archived) (20)

Integrating Hadoop & Solr

The Data-Driven Paradigm

Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...

SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr

SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance

Chicago Solr Meetup - June 10th: Exploring Hadoop with Search

What's new in solr june 2014

Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr

Minneapolis Solr Meetup - May 28, 2014: Target.com Search

Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...

Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...

Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC

What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC

Solr At AOL, Presented by Sean Timm at SolrExchage DC

Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC

Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC

Building a data driven search application with LucidWorks SiLK

Introducing LucidWorks App for Splunk Enterprise webinar

Solr4 nosql search_server_2013

Último

From Family Reminiscence to Scholarly Archive .Alan Dix

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

"ML in Production",Oleksandr BaganFwdays

Story boards and shot lists for my a level piececharlottematthew16

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

Take control of your SAP testing with UiPath Test SuiteDianaGray10

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

How to write a Business Continuity PlanDatabarracks

WordPress Websites for Engineers: Elevate Your Brandgvaughan

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Lucene at Yelp: Federating Index and Retrieving Top K Hits Efficiently

1. Lucene @ Yelp Sudarshan Gaikaiwari

2. Bio 1. Over a decade of experience in information retrieval 2. Used IR techniques at Symantec's DLP group 3. Search Engineer at Yelp

3. Outline 1. Overview of search services at Yelp 2. Federation Motivation 3. Lucy Indexing 4. Lucy Searching 5. Efficiently Retrieving top k hits

4. The services we provide

5. Lucy: business search

6. Lucy also powers phone search

7. Cathy: she 'talks' a lot

8. Listsearch: it searches lists....

9. Reviewsearch: it searches reviews....

10. DYM: did you really mean that?

11. Suggest: auto completion

12. Federation Motivation

13. Problem Search is too slow

14. Hard Disk Seek Latency Disk seek 10,000,000 ns Source Software Engineering Advice from Building Large-Scale Distributed Systems Jeffery Dean

15. RAM read latency Main memory reference 100 ns

16. Pinning Index in RAM ● vmtouch ● mlock ● http://hoytech.com/vmtouch/

17. Problem Index is too large fit in memory on a single machine

18. Geographical sharding

19. Geographical Sharding drawbacks 1. Cumbersome manual process to determine shard boundary 2. No guarantee that a boundary can be found.

20. Federation 1. Split index across multiple machines 2. Shard on business id 3. TF-IDF scores from different machines should be comparable

21. Mapping businesses to shards 1. Assigning businesses to shards shard = shardlist[hash(business_id) % len(shardlist)] Problems 1. Involves re-indexing all the businesses if we want to add a new shard

22. Virtual Nodes

23. Advantages 1. Flexibility (move vbuckets from one shard to another) 2. Split hot spot shards

24. Lucy Master Slave Architecture Separate indexing (masters) A master for each shard of a service Searching (slaves) A slave for every replica of a service

25. Lucy Indexing

26.

27.

28.

29.

30. Lucy Searching

31.

32. Federator: Combining results across shards 1. Once we distribute an index across shards we need a component which will search all these shards and combine their results. 2. Written in Python (runs inside a python web process). 3. Uses Tornado IO loop to send requests to all shards. 4. The transfer protocol for the requests in JSON RPC

33. Lucy Server

34.

35.

36. Tokens to Business Attributes

37. Executing queries 1. Gather the top results for a query 2. Collect attribute statitics for attributes like places, categories

38. Lucene 1. Efficiently executes queries over the index 2. Provides how relevant the business is to the words in the query (word score) 3. Upgrading lucene to 2.9/3.1 is WIP

39.

40. Successive geobounds relaxation

41. Successive geobounds relaxation

42. Federation

43. Efficiently Retrieving top k hits 1. When user moves through multiple pages the number of hits to be returned increases num hits = start + count 2. So if we need to retrieve 500 hits the naive way would be to retrieve 500 hits from each shard and then sort them

44. Distribution of hits in shards

45.

46. Probability a hit is in a shard

47. Binomial Distribution Probability (r of top k hits) are in a particular shard Mean Variance

48. Formula Std Deviation Formula

49. Simulation Formula Hits selected from each Results Missed (%) shard k = 100 p = 0.2 24 0.017 32 0.0001407 44 0.00000

50. Simulation Graph

51. Results 1. ~ 50% savings over 100 hits (44 hits requested from each shard) 2. 77% savings over 1000 hits (228 hits requested from each shard)

52. Future work 1. In memory index 2. Move towards real time search

53. Come Join Us!

54. Thank You smg@yelp.com

Lucene at Yelp: Federating Index and Retrieving Top K Hits Efficiently

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (20)

Similar a Lucene at Yelp: Federating Index and Retrieving Top K Hits Efficiently

Similar a Lucene at Yelp: Federating Index and Retrieving Top K Hits Efficiently (20)

Más de Lucidworks (Archived)

Más de Lucidworks (Archived) (20)

Último

Último (20)

Lucene at Yelp: Federating Index and Retrieving Top K Hits Efficiently