SlideShare una empresa de Scribd logo
1 de 30
Comparing Distributed Indexing: To Mapreduce or Not? Richard McCreadie Craig Macdonald Iadh Ounis
Talk Outline ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
MOTIVATIONS ,[object Object],[object Object]
Why is Efficient Indexing Important? ,[object Object],[object Object],[object Object],Collection Data Year Docs Size(GB) WT2G Web 1999 240k 2.0 GOV Web 2002 1.8M 18.0 Blogs06 Blogs 2006 3M 13.0 GOV2 Web 2004 25M 425.0 ClueWeb09 Web 2009 1.2B 25,000
[object Object],[object Object],[object Object],Solutions? MapReduce
Contributions ,[object Object],[object Object],[object Object],[object Object],[object Object]
CLASSICAL INDEXING ,[object Object],[object Object],[object Object]
Classical Indexing ,[object Object],[object Object],[object Object],[object Object],(I.H. Witten, A. Moffat and T.C. Bell, 1999) Lexicon Posting List term Total docs Total frequency pointer Document number frequency
[object Object],[object Object],[object Object],Single-Pass In-Memory Indexing <> <> <> <> <> <> DISK Compressed Files % Used RAM t1 t2 t3 Indexer Final Inverted Index
How can Indexing be Distributed? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Problems with Classical Approaches ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
MAPREDUCE INDEXING ,[object Object],[object Object]
MapReduce ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Indexing with MapReduce ,[object Object],[object Object],[object Object],[object Object],Dean & Ghemawat, MapReduce: Simplified data processing on large clusters. OSDI 2004 Emit each word in the document A big intermediate sort Lots of merging Approach  Emits Sorting Num emits per map Emit size D&G_Token Tokens Lots Lots Tiny D&G_Term Terms Lots Many Tiny Nutch Documents Little Some Average Single-Pass Posting lists Some Few Large
D&G_Token & D&G_Term ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Emit every  token Emit every  term
Nutch Style Indexing ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Emit every  document
Our Single-Pass Indexing Strategy ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Emit limited  Posting-Lists
Our MapReduce Indexing Strategy (2) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
EXPERIMENTATION & RESULTS  ,[object Object],[object Object]
Research Questions ,[object Object],[object Object],[object Object],[object Object]
Evaluation of MapReduce Indexing ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Experimental Setup ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Target Indexing Throughput Table 1 : Throughput (MB/sec) when indexing .GOV2 with m machines ,[object Object],[object Object],Indexing Strategy 1 2 4 6 8 Shared-Nothing Distributed 3 6 12 18 24 Shared-Corpus Distributed MapReduce D&G_Token MapReduce D&G_Term MapReduce Single-Pass Number of Machines Allocated
Baseline Indexing Throughput ,[object Object],[object Object],[object Object],[object Object],Table 1 : Throughput (MB/sec) when indexing .GOV2 with m machines Indexing Strategy 1 2 4 6 8 Shared-Nothing Distributed 3 6 12 18 24 Shared-Corpus Distributed 2.44 4.6 12.8 12.4 12.8 MapReduce D&G_Token MapReduce D&G_Term MapReduce Single-Pass Number of Machines Allocated
D&G_Token Indexing Throughput ,[object Object],[object Object],Table 1 : Throughput (MB/sec) when indexing .GOV2 with m machines Indexing Strategy 1 2 4 6 8 Shared-Nothing Distributed 3 6 12 18 24 Shared-Corpus Distributed 2.44 4.6 12.8 12.4 12.8 MapReduce D&G_Token - - - - - MapReduce D&G_Term MapReduce Single-Pass Number of Machines Allocated
D&G_Term Indexing Throughput ,[object Object],[object Object],[object Object],[object Object],Table 1 : Throughput (MB/sec) when indexing .GOV2 with m machines Indexing Strategy 1 2 4 6 8 Shared-Nothing Distributed 3 6 12 18 24 Shared-Corpus Distributed 2.44 4.6 12.8 12.4 12.8 MapReduce D&G_Token - - - - - MapReduce D&G_Term 1.15 1.59 4.01 4.71 6.38 MapReduce Single-Pass Number of Machines Allocated
Single-Pass Indexing Throughput ,[object Object],[object Object],[object Object],Table 1 : Throughput (MB/sec) when indexing .GOV2 with m machines Indexing Strategy 1 2 4 6 8 Shared-Nothing Distributed 3 6 12 18 24 Shared-Corpus Distributed 2.44 4.6 12.8 12.4 12.8 MapReduce D&G_Token - - - - - MapReduce D&G_Term 1.15 1.59 4.01 4.71 6.38 MapReduce Single-Pass 2.59 5.19 9.45 13.16 17.31 Number of Machines Allocated
CONCLUSION
Conclusions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Questions? ,[object Object]

Más contenido relacionado

La actualidad más candente

Hadoop in sigmod 2011
Hadoop in sigmod 2011Hadoop in sigmod 2011
Hadoop in sigmod 2011
Bin Cai
 
Reduce Side Joins
Reduce Side Joins Reduce Side Joins
Reduce Side Joins
Edureka!
 

La actualidad más candente (17)

Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive Indexing
 
An Introduction To Map-Reduce
An Introduction To Map-ReduceAn Introduction To Map-Reduce
An Introduction To Map-Reduce
 
Hadoop in sigmod 2011
Hadoop in sigmod 2011Hadoop in sigmod 2011
Hadoop in sigmod 2011
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
 
Repartition join in mapreduce
Repartition join in mapreduceRepartition join in mapreduce
Repartition join in mapreduce
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduceComputing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
 
Reduce Side Joins
Reduce Side Joins Reduce Side Joins
Reduce Side Joins
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Skyline Query Processing using Filtering in Distributed Environment
Skyline Query Processing using Filtering in Distributed EnvironmentSkyline Query Processing using Filtering in Distributed Environment
Skyline Query Processing using Filtering in Distributed Environment
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analytics
 
Chap3 slides
Chap3 slidesChap3 slides
Chap3 slides
 
Ieeepro techno solutions ieee java project - budget-driven scheduling algor...
Ieeepro techno solutions   ieee java project - budget-driven scheduling algor...Ieeepro techno solutions   ieee java project - budget-driven scheduling algor...
Ieeepro techno solutions ieee java project - budget-driven scheduling algor...
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
 

Destacado (6)

Trec2009blog overview v9
Trec2009blog overview v9Trec2009blog overview v9
Trec2009blog overview v9
 
Content Personalization - Why it Helps
Content Personalization - Why it HelpsContent Personalization - Why it Helps
Content Personalization - Why it Helps
 
News Article Ranking : Leveraging the Wisdom of Bloggers
News Article Ranking : Leveraging the Wisdom of BloggersNews Article Ranking : Leveraging the Wisdom of Bloggers
News Article Ranking : Leveraging the Wisdom of Bloggers
 
Christmas Gifts 2010
Christmas Gifts 2010Christmas Gifts 2010
Christmas Gifts 2010
 
Crowdsourcing a News Query Classification Dataset
Crowdsourcing a News Query Classification DatasetCrowdsourcing a News Query Classification Dataset
Crowdsourcing a News Query Classification Dataset
 
Hype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerHype vs. Reality: The AI Explainer
Hype vs. Reality: The AI Explainer
 

Similar a Comparing Distributed Indexing To Mapreduce or Not?

HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentation
Noha Elprince
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)
anh tuan
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
ijcsit
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
msgroner
 
Spatial Data Integrator - Software Presentation and Use Cases
Spatial Data Integrator - Software Presentation and Use CasesSpatial Data Integrator - Software Presentation and Use Cases
Spatial Data Integrator - Software Presentation and Use Cases
mathieuraj
 
Download It
Download ItDownload It
Download It
butest
 

Similar a Comparing Distributed Indexing To Mapreduce or Not? (20)

Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & Hadoop
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script
 
E031201032036
E031201032036E031201032036
E031201032036
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
 
My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentation
 
Map reduce
Map reduceMap reduce
Map reduce
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
 
Join Algorithms in MapReduce
Join Algorithms in MapReduceJoin Algorithms in MapReduce
Join Algorithms in MapReduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
MapReduce
MapReduceMapReduce
MapReduce
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Spatial Data Integrator - Software Presentation and Use Cases
Spatial Data Integrator - Software Presentation and Use CasesSpatial Data Integrator - Software Presentation and Use Cases
Spatial Data Integrator - Software Presentation and Use Cases
 
Lecture 1 mapreduce
Lecture 1  mapreduceLecture 1  mapreduce
Lecture 1 mapreduce
 
Mapreduce2008 cacm
Mapreduce2008 cacmMapreduce2008 cacm
Mapreduce2008 cacm
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
 
Download It
Download ItDownload It
Download It
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 

Comparing Distributed Indexing To Mapreduce or Not?

  • 1. Comparing Distributed Indexing: To Mapreduce or Not? Richard McCreadie Craig Macdonald Iadh Ounis
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 29.
  • 30.

Notas del editor

  1. © Terrier Development Team, University of Glasgow, 2005
  2. © Terrier Development Team, University of Glasgow, 2005