SlideShare una empresa de Scribd logo
1 de 21
“Join Algorithms in MapReduce Framework”
Shrihari A Rathod
• Introduction
• Objectives
• Join Algorithm Selection
• Discussions (Comparison, Advantages, Issues , Applications)
• References
AGENDA
INTRODUCTION
MapReduce:
• Large scale data processing in parallel
• Two phases in MapReduce
• Read a lot of Data
• Map: Extract something you care about from each record
• Shuffle and Sort
• Reduce: Aggregate, Summarize, Filter or Transform
• Write the results
• Outline stays the same, map and reduce change to fit the problem
INTRODUCTION
MapReduce A Brief
Map Phase
• map(in_key, in_value) → list(out_key, intermediate_value)
• Processes input key value pairs
• Produces set of intermediate pairs
Reduce Phase
• reduce(out_key,list(intermediate_value)) → list(out_value)
• Combines all intermediate value for a particular key.
• Produces a set of merged output values(usually just one)
INTRODUCTION
Hadoop:
• Apache Hadoop is an open-source software framework that allows for the distributed
processing of large datasets across clusters of commodity computers using a simple
programming model.
• Hadoop is an open source implementation of Google MapReduce, GFS (distributed file
system)
• Supports running of applications on large clusters of commodity hardware.
• Task are divided into Map-Reduce framework
• Provides a distributed file system that stores data on the compute nodes.
INTRODUCTION
• The Master nodes oversee the two key
functional pieces that make up Hadoop:
storing lots of data (HDFS), and running
parallel computations on all that data (Map
Reduce).
• The Name Node oversees & coordinates the
data storage function (HDFS), while the Job
Tracker oversees & coordinates the parallel
processing of data using Map Reduce.
• Slave Nodes do all the dirty work of storing
the data and running the
computations. Each slave runs both a Data
Node and Task Tracker daemon that
communicate with and receive instructions
from their master nodes.
• The Task Tracker daemon is a slave to the
Job Tracker, the Data Node daemon &
Name Node.
INTRODUCTION
• Two stages- a ‘Map stage’ and a ‘Reduce stage’.
• A mapper’s job during Map Stage is to “read” the data from join tables and
to “return” the ‘join key’ and ‘join value’ pair into an intermediate file.
• Further, in the shuffle stage, this intermediate file is then sorted and merged.
• The reducer’s job during reduce stage is to take this sorted result as input & complete the
task of join.
Join Operation:
OBJECTIVE
Objectives:
The contributions of this seminar can be summarized as follows:
• There is a great controversy over which join implementation the user should choose.
• Analysis of MapReduce framework, make a comparison between different ways to implement
join in MapReduce. Examine the existing algorithms for joining datasets using Map/Reduce.
• Analysis of affecting factors to MapReduce, and study the new influence of these factors
when implementing join in MapReduce
• This seminar describes various methods for Joining datasets using Hadoop MapReduce &
proposes a strategy to find out best suitable join processing algorithm.
Map Side Join
A map-side join takes place when the
data is joined before it reaches the map
function.
• A given key has to be in the same
partition in each dataset so that all
partitions that can hold a certain key
are joined together. For this to work,
all datasets should be partitioned
using the same partitioner and
moreover, the number of partitions in
each dataset should be identical.
• The sort order of the data in each
dataset must be identical. This
requires that all datasets must be
sorted using the same comparator.
JOIN ALGORITHMS
Reduce Side Join
Map Phase
The ‘map’ phase only pre-processes the tuples of the two datasets to organize them in terms of
the join key.
Partitioning and Grouping Phase
The partitioner partitions the tuples among the reducers based on the join key such that all tuples
from both datasets having the same key go to the same reducer. The reduce function is called
once for a key and the list of values associated with it. This list of values is generated by
grouping together all the tuples associated with the same key. We are sending a composite
TextPair key and hence the Reducer will consider (key, tag) as a key
Reduce Phase
The framework sorts the keys (in this case, the composite TextPair key) and passes them on with
the corresponding values to the Reducer. Since the sorting is done on the composite key (primary
sort on Key and secondary sort on Tag), tuples from one table will all come before the other table.
JOIN ALGORITHMS
Reduce Side Join
JOIN ALGORITHMS
Repartition Join [8]
Map Phase :
• Each map task works on a split of either R or L.
• Each map task tags the record with its originating table.
• Outputs the extracted join key and the tagged record as a (key, value) pair.
• The outputs are then partitioned, sorted and merged by the framework.
Reducer Phase :
• All the records for each join key are grouped together and eventually fed to a reducer.
• For each join key, the reduce function first separates and buffers the input records into two
sets according to the table tag.
• Performs a cross-product between records in the above sets.
JOIN ALGORITHMS
Repartition Join [8]
JOIN ALGORITHMS
Broadcast Join [8]
• Broadcast join is used there need to
join the smaller data set with large data
set.
• With the assumption that smaller data
set will fit into memory easily.
• The smaller data set is pushed into
distributed cache and replicated among
the cluster nodes.
• In init method of mapper a hash
map(or other associative array) is built
from smaller data set.
• And join operation is performed in
map method and output is emitted.
JOIN ALGORITHMS
Trojan Join [8]
• Trojan Join supports more
effective join by assuming
we know the schema and
workload
• Idea is to co-partition the
data at load time.
• Joins are locally processed
within each node at query
time, but we are free to
group the data on any
attribute other than the join
attribute in same mapreduce
job
JOIN ALGORITHMS
Replicated Join [8]
JOIN ALGORITHMS
Decision Tree for Join Algorithm Selection [8]
JOIN ALGORITHM SELECTION
Schema & Join Condition Known
No Yes
Two Relation Join Trojan Join
Yes No
Cost effective Is Replication Efficient
Yes No
Broadcast Join Selectivity No Yes
Less High
Optimized Broadcast Join Repartition Join Replicated Join
DISSCUSSIONS
Criteria for Join Algorithm Selection
• Prior knowledge of schema and join condition
• Transmission over network
• Number of tuples
DISSCUSSIONS
Comparison of Join processing methods
Join Type MapReduce
jobs
Advantages Issues
Repartition 1 MapReduce
job
Simple
implementation
of Reduce phase
Sorting and movement of
tuples over network
Broadcast 1 Map phase No sorting and
movement of tuples
Useful only if one relation
is small.
Trojan 1 Map phase Uses schema
knowledge
Useful, if join conditions
are known
Replicated 1 MapReduce
job
Efficient for Star
join
and Chain join
For large relations more
number of reducers /
replicas are required
REFERENCES
[1] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters.
Commun. ACM, 51(1):107–113, 2008.
[2] Fuhui Wu et al. “Comparison & Performance Analysis of Join Approach in MapReduce”
ISCTCS 2012, CCIS 320, pp. 629–636[8]
[3] Marko Lalić et al. “Comparison of a Sequential & a MapReduce Approach to Joining
Large Datasets” MIPRO 2013, pp.1289-1291[3]
[4] Spyros, B., Jignesh, M.P., Vuk, E., Jun, R., Eugene, J., Yuanyuan, T.: A Comparison of Join
Algorithms for Log Processing in MapReduce. In: SIGMOD 2010, June 6–11. ACM,
Indianapolis (2010)
[5] Foto N. Afrati et al. “Optimizing Multiway Joins in a Map-Reduce Environment” IEEE
Transactions On Knowledge And Data Engineering, pp. 1282- 1298[17], 2011
[6] Alper Okcan et al. “Processing Theta-Joins using MapReduce” SIGMOD‟11, June 12–16,
2011 pp. 949-951[12]
[7] Xiaofei Zhang et al. “Efficient Multiway Theta Join Processing Using MapReduce”
Proceedings of the VLDB Endowment, Vol. 5, No. 11, pp.1184-1196[12]
[8] Anwar Shaikh et al. “Join Query Processing in MapReduce Environment” CNC 2012,
LNICST , pp.275-281[7]
THANK YOU !

Más contenido relacionado

La actualidad más candente

Git and Github slides.pdf
Git and Github slides.pdfGit and Github slides.pdf
Git and Github slides.pdfTilton2
 
Git the Docs: A fun, hands-on introduction to version control
Git the Docs: A fun, hands-on introduction to version controlGit the Docs: A fun, hands-on introduction to version control
Git the Docs: A fun, hands-on introduction to version controlBecky Todd
 
Git 101: Git and GitHub for Beginners
Git 101: Git and GitHub for Beginners Git 101: Git and GitHub for Beginners
Git 101: Git and GitHub for Beginners HubSpot
 
Challenges of Conventional Systems.pptx
Challenges of Conventional Systems.pptxChallenges of Conventional Systems.pptx
Challenges of Conventional Systems.pptxGovardhanV7
 
Go logging using eBPF
Go logging using eBPFGo logging using eBPF
Go logging using eBPFZain Asgar
 
Fine-tuning Group Replication for Performance
Fine-tuning Group Replication for PerformanceFine-tuning Group Replication for Performance
Fine-tuning Group Replication for PerformanceVitor Oliveira
 
Linux basic commands
Linux basic commandsLinux basic commands
Linux basic commandsSagar Kumar
 
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...Altinity Ltd
 
Introduction to systemd
Introduction to systemdIntroduction to systemd
Introduction to systemdYusaku OGAWA
 
Distributed Database System
Distributed Database SystemDistributed Database System
Distributed Database SystemSulemang
 
Linux network file system (nfs)
Linux   network file system (nfs)Linux   network file system (nfs)
Linux network file system (nfs)Raghu nath
 
Monitoring Oracle Database Instances with Zabbix
Monitoring Oracle Database Instances with ZabbixMonitoring Oracle Database Instances with Zabbix
Monitoring Oracle Database Instances with ZabbixGerger
 

La actualidad más candente (20)

Git and Github slides.pdf
Git and Github slides.pdfGit and Github slides.pdf
Git and Github slides.pdf
 
Git basics
Git basicsGit basics
Git basics
 
Git the Docs: A fun, hands-on introduction to version control
Git the Docs: A fun, hands-on introduction to version controlGit the Docs: A fun, hands-on introduction to version control
Git the Docs: A fun, hands-on introduction to version control
 
Git 101: Git and GitHub for Beginners
Git 101: Git and GitHub for Beginners Git 101: Git and GitHub for Beginners
Git 101: Git and GitHub for Beginners
 
Challenges of Conventional Systems.pptx
Challenges of Conventional Systems.pptxChallenges of Conventional Systems.pptx
Challenges of Conventional Systems.pptx
 
Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
 
Source Code management System
Source Code management SystemSource Code management System
Source Code management System
 
Hadoop Internals
Hadoop InternalsHadoop Internals
Hadoop Internals
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Go logging using eBPF
Go logging using eBPFGo logging using eBPF
Go logging using eBPF
 
Fine-tuning Group Replication for Performance
Fine-tuning Group Replication for PerformanceFine-tuning Group Replication for Performance
Fine-tuning Group Replication for Performance
 
Linux basic commands
Linux basic commandsLinux basic commands
Linux basic commands
 
GitLab.pptx
GitLab.pptxGitLab.pptx
GitLab.pptx
 
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
 
Mongo db report
Mongo db reportMongo db report
Mongo db report
 
Introduction to systemd
Introduction to systemdIntroduction to systemd
Introduction to systemd
 
Distributed Database System
Distributed Database SystemDistributed Database System
Distributed Database System
 
Linux boot process
Linux boot processLinux boot process
Linux boot process
 
Linux network file system (nfs)
Linux   network file system (nfs)Linux   network file system (nfs)
Linux network file system (nfs)
 
Monitoring Oracle Database Instances with Zabbix
Monitoring Oracle Database Instances with ZabbixMonitoring Oracle Database Instances with Zabbix
Monitoring Oracle Database Instances with Zabbix
 

Similar a Join Algorithms in MapReduce

Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce scriptHaripritha
 
Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfIntroduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfBikalAdhikari4
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigKhanKhaja1
 
Repartition join in mapreduce
Repartition join in mapreduceRepartition join in mapreduce
Repartition join in mapreduceUday Vakalapudi
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentationateeq ateeq
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfTSANKARARAO
 
Lecture 4 principles of parallel algorithm design updated
Lecture 4   principles of parallel algorithm design updatedLecture 4   principles of parallel algorithm design updated
Lecture 4 principles of parallel algorithm design updatedVajira Thambawita
 
Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?TerrierTeam
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentationAhmad El Tawil
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorSubhas Kumar Ghosh
 

Similar a Join Algorithms in MapReduce (20)

Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script
 
MapReduce.pptx
MapReduce.pptxMapReduce.pptx
MapReduce.pptx
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfIntroduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdf
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pig
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
Hadoop Mapreduce joins
Hadoop Mapreduce joinsHadoop Mapreduce joins
Hadoop Mapreduce joins
 
Repartition join in mapreduce
Repartition join in mapreduceRepartition join in mapreduce
Repartition join in mapreduce
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
2 mapreduce-model-principles
2 mapreduce-model-principles2 mapreduce-model-principles
2 mapreduce-model-principles
 
Unit3 MapReduce
Unit3 MapReduceUnit3 MapReduce
Unit3 MapReduce
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
 
Big Data.pptx
Big Data.pptxBig Data.pptx
Big Data.pptx
 
MapReduce-Notes.pdf
MapReduce-Notes.pdfMapReduce-Notes.pdf
MapReduce-Notes.pdf
 
Lecture 4 principles of parallel algorithm design updated
Lecture 4   principles of parallel algorithm design updatedLecture 4   principles of parallel algorithm design updated
Lecture 4 principles of parallel algorithm design updated
 
Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparator
 

Último

Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 

Último (20)

Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 

Join Algorithms in MapReduce

  • 1. “Join Algorithms in MapReduce Framework” Shrihari A Rathod
  • 2. • Introduction • Objectives • Join Algorithm Selection • Discussions (Comparison, Advantages, Issues , Applications) • References AGENDA
  • 3. INTRODUCTION MapReduce: • Large scale data processing in parallel • Two phases in MapReduce • Read a lot of Data • Map: Extract something you care about from each record • Shuffle and Sort • Reduce: Aggregate, Summarize, Filter or Transform • Write the results • Outline stays the same, map and reduce change to fit the problem
  • 4. INTRODUCTION MapReduce A Brief Map Phase • map(in_key, in_value) → list(out_key, intermediate_value) • Processes input key value pairs • Produces set of intermediate pairs Reduce Phase • reduce(out_key,list(intermediate_value)) → list(out_value) • Combines all intermediate value for a particular key. • Produces a set of merged output values(usually just one)
  • 5. INTRODUCTION Hadoop: • Apache Hadoop is an open-source software framework that allows for the distributed processing of large datasets across clusters of commodity computers using a simple programming model. • Hadoop is an open source implementation of Google MapReduce, GFS (distributed file system) • Supports running of applications on large clusters of commodity hardware. • Task are divided into Map-Reduce framework • Provides a distributed file system that stores data on the compute nodes.
  • 6. INTRODUCTION • The Master nodes oversee the two key functional pieces that make up Hadoop: storing lots of data (HDFS), and running parallel computations on all that data (Map Reduce). • The Name Node oversees & coordinates the data storage function (HDFS), while the Job Tracker oversees & coordinates the parallel processing of data using Map Reduce. • Slave Nodes do all the dirty work of storing the data and running the computations. Each slave runs both a Data Node and Task Tracker daemon that communicate with and receive instructions from their master nodes. • The Task Tracker daemon is a slave to the Job Tracker, the Data Node daemon & Name Node.
  • 7. INTRODUCTION • Two stages- a ‘Map stage’ and a ‘Reduce stage’. • A mapper’s job during Map Stage is to “read” the data from join tables and to “return” the ‘join key’ and ‘join value’ pair into an intermediate file. • Further, in the shuffle stage, this intermediate file is then sorted and merged. • The reducer’s job during reduce stage is to take this sorted result as input & complete the task of join. Join Operation:
  • 8. OBJECTIVE Objectives: The contributions of this seminar can be summarized as follows: • There is a great controversy over which join implementation the user should choose. • Analysis of MapReduce framework, make a comparison between different ways to implement join in MapReduce. Examine the existing algorithms for joining datasets using Map/Reduce. • Analysis of affecting factors to MapReduce, and study the new influence of these factors when implementing join in MapReduce • This seminar describes various methods for Joining datasets using Hadoop MapReduce & proposes a strategy to find out best suitable join processing algorithm.
  • 9. Map Side Join A map-side join takes place when the data is joined before it reaches the map function. • A given key has to be in the same partition in each dataset so that all partitions that can hold a certain key are joined together. For this to work, all datasets should be partitioned using the same partitioner and moreover, the number of partitions in each dataset should be identical. • The sort order of the data in each dataset must be identical. This requires that all datasets must be sorted using the same comparator. JOIN ALGORITHMS
  • 10. Reduce Side Join Map Phase The ‘map’ phase only pre-processes the tuples of the two datasets to organize them in terms of the join key. Partitioning and Grouping Phase The partitioner partitions the tuples among the reducers based on the join key such that all tuples from both datasets having the same key go to the same reducer. The reduce function is called once for a key and the list of values associated with it. This list of values is generated by grouping together all the tuples associated with the same key. We are sending a composite TextPair key and hence the Reducer will consider (key, tag) as a key Reduce Phase The framework sorts the keys (in this case, the composite TextPair key) and passes them on with the corresponding values to the Reducer. Since the sorting is done on the composite key (primary sort on Key and secondary sort on Tag), tuples from one table will all come before the other table. JOIN ALGORITHMS
  • 11. Reduce Side Join JOIN ALGORITHMS
  • 12. Repartition Join [8] Map Phase : • Each map task works on a split of either R or L. • Each map task tags the record with its originating table. • Outputs the extracted join key and the tagged record as a (key, value) pair. • The outputs are then partitioned, sorted and merged by the framework. Reducer Phase : • All the records for each join key are grouped together and eventually fed to a reducer. • For each join key, the reduce function first separates and buffers the input records into two sets according to the table tag. • Performs a cross-product between records in the above sets. JOIN ALGORITHMS
  • 14. Broadcast Join [8] • Broadcast join is used there need to join the smaller data set with large data set. • With the assumption that smaller data set will fit into memory easily. • The smaller data set is pushed into distributed cache and replicated among the cluster nodes. • In init method of mapper a hash map(or other associative array) is built from smaller data set. • And join operation is performed in map method and output is emitted. JOIN ALGORITHMS
  • 15. Trojan Join [8] • Trojan Join supports more effective join by assuming we know the schema and workload • Idea is to co-partition the data at load time. • Joins are locally processed within each node at query time, but we are free to group the data on any attribute other than the join attribute in same mapreduce job JOIN ALGORITHMS
  • 17. Decision Tree for Join Algorithm Selection [8] JOIN ALGORITHM SELECTION Schema & Join Condition Known No Yes Two Relation Join Trojan Join Yes No Cost effective Is Replication Efficient Yes No Broadcast Join Selectivity No Yes Less High Optimized Broadcast Join Repartition Join Replicated Join
  • 18. DISSCUSSIONS Criteria for Join Algorithm Selection • Prior knowledge of schema and join condition • Transmission over network • Number of tuples
  • 19. DISSCUSSIONS Comparison of Join processing methods Join Type MapReduce jobs Advantages Issues Repartition 1 MapReduce job Simple implementation of Reduce phase Sorting and movement of tuples over network Broadcast 1 Map phase No sorting and movement of tuples Useful only if one relation is small. Trojan 1 Map phase Uses schema knowledge Useful, if join conditions are known Replicated 1 MapReduce job Efficient for Star join and Chain join For large relations more number of reducers / replicas are required
  • 20. REFERENCES [1] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107–113, 2008. [2] Fuhui Wu et al. “Comparison & Performance Analysis of Join Approach in MapReduce” ISCTCS 2012, CCIS 320, pp. 629–636[8] [3] Marko Lalić et al. “Comparison of a Sequential & a MapReduce Approach to Joining Large Datasets” MIPRO 2013, pp.1289-1291[3] [4] Spyros, B., Jignesh, M.P., Vuk, E., Jun, R., Eugene, J., Yuanyuan, T.: A Comparison of Join Algorithms for Log Processing in MapReduce. In: SIGMOD 2010, June 6–11. ACM, Indianapolis (2010) [5] Foto N. Afrati et al. “Optimizing Multiway Joins in a Map-Reduce Environment” IEEE Transactions On Knowledge And Data Engineering, pp. 1282- 1298[17], 2011 [6] Alper Okcan et al. “Processing Theta-Joins using MapReduce” SIGMOD‟11, June 12–16, 2011 pp. 949-951[12] [7] Xiaofei Zhang et al. “Efficient Multiway Theta Join Processing Using MapReduce” Proceedings of the VLDB Endowment, Vol. 5, No. 11, pp.1184-1196[12] [8] Anwar Shaikh et al. “Join Query Processing in MapReduce Environment” CNC 2012, LNICST , pp.275-281[7]

Notas del editor

  1. MapReduce provides: Automatic parallelization & distribution Fault Tolerance I/O Scheduling Status and Monitoring