Reduce Side Joins

www.edureka.co/big-data-and-hadoop
Reduce side joins in Map Reduce
View Big Data and Hadoop Course at: http://www.edureka.co/big-data-and-hadoop

Objectives
What is Reduce side join
Why Reduce side join
Where we use MapReduce
MapReduce Flow
Steps to implement MapReduce
Run Reduce side join using MapReduce
At the end of this module, you will be able to

Why we join data??
Consider an example,
We have the data of a customer in two files/data/table
Cust_id Name Item
001 John iphone
002 Jenny laptop
Cust_id City Phone
001 NewYork 123456
003 Vegas 365895
To get the complete details, one needs to join both the data files
Using joins we can generate data which would be useful and sensible based on some key here it is Cust_id
John iphone NewYork 123456

Types of join in MapReduce
Data joins in hadoop
Map side Reduce side
• Happens on map side
• Done in memory
• One data is big other is small
• Expensive
• Happens on reduce side
• Done off memory
• Both data is huge
• Cheap

Where should Reduce Side Join be used ??
 Joining data is arguably one of the biggest uses of Hadoop.
 When one needs to implement joins simple steps. Reduce-side joins are straight forward due to the fact that
Hadoop sends identical keys to the same reducer, so by default the data is organized for us
 Handy when all the files on which to be performed are huge in size
 Should be used in case you are not in a hurry to get the result since it takes time to join huge data

Before we go ahead with Reduce side join let us refresh
“Mapreduce”

Where MapReduce is Used?
Weather Forecasting
HealthCare
 Problem Statement:
» De-identify personal health information.
 Problem Statement:
» Finding Maximum temperature recorded in a year.

Where MapReduce is Used?
MapReduce
FeaturesLarge Scale
Distributed Model
Used in
Function
Design Pattern
Parallel
Programming
A Program Model
Classification
Analytics
Recommendation
Index and Search
Map
Reduce
Classification
Eg: Top N records
Analytics
Eg: Join, Selection
Recommendation
Eg: Sort
Summarization
Eg: Inverted Index
Implemented
Google
Apache Hadoop
HDFS
Pig
Hive
HBase
For

MapReduce Paradigm
The Overall MapReduce Word Count Process
Input Splitting Mapping Shuffling Reducing Final Result
List(K3,V3)
Deer Bear River
Dear Bear River
Car Car River
Deer Car Bear
Bear, 2
Car, 3
Deer, 2
River, 2
Deer, 1
Bear, 1
River, 1
Car, 1
Car, 1
River, 1
Deer, 1
Car, 1
Bear, 1
K2,List(V2)List(K2,V2)
K1,V1
Car Car River
Deer Car Bear
Bear, 2
Car, 3
Deer, 2
River, 2
Bear, (1,1)
Car, (1,1,1)
Deer, (1,1)
River, (1,1)

MapReduce Job Submission Flow
Input data is distributed to nodes
Node 1 Node 2
INPUT DATA

Each map task works on a “split” of data
Map
Node 1
Map
Node 2
INPUT DATA

Mapper outputs intermediate data
Map
Node 1
Map
Node 2
INPUT DATA

Data exchange between nodes in a “shuffle” process
Map
Node 1
Map
Node 2
Node 1 Node 2
INPUT DATA

Intermediate data of the same key goes to the same reducer
Map
Node 1
Map
Node 2
Reduce
Node 1
Reduce
Node 2
INPUT DATA

Intermediate data of the same key goes to the same reducer
Reducer output is stored
Map
Node 1
Map
Node 2
Reduce
Node 1
Reduce
Node 2
INPUT DATA

 Apart from keys we use tagging to identify the source of the file in reduce side joins.
 We use different mappers to read the files individually.
 Each value emitted from the mappers is tagged with unique identifier for a file
 Output of all the mapper would go to one-one reducer based on unique keys
 In the reducer, fields from different data sources are joined based on the common key from different files.
How it works Reduce Side??

File 1 File2
Map Task 1
{tag}
value
Map Task 2
{tag}
value
Reducer 1
Shuffling and sorting
Partitioner
Part-001 Part-002
Reducer 2
How it works Reduce Side??

Reduce Side Join
Demo

Reduce Side Joins

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Reduce Side Joins

Similar a Reduce Side Joins (20)

Más de Edureka!

Más de Edureka! (20)

Reduce Side Joins