2. Slide 2 www.edureka.co/big-data-and-hadoop
Objectives
What is Reduce side join
Why Reduce side join
Where we use MapReduce
MapReduce Flow
Steps to implement MapReduce
Run Reduce side join using MapReduce
At the end of this module, you will be able to
3. Slide 3 www.edureka.co/big-data-and-hadoop
Why we join data??
Consider an example,
We have the data of a customer in two files/data/table
Cust_id Name Item
001 John iphone
002 Jenny laptop
Cust_id City Phone
001 NewYork 123456
003 Vegas 365895
To get the complete details, one needs to join both the data files
Using joins we can generate data which would be useful and sensible based on some key here it is Cust_id
John iphone NewYork 123456
4. Slide 4 www.edureka.co/big-data-and-hadoop
Types of join in MapReduce
Data joins in hadoop
Map side Reduce side
• Happens on map side
• Done in memory
• One data is big other is small
• Expensive
• Happens on reduce side
• Done off memory
• Both data is huge
• Cheap
5. Slide 5 www.edureka.co/big-data-and-hadoop
Where should Reduce Side Join be used ??
Joining data is arguably one of the biggest uses of Hadoop.
When one needs to implement joins simple steps. Reduce-side joins are straight forward due to the fact that
Hadoop sends identical keys to the same reducer, so by default the data is organized for us
Handy when all the files on which to be performed are huge in size
Should be used in case you are not in a hurry to get the result since it takes time to join huge data
7. Slide 7 www.edureka.co/big-data-and-hadoop
Where MapReduce is Used?
Weather Forecasting
HealthCare
Problem Statement:
» De-identify personal health information.
Problem Statement:
» Finding Maximum temperature recorded in a year.
8. Slide 8 www.edureka.co/big-data-and-hadoop
Where MapReduce is Used?
MapReduce
FeaturesLarge Scale
Distributed Model
Used in
Function
Design Pattern
Parallel
Programming
A Program Model
Classification
Analytics
Recommendation
Index and Search
Map
Reduce
Classification
Eg: Top N records
Analytics
Eg: Join, Selection
Recommendation
Eg: Sort
Summarization
Eg: Inverted Index
Implemented
Google
Apache Hadoop
HDFS
Pig
Hive
HBase
For
9. Slide 9 www.edureka.co/big-data-and-hadoop
MapReduce Paradigm
The Overall MapReduce Word Count Process
Input Splitting Mapping Shuffling Reducing Final Result
List(K3,V3)
Deer Bear River
Dear Bear River
Car Car River
Deer Car Bear
Bear, 2
Car, 3
Deer, 2
River, 2
Deer, 1
Bear, 1
River, 1
Car, 1
Car, 1
River, 1
Deer, 1
Car, 1
Bear, 1
K2,List(V2)List(K2,V2)
K1,V1
Car Car River
Deer Car Bear
Bear, 2
Car, 3
Deer, 2
River, 2
Bear, (1,1)
Car, (1,1,1)
Deer, (1,1)
River, (1,1)
12. Slide 12 www.edureka.co/big-data-and-hadoop
MapReduce Job Submission Flow
Input data is distributed to nodes
Each map task works on a “split” of data
Mapper outputs intermediate data
Map
Node 1
Map
Node 2
INPUT DATA
13. Slide 13 www.edureka.co/big-data-and-hadoop
MapReduce Job Submission Flow
Input data is distributed to nodes
Each map task works on a “split” of data
Mapper outputs intermediate data
Data exchange between nodes in a “shuffle” process
Map
Node 1
Map
Node 2
Node 1 Node 2
INPUT DATA
14. Slide 14 www.edureka.co/big-data-and-hadoop
MapReduce Job Submission Flow
Input data is distributed to nodes
Each map task works on a “split” of data
Mapper outputs intermediate data
Data exchange between nodes in a “shuffle” process
Intermediate data of the same key goes to the same reducer
Map
Node 1
Map
Node 2
Reduce
Node 1
Reduce
Node 2
INPUT DATA
15. Slide 15 www.edureka.co/big-data-and-hadoop
MapReduce Job Submission Flow
Input data is distributed to nodes
Each map task works on a “split” of data
Mapper outputs intermediate data
Data exchange between nodes in a “shuffle” process
Intermediate data of the same key goes to the same reducer
Reducer output is stored
Map
Node 1
Map
Node 2
Reduce
Node 1
Reduce
Node 2
INPUT DATA
16. Slide 16 www.edureka.co/big-data-and-hadoop
Apart from keys we use tagging to identify the source of the file in reduce side joins.
We use different mappers to read the files individually.
Each value emitted from the mappers is tagged with unique identifier for a file
Output of all the mapper would go to one-one reducer based on unique keys
In the reducer, fields from different data sources are joined based on the common key from different files.
How it works Reduce Side??
17. Slide 17 www.edureka.co/big-data-and-hadoop
File 1 File2
Map Task 1
{tag}
value
Map Task 2
{tag}
value
Reducer 1
Shuffling and sorting
Partitioner
Part-001 Part-002
Reducer 2
How it works Reduce Side??