SlideShare una empresa de Scribd logo
1 de 43
Descargar para leer sin conexión
MapReduce
Writing MapReduce with Java
MapReduce
Recap - Why MapReduce?
● Instead of processing Big Data directly
● We breakdown the logic into
○ map()
■ Executed on machines with data
■ Gives out key-value pairs
○ Reduce()
■ Gets output of maps grouped by key
■ Grouping is done by MapReduce Framework
■ Can aggregate data
MapReduce
MAP / REDUCE - Why JAVA
Why in Java?
• Primary Support
• Can modify behaviour to a very large extent
MapReduce
MAP / REDUCE - JAVA - Objective
Write a map-reduce job to count unique words
this is a cow
this is a buffalo
there is a hen
3 a
1 buffalo
1 cow
1 hen
3 is
1 there
2 this
MapReduce
MAP / REDUCE - JAVA - Objective
Write a map-reduce job to count unique words in text file
/data/mr/wordcount/input/big.txt
Location in HDFS
In CloudxLab
Input File
MapReduce
MAP / REDUCE - JAVA - Mapper
InputFormat
Datanode
HDFS Block1
Record1
(key, value)
Record2
Record3
Map()
Mapper
Map()
Map()
(key1, value1)
(key2, value2)
Nothing
(key3, value3)
InputSplit
We need to write the code which
would break down the input record
into key-value.
MapReduce
MAP / REDUCE - JAVA - Mapper
TextInputFormat
Datanode
this is a cownthis is a
buffalonthere is a hen
Record
(0, "this is a cow")
(15, "this is a buffalo")
(34, "there is a hen")
InputSplit
MapReduce
MAP / REDUCE - JAVA - Mapper
TextInputFormat
Datanode
this is a cownthis is a buffalonthere is a hen
Record
(0, "this is a cow")
(15, "this is a buffalo")
(34, "there is a hen")
InputSplit
34
15
Location where
line starts
MapReduce
public class StubMapper extends Mapper<Object, Text, Text, LongWritable> {
// A class in java is a complex data type that can have methods in it too
// Or a class a blue print.
// Person is a class and sandeep is an object.
}
MAP / REDUCE - JAVA - Mapper - Class
MapReduce
public class StubMapper extends Mapper<Object, Text, Text, LongWritable> {
// Out class stubmapper is inheriting the parent class Mapper
// Which is provided by framework
// StubMapper is initialized for each input split
}
MAP / REDUCE - JAVA - Mapper - Extends
MapReduce
public class StubMapper extends Mapper<Object, Text, Text, LongWritable> {
@Override
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String[] words = value.toString().split("[ t]+");
for(String word:words)
{
context.write(new Text(word), new LongWritable(1));
}
}
}
MAP / REDUCE - JAVA - Mapper - Datatypes
Data types of input, ouput key and value
Data type of Input Key.
In our example, it is number
of bytes at which the value
is starting
The Data type of
input value.
In our case, input
value is each line, i.e.
Text
The Data type of
output key,
We are going to
give key as word,
therefore it is Text
The Data type of output value,
We are going to give value as
1 therefore it is Long
MapReduce
public class StubMapper extends Mapper<Object, Text, Text, LongWritable> {
@Override
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String[] words = value.toString().split("[ t]+");
for(String word:words)
{
context.write(new Text(word), new LongWritable(1));
}
}
}
MAP / REDUCE - JAVA - Mapper - method
MapReduce
public class StubMapper extends Mapper<Object, Text, Text, LongWritable> {
@Override
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String[] words = value.toString().split("[ t]+");
for(String word:words)
{
context.write(new Text(word), new LongWritable(1));
}
}
}
MAP / REDUCE - JAVA - Mapper - method
The input line is split by space
or tabs into array of strings
MapReduce
public class StubMapper extends Mapper<Object, Text, Text, LongWritable> {
@Override
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String[] words = value.toString().split("[ t]+");
for(String word:words)
{
context.write(new Text(word), new LongWritable(1));
}
}
}
MAP / REDUCE - JAVA - Mapper - method
For Each of the words ...
MapReduce
public class StubMapper extends Mapper<Object, Text, Text, LongWritable> {
@Override
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String[] words = value.toString().split("[ t]+");
for(String word:words)
{
context.write(new Text(word), new LongWritable(1));
}
}
}
MAP / REDUCE - JAVA - Mapper - method
… we give out the
word as key ...
… and numeric 1 as the value.
MapReduce
MAP / REDUCE - JAVA - Writable
What is "new Text(word) "?
Usual types of Java to represent numbers and text were not efficient. So,
mapreduce team designed their own classes called writables
String
long
int
Text
LongWritable
IntWritable
Java Hadoop
MapReduce
MAP / REDUCE - JAVA - Writable
What is "new Text(word) "?
Before handling over anything to MapReduce, you need to wrap it into
corresponding writable class or create a new one.
new Text(word)
new LongWritable(word)
Wrapping
value.toString()
Unwrapping
MapReduce
public class StubMapper extends Mapper<Object, Text, Text,
LongWritable> {
@Override
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String[] words = value.toString().split("[ t]+");
for(String word:words)
{
context.write(new Text(word), new LongWritable(1));
}
}
}
MAP / REDUCE - Java - Full Code
Create a Mapper
MapReduce
MAP / REDUCE - Java - Full Code
Take a look at complete code at gihub folder:
https://github.com/singhabhinav/cloudxlab/tree/master/hdpexamples/java/src/com/cloudxlab/wordcount
MapReduce
MAP / REDUCE - Java - Complete Code
The output of Mapper
Record
(0, "this is a cow")
(15, "this is a buffalo")
(34, "there is a hen")
InputSplit
this 1
is 1
a 1
cow 1
this 1
is 1
a 1
buffalo 1
there 1
is 1
a 1
hen 1
StubMapper.map()
StubMapper.map()
StubMapper.map()
MapReduce
MAP / REDUCE - JAVA - Reducer
public class StubReducer extends Reducer<Text, LongWritable, Text,
LongWritable> {
@Override
public void reduce(Text key, Iterable<LongWritable> values, Context context)
throws IOException, InterruptedException {
long sum = 0;
for(LongWritable iw:values)
{
sum += iw.get();
}
context.write(key, new LongWritable(sum));
}
}
Create a Reducer
MapReduce
MAP / REDUCE - JAVA
public class StubDriver {
public static void main(String[] args) throws Exception {
Job job = Job.getInstance();
job.setJarByClass(StubDriver.class);
job.setMapperClass(StubMapper.class);
job.setReducerClass(StubReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
FileInputFormat.addInputPath(job, new Path("/data/mr/wordcount/input/big.txt");
FileOutputFormat.setOutputPath(job, new Path("javamrout"));
boolean result = job.waitForCompletion(true);
System.exit(result ? 0 : 1);
}
}
Create a Driver
MapReduce
MAP / MAP / REDUCE - JAVA -
Writing Map-Reduce in Java (Continued)
9. Export jar
10. scp jar to the hadoop server
11. Run it using the following command:
hadoop jar sandeep/training2.jar StubDriver <arguments>
e.g: hadoop jar sandeep/training2.jar StubDriver
/users/root/wordcount/input
/users/root/wordcount/output16/
12. In case there is a need use -use-lib
13. Testing: Add all the jars provided
Using external Jars:
$ export LIBJARS=/path/jar1,/path/jar2
$ hadoop jar my-example.jar com.example.MyTool -libjars ${LIBJARS}
MapReduce
MAP / MAP / REDUCE - JAVA - Hands-ON
## These are the examples of Map-Reduce
git clone https://github.com/<your github login>/bigdata.git
cd cloudxlab/hdpexamples/java
ant jar
To Run wordcount MapReduce, use:
hadoop jar build/jar/hdpexamples.jar
com.cloudxlab.wordcount.StubDriver
MapReduce
MAP / REDUCE - INPUT SPLITS (CONT.)
public abstract class InputSplit {
public abstract long getLength()
public abstract String[] getLocations()
}
• Has length and locations
• Largest gets processed first
• InputFormat creates splits
• Default one is TextInput
Format
• Extend it to custom
splits/records
public abstract class InputFormat {
List getSplits (JobContext);
RecordReader createRecordReader
(InputSplit,TaskAttemptContext);
}
MapReduce
MAP / REDUCE - Secondary Sorting
• The key-value pairs generated by Mapper are sorted by key
• Reducer recieves the values for each key.
• These values are not sorted.
• To have these sorted, you need to use Secondary Sorting.
Sorting
(Primary & Secondary)
Grouping
partitioning
Reducer
Mapper
GroupingReducerH
D
F
S
MapReduce
MAP / REDUCE - Secondary Sorting
1. Define Sorting:
a. Create a WritableComparable class instead of “key”
b. In this class, return Primary and Secondary Key.
2. Define Grouping
a. Create Grouping class by extending WritableComparator
3. Define Paritioning
a. Extend Partitioner and implement how to parition on PK
See the folder “nextword” from “Session 5” project
More: here and here and here and in “The Definitive Guide of Hadoop”.
MapReduce
MAP / REDUCE - DATA FLOW WITH SINGLE REDUCER
Network Transfer
Local Transfer
Node
MapReduce
MAP / REDUCE - MULTIPLE REDUCERS
MapReduce
MAP / REDUCE - PARTITIONER
• Defines the key for partitioning
• Decides which key goes to which reducers
public static class AgePartitioner extends Partitioner<Text, Text> {
public int getPartition(Text gender, Text value, int numReduceTasks) {
if(gender.getString().equals(“M”))
return 0;
else
return 1;
}
}
MapReduce
MAP / REDUCE - HOW MANY REDUCERS?
• By Default One
• Too many reducers effort of shuffling is high
• Too few reducers, computation takes time
• Tune it to the total number of slots
MapReduce
MAP / REDUCE - COMBINER FUNCTIONS
• Runs on the same node after Map has finished
• Processes the output of Map
• Helps in minimise the data transfer
• Does not replace reducer
• Should be commutative and associative
MapReduce
MAP / REDUCE - COMBINER FUNCTIONS
• Defined Using Reducer Class - Same signature as reducer
• No matter in what way it is applied, output should be same
• Examples: Sum, Min, Max
• max(0, 20, 10, 25) = max(max(0, 20), max(10,25)) = max(20,
25) = 25
• = max(max(0, 10), max(20,25)) = max(10, 25) = 25
• Not: average or mean
• avg(0, 20, 10, 25) => 11.25
• = avg(avg(0, 20), avg(10,25)) = avg(10, 17.5) = 13.75
• = avg(avg(0, 10, 20), avg(25)) = avg(10, 25) = 17.5
• is function f(a, b,c…) = {return sqrt(a*a+b*b+c*c…);}
job.setCombinerClass(MaxTemperatureReducer.class);
MapReduce
MAP / REDUCE - Job Chaining
Map1 Reduce1
Job1
Map2 Reduce2
Job2
Method1:
Using our Java Code:
if(job1.waitForCompletion(true))
{
job2.waitForCompletion(true);
}
MapReduce
MAP / REDUCE - Job Chaining
Method2: Using Unix
hadoop jar x.jar Driver1 inputdir outputdir1 && hadoop jar x.jar Driver2 outputdir1 outdir2
Method3: Using Oozie
We will discuss it later.
Method4: Using dependencies
//job2 can’t start until job1 completes
job2.addDependingJob(job1);
See this project.
In this project we chain our previously done wordcount with new job to
order the words in descending order of counts
MapReduce
1. For running C/C++ code
2. Better than streaming
3. You can run as following:
$ bin/hadoop pipes -input inputPath -output outputPath -program path/to/executable
MAP / REDUCE - Pipes
MapReduce
Thank you.
Hadoop & Spark
+1 419 665 3276 (US)
+91 803 959 1464 (IN)
support@knowbigdata.com
Subscribe to our Youtube channel for latest videos -
https://www.youtube.com/channel/UCxugRFe5wETYA7nMH6VGyEA
MapReduce
Writing Map-Reduce in Java
1. Install Eclipse
2. Create a Java project
3. Add Libs: hadoop-mapreduce-client-core.jar,
hadoop-common.jar
4. Change JDK to 7.0
5. Change the Java Compiler settings to 1.6
MAP / REDUCE - JAVA
MapReduce
MAP / REDUCE - JAVA
Checkout & Follow instructions at
https://github.com/girisandeep/mrexamples
MapReduce
MAP / REDUCE - JAVA
public class StubTest {
@Before
public void setUp() {
mapDriver = new MapDriver<Object, Text, Text, LongWritable>();
mapDriver.setMapper(new StubMapper(););
reduceDriver = new ReduceDriver<Text, LongWritable, Text, LongWritable>();
reduceDriver.setReducer(new StubReducer(););
mapReduceDriver = new MapReduceDriver<Object, Text, Text, LongWritable, Text,
LongWritable>();
mapReduceDriver.setMapper(mapper);
mapReduceDriver.setReducer(reducer);
}
@Test
public void testXYZ() {
….
}
}
14. Create Test Case
MapReduce
MAP / REDUCE - JAVA
@Test
public void testMapReduce() throws IOException {
mapReduceDriver.addInput(new Pair<Object, Text>
("1", new Text("sandeep giri is here")));
mapReduceDriver.addInput(new Pair<Object, Text>
("2", new Text("teach the map and reduce class is fun.")));
List<Pair<Text, LongWritable>> output = mapReduceDriver.run();
for (Pair<Text, LongWritable> p : output) {
System.out.print(p.getFirst() + “-“ + p.getSecond());
//assert here
….
}
}
15. Create a test case
MapReduce
Custom Writable
● Objects that are serialized need to extend Writable
● Examples: Text, IntWritable, LongWritable, FloatWritable, BooleanWritable etc. (See)
● You can define you own
MapReduce
AVAILABLE INPUT SPLITS
• You can directly read files inside your mapper:
• FileSystem fs = FileSystem.get(URI.create(uri), conf);
• Third Party - GZIp Splittable: http://niels.basjes.nl/splittable-gzip
• https://github.com/twitter/hadoop-lzo
Notes

Más contenido relacionado

La actualidad más candente

Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Konrad Malawski
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...Spark Summit
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsCheng Min Chi
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopMohamed Elsaka
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsCheng Lian
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Sparkjlacefie
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...DataWorks Summit
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingToni Cebrián
 
scalable machine learning
scalable machine learningscalable machine learning
scalable machine learningSamir Bessalah
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryDatabricks
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceDr Ganesh Iyer
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkUnsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkDB Tsai
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming JobsDatabricks
 
Spark Streaming, Machine Learning and meetup.com streaming API.
Spark Streaming, Machine Learning and  meetup.com streaming API.Spark Streaming, Machine Learning and  meetup.com streaming API.
Spark Streaming, Machine Learning and meetup.com streaming API.Sergey Zelvenskiy
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionChetan Khatri
 

La actualidad más candente (20)

Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internals
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
Scala+data
Scala+dataScala+data
Scala+data
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime Internals
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
 
scalable machine learning
scalable machine learningscalable machine learning
scalable machine learning
 
Scalding
ScaldingScalding
Scalding
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkUnsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Spark Streaming, Machine Learning and meetup.com streaming API.
Spark Streaming, Machine Learning and  meetup.com streaming API.Spark Streaming, Machine Learning and  meetup.com streaming API.
Spark Streaming, Machine Learning and meetup.com streaming API.
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 

Similar a Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | CloudxLab

Hadoop Integration in Cassandra
Hadoop Integration in CassandraHadoop Integration in Cassandra
Hadoop Integration in CassandraJairam Chandar
 
Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoopdatasalt
 
Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Gabriele Modena
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 
Cs267 hadoop programming
Cs267 hadoop programmingCs267 hadoop programming
Cs267 hadoop programmingKuldeep Dhole
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Robert Metzger
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusKoichi Fujikawa
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
Hadoop本 輪読会 1章〜2章
Hadoop本 輪読会 1章〜2章Hadoop本 輪読会 1章〜2章
Hadoop本 輪読会 1章〜2章moai kids
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingEd Kohlwey
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and MonoidsHugo Gävert
 
An introduction to Test Driven Development on MapReduce
An introduction to Test Driven Development on MapReduceAn introduction to Test Driven Development on MapReduce
An introduction to Test Driven Development on MapReduceAnanth PackkilDurai
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ..."MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...Adrian Florea
 

Similar a Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | CloudxLab (20)

MapReduce
MapReduceMapReduce
MapReduce
 
Hadoop Integration in Cassandra
Hadoop Integration in CassandraHadoop Integration in Cassandra
Hadoop Integration in Cassandra
 
Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoop
 
Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Cs267 hadoop programming
Cs267 hadoop programmingCs267 hadoop programming
Cs267 hadoop programming
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop Papyrus
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Scalding for Hadoop
Scalding for HadoopScalding for Hadoop
Scalding for Hadoop
 
Hadoop本 輪読会 1章〜2章
Hadoop本 輪読会 1章〜2章Hadoop本 輪読会 1章〜2章
Hadoop本 輪読会 1章〜2章
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel Processing
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
An introduction to Test Driven Development on MapReduce
An introduction to Test Driven Development on MapReduceAn introduction to Test Driven Development on MapReduce
An introduction to Test Driven Development on MapReduce
 
Spark overview
Spark overviewSpark overview
Spark overview
 
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ..."MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
Hadoop 3
Hadoop 3Hadoop 3
Hadoop 3
 

Más de CloudxLab

Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep LearningCloudxLab
 
Deep Learning Overview
Deep Learning OverviewDeep Learning Overview
Deep Learning OverviewCloudxLab
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural NetworksCloudxLab
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingCloudxLab
 
Autoencoders
AutoencodersAutoencoders
AutoencodersCloudxLab
 
Training Deep Neural Nets
Training Deep Neural NetsTraining Deep Neural Nets
Training Deep Neural NetsCloudxLab
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningCloudxLab
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...CloudxLab
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...CloudxLab
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabIntroduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabCloudxLab
 
Introduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabIntroduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabCloudxLab
 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabCloudxLab
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random ForestsCloudxLab
 
Decision Trees
Decision TreesDecision Trees
Decision TreesCloudxLab
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector MachinesCloudxLab
 

Más de CloudxLab (20)

Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
 
Deep Learning Overview
Deep Learning OverviewDeep Learning Overview
Deep Learning Overview
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
 
Training Deep Neural Nets
Training Deep Neural NetsTraining Deep Neural Nets
Training Deep Neural Nets
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabIntroduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
 
Introduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabIntroduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLab
 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random Forests
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 

Último

AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 

Último (20)

AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | CloudxLab

  • 2. MapReduce Recap - Why MapReduce? ● Instead of processing Big Data directly ● We breakdown the logic into ○ map() ■ Executed on machines with data ■ Gives out key-value pairs ○ Reduce() ■ Gets output of maps grouped by key ■ Grouping is done by MapReduce Framework ■ Can aggregate data
  • 3. MapReduce MAP / REDUCE - Why JAVA Why in Java? • Primary Support • Can modify behaviour to a very large extent
  • 4. MapReduce MAP / REDUCE - JAVA - Objective Write a map-reduce job to count unique words this is a cow this is a buffalo there is a hen 3 a 1 buffalo 1 cow 1 hen 3 is 1 there 2 this
  • 5. MapReduce MAP / REDUCE - JAVA - Objective Write a map-reduce job to count unique words in text file /data/mr/wordcount/input/big.txt Location in HDFS In CloudxLab Input File
  • 6. MapReduce MAP / REDUCE - JAVA - Mapper InputFormat Datanode HDFS Block1 Record1 (key, value) Record2 Record3 Map() Mapper Map() Map() (key1, value1) (key2, value2) Nothing (key3, value3) InputSplit We need to write the code which would break down the input record into key-value.
  • 7. MapReduce MAP / REDUCE - JAVA - Mapper TextInputFormat Datanode this is a cownthis is a buffalonthere is a hen Record (0, "this is a cow") (15, "this is a buffalo") (34, "there is a hen") InputSplit
  • 8. MapReduce MAP / REDUCE - JAVA - Mapper TextInputFormat Datanode this is a cownthis is a buffalonthere is a hen Record (0, "this is a cow") (15, "this is a buffalo") (34, "there is a hen") InputSplit 34 15 Location where line starts
  • 9. MapReduce public class StubMapper extends Mapper<Object, Text, Text, LongWritable> { // A class in java is a complex data type that can have methods in it too // Or a class a blue print. // Person is a class and sandeep is an object. } MAP / REDUCE - JAVA - Mapper - Class
  • 10. MapReduce public class StubMapper extends Mapper<Object, Text, Text, LongWritable> { // Out class stubmapper is inheriting the parent class Mapper // Which is provided by framework // StubMapper is initialized for each input split } MAP / REDUCE - JAVA - Mapper - Extends
  • 11. MapReduce public class StubMapper extends Mapper<Object, Text, Text, LongWritable> { @Override public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String[] words = value.toString().split("[ t]+"); for(String word:words) { context.write(new Text(word), new LongWritable(1)); } } } MAP / REDUCE - JAVA - Mapper - Datatypes Data types of input, ouput key and value Data type of Input Key. In our example, it is number of bytes at which the value is starting The Data type of input value. In our case, input value is each line, i.e. Text The Data type of output key, We are going to give key as word, therefore it is Text The Data type of output value, We are going to give value as 1 therefore it is Long
  • 12. MapReduce public class StubMapper extends Mapper<Object, Text, Text, LongWritable> { @Override public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String[] words = value.toString().split("[ t]+"); for(String word:words) { context.write(new Text(word), new LongWritable(1)); } } } MAP / REDUCE - JAVA - Mapper - method
  • 13. MapReduce public class StubMapper extends Mapper<Object, Text, Text, LongWritable> { @Override public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String[] words = value.toString().split("[ t]+"); for(String word:words) { context.write(new Text(word), new LongWritable(1)); } } } MAP / REDUCE - JAVA - Mapper - method The input line is split by space or tabs into array of strings
  • 14. MapReduce public class StubMapper extends Mapper<Object, Text, Text, LongWritable> { @Override public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String[] words = value.toString().split("[ t]+"); for(String word:words) { context.write(new Text(word), new LongWritable(1)); } } } MAP / REDUCE - JAVA - Mapper - method For Each of the words ...
  • 15. MapReduce public class StubMapper extends Mapper<Object, Text, Text, LongWritable> { @Override public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String[] words = value.toString().split("[ t]+"); for(String word:words) { context.write(new Text(word), new LongWritable(1)); } } } MAP / REDUCE - JAVA - Mapper - method … we give out the word as key ... … and numeric 1 as the value.
  • 16. MapReduce MAP / REDUCE - JAVA - Writable What is "new Text(word) "? Usual types of Java to represent numbers and text were not efficient. So, mapreduce team designed their own classes called writables String long int Text LongWritable IntWritable Java Hadoop
  • 17. MapReduce MAP / REDUCE - JAVA - Writable What is "new Text(word) "? Before handling over anything to MapReduce, you need to wrap it into corresponding writable class or create a new one. new Text(word) new LongWritable(word) Wrapping value.toString() Unwrapping
  • 18. MapReduce public class StubMapper extends Mapper<Object, Text, Text, LongWritable> { @Override public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String[] words = value.toString().split("[ t]+"); for(String word:words) { context.write(new Text(word), new LongWritable(1)); } } } MAP / REDUCE - Java - Full Code Create a Mapper
  • 19. MapReduce MAP / REDUCE - Java - Full Code Take a look at complete code at gihub folder: https://github.com/singhabhinav/cloudxlab/tree/master/hdpexamples/java/src/com/cloudxlab/wordcount
  • 20. MapReduce MAP / REDUCE - Java - Complete Code The output of Mapper Record (0, "this is a cow") (15, "this is a buffalo") (34, "there is a hen") InputSplit this 1 is 1 a 1 cow 1 this 1 is 1 a 1 buffalo 1 there 1 is 1 a 1 hen 1 StubMapper.map() StubMapper.map() StubMapper.map()
  • 21. MapReduce MAP / REDUCE - JAVA - Reducer public class StubReducer extends Reducer<Text, LongWritable, Text, LongWritable> { @Override public void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException { long sum = 0; for(LongWritable iw:values) { sum += iw.get(); } context.write(key, new LongWritable(sum)); } } Create a Reducer
  • 22. MapReduce MAP / REDUCE - JAVA public class StubDriver { public static void main(String[] args) throws Exception { Job job = Job.getInstance(); job.setJarByClass(StubDriver.class); job.setMapperClass(StubMapper.class); job.setReducerClass(StubReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(LongWritable.class); FileInputFormat.addInputPath(job, new Path("/data/mr/wordcount/input/big.txt"); FileOutputFormat.setOutputPath(job, new Path("javamrout")); boolean result = job.waitForCompletion(true); System.exit(result ? 0 : 1); } } Create a Driver
  • 23. MapReduce MAP / MAP / REDUCE - JAVA - Writing Map-Reduce in Java (Continued) 9. Export jar 10. scp jar to the hadoop server 11. Run it using the following command: hadoop jar sandeep/training2.jar StubDriver <arguments> e.g: hadoop jar sandeep/training2.jar StubDriver /users/root/wordcount/input /users/root/wordcount/output16/ 12. In case there is a need use -use-lib 13. Testing: Add all the jars provided Using external Jars: $ export LIBJARS=/path/jar1,/path/jar2 $ hadoop jar my-example.jar com.example.MyTool -libjars ${LIBJARS}
  • 24. MapReduce MAP / MAP / REDUCE - JAVA - Hands-ON ## These are the examples of Map-Reduce git clone https://github.com/<your github login>/bigdata.git cd cloudxlab/hdpexamples/java ant jar To Run wordcount MapReduce, use: hadoop jar build/jar/hdpexamples.jar com.cloudxlab.wordcount.StubDriver
  • 25. MapReduce MAP / REDUCE - INPUT SPLITS (CONT.) public abstract class InputSplit { public abstract long getLength() public abstract String[] getLocations() } • Has length and locations • Largest gets processed first • InputFormat creates splits • Default one is TextInput Format • Extend it to custom splits/records public abstract class InputFormat { List getSplits (JobContext); RecordReader createRecordReader (InputSplit,TaskAttemptContext); }
  • 26. MapReduce MAP / REDUCE - Secondary Sorting • The key-value pairs generated by Mapper are sorted by key • Reducer recieves the values for each key. • These values are not sorted. • To have these sorted, you need to use Secondary Sorting. Sorting (Primary & Secondary) Grouping partitioning Reducer Mapper GroupingReducerH D F S
  • 27. MapReduce MAP / REDUCE - Secondary Sorting 1. Define Sorting: a. Create a WritableComparable class instead of “key” b. In this class, return Primary and Secondary Key. 2. Define Grouping a. Create Grouping class by extending WritableComparator 3. Define Paritioning a. Extend Partitioner and implement how to parition on PK See the folder “nextword” from “Session 5” project More: here and here and here and in “The Definitive Guide of Hadoop”.
  • 28. MapReduce MAP / REDUCE - DATA FLOW WITH SINGLE REDUCER Network Transfer Local Transfer Node
  • 29. MapReduce MAP / REDUCE - MULTIPLE REDUCERS
  • 30. MapReduce MAP / REDUCE - PARTITIONER • Defines the key for partitioning • Decides which key goes to which reducers public static class AgePartitioner extends Partitioner<Text, Text> { public int getPartition(Text gender, Text value, int numReduceTasks) { if(gender.getString().equals(“M”)) return 0; else return 1; } }
  • 31. MapReduce MAP / REDUCE - HOW MANY REDUCERS? • By Default One • Too many reducers effort of shuffling is high • Too few reducers, computation takes time • Tune it to the total number of slots
  • 32. MapReduce MAP / REDUCE - COMBINER FUNCTIONS • Runs on the same node after Map has finished • Processes the output of Map • Helps in minimise the data transfer • Does not replace reducer • Should be commutative and associative
  • 33. MapReduce MAP / REDUCE - COMBINER FUNCTIONS • Defined Using Reducer Class - Same signature as reducer • No matter in what way it is applied, output should be same • Examples: Sum, Min, Max • max(0, 20, 10, 25) = max(max(0, 20), max(10,25)) = max(20, 25) = 25 • = max(max(0, 10), max(20,25)) = max(10, 25) = 25 • Not: average or mean • avg(0, 20, 10, 25) => 11.25 • = avg(avg(0, 20), avg(10,25)) = avg(10, 17.5) = 13.75 • = avg(avg(0, 10, 20), avg(25)) = avg(10, 25) = 17.5 • is function f(a, b,c…) = {return sqrt(a*a+b*b+c*c…);} job.setCombinerClass(MaxTemperatureReducer.class);
  • 34. MapReduce MAP / REDUCE - Job Chaining Map1 Reduce1 Job1 Map2 Reduce2 Job2 Method1: Using our Java Code: if(job1.waitForCompletion(true)) { job2.waitForCompletion(true); }
  • 35. MapReduce MAP / REDUCE - Job Chaining Method2: Using Unix hadoop jar x.jar Driver1 inputdir outputdir1 && hadoop jar x.jar Driver2 outputdir1 outdir2 Method3: Using Oozie We will discuss it later. Method4: Using dependencies //job2 can’t start until job1 completes job2.addDependingJob(job1); See this project. In this project we chain our previously done wordcount with new job to order the words in descending order of counts
  • 36. MapReduce 1. For running C/C++ code 2. Better than streaming 3. You can run as following: $ bin/hadoop pipes -input inputPath -output outputPath -program path/to/executable MAP / REDUCE - Pipes
  • 37. MapReduce Thank you. Hadoop & Spark +1 419 665 3276 (US) +91 803 959 1464 (IN) support@knowbigdata.com Subscribe to our Youtube channel for latest videos - https://www.youtube.com/channel/UCxugRFe5wETYA7nMH6VGyEA
  • 38. MapReduce Writing Map-Reduce in Java 1. Install Eclipse 2. Create a Java project 3. Add Libs: hadoop-mapreduce-client-core.jar, hadoop-common.jar 4. Change JDK to 7.0 5. Change the Java Compiler settings to 1.6 MAP / REDUCE - JAVA
  • 39. MapReduce MAP / REDUCE - JAVA Checkout & Follow instructions at https://github.com/girisandeep/mrexamples
  • 40. MapReduce MAP / REDUCE - JAVA public class StubTest { @Before public void setUp() { mapDriver = new MapDriver<Object, Text, Text, LongWritable>(); mapDriver.setMapper(new StubMapper();); reduceDriver = new ReduceDriver<Text, LongWritable, Text, LongWritable>(); reduceDriver.setReducer(new StubReducer();); mapReduceDriver = new MapReduceDriver<Object, Text, Text, LongWritable, Text, LongWritable>(); mapReduceDriver.setMapper(mapper); mapReduceDriver.setReducer(reducer); } @Test public void testXYZ() { …. } } 14. Create Test Case
  • 41. MapReduce MAP / REDUCE - JAVA @Test public void testMapReduce() throws IOException { mapReduceDriver.addInput(new Pair<Object, Text> ("1", new Text("sandeep giri is here"))); mapReduceDriver.addInput(new Pair<Object, Text> ("2", new Text("teach the map and reduce class is fun."))); List<Pair<Text, LongWritable>> output = mapReduceDriver.run(); for (Pair<Text, LongWritable> p : output) { System.out.print(p.getFirst() + “-“ + p.getSecond()); //assert here …. } } 15. Create a test case
  • 42. MapReduce Custom Writable ● Objects that are serialized need to extend Writable ● Examples: Text, IntWritable, LongWritable, FloatWritable, BooleanWritable etc. (See) ● You can define you own
  • 43. MapReduce AVAILABLE INPUT SPLITS • You can directly read files inside your mapper: • FileSystem fs = FileSystem.get(URI.create(uri), conf); • Third Party - GZIp Splittable: http://niels.basjes.nl/splittable-gzip • https://github.com/twitter/hadoop-lzo Notes