SlideShare a Scribd company logo
1 of 37
Download to read offline
Java MapReduce
Programming on
Apache Hadoop
Aaron T. Myers, aka ATM
with thanks to Sandy Ryza
Introductions
● Software Engineer/Tech Lead for HDFS at
Cloudera
● Committer/PMC Member on the Apache
Hadoop project
● My work focuses primarily on HDFS and
Hadoop security
What is MapReduce?
● A distributed programming paradigm
What is a distributed programming
paradigm?
Help!
What is a distributed programming
paradigm?
Distributed Systems are Hard
● Monitoring
● RPC protocols, serialization
● Fault tolerance
● Deployment
● Scheduling/Resource Management
Writing Data Parallel Programs
Should Not Be
MapReduce to the Rescue
● You specify map(...) and reduce(...)
functions
○ map = (list(k, v) -> list(k, v))
○ reduce = (k, list(v) -> k, v)
● The framework does the rest
○ Split up the data
○ Run several mappers over the splits
○ Shuffle the data around for the reducers
○ Run several reducers
○ Store the final results
Map
apple apple banana
a happy airplane
airplane on the runway
runway apple runway
rumple on the apple
apple apple banana
a happy airplane
airplane on the runway
runway apple runway
rumple on the apple
apple - 1
apple - 1
banana - 1
a - 1
happy - 1
airplane - 1
on - 1
the - 1
runway - 1
runway - 1
runway - 1
apple - 1
rumple - 1
on - 1
the - 1
apple - 1
map()
map()
map()
map()
map()
Map Inputs Map OutputsInput Data Map Function
Shuffle
Reduce
reduce()
reduce()
reduce()
reduce()
reduce()
reduce()
reduce()
reduce()
a - 1
airplane - 1
apple - 4
banana - 1
on - 2
runway - 3
rumple - 1
the - 2
a - 1, 1
airplane - 1
apple - 1, 1, 1, 1
banana - 1
on - 1, 1
runway - 1, 1, 1
rumple - 1
the - 1, 1
Shuffle Reduce Output
What is (Core) Hadoop?
● An open source platform for storing,
processing, and analyzing enormous
amounts of data
● Consists of…
○ A distributed file system (HDFS)
○ An implementation of the Map/Reduce paradigm
(Hadoop MapReduce)
● Written in Java!
What is Hadoop?
Traditional Operating System
Storage:
File System
Execution/Scheduling:
Processes
What is Hadoop?
Hadoop
(Distributed operating system)
Storage:
Hadoop Distributed
File System (HDFS)
Execution/Scheduling:
MapReduce
HDFS (briefly)
● Distributed file system that runs on all nodes
in the cluster
○ Co-located with Hadoop MapReduce daemons
● Looks like a pretty normal Unix file system
○ hadoop fs -ls /user/atm/
○ hadoop fs -cp /user/atm/data.txt /user/atm/data2.txt
○ hadoop fs -rm /user/atm/data.txt
○ …
● Don’t use the normal Java File API
○ Instead use org.apache.hadoop.fs.FileSystem API
Writing MapReduce programs in
Java
● Interface to MapReduce in Hadoop is Java
API
● WordCount!
Word Count Map Function
public class WordCountMapper extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one= new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable>output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
Word Count Reduce Function
public static class WordCountReducer extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable>output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Word Count Driver
InputFormats
● TextInputFormat
○ Each line becomes <LongWritable, Text> = <byte
offset in file, whole line>
● KeyValueTextInputFormat
○ Splits lines on delimiter into Text key and Text value
● SequenceFileInputFormat
○ Reads key/value pairs from SequenceFile, a Hadoop
format
● DBInputFormat
○ Uses JDBC to connect to a database
● Many more, or write your own!
Serialization
● Writables
○ Native to Hadoop
○ Implement serialization for higher level structures
yourself
● Avro
○ Extensible
○ Cross-language
○ Handles serialization of higher level structures for
you
● And others…
○ Parquet, Thrift, etc.
Writables
public class MyNumberAndStringWritable implements Writable {
private int number;
private String str;
public void write(DataOutput out) throws IOException {
out.writeInt(number);
out.writeUTF(str);
}
public void readFields(DataInput in) throws IOException {
number = in.readInt();
str = in.readUTF();
}
}
Avro
protocol MyMapReduceObjects {
record MyNumberAndString {
string str;
int number;
}
}
Testing MapReduce Programs
● First, write unit tests (duh) with MRUnit
● LocalJobRunner
○ Runs job in single process
● Single-node cluster (Cloudera VM!)
○ Multiple processes on the same machine
● On the real cluster
MRUnit
@Test
public void testMapper() throws IOException {
MapDriver<LongWritable, Text, Text, IntWritable> mapDriver=
new MapDriver<LongWritable, Text, Text, IntWritable>(new WordCountMapper());
String line = "apple banana banana carrot";
mapDriver.withInput(new LongWritable(0), new Text(line));
mapDriver.withOutput(new Text("apple"), new IntWritable(1));
mapDriver.withOutput(new Text("banana"), new IntWritable(1));
mapDriver.withOutput(new Text("banana"), new IntWritable(1));
mapDriver.withOutput(new Text("carrot"), new IntWritable(1));
mapDriver.runTest();
}
MRUnit
@Test
public void testReducer() {
ReduceDriver<Text, IntWritable, Text, IntWritable> reduceDriver=
new MapDriver<Text, IntWritable, Text, IntWritable>(new WordCountReducer());
reduceDriver.withInput(new Text("apple"),
Arrays.asList(new IntWritable(1), new IntWritable(2)));
reduceDriver.withOutput(new Text("apple"), new IntWritable("3"));
reduceDriver.runTest();
}
Counters
Map-Reduce Framework
Map input records=183
Map output records=183
Map output bytes=533563
Map output materialized bytes=534190
Input split bytes=144
Combine input records=0
Combine output records=0
Reduce input groups=183
Reduce shuffle bytes=0
Reduce input records=183
Reduce output records=183
Spilled Records=366
Shuffled Maps =0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=7
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
File System Counters
FILE: Number of bytes read=1844866
FILE: Number of bytes written=1927344
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
File Input Format Counters
Bytes Read=655137
File Output Format Counters
Bytes Written=537484
Counters
if (record.isUgly()) {
context.getCounter("Ugly Record Counters",
"Ugly Records").increment(1);
}
Counters
Map-Reduce Framework
Map input records=183
Map output records=183
Map output bytes=533563
Map output materialized bytes=534190
Input split bytes=144
Combine input records=0
Combine output records=0
Reduce input groups=183
Reduce shuffle bytes=0
Reduce input records=183
Reduce output records=183
Spilled Records=366
Shuffled Maps =0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=7
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
File System Counters
FILE: Number of bytes read=1844866
FILE: Number of bytes written=1927344
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
File Input Format Counters
Bytes Read=655137
File Output Format Counters
Bytes Written=537484
Ugly Record Counters
Ugly Records=1024
Distributed Cache
We need some data and libraries on all the
nodes.
Distributed Cache
Map or
Reduce Task
Map or
Reduce Task
Local
Copy
HDFS
Distributed
CacheMap or
Reduce Task
Map or
Reduce Task
Local
Copy
Distributed Cache
In our driver:
DistributedCache .addCacheFile(
new URI("/some/path/to/ourfile.txt" ), conf);
In our mapper or reducer:
@Override
public void setup(Context context) throws IOException,
InterruptedException {
Configuration conf = context.getConfiguration();
localFiles = DistributedCache .getLocalCacheFiles(conf);
}
Java
Technologies
Built on
MapReduce
Crunch
● Library on top of MapReduce that makes it
easy to write pipelines of jobs in Java
● Contains capabilities like joins and
aggregation functions to save programmers
from writing these for each job
Crunch
public class WordCount {
public static void main(String[] args) throws Exception {
Pipeline pipeline = new MRPipeline(WordCount.class);
PCollection<String> lines = pipeline.readTextFile(args[0]);
PCollection<String> words = lines.parallelDo("my splitter", new DoFn<String, String>() {
public void process(String line, Emitter<String> emitter) {
for (String word : line.split("s+")) {
emitter.emit(word);
}
}
}, Writables.strings());
PTable<String, Long> counts= Aggregate.count(words);
pipeline.writeTextFile(counts, args[1]);
pipeline.run();
}
}
Mahout
● Machine Learning on Hadoop
○ Collaborative Filtering
○ User and Item based recommenders
○ K-Means, Fuzzy K-Means clustering
○ Dirichlet process clustering
○ Latent Dirichlet Allocation
○ Singular value decomposition
○ Parallel Frequent Pattern mining
○ Complementary Naive Bayes classifier
○ Random forest decision tree based classifier
Non-Java technologies that use
MapReduce
● Hive
○ SQL -> M/R translator, metadata manager
● Pig
○ Scripting DSL -> M/R translator
● Distcp
○ HDFS tool to bulk copy data from one HDFS cluster
to another
Thanks!
● Questions?

More Related Content

What's hot

Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explainedDmytro Sandu
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentationateeq ateeq
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-ReduceBrendan Tierney
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReducefvanvollenhoven
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reducePaladion Networks
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduceHassan A-j
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windowsMuhammad Shahid
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and PipesHanborq Inc.
 

What's hot (19)

Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explained
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and Pipes
 

Similar to Hadoop - Introduction to map reduce programming - Reunião 12/04/2014

Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型wang xing
 
Hadoop online-training
Hadoop online-trainingHadoop online-training
Hadoop online-trainingGeohedrick
 
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introductionrajsandhu1989
 
An Introduction to MapReduce
An Introduction to MapReduce An Introduction to MapReduce
An Introduction to MapReduce Sina Ebrahimi
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview questionpappupassindia
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopGERARDO BARBERENA
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop IntroductionSNEHAL MASNE
 

Similar to Hadoop - Introduction to map reduce programming - Reunião 12/04/2014 (20)

Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Data Science
Data ScienceData Science
Data Science
 
SparkNotes
SparkNotesSparkNotes
SparkNotes
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Hadoop online-training
Hadoop online-trainingHadoop online-training
Hadoop online-training
 
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introduction
 
An Introduction to MapReduce
An Introduction to MapReduce An Introduction to MapReduce
An Introduction to MapReduce
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Spark overview
Spark overviewSpark overview
Spark overview
 
mapReduce.pptx
mapReduce.pptxmapReduce.pptx
mapReduce.pptx
 
hadoop
hadoophadoop
hadoop
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
 

Recently uploaded

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 

Recently uploaded (20)

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 

Hadoop - Introduction to map reduce programming - Reunião 12/04/2014

  • 1. Java MapReduce Programming on Apache Hadoop Aaron T. Myers, aka ATM with thanks to Sandy Ryza
  • 2. Introductions ● Software Engineer/Tech Lead for HDFS at Cloudera ● Committer/PMC Member on the Apache Hadoop project ● My work focuses primarily on HDFS and Hadoop security
  • 3. What is MapReduce? ● A distributed programming paradigm
  • 4. What is a distributed programming paradigm? Help!
  • 5. What is a distributed programming paradigm?
  • 6. Distributed Systems are Hard ● Monitoring ● RPC protocols, serialization ● Fault tolerance ● Deployment ● Scheduling/Resource Management
  • 7. Writing Data Parallel Programs Should Not Be
  • 8. MapReduce to the Rescue ● You specify map(...) and reduce(...) functions ○ map = (list(k, v) -> list(k, v)) ○ reduce = (k, list(v) -> k, v) ● The framework does the rest ○ Split up the data ○ Run several mappers over the splits ○ Shuffle the data around for the reducers ○ Run several reducers ○ Store the final results
  • 9. Map apple apple banana a happy airplane airplane on the runway runway apple runway rumple on the apple apple apple banana a happy airplane airplane on the runway runway apple runway rumple on the apple apple - 1 apple - 1 banana - 1 a - 1 happy - 1 airplane - 1 on - 1 the - 1 runway - 1 runway - 1 runway - 1 apple - 1 rumple - 1 on - 1 the - 1 apple - 1 map() map() map() map() map() Map Inputs Map OutputsInput Data Map Function Shuffle
  • 10. Reduce reduce() reduce() reduce() reduce() reduce() reduce() reduce() reduce() a - 1 airplane - 1 apple - 4 banana - 1 on - 2 runway - 3 rumple - 1 the - 2 a - 1, 1 airplane - 1 apple - 1, 1, 1, 1 banana - 1 on - 1, 1 runway - 1, 1, 1 rumple - 1 the - 1, 1 Shuffle Reduce Output
  • 11. What is (Core) Hadoop? ● An open source platform for storing, processing, and analyzing enormous amounts of data ● Consists of… ○ A distributed file system (HDFS) ○ An implementation of the Map/Reduce paradigm (Hadoop MapReduce) ● Written in Java!
  • 12. What is Hadoop? Traditional Operating System Storage: File System Execution/Scheduling: Processes
  • 13. What is Hadoop? Hadoop (Distributed operating system) Storage: Hadoop Distributed File System (HDFS) Execution/Scheduling: MapReduce
  • 14. HDFS (briefly) ● Distributed file system that runs on all nodes in the cluster ○ Co-located with Hadoop MapReduce daemons ● Looks like a pretty normal Unix file system ○ hadoop fs -ls /user/atm/ ○ hadoop fs -cp /user/atm/data.txt /user/atm/data2.txt ○ hadoop fs -rm /user/atm/data.txt ○ … ● Don’t use the normal Java File API ○ Instead use org.apache.hadoop.fs.FileSystem API
  • 15. Writing MapReduce programs in Java ● Interface to MapReduce in Hadoop is Java API ● WordCount!
  • 16. Word Count Map Function public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one= new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable>output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } }
  • 17. Word Count Reduce Function public static class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable>output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
  • 19. InputFormats ● TextInputFormat ○ Each line becomes <LongWritable, Text> = <byte offset in file, whole line> ● KeyValueTextInputFormat ○ Splits lines on delimiter into Text key and Text value ● SequenceFileInputFormat ○ Reads key/value pairs from SequenceFile, a Hadoop format ● DBInputFormat ○ Uses JDBC to connect to a database ● Many more, or write your own!
  • 20. Serialization ● Writables ○ Native to Hadoop ○ Implement serialization for higher level structures yourself ● Avro ○ Extensible ○ Cross-language ○ Handles serialization of higher level structures for you ● And others… ○ Parquet, Thrift, etc.
  • 21. Writables public class MyNumberAndStringWritable implements Writable { private int number; private String str; public void write(DataOutput out) throws IOException { out.writeInt(number); out.writeUTF(str); } public void readFields(DataInput in) throws IOException { number = in.readInt(); str = in.readUTF(); } }
  • 22. Avro protocol MyMapReduceObjects { record MyNumberAndString { string str; int number; } }
  • 23. Testing MapReduce Programs ● First, write unit tests (duh) with MRUnit ● LocalJobRunner ○ Runs job in single process ● Single-node cluster (Cloudera VM!) ○ Multiple processes on the same machine ● On the real cluster
  • 24. MRUnit @Test public void testMapper() throws IOException { MapDriver<LongWritable, Text, Text, IntWritable> mapDriver= new MapDriver<LongWritable, Text, Text, IntWritable>(new WordCountMapper()); String line = "apple banana banana carrot"; mapDriver.withInput(new LongWritable(0), new Text(line)); mapDriver.withOutput(new Text("apple"), new IntWritable(1)); mapDriver.withOutput(new Text("banana"), new IntWritable(1)); mapDriver.withOutput(new Text("banana"), new IntWritable(1)); mapDriver.withOutput(new Text("carrot"), new IntWritable(1)); mapDriver.runTest(); }
  • 25. MRUnit @Test public void testReducer() { ReduceDriver<Text, IntWritable, Text, IntWritable> reduceDriver= new MapDriver<Text, IntWritable, Text, IntWritable>(new WordCountReducer()); reduceDriver.withInput(new Text("apple"), Arrays.asList(new IntWritable(1), new IntWritable(2))); reduceDriver.withOutput(new Text("apple"), new IntWritable("3")); reduceDriver.runTest(); }
  • 26. Counters Map-Reduce Framework Map input records=183 Map output records=183 Map output bytes=533563 Map output materialized bytes=534190 Input split bytes=144 Combine input records=0 Combine output records=0 Reduce input groups=183 Reduce shuffle bytes=0 Reduce input records=183 Reduce output records=183 Spilled Records=366 Shuffled Maps =0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=7 CPU time spent (ms)=0 Physical memory (bytes) snapshot=0 Virtual memory (bytes) snapshot=0 File System Counters FILE: Number of bytes read=1844866 FILE: Number of bytes written=1927344 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 File Input Format Counters Bytes Read=655137 File Output Format Counters Bytes Written=537484
  • 27. Counters if (record.isUgly()) { context.getCounter("Ugly Record Counters", "Ugly Records").increment(1); }
  • 28. Counters Map-Reduce Framework Map input records=183 Map output records=183 Map output bytes=533563 Map output materialized bytes=534190 Input split bytes=144 Combine input records=0 Combine output records=0 Reduce input groups=183 Reduce shuffle bytes=0 Reduce input records=183 Reduce output records=183 Spilled Records=366 Shuffled Maps =0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=7 CPU time spent (ms)=0 Physical memory (bytes) snapshot=0 Virtual memory (bytes) snapshot=0 File System Counters FILE: Number of bytes read=1844866 FILE: Number of bytes written=1927344 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 File Input Format Counters Bytes Read=655137 File Output Format Counters Bytes Written=537484 Ugly Record Counters Ugly Records=1024
  • 29. Distributed Cache We need some data and libraries on all the nodes.
  • 30. Distributed Cache Map or Reduce Task Map or Reduce Task Local Copy HDFS Distributed CacheMap or Reduce Task Map or Reduce Task Local Copy
  • 31. Distributed Cache In our driver: DistributedCache .addCacheFile( new URI("/some/path/to/ourfile.txt" ), conf); In our mapper or reducer: @Override public void setup(Context context) throws IOException, InterruptedException { Configuration conf = context.getConfiguration(); localFiles = DistributedCache .getLocalCacheFiles(conf); }
  • 33. Crunch ● Library on top of MapReduce that makes it easy to write pipelines of jobs in Java ● Contains capabilities like joins and aggregation functions to save programmers from writing these for each job
  • 34. Crunch public class WordCount { public static void main(String[] args) throws Exception { Pipeline pipeline = new MRPipeline(WordCount.class); PCollection<String> lines = pipeline.readTextFile(args[0]); PCollection<String> words = lines.parallelDo("my splitter", new DoFn<String, String>() { public void process(String line, Emitter<String> emitter) { for (String word : line.split("s+")) { emitter.emit(word); } } }, Writables.strings()); PTable<String, Long> counts= Aggregate.count(words); pipeline.writeTextFile(counts, args[1]); pipeline.run(); } }
  • 35. Mahout ● Machine Learning on Hadoop ○ Collaborative Filtering ○ User and Item based recommenders ○ K-Means, Fuzzy K-Means clustering ○ Dirichlet process clustering ○ Latent Dirichlet Allocation ○ Singular value decomposition ○ Parallel Frequent Pattern mining ○ Complementary Naive Bayes classifier ○ Random forest decision tree based classifier
  • 36. Non-Java technologies that use MapReduce ● Hive ○ SQL -> M/R translator, metadata manager ● Pig ○ Scripting DSL -> M/R translator ● Distcp ○ HDFS tool to bulk copy data from one HDFS cluster to another