MapReduce with Hadoop

MapReduce with HADOOP

Vitalie Scurtu

What is hadoop?
Hadoop is a set of open source frameworks for
parallel and distributive computing:
• HDFS: Distributed file system
• MapReduce: A technique and a framework for
parallel computation in cluster.
• ZooKeeper: A configuration service.
• and others: Hive ,HBase ,Mahout, Pig.
• Yahoo's Hadoop clusters was used to sort 1 terabyte of data in 209
seconds in Terabyte Sorting Competition.

Why distributed computing?
• Reduced costs. More computers are cheaper
then more powerful computer.
• Scalability. We can add new computer to the
cluster anytime.
• Super power and super speed.
• Distributed algorithms.
• Stability
• Robust frameworks.

Configuring Hadoop
• It is java and it uses xml file for configuration.
• Installation is very simple.
• Every computer can become a part of the cluster.
• To try a demo we need only 30 minutes.
• Uses an advanced configuration system named
ZooKeeper
• cat /usr/local/hadoop/conf/slaves
hadoop-master
hadoop-slave01
hadoop-slave02
hadoop-slave03
hadoop-slave06

HDFS
Hadoop Distributed File System
• Distributed file system
• Support for huge files (GB, terrabyte)
• Hardware Failure safe, replication
• File access model is “Write-once-read-many”
• Cross-platform (java)

MapReduce
• An uniq model for distributed computation, main algorithm is divided in
two
– Map
• Accepts in input key-value pairs (dictionary)
• Records must be independend (Key A does not depend on Key B)
• It does the intermediary computations and prepares the data for Reduce stage.
– Reduce
• Accepts in input collections of key-value with intermediary results.
• Parallel Sorting and Grouping functions.
• Returns the final result.
– Map -> Reduce
• It is not only a distributed framework but also a development methodology thanks to its
uniq formula. The algorithms contrains makes it possible for the developer to think
about implementation and not to focus on the parallel computation. Once a problem is
transormed into a MapReduce algorithm, the framework is applicable.
– Computation time: max(time_of_each_map) + max(time_of_each_reduce)

MapReduce

Map1

Map2
Reduce Output
Input Map3

Map4

Example of Applications
• Problem: Extract all the texts from a database
with 1 million posts and compute the occurency
of each token.
mapper.py <- Takes as input an id
-> Prints each token with its occurency
reducer.py <- Takes as input a list of tokens with
ids occurency
-> Sums the occurency of all tokens
and outputs the final result.

Experiment 1, 100K docs, 5 slaves
• Time without MapReduce
– 906.63user
– 4.18system
– 0:14:32 elapsed
– 104%CPU (0avgtext+0avgdata 0maxresident)k
• Time with MapReduce
– 3.79user
– 0.40system
– 0:21:00 elapsed

– 10/10/25 11:10:36 INFO streaming.StreamJob: map 0% reduce 0%

Experiment 2, 1M doc, 5 slaves
– 6892.08user
– 25.03system
– 1:56:37 elapsed
– 6.30user
– 0.98system
– 3:26:18elapsed


Experiment 3, 1M doc, 3 slaves
– 6892.08user
– 25.03system
– 1:56:37 elapsed
– 5.50user
– 0.97system
– 00:53:20elapsed

What’s next?
• MapReduce can be applied in many problems
and natural language processing applications.
Examples
– Sentiment analysis.
– Computing probabilities of huge data.
– Retrieval problem.
– Huge data statistics and analysis.
– MapReduce is not only a framework it is also a
distributed computing methodology.

MapReduce with Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to MapReduce with Hadoop

Similar to MapReduce with Hadoop (20)

Recently uploaded

Recently uploaded (20)

MapReduce with Hadoop