2. What is hadoop?
Hadoop is a set of open source frameworks for
parallel and distributive computing:
• HDFS: Distributed file system
• MapReduce: A technique and a framework for
parallel computation in cluster.
• ZooKeeper: A configuration service.
• and others: Hive ,HBase ,Mahout, Pig.
• Yahoo's Hadoop clusters was used to sort 1 terabyte of data in 209
seconds in Terabyte Sorting Competition.
3. Why distributed computing?
• Reduced costs. More computers are cheaper
then more powerful computer.
• Scalability. We can add new computer to the
cluster anytime.
• Super power and super speed.
• Distributed algorithms.
• Stability
• Robust frameworks.
4. Configuring Hadoop
• It is java and it uses xml file for configuration.
• Installation is very simple.
• Every computer can become a part of the cluster.
• To try a demo we need only 30 minutes.
• Uses an advanced configuration system named
ZooKeeper
• cat /usr/local/hadoop/conf/slaves
hadoop-master
hadoop-slave01
hadoop-slave02
hadoop-slave03
hadoop-slave06
5. HDFS
Hadoop Distributed File System
• Distributed file system
• Support for huge files (GB, terrabyte)
• Hardware Failure safe, replication
• File access model is “Write-once-read-many”
• Cross-platform (java)
6. MapReduce
• An uniq model for distributed computation, main algorithm is divided in
two
– Map
• Accepts in input key-value pairs (dictionary)
• Records must be independend (Key A does not depend on Key B)
• It does the intermediary computations and prepares the data for Reduce stage.
– Reduce
• Accepts in input collections of key-value with intermediary results.
• Parallel Sorting and Grouping functions.
• Returns the final result.
– Map -> Reduce
• It is not only a distributed framework but also a development methodology thanks to its
uniq formula. The algorithms contrains makes it possible for the developer to think
about implementation and not to focus on the parallel computation. Once a problem is
transormed into a MapReduce algorithm, the framework is applicable.
– Computation time: max(time_of_each_map) + max(time_of_each_reduce)
8. Example of Applications
• Problem: Extract all the texts from a database
with 1 million posts and compute the occurency
of each token.
mapper.py <- Takes as input an id
-> Prints each token with its occurency
reducer.py <- Takes as input a list of tokens with
ids occurency
-> Sums the occurency of all tokens
and outputs the final result.
9. Experiment 1, 100K docs, 5 slaves
• Time without MapReduce
– 906.63user
– 4.18system
– 0:14:32 elapsed
– 104%CPU (0avgtext+0avgdata 0maxresident)k
• Time with MapReduce
– 3.79user
– 0.40system
– 0:21:00 elapsed
– 0%CPU (0avgtext+0avgdata 0maxresident)k
– 10/10/25 11:10:36 INFO streaming.StreamJob: map 0% reduce 0%
– 10/10/25 11:10:50 INFO streaming.StreamJob: map 16% reduce 0%
– 10/10/25 11:11:48 INFO streaming.StreamJob: map 33% reduce 0%
– 10/10/25 11:12:10 INFO streaming.StreamJob: map 49% reduce 0%
– 10/10/25 11:14:09 INFO streaming.StreamJob: map 66% reduce 0%
– 10/10/25 11:14:37 INFO streaming.StreamJob: map 82% reduce 0%
– 10/10/25 11:16:26 INFO streaming.StreamJob: map 83% reduce 0%
– 10/10/25 11:18:12 INFO streaming.StreamJob: map 83% reduce 17%
– 10/10/25 11:20:18 INFO streaming.StreamJob: map 99% reduce 17%
10. Experiment 2, 1M doc, 5 slaves
• Time without MapReduce
– 6892.08user
– 25.03system
– 1:56:37 elapsed
– 98%CPU (0avgtext+0avgdata 0maxresident)k
• Time with MapReduce
– 6.30user
– 0.98system
– 3:26:18elapsed
– 0%CPU (0avgtext+0avgdata 0maxresident)k
– 10/10/26 15:04:36 INFO streaming.StreamJob: map 100% reduce 14%
– 10/10/26 15:04:37 INFO streaming.StreamJob: map 100% reduce 16%
– 10/10/26 15:04:39 INFO streaming.StreamJob: map 100% reduce 25%
– 10/10/26 15:04:40 INFO streaming.StreamJob: map 100% reduce 27%
– 10/10/26 15:04:42 INFO streaming.StreamJob: map 100% reduce 30%
– 10/10/26 15:04:44 INFO streaming.StreamJob: map 100% reduce 32%
– 10/10/26 15:04:45 INFO streaming.StreamJob: map 100% reduce 34%
– 10/10/26 15:04:48 INFO streaming.StreamJob: map 100% reduce 35%
– 10/10/26 15:07:29 INFO streaming.StreamJob: map 83% reduce 35%
– 10/10/26 15:07:35 INFO streaming.StreamJob: map 100% reduce 35%
– 10/10/26 15:09:57 INFO streaming.StreamJob: map 100% reduce 36%
– 10/10/26 15:09:59 INFO streaming.StreamJob: map 100% reduce 37%
11. Experiment 3, 1M doc, 3 slaves
• Time without MapReduce
– 6892.08user
– 25.03system
– 1:56:37 elapsed
– 98%CPU (0avgtext+0avgdata 0maxresident)k
• Time with MapReduce
– 5.50user
– 0.97system
– 00:53:20elapsed
– 0%CPU (0avgtext+0avgdata 0maxresident)k
– 10/10/26 15:04:36 INFO streaming.StreamJob: map 100% reduce 14%
– 10/10/26 15:04:37 INFO streaming.StreamJob: map 100% reduce 16%
– 10/10/26 15:04:39 INFO streaming.StreamJob: map 100% reduce 25%
– 10/10/26 15:04:40 INFO streaming.StreamJob: map 100% reduce 27%
– 10/10/26 15:04:42 INFO streaming.StreamJob: map 100% reduce 30%
– 10/10/26 15:04:44 INFO streaming.StreamJob: map 100% reduce 32%
– 10/10/26 15:04:45 INFO streaming.StreamJob: map 100% reduce 34%
– 10/10/26 15:04:48 INFO streaming.StreamJob: map 100% reduce 35%
– 10/10/26 15:07:29 INFO streaming.StreamJob: map 83% reduce 35%
– 10/10/26 15:07:35 INFO streaming.StreamJob: map 100% reduce 35%
– 10/10/26 15:09:57 INFO streaming.StreamJob: map 100% reduce 36%
– 10/10/26 15:09:59 INFO streaming.StreamJob: map 100% reduce 37%
12. What’s next?
• MapReduce can be applied in many problems
and natural language processing applications.
Examples
– Sentiment analysis.
– Computing probabilities of huge data.
– Retrieval problem.
– Huge data statistics and analysis.
– MapReduce is not only a framework it is also a
distributed computing methodology.