SlideShare a Scribd company logo
1 of 12
Download to read offline
MapReduce with HADOOP



      Vitalie Scurtu
What is hadoop?
Hadoop is a set of open source frameworks for
  parallel and distributive computing:
• HDFS: Distributed file system
• MapReduce: A technique and a framework for
  parallel computation in cluster.
• ZooKeeper: A configuration service.
• and others: Hive ,HBase ,Mahout, Pig.
• Yahoo's Hadoop clusters was used to sort 1 terabyte of data in 209
  seconds in Terabyte Sorting Competition.
Why distributed computing?
• Reduced costs. More computers are cheaper
  then more powerful computer.
• Scalability. We can add new computer to the
  cluster anytime.
• Super power and super speed.
• Distributed algorithms.
• Stability
• Robust frameworks.
Configuring Hadoop
• It is java and it uses xml file for configuration.
• Installation is very simple.
• Every computer can become a part of the cluster.
• To try a demo we need only 30 minutes.
• Uses an advanced configuration system named
  ZooKeeper
• cat /usr/local/hadoop/conf/slaves
         hadoop-master
         hadoop-slave01
         hadoop-slave02
         hadoop-slave03
         hadoop-slave06
HDFS
          Hadoop Distributed File System
•   Distributed file system
•   Support for huge files (GB, terrabyte)
•   Hardware Failure safe, replication
•   File access model is “Write-once-read-many”
•   Cross-platform (java)
MapReduce
• An uniq model for distributed computation, main algorithm is divided in
  two
    – Map
        • Accepts in input key-value pairs (dictionary)
        • Records must be independend (Key A does not depend on Key B)
        • It does the intermediary computations and prepares the data for Reduce stage.
    – Reduce
        • Accepts in input collections of key-value with intermediary results.
        • Parallel Sorting and Grouping functions.
        • Returns the final result.
    – Map -> Reduce
        • It is not only a distributed framework but also a development methodology thanks to its
          uniq formula. The algorithms contrains makes it possible for the developer to think
          about implementation and not to focus on the parallel computation. Once a problem is
          transormed into a MapReduce algorithm, the framework is applicable.
    – Computation time: max(time_of_each_map) + max(time_of_each_reduce)
MapReduce

        Map1


        Map2
                    Reduce   Output
Input   Map3




        Map4
Example of Applications
• Problem: Extract all the texts from a database
   with 1 million posts and compute the occurency
   of each token.
   mapper.py <- Takes as input an id
                -> Prints each token with its occurency
  reducer.py <- Takes as input a list of tokens with
   ids occurency
               -> Sums the occurency of all tokens
   and outputs the final result.
Experiment 1, 100K docs, 5 slaves
•   Time without MapReduce
     –   906.63user
     –   4.18system
     –   0:14:32 elapsed
     –   104%CPU (0avgtext+0avgdata 0maxresident)k
•   Time with MapReduce
     –   3.79user
     –   0.40system
     –   0:21:00 elapsed
     –   0%CPU (0avgtext+0avgdata 0maxresident)k

     –   10/10/25 11:10:36 INFO streaming.StreamJob:   map 0% reduce 0%
     –   10/10/25 11:10:50 INFO streaming.StreamJob:   map 16% reduce 0%
     –   10/10/25 11:11:48 INFO streaming.StreamJob:   map 33% reduce 0%
     –   10/10/25 11:12:10 INFO streaming.StreamJob:   map 49% reduce 0%
     –   10/10/25 11:14:09 INFO streaming.StreamJob:   map 66% reduce 0%
     –   10/10/25 11:14:37 INFO streaming.StreamJob:   map 82% reduce 0%
     –   10/10/25 11:16:26 INFO streaming.StreamJob:   map 83% reduce 0%
     –   10/10/25 11:18:12 INFO streaming.StreamJob:   map 83% reduce 17%
     –   10/10/25 11:20:18 INFO streaming.StreamJob:   map 99% reduce 17%
Experiment 2, 1M doc, 5 slaves
•   Time without MapReduce
     –   6892.08user
     –   25.03system
     –   1:56:37 elapsed
     –   98%CPU (0avgtext+0avgdata 0maxresident)k
•   Time with MapReduce
     –   6.30user
     –   0.98system
     –   3:26:18elapsed
     –   0%CPU (0avgtext+0avgdata 0maxresident)k

     –   10/10/26 15:04:36 INFO streaming.StreamJob:   map 100% reduce 14%
     –   10/10/26 15:04:37 INFO streaming.StreamJob:   map 100% reduce 16%
     –   10/10/26 15:04:39 INFO streaming.StreamJob:   map 100% reduce 25%
     –   10/10/26 15:04:40 INFO streaming.StreamJob:   map 100% reduce 27%
     –   10/10/26 15:04:42 INFO streaming.StreamJob:   map 100% reduce 30%
     –   10/10/26 15:04:44 INFO streaming.StreamJob:   map 100% reduce 32%
     –   10/10/26 15:04:45 INFO streaming.StreamJob:   map 100% reduce 34%
     –   10/10/26 15:04:48 INFO streaming.StreamJob:   map 100% reduce 35%
     –   10/10/26 15:07:29 INFO streaming.StreamJob:   map 83% reduce 35%
     –   10/10/26 15:07:35 INFO streaming.StreamJob:   map 100% reduce 35%
     –   10/10/26 15:09:57 INFO streaming.StreamJob:   map 100% reduce 36%
     –   10/10/26 15:09:59 INFO streaming.StreamJob:   map 100% reduce 37%
Experiment 3, 1M doc, 3 slaves
•   Time without MapReduce
     –   6892.08user
     –   25.03system
     –   1:56:37 elapsed
     –   98%CPU (0avgtext+0avgdata 0maxresident)k
•   Time with MapReduce
     –   5.50user
     –   0.97system
     –   00:53:20elapsed
     –   0%CPU (0avgtext+0avgdata 0maxresident)k
     –   10/10/26 15:04:36 INFO streaming.StreamJob:   map 100% reduce 14%
     –   10/10/26 15:04:37 INFO streaming.StreamJob:   map 100% reduce 16%
     –   10/10/26 15:04:39 INFO streaming.StreamJob:   map 100% reduce 25%
     –   10/10/26 15:04:40 INFO streaming.StreamJob:   map 100% reduce 27%
     –   10/10/26 15:04:42 INFO streaming.StreamJob:   map 100% reduce 30%
     –   10/10/26 15:04:44 INFO streaming.StreamJob:   map 100% reduce 32%
     –   10/10/26 15:04:45 INFO streaming.StreamJob:   map 100% reduce 34%
     –   10/10/26 15:04:48 INFO streaming.StreamJob:   map 100% reduce 35%
     –   10/10/26 15:07:29 INFO streaming.StreamJob:   map 83% reduce 35%
     –   10/10/26 15:07:35 INFO streaming.StreamJob:   map 100% reduce 35%
     –   10/10/26 15:09:57 INFO streaming.StreamJob:   map 100% reduce 36%
     –   10/10/26 15:09:59 INFO streaming.StreamJob:   map 100% reduce 37%
What’s next?
• MapReduce can be applied in many problems
  and natural language processing applications.
  Examples
  – Sentiment analysis.
  – Computing probabilities of huge data.
  – Retrieval problem.
  – Huge data statistics and analysis.
  – MapReduce is not only a framework it is also a
    distributed computing methodology.

More Related Content

What's hot

Processing Big Data in Realtime
Processing Big Data in RealtimeProcessing Big Data in Realtime
Processing Big Data in Realtime
Tikal Knowledge
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
npinto
 

What's hot (20)

Giraph
GiraphGiraph
Giraph
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
 
Doom in SpaceX
Doom in SpaceXDoom in SpaceX
Doom in SpaceX
 
Kafka short
Kafka shortKafka short
Kafka short
 
Automatic Features Generation And Model Training On Spark: A Bayesian Approach
Automatic Features Generation And Model Training On Spark: A Bayesian ApproachAutomatic Features Generation And Model Training On Spark: A Bayesian Approach
Automatic Features Generation And Model Training On Spark: A Bayesian Approach
 
Deep Learning on Aerial Imagery: What does it look like on a map?
Deep Learning on Aerial Imagery: What does it look like on a map?Deep Learning on Aerial Imagery: What does it look like on a map?
Deep Learning on Aerial Imagery: What does it look like on a map?
 
Heatmap
HeatmapHeatmap
Heatmap
 
Processing Big Data in Realtime
Processing Big Data in RealtimeProcessing Big Data in Realtime
Processing Big Data in Realtime
 
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster Innards
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Scaling graphite to handle a zerg rush
Scaling graphite to handle a zerg rushScaling graphite to handle a zerg rush
Scaling graphite to handle a zerg rush
 
R user group 2011 09
R user group 2011 09R user group 2011 09
R user group 2011 09
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
 
Harnessing Big Data with Spark
Harnessing Big Data with SparkHarnessing Big Data with Spark
Harnessing Big Data with Spark
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
The next generation of the Montage image mosaic engine
The next generation of the Montage image mosaic engineThe next generation of the Montage image mosaic engine
The next generation of the Montage image mosaic engine
 
2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks
 

Viewers also liked

Medium Information Quantity
Medium Information QuantityMedium Information Quantity
Medium Information Quantity
Vitalie Scurtu
 

Viewers also liked (17)

Question1
Question1Question1
Question1
 
Quantità dell'informazione
Quantità dell'informazioneQuantità dell'informazione
Quantità dell'informazione
 
Ecostystems
EcostystemsEcostystems
Ecostystems
 
Question1
Question1Question1
Question1
 
Medium Information Quantity
Medium Information QuantityMedium Information Quantity
Medium Information Quantity
 
Cartoon on teaching of sense organs
Cartoon on teaching of sense organsCartoon on teaching of sense organs
Cartoon on teaching of sense organs
 
Lost boy draft 1
Lost boy draft 1Lost boy draft 1
Lost boy draft 1
 
Misfortune 2nd Draft
Misfortune 2nd DraftMisfortune 2nd Draft
Misfortune 2nd Draft
 
Misfortune
MisfortuneMisfortune
Misfortune
 
For the love of family
For the love of family For the love of family
For the love of family
 
Persona non grata first draft
Persona non grata first draftPersona non grata first draft
Persona non grata first draft
 
Food and health
Food and healthFood and health
Food and health
 
Jung
JungJung
Jung
 
Script working title
Script working titleScript working title
Script working title
 
Script working title katie's killer
Script working title   katie's killerScript working title   katie's killer
Script working title katie's killer
 
See no evil
See no evilSee no evil
See no evil
 
Script working title
Script working title Script working title
Script working title
 

Similar to MapReduce with Hadoop

A Lightweight Infrastructure for Graph Analytics
A Lightweight Infrastructure for Graph AnalyticsA Lightweight Infrastructure for Graph Analytics
A Lightweight Infrastructure for Graph Analytics
Donald Nguyen
 
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit
 

Similar to MapReduce with Hadoop (20)

A Lightweight Infrastructure for Graph Analytics
A Lightweight Infrastructure for Graph AnalyticsA Lightweight Infrastructure for Graph Analytics
A Lightweight Infrastructure for Graph Analytics
 
Exascale Deep Learning for Climate Analytics
Exascale Deep Learning for Climate AnalyticsExascale Deep Learning for Climate Analytics
Exascale Deep Learning for Climate Analytics
 
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
 
Dasia 2022
Dasia 2022Dasia 2022
Dasia 2022
 
Blue Waters and Resource Management - Now and in the Future
 Blue Waters and Resource Management - Now and in the Future Blue Waters and Resource Management - Now and in the Future
Blue Waters and Resource Management - Now and in the Future
 
Efficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/ReduceEfficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/Reduce
 
Efficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/ReduceEfficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/Reduce
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
MapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APIMapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase API
 
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezYahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
On Extending MapReduce - Survey and Experiments
On Extending MapReduce - Survey and ExperimentsOn Extending MapReduce - Survey and Experiments
On Extending MapReduce - Survey and Experiments
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
ASE2010
ASE2010ASE2010
ASE2010
 
Partitioning SKA Dataflows for Optimal Graph Execution
Partitioning SKA Dataflows for Optimal Graph ExecutionPartitioning SKA Dataflows for Optimal Graph Execution
Partitioning SKA Dataflows for Optimal Graph Execution
 
IAC 2020
IAC 2020IAC 2020
IAC 2020
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
 
MapReduce.pptx
MapReduce.pptxMapReduce.pptx
MapReduce.pptx
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascale
 

Recently uploaded

Al Mizhar Dubai Escorts +971561403006 Escorts Service In Al Mizhar
Al Mizhar Dubai Escorts +971561403006 Escorts Service In Al MizharAl Mizhar Dubai Escorts +971561403006 Escorts Service In Al Mizhar
Al Mizhar Dubai Escorts +971561403006 Escorts Service In Al Mizhar
allensay1
 
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai KuwaitThe Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
daisycvs
 

Recently uploaded (20)

Nashik Call Girl Just Call 7091819311 Top Class Call Girl Service Available
Nashik Call Girl Just Call 7091819311 Top Class Call Girl Service AvailableNashik Call Girl Just Call 7091819311 Top Class Call Girl Service Available
Nashik Call Girl Just Call 7091819311 Top Class Call Girl Service Available
 
PARK STREET 💋 Call Girl 9827461493 Call Girls in Escort service book now
PARK STREET 💋 Call Girl 9827461493 Call Girls in  Escort service book nowPARK STREET 💋 Call Girl 9827461493 Call Girls in  Escort service book now
PARK STREET 💋 Call Girl 9827461493 Call Girls in Escort service book now
 
Berhampur Call Girl Just Call 8084732287 Top Class Call Girl Service Available
Berhampur Call Girl Just Call 8084732287 Top Class Call Girl Service AvailableBerhampur Call Girl Just Call 8084732287 Top Class Call Girl Service Available
Berhampur Call Girl Just Call 8084732287 Top Class Call Girl Service Available
 
Unveiling Falcon Invoice Discounting: Leading the Way as India's Premier Bill...
Unveiling Falcon Invoice Discounting: Leading the Way as India's Premier Bill...Unveiling Falcon Invoice Discounting: Leading the Way as India's Premier Bill...
Unveiling Falcon Invoice Discounting: Leading the Way as India's Premier Bill...
 
QSM Chap 10 Service Culture in Tourism and Hospitality Industry.pptx
QSM Chap 10 Service Culture in Tourism and Hospitality Industry.pptxQSM Chap 10 Service Culture in Tourism and Hospitality Industry.pptx
QSM Chap 10 Service Culture in Tourism and Hospitality Industry.pptx
 
Pre Engineered Building Manufacturers Hyderabad.pptx
Pre Engineered  Building Manufacturers Hyderabad.pptxPre Engineered  Building Manufacturers Hyderabad.pptx
Pre Engineered Building Manufacturers Hyderabad.pptx
 
Al Mizhar Dubai Escorts +971561403006 Escorts Service In Al Mizhar
Al Mizhar Dubai Escorts +971561403006 Escorts Service In Al MizharAl Mizhar Dubai Escorts +971561403006 Escorts Service In Al Mizhar
Al Mizhar Dubai Escorts +971561403006 Escorts Service In Al Mizhar
 
Kalyan Call Girl 98350*37198 Call Girls in Escort service book now
Kalyan Call Girl 98350*37198 Call Girls in Escort service book nowKalyan Call Girl 98350*37198 Call Girls in Escort service book now
Kalyan Call Girl 98350*37198 Call Girls in Escort service book now
 
Uneak White's Personal Brand Exploration Presentation
Uneak White's Personal Brand Exploration PresentationUneak White's Personal Brand Exploration Presentation
Uneak White's Personal Brand Exploration Presentation
 
joint cost.pptx COST ACCOUNTING Sixteenth Edition ...
joint cost.pptx  COST ACCOUNTING  Sixteenth Edition                          ...joint cost.pptx  COST ACCOUNTING  Sixteenth Edition                          ...
joint cost.pptx COST ACCOUNTING Sixteenth Edition ...
 
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai KuwaitThe Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
 
Marel Q1 2024 Investor Presentation from May 8, 2024
Marel Q1 2024 Investor Presentation from May 8, 2024Marel Q1 2024 Investor Presentation from May 8, 2024
Marel Q1 2024 Investor Presentation from May 8, 2024
 
Berhampur CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
Berhampur CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDINGBerhampur CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
Berhampur CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
 
WheelTug Short Pitch Deck 2024 | Byond Insights
WheelTug Short Pitch Deck 2024 | Byond InsightsWheelTug Short Pitch Deck 2024 | Byond Insights
WheelTug Short Pitch Deck 2024 | Byond Insights
 
UAE Bur Dubai Call Girls ☏ 0564401582 Call Girl in Bur Dubai
UAE Bur Dubai Call Girls ☏ 0564401582 Call Girl in Bur DubaiUAE Bur Dubai Call Girls ☏ 0564401582 Call Girl in Bur Dubai
UAE Bur Dubai Call Girls ☏ 0564401582 Call Girl in Bur Dubai
 
CROSS CULTURAL NEGOTIATION BY PANMISEM NS
CROSS CULTURAL NEGOTIATION BY PANMISEM NSCROSS CULTURAL NEGOTIATION BY PANMISEM NS
CROSS CULTURAL NEGOTIATION BY PANMISEM NS
 
Falcon Invoice Discounting: Empowering Your Business Growth
Falcon Invoice Discounting: Empowering Your Business GrowthFalcon Invoice Discounting: Empowering Your Business Growth
Falcon Invoice Discounting: Empowering Your Business Growth
 
Horngren’s Cost Accounting A Managerial Emphasis, Canadian 9th edition soluti...
Horngren’s Cost Accounting A Managerial Emphasis, Canadian 9th edition soluti...Horngren’s Cost Accounting A Managerial Emphasis, Canadian 9th edition soluti...
Horngren’s Cost Accounting A Managerial Emphasis, Canadian 9th edition soluti...
 
Berhampur 70918*19311 CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
Berhampur 70918*19311 CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDINGBerhampur 70918*19311 CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
Berhampur 70918*19311 CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
 
JAJPUR CALL GIRL ❤ 82729*64427❤ CALL GIRLS IN JAJPUR ESCORTS
JAJPUR CALL GIRL ❤ 82729*64427❤ CALL GIRLS IN JAJPUR  ESCORTSJAJPUR CALL GIRL ❤ 82729*64427❤ CALL GIRLS IN JAJPUR  ESCORTS
JAJPUR CALL GIRL ❤ 82729*64427❤ CALL GIRLS IN JAJPUR ESCORTS
 

MapReduce with Hadoop

  • 1. MapReduce with HADOOP Vitalie Scurtu
  • 2. What is hadoop? Hadoop is a set of open source frameworks for parallel and distributive computing: • HDFS: Distributed file system • MapReduce: A technique and a framework for parallel computation in cluster. • ZooKeeper: A configuration service. • and others: Hive ,HBase ,Mahout, Pig. • Yahoo's Hadoop clusters was used to sort 1 terabyte of data in 209 seconds in Terabyte Sorting Competition.
  • 3. Why distributed computing? • Reduced costs. More computers are cheaper then more powerful computer. • Scalability. We can add new computer to the cluster anytime. • Super power and super speed. • Distributed algorithms. • Stability • Robust frameworks.
  • 4. Configuring Hadoop • It is java and it uses xml file for configuration. • Installation is very simple. • Every computer can become a part of the cluster. • To try a demo we need only 30 minutes. • Uses an advanced configuration system named ZooKeeper • cat /usr/local/hadoop/conf/slaves hadoop-master hadoop-slave01 hadoop-slave02 hadoop-slave03 hadoop-slave06
  • 5. HDFS Hadoop Distributed File System • Distributed file system • Support for huge files (GB, terrabyte) • Hardware Failure safe, replication • File access model is “Write-once-read-many” • Cross-platform (java)
  • 6. MapReduce • An uniq model for distributed computation, main algorithm is divided in two – Map • Accepts in input key-value pairs (dictionary) • Records must be independend (Key A does not depend on Key B) • It does the intermediary computations and prepares the data for Reduce stage. – Reduce • Accepts in input collections of key-value with intermediary results. • Parallel Sorting and Grouping functions. • Returns the final result. – Map -> Reduce • It is not only a distributed framework but also a development methodology thanks to its uniq formula. The algorithms contrains makes it possible for the developer to think about implementation and not to focus on the parallel computation. Once a problem is transormed into a MapReduce algorithm, the framework is applicable. – Computation time: max(time_of_each_map) + max(time_of_each_reduce)
  • 7. MapReduce Map1 Map2 Reduce Output Input Map3 Map4
  • 8. Example of Applications • Problem: Extract all the texts from a database with 1 million posts and compute the occurency of each token. mapper.py <- Takes as input an id -> Prints each token with its occurency reducer.py <- Takes as input a list of tokens with ids occurency -> Sums the occurency of all tokens and outputs the final result.
  • 9. Experiment 1, 100K docs, 5 slaves • Time without MapReduce – 906.63user – 4.18system – 0:14:32 elapsed – 104%CPU (0avgtext+0avgdata 0maxresident)k • Time with MapReduce – 3.79user – 0.40system – 0:21:00 elapsed – 0%CPU (0avgtext+0avgdata 0maxresident)k – 10/10/25 11:10:36 INFO streaming.StreamJob: map 0% reduce 0% – 10/10/25 11:10:50 INFO streaming.StreamJob: map 16% reduce 0% – 10/10/25 11:11:48 INFO streaming.StreamJob: map 33% reduce 0% – 10/10/25 11:12:10 INFO streaming.StreamJob: map 49% reduce 0% – 10/10/25 11:14:09 INFO streaming.StreamJob: map 66% reduce 0% – 10/10/25 11:14:37 INFO streaming.StreamJob: map 82% reduce 0% – 10/10/25 11:16:26 INFO streaming.StreamJob: map 83% reduce 0% – 10/10/25 11:18:12 INFO streaming.StreamJob: map 83% reduce 17% – 10/10/25 11:20:18 INFO streaming.StreamJob: map 99% reduce 17%
  • 10. Experiment 2, 1M doc, 5 slaves • Time without MapReduce – 6892.08user – 25.03system – 1:56:37 elapsed – 98%CPU (0avgtext+0avgdata 0maxresident)k • Time with MapReduce – 6.30user – 0.98system – 3:26:18elapsed – 0%CPU (0avgtext+0avgdata 0maxresident)k – 10/10/26 15:04:36 INFO streaming.StreamJob: map 100% reduce 14% – 10/10/26 15:04:37 INFO streaming.StreamJob: map 100% reduce 16% – 10/10/26 15:04:39 INFO streaming.StreamJob: map 100% reduce 25% – 10/10/26 15:04:40 INFO streaming.StreamJob: map 100% reduce 27% – 10/10/26 15:04:42 INFO streaming.StreamJob: map 100% reduce 30% – 10/10/26 15:04:44 INFO streaming.StreamJob: map 100% reduce 32% – 10/10/26 15:04:45 INFO streaming.StreamJob: map 100% reduce 34% – 10/10/26 15:04:48 INFO streaming.StreamJob: map 100% reduce 35% – 10/10/26 15:07:29 INFO streaming.StreamJob: map 83% reduce 35% – 10/10/26 15:07:35 INFO streaming.StreamJob: map 100% reduce 35% – 10/10/26 15:09:57 INFO streaming.StreamJob: map 100% reduce 36% – 10/10/26 15:09:59 INFO streaming.StreamJob: map 100% reduce 37%
  • 11. Experiment 3, 1M doc, 3 slaves • Time without MapReduce – 6892.08user – 25.03system – 1:56:37 elapsed – 98%CPU (0avgtext+0avgdata 0maxresident)k • Time with MapReduce – 5.50user – 0.97system – 00:53:20elapsed – 0%CPU (0avgtext+0avgdata 0maxresident)k – 10/10/26 15:04:36 INFO streaming.StreamJob: map 100% reduce 14% – 10/10/26 15:04:37 INFO streaming.StreamJob: map 100% reduce 16% – 10/10/26 15:04:39 INFO streaming.StreamJob: map 100% reduce 25% – 10/10/26 15:04:40 INFO streaming.StreamJob: map 100% reduce 27% – 10/10/26 15:04:42 INFO streaming.StreamJob: map 100% reduce 30% – 10/10/26 15:04:44 INFO streaming.StreamJob: map 100% reduce 32% – 10/10/26 15:04:45 INFO streaming.StreamJob: map 100% reduce 34% – 10/10/26 15:04:48 INFO streaming.StreamJob: map 100% reduce 35% – 10/10/26 15:07:29 INFO streaming.StreamJob: map 83% reduce 35% – 10/10/26 15:07:35 INFO streaming.StreamJob: map 100% reduce 35% – 10/10/26 15:09:57 INFO streaming.StreamJob: map 100% reduce 36% – 10/10/26 15:09:59 INFO streaming.StreamJob: map 100% reduce 37%
  • 12. What’s next? • MapReduce can be applied in many problems and natural language processing applications. Examples – Sentiment analysis. – Computing probabilities of huge data. – Retrieval problem. – Huge data statistics and analysis. – MapReduce is not only a framework it is also a distributed computing methodology.