SlideShare una empresa de Scribd logo
1 de 22
MapReduce Paradigm

  Dilip Reddy Kancharla
        Spring 2012
Outline
• Introduction
• Motivating example
• Hadoop
  – Hadoop MapReduce
  – HDFS
• Pros & Cons of MapReduce
• Hadoop Applicability to different workflows
• Conclusions and Future work
Critical                                 User
MapReduce                              Program
Execution       Fork                                        Fork
                                           Fork
Overview [DG08]
                                       Master

                             Assign               Assign
                             Map                  Reduce
               Key/Value
                 Pairs      Worker
                                                   Remote                     Output
                                       Local                 Worker
 Split 1                                           read                        file 1
                                       Write                          Write
 Split 2
                            Worker
 Split 3
 Split 4                      .
                              .
                                                                              Output
 Split 5                                                     Worker            file 2
                              .
   .
   .                       Worker
   .                                                                      Output
                                            Intermediate    Reduce
 Input Files               Map Phase        Operations      Phase         Files
MapReduce Paradigm
• Splits input files into blocks (typically of 64MB
  each)
• Operates on key/value pairs
• Mappers filter & transform input data
• Reducers aggregate mappers output
• Efficient way to process the cluster:
  – Move code to data
  – Run code on all machines
• Map
                     Hash Function
     (K1,v1)                               List(k2,v2)


• Reduce
                      Aggregate Function     List(k3,v3)
     (k2,list(v2))
Advanced MapReduce
• Hadoop Streaming
  – Lets you stream Mapper and reducer written in
    other languages such as python, ruby, etc.,
• Chaining MapReduce jobs
• Joining data
• Bloom filters
Hadoop
• Open Source Implementation of MapReduce by
  Apache Software Foundation.
• Created by Doug Cutting.
• Derived from Google's MapReduce and Google
  File System (GFS) papers.
• Apache Hadoop is a software framework that
  supports data-intensive distributed applications
  under a free license
• It enables applications to work with thousands of
  computational independent computers and
  petabytes of data.
Hadoop Architecture
• Hadoop MapReduce
  – Single master node, many worker nodes
  – Client submits a job to master node
  – Master splits each job into tasks (MapReduce),
    and assigns tasks to worker nodes
• Hadoop Distributed File System (HDFS)
  – Single name node, many data nodes
  – Files stored as large, fixed-size (e.g. 64MB) blocks
  – HDFS typically holds map input and reduce output
Hadoop Architecture
     Secondary
     Namenode



     Namenode                  JobTracker




    Data                                     Data
                     Data
    node                                     node
                     node
TaskTracker                           TaskTracker
                 TaskTracker
  Map                                       Map
   Map             Map                       Map
    Map             Map                       Map
                     Map
   Map
    Map                                     Map
                                             Map
    Reduce          Map
                     Map                     Reduce
                     Reduce
Job Scheduling in Hadoop
• One map task for each block of the input file
  – Applies user-defined map function to each record in
    the block
  – Record = <key, value>
• User-defined number of reduce tasks
  – Each reduce task is assigned a set of record groups
  – For each group, apply user-defined reduce function to
    the record values in that group
• Reduce tasks read from every map task
  – Each read returns the record groups for that reduce
    task
Dataflow in Hadoop
• Map tasks write their output to local disk
  – Output available after map task has completed
• Reduce tasks write their output to HDFS
  – Once job is finished, next job’s map tasks can be
    scheduled, and will read input from HDFS
• Therefore, fault tolerance is simple: simply re-
  run tasks on failure
  – No consumers see partial operator output
Dataflow in Hadoop[CAHER10]

   Submit job




      map       schedule   reduce



      map                  reduce
Dataflow in Hadoop[CAHER10]



Read
Input File
                       map         reduce
             Block 1

  HDFS
             Block 2
                       map         reduce
Dataflow in Hadoop[CAHER10]




     map   Local
            FS
                              reduce

                   HTTP GET
           Local
     map    FS                reduce
Dataflow in Hadoop[CAHER10]



                            Write
                            Final
                   reduce
                            Answer
                               HDFS

                   reduce
HDFS
• Data is distributed and replicated over
  multiple machines.
• Files are not stored in contiguously on servers
  broken up into blocks.
• Designed for large files (large means GB or TB)
• Block Oriented
• Linux Style commands (eg. ls, cp, mkdir, mv)
Different Workflows[MTAGS11]
Hadoop Applicability by Workflow[MTAGS11]




  Score Meaning:
  • Score Zero implies Easily adaptable to the workflow
  • Score 0.5 implies Moderately adaptable to the
    workflow
  • Score 1 indicates one of the potential workflow areas
    where Hadoop needs improvement
Relative Merits and Demerits of
           Hadoop Over DBMS
Pros                                   Cons
• Fault tolerance                     • No high level language like
• Self Healing rebalances files          SQL in DBMS
  across cluster                      • No schema and no index
• Highly Scalable                     • Low efficiency
• Highly Flexible as it does not      • Very young (since 2004)
  have any dependency on                 compared to over 40years
  data model and schema                  of DBMS
                 Hadoop                      Relational
           Scale out (add more            Scaling is difficult
                machines)
              Key/Value pairs                   Tables
        Say how to process the data    Say what you want (SQL)
              Offline/ batch              Online/ realtime
Conclusions and Future Work
• MapReduce is easy to program
• Hadoop=HDFS+MapReduce
• Distributed, Parallel processing
• Designed for fault tolerance and high scalability
• MapReduce is unlikely to substitute DBMS in
  data warehousing instead we expect them to
  complement each other and help in data analysis
  of scientific data patterns
• Finally, Efficiency and especially I/O costs needs
  to be addressed for successful implications
References
[LLCCM12] Kyong-Ha Lee, Yoon-Joon Lee, Hyunsik Choi, Yon Dohn
Chung, and Bongki Moon, “Parallel data processing with MapReduce:
a survey,” SIGMOD, January 2012, pp. 11-20.
 [MTAGS11] Elif Dede, Madhusudhan Govindaraju, Daniel Gunter, and
Lavanya Ramakrishnan, “ Riding the Elephant: Managing Ensembles
with Hadoop,” Proceedings of the 2011 ACM international workshop
on Many task computing on grids and supercomputers, ACM, New
York, NY, USA, pp. 49-58.
[DG08]Jeffrey Dean and Sanjay Ghemawat, “MapReduce: simplified
data processing on large clusters,” January 2008, pp. 107-113. ACM.
[CAHER10]Tyson Condie, Neil Conway, Peter Alvaro, Joseph M.
Hellerstein, Khaled Elmeleegy, and Russell Sears, “MapReduce online,”
Proceedings of the 7th USENIX conference on Networked systems
design and implementation (NSDI'10), USENIX Association, Berkeley,
CA, USA, 2010, pp. 21-37.
Thank You!



Questions?

Más contenido relacionado

La actualidad más candente

Grasp patterns and its types
Grasp patterns and its typesGrasp patterns and its types
Grasp patterns and its typesSyed Hassan Ali
 
Chapter 8 Operating Systems silberschatz : deadlocks
Chapter 8 Operating Systems silberschatz : deadlocksChapter 8 Operating Systems silberschatz : deadlocks
Chapter 8 Operating Systems silberschatz : deadlocksGiulianoRanauro
 
SRS(software requirement specification)
SRS(software requirement specification)SRS(software requirement specification)
SRS(software requirement specification)Akash Kumar Dhameja
 
Software Project Management (monitoring and control)
Software Project Management (monitoring and control)Software Project Management (monitoring and control)
Software Project Management (monitoring and control)IsrarDewan
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduceFrane Bandov
 
Web and http computer network
Web and http computer networkWeb and http computer network
Web and http computer networkAnil Pokhrel
 
Web Application Design
Web Application DesignWeb Application Design
Web Application DesignHemin Patel
 
Congestion avoidance in TCP
Congestion avoidance in TCPCongestion avoidance in TCP
Congestion avoidance in TCPselvakumar_b1985
 
Design Goals of Distributed System
Design Goals of Distributed SystemDesign Goals of Distributed System
Design Goals of Distributed SystemAshish KC
 
Socket Programming
Socket ProgrammingSocket Programming
Socket ProgrammingCEC Landran
 
Requirement prioritization
Requirement prioritizationRequirement prioritization
Requirement prioritizationAbdul Basit
 
Big data - Key Enablers, Drivers & Challenges
Big data - Key Enablers, Drivers & ChallengesBig data - Key Enablers, Drivers & Challenges
Big data - Key Enablers, Drivers & ChallengesShilpi Sharma
 
Chapter 01 software engineering pressman
Chapter 01  software engineering pressmanChapter 01  software engineering pressman
Chapter 01 software engineering pressmanRohitGoyal183
 
Transactions and Concurrency Control
Transactions and Concurrency ControlTransactions and Concurrency Control
Transactions and Concurrency ControlDilum Bandara
 
Requirements prioritization
Requirements prioritizationRequirements prioritization
Requirements prioritizationSyed Zaid Irshad
 

La actualidad más candente (20)

Grasp patterns and its types
Grasp patterns and its typesGrasp patterns and its types
Grasp patterns and its types
 
Chapter 8 Operating Systems silberschatz : deadlocks
Chapter 8 Operating Systems silberschatz : deadlocksChapter 8 Operating Systems silberschatz : deadlocks
Chapter 8 Operating Systems silberschatz : deadlocks
 
SRS(software requirement specification)
SRS(software requirement specification)SRS(software requirement specification)
SRS(software requirement specification)
 
Software Project Management (monitoring and control)
Software Project Management (monitoring and control)Software Project Management (monitoring and control)
Software Project Management (monitoring and control)
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
Web and http computer network
Web and http computer networkWeb and http computer network
Web and http computer network
 
Big Data Analytics Lab File
Big Data Analytics Lab FileBig Data Analytics Lab File
Big Data Analytics Lab File
 
Web Application Design
Web Application DesignWeb Application Design
Web Application Design
 
3. challenges
3. challenges3. challenges
3. challenges
 
Congestion avoidance in TCP
Congestion avoidance in TCPCongestion avoidance in TCP
Congestion avoidance in TCP
 
Big data security
Big data securityBig data security
Big data security
 
Design Goals of Distributed System
Design Goals of Distributed SystemDesign Goals of Distributed System
Design Goals of Distributed System
 
Socket Programming
Socket ProgrammingSocket Programming
Socket Programming
 
Requirement prioritization
Requirement prioritizationRequirement prioritization
Requirement prioritization
 
Big data - Key Enablers, Drivers & Challenges
Big data - Key Enablers, Drivers & ChallengesBig data - Key Enablers, Drivers & Challenges
Big data - Key Enablers, Drivers & Challenges
 
Chapter 01 software engineering pressman
Chapter 01  software engineering pressmanChapter 01  software engineering pressman
Chapter 01 software engineering pressman
 
Transactions and Concurrency Control
Transactions and Concurrency ControlTransactions and Concurrency Control
Transactions and Concurrency Control
 
Ettercap
EttercapEttercap
Ettercap
 
Congestion control
Congestion controlCongestion control
Congestion control
 
Requirements prioritization
Requirements prioritizationRequirements prioritization
Requirements prioritization
 

Similar a MapReduce Paradigm

Scheduling scheme for hadoop clusters
Scheduling scheme for hadoop clustersScheduling scheme for hadoop clusters
Scheduling scheme for hadoop clustersAmjith Singh
 
Hadoop – Architecture.pptx
Hadoop – Architecture.pptxHadoop – Architecture.pptx
Hadoop – Architecture.pptxSakthiVinoth78
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationYahoo Developer Network
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigKhanKhaja1
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programsjani shaik
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopGERARDO BARBERENA
 
Big Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfBig Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfWasyihunSema2
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionDong Ngoc
 

Similar a MapReduce Paradigm (20)

Hadoop
HadoopHadoop
Hadoop
 
Scheduling scheme for hadoop clusters
Scheduling scheme for hadoop clustersScheduling scheme for hadoop clusters
Scheduling scheme for hadoop clusters
 
Hadoop – Architecture.pptx
Hadoop – Architecture.pptxHadoop – Architecture.pptx
Hadoop – Architecture.pptx
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
 
Hadoop by sunitha
Hadoop by sunithaHadoop by sunitha
Hadoop by sunitha
 
Hadoop
HadoopHadoop
Hadoop
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pig
 
MapReduce
MapReduceMapReduce
MapReduce
 
Spark architechure.pptx
Spark architechure.pptxSpark architechure.pptx
Spark architechure.pptx
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programs
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Big Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfBig Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdf
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 

Último

ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)cama23
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 

Último (20)

ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 

MapReduce Paradigm

  • 1. MapReduce Paradigm Dilip Reddy Kancharla Spring 2012
  • 2. Outline • Introduction • Motivating example • Hadoop – Hadoop MapReduce – HDFS • Pros & Cons of MapReduce • Hadoop Applicability to different workflows • Conclusions and Future work
  • 3. Critical User MapReduce Program Execution Fork Fork Fork Overview [DG08] Master Assign Assign Map Reduce Key/Value Pairs Worker Remote Output Local Worker Split 1 read file 1 Write Write Split 2 Worker Split 3 Split 4 . . Output Split 5 Worker file 2 . . . Worker . Output Intermediate Reduce Input Files Map Phase Operations Phase Files
  • 4. MapReduce Paradigm • Splits input files into blocks (typically of 64MB each) • Operates on key/value pairs • Mappers filter & transform input data • Reducers aggregate mappers output • Efficient way to process the cluster: – Move code to data – Run code on all machines
  • 5. • Map Hash Function (K1,v1) List(k2,v2) • Reduce Aggregate Function List(k3,v3) (k2,list(v2))
  • 6. Advanced MapReduce • Hadoop Streaming – Lets you stream Mapper and reducer written in other languages such as python, ruby, etc., • Chaining MapReduce jobs • Joining data • Bloom filters
  • 7. Hadoop • Open Source Implementation of MapReduce by Apache Software Foundation. • Created by Doug Cutting. • Derived from Google's MapReduce and Google File System (GFS) papers. • Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license • It enables applications to work with thousands of computational independent computers and petabytes of data.
  • 8. Hadoop Architecture • Hadoop MapReduce – Single master node, many worker nodes – Client submits a job to master node – Master splits each job into tasks (MapReduce), and assigns tasks to worker nodes • Hadoop Distributed File System (HDFS) – Single name node, many data nodes – Files stored as large, fixed-size (e.g. 64MB) blocks – HDFS typically holds map input and reduce output
  • 9. Hadoop Architecture Secondary Namenode Namenode JobTracker Data Data Data node node node TaskTracker TaskTracker TaskTracker Map Map Map Map Map Map Map Map Map Map Map Map Map Reduce Map Map Reduce Reduce
  • 10. Job Scheduling in Hadoop • One map task for each block of the input file – Applies user-defined map function to each record in the block – Record = <key, value> • User-defined number of reduce tasks – Each reduce task is assigned a set of record groups – For each group, apply user-defined reduce function to the record values in that group • Reduce tasks read from every map task – Each read returns the record groups for that reduce task
  • 11. Dataflow in Hadoop • Map tasks write their output to local disk – Output available after map task has completed • Reduce tasks write their output to HDFS – Once job is finished, next job’s map tasks can be scheduled, and will read input from HDFS • Therefore, fault tolerance is simple: simply re- run tasks on failure – No consumers see partial operator output
  • 12. Dataflow in Hadoop[CAHER10] Submit job map schedule reduce map reduce
  • 13. Dataflow in Hadoop[CAHER10] Read Input File map reduce Block 1 HDFS Block 2 map reduce
  • 14. Dataflow in Hadoop[CAHER10] map Local FS reduce HTTP GET Local map FS reduce
  • 15. Dataflow in Hadoop[CAHER10] Write Final reduce Answer HDFS reduce
  • 16. HDFS • Data is distributed and replicated over multiple machines. • Files are not stored in contiguously on servers broken up into blocks. • Designed for large files (large means GB or TB) • Block Oriented • Linux Style commands (eg. ls, cp, mkdir, mv)
  • 18. Hadoop Applicability by Workflow[MTAGS11] Score Meaning: • Score Zero implies Easily adaptable to the workflow • Score 0.5 implies Moderately adaptable to the workflow • Score 1 indicates one of the potential workflow areas where Hadoop needs improvement
  • 19. Relative Merits and Demerits of Hadoop Over DBMS Pros Cons • Fault tolerance • No high level language like • Self Healing rebalances files SQL in DBMS across cluster • No schema and no index • Highly Scalable • Low efficiency • Highly Flexible as it does not • Very young (since 2004) have any dependency on compared to over 40years data model and schema of DBMS Hadoop Relational Scale out (add more Scaling is difficult machines) Key/Value pairs Tables Say how to process the data Say what you want (SQL) Offline/ batch Online/ realtime
  • 20. Conclusions and Future Work • MapReduce is easy to program • Hadoop=HDFS+MapReduce • Distributed, Parallel processing • Designed for fault tolerance and high scalability • MapReduce is unlikely to substitute DBMS in data warehousing instead we expect them to complement each other and help in data analysis of scientific data patterns • Finally, Efficiency and especially I/O costs needs to be addressed for successful implications
  • 21. References [LLCCM12] Kyong-Ha Lee, Yoon-Joon Lee, Hyunsik Choi, Yon Dohn Chung, and Bongki Moon, “Parallel data processing with MapReduce: a survey,” SIGMOD, January 2012, pp. 11-20. [MTAGS11] Elif Dede, Madhusudhan Govindaraju, Daniel Gunter, and Lavanya Ramakrishnan, “ Riding the Elephant: Managing Ensembles with Hadoop,” Proceedings of the 2011 ACM international workshop on Many task computing on grids and supercomputers, ACM, New York, NY, USA, pp. 49-58. [DG08]Jeffrey Dean and Sanjay Ghemawat, “MapReduce: simplified data processing on large clusters,” January 2008, pp. 107-113. ACM. [CAHER10]Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, Khaled Elmeleegy, and Russell Sears, “MapReduce online,” Proceedings of the 7th USENIX conference on Networked systems design and implementation (NSDI'10), USENIX Association, Berkeley, CA, USA, 2010, pp. 21-37.

Notas del editor

  1. If Distributed Computing is so hard, Do we need it?
  2. Run code on machines unlike conventional systems where we move data to code, do processing and then store them back.
  3. - Out of the scope of papers
  4. The master (Job-Tracker) is ress. Each worker runs a Task- Tracker process that manages the execution of the tasks currently assigned to that node. Each TaskTracker has a fixed number of slots for executing tasks. Each map task is assigned a portion of the input file called a split. By default, a split contains a single HDFS block, so the total number of file blocks determines the number of map tasks.
  5. Reducers begin processing data as soon as it is produced by mappers, they can generate and refine an approximation of their final answer during the course of executionMapReduce jobs can run continuously, accepting new data as it arrives and analyzing it immediately. This allows MapReduce to be used for applications such as event monitoring and stream processing.Data Node: Store actual file blocks on disk. Does not store entire files!Report block info to Namenode.Receive instructions from namenode.Secondary Namenode: Snapshot of namenode.Not a flipover server of namenode.Help minimize downtime/data loss ifNameNode failsJobTracker: Partition tasks across the cluster. Track MapReduce tasks. Re start failed tasks on different nodes.TaskTracker does the task processing and logs each and every event.
  6. The input to a job is an input specification that is in key-value pairs. Each job consists of two stages: first, a user defin map function is applied to each input record to produce a list of intermediate key-value pairs. Second, a user-defined reduce function is called once for each distinct key in the map output and passed the list of intermediate values associated with that key. Reduce - The shuffle phase (Each reduce task is assigned a partition of the keyrange produced by the map step, so the reduce task must fetch the content of this partition from every map task’s output). The sort phase groups records with the same key. Apply the user-defined reduce function
  7. The buffer content is written to the local file system as an index file and a data file . Index file for indexing and The data file contains only the records, which are sorted by the key within each partition segment. A reduce task fetches data from each map task by issuing HTTP requests to a configurablenumber of TaskTrackers at once (5 by default). The Job- Tracker relays the location of every TaskTracker that hosts map output to every TaskTracker that is executing a reduce task. a reduce task cannot fetch the output of a map task until the map has finished executing and committed its final output to disk.
  8. The map phase reads the task’s split/HDFS blocks from HDFS, parses it into records (key/value pairs), and applies the map function to each record.After the map function has been applied to each input record, the commit phase registers the final output with the TaskTracker, which then informs theJobTracker that the task has finished executing.
  9. a reduce task fetches data from ach map task by issuing HTTP requests to a configurable number of TaskTrackers at once (5 by default). The Job-Tracker relays the location of every TaskTracker that hosts map output to every TaskTracker that is executing a reduce task. Note that a reduce task cannot fetch the output of a map task until the map has finished executing and committed its final output to disk
  10. In this design, the output of both map and reduce tasks is written to disk before it can be consumed. This is particularly expensive for reduce tasks, because their output is written to HDFS. Output materialization simplifies fault tolerance, because it reduces the amount of state that must be restored to consistency after a node failure. If any task (either map or reduce) fails, the JobTracker simply schedules a new task to perform the same work as the failed task.
  11. While it was possible to implement all patterns in the framework but the level of difficulty varied.This evaluation helps in identifying if an applications workflow will be suitable to run in MapReduce Framework or not.
  12. Fault tolerant when node fails due to high data replication. Scalable just by adding nodes we can process as much data as we want.Low efficiency:- with fault tolerance and scalability as its primary goals, MapReduce operations are not always optimized for I/O efficiency. Also Map and Reduce are blocking operations
  13. -Easy since it hides implementation details of parallelization, fault tolerance, local optimization and load balanace. Horizontal scale out helps in processing as much as data we want by simply adding as many nodes as you want.