SlideShare una empresa de Scribd logo
1 de 63
Large Scale Computing with
MapReduce
Sen Han

                             1
• By 2020, there will be 5,200 GB of data for every person
  on Earth
• next eight years, the amount of digital data produced will
  exceed 40 zetabytes, which is the equivalent of 5200 GB
  of data for every man.
• The data recorded by each of the big experiments at the
  Large Hadron Colider (LHC) at Gern in Geneva is
  enough to fill around 100000 DVDs every year
• Source: Facebook, Google, etc.




Data Explosion                                                 2
• Big Data in Fields:

     Sport              Finance
     Banking            Science
     Marketing          Journalism
     Medicine           Education




Data Explosion                       3
Downloading
                 Creating   Retrieve Most
a large amount
                 Indexes    related pages
 of web pages




A case study of Google                  4
• Single-thread performance doesn’t matter
  • Throughput more important than peak performance.
• Stuff Breaks
  • 1 server can run many years but large cluster of
    servers, like lose 10 a day.
• ―Ultra-reliable‖ hardware doesn’t really help.
  • Software needs to be fault tolerant.
  • Commodity machine with lower price is better.




Large Data Set                                         5
Traditional RDBMS           MapReduce
  Data Size   Gigabytes                   Petabytes
  Access      Interactive and batch       Batch
  Updates     Read and write many times   Write once, read many times
  Structure   Static Schema               Dynamic Schema
  Integrity   High                        Low
  Scaling     Nonlinear                   Linear




Streaming Data                                                          6
MapReduce in Google   7
• Map:
  • produces a set of intermediate key/value pairs
• Reduce:
  • Deliver Results from key/value pairs




Functional MapReduce                                 8
• map(String key, String value):
       // key: document name
       // value:document conents:
       EmitIntermediate(w,‖1‖);
• reduce(String key, Iterator values):
       //key: a word
       //value: a list of counts
       int result = 0;
       for each v in values:
               result += ParseInt(v);
       Emit(AsString(result));




Functional MapReduce                     9
•   Parallel map over input.
•   Parallel grouping of intermediate data.
•   Parallel map over groups.
•   Parallel reduction per group.




Discover Parallelism in
MapReduce                                     10
Distributed MapReduce   11
Distributed MapReduce   12
• One master, many workers
  • Input data split into M map tasks (typically 64 MB in size)
  • Reduce phase partitioned into R reduce tasks.
  • Tasks are assigned to workers dynamically.
  • Often M = 200,000; R = 4,000; workers=2,000
• Master assigns each map task to a free worker.
  • Consider locality of data to worker then assigning task.
  • Worker reads task input (often from local disk)
  • Worker produces R local files containing intermediate k/v pairs.
• Master assigns each reduce task to a free worker.
  • Worker reads intermediate k/v pairs from map workers.
  • Worker sorts&applies user’s Reduce op to produce the output.



MapReduce: Job
Scheduling                                                             13
MapReduce: Job
Scheduling       14
•   On Worker Failure:
•   Detect failure via periodic heartbeats.
•   Re-execute complete and in-progress map tasks.
•   Re-execute in progress reduce tasks
•   Task completion committed through master.
•   On master failure:
•   State is check pointed to GFS: new master recovers &
    continues



MapReduce: Fault
Tolerance                                                  15
•   Master scheduling:
•   Asks GFS for locations of replicas of input file blocks.
•   Map tasks typically split into 64MB (==GFS block size)
•   Map tasks scheduled so GFS input block replica are on
    same machine or same rack.




MapReduce: Locality
Optimization                                                   16
•   Optional secondary keys for ordering.
•   Compression of intermediate data.
•   Combiner: useful for saving network bandwidth
•   User-defined counters.




MapReduce: Other
refinements                                         17
• Distributed Grep
  • The map function emits a line if it matches a given pattern. The reduce
     function is an identity function that just copies the supplied intermediate data
     to the output.
• Count of URL Access Frequency
  • Input is web page logs. Output is <URL, 1> The reduce function adds
     together all values for the same URL and emits a <URL, total count> pair.
• Reverse Web-Link Graph
  • The map function outputs (target, source) pairs for each link to target URL
     found in a page named source. The reduce function concatenates the lst of all
     source URLs associated with a given target URL and emits the pair: (target,
     list(source)).
• Term-Vector per Host
  • A term vector summarizes the most important words that occur in a document
     or a set of documents as a list of (word, frequency) pairs.




MapReduce: examples                                                                18
• MapReduce runtime library[8]:
  •   Automatic parallelization.
  •   Load balancing.
  •   Network and disk transfer optimization.
  •   Handling of machine failure.
  •   Robustness.




MapReduce: runtime
library                                         19
• Economy: a cluster of commodity computers
• Usability: a simpler user interface of submitting
  computing jobs and all distributed computing are carried
  out on the back. No need of dealing with these issues.
• Reliable: Fault tolerant.




Hadoop: a opensource
library                                                  20
Existing Limitations   21
•   Built-in back up became a necessity.
•   Built-in automated recovery mechanism.
•   Running things in parallel.(Distributed Programming)
•   Easy to Administrate.
•   Something that is Cost effective.




What was required?                                         22
• Google’s MapReduce


       • Apache Nutch (Open source web search
         engine)



       • Apache Lucene (Text search Library)




Origin of HADOOP                                23
File System component of Hadoop.



•   Streaming Data Access
•   Hardware Failure
•   Commodity Hardware
•   Moving Data is Expensive




HDFS                               24
•   Scalable
•   Fault tolerant
•   Distributed file system
•   Data Storage
•   Cost effective processing




Hadoop                          25
Name
 User
                           Node(Master)



        HDFS client
                                      Data



                Data                    Data
             Node(Slave)             Node(Slave)




HDFS Core Architecture                             26
•   Only one NameNode.
•   Selects DataNodes to create replicas.
•   Image
•   Checkpoint
•   Journal
•   CheckpointNode / BackupNode




NameNode                                    27
•   Variable block size (default is 128mb).
•   Replicas at multiple locations (default 3).
•   Namespaces of all the blocks stored in NameNode.
•   Handshake with NameNode at startup.
•   Storage ID – To identify a DataNode.
•   Update of block replicas every one hour.
•   Heartbeat – Normal operation of DataNode




DataNode                                               28
Heartbeat
• Backup of the state of the file system.
• To protect from data loss during
   software upgrade.                      DataNode   NameNode
• DataNode copies storage directories
  and hardlinks blocks into it.




Snapshot                                         Snapshot
                                                       29
• Data in file cannot be modified once saved. (Only
  Append)
• Only one client can have write access to a file at a time.
• Soft limit and Hard limit.
• Bytes sent in pipeline to Datablocks (in form of packets).
• Optimized for Batch programming systems.




Reads and Write                                           30
• Two rules:
  1. One DataNode should contain more than one replica.
  2. No rack contains more than two replicas of the same
     block.
• Placement of replicas play a vital role.
• Block report gives the number of replicas.
• Replication priority queue.




Replica Management                                         31
•   Represents POSIX model(read, write and execute).
•   Latest version uses Kerberos authentication.
•   Does not travel on untrusted networks.
•   Very weak security features but working on it.




Security                                               32
•   Yahoo
•   Facebook
•   Twitter
•   Ebay
•   LinkedIn
•   Amazon(A9)




Who all use Hadoop?   33
• Yahoo played a vital role in the development
  of Hadoop.
• Initially used for indexing of web crawl results.
• To block spams entering into the mail server,
  filters, content optimization etc..,




                                                      34
•   When facebook first started – commerical RDBMS.
•   Need for infrastructure to handle such huge data.
•   Days turned into hours.
•   Log processing, Recommendation systems, Data
    warehouse and archiving.




                                                        35
• Uses LZO compression to store data.
• Used for analyzing and collecting information.
• Uses Scala programming language along with
  Hadoop.
• Tweets, Log information etc..,




                                                   36
•   Huge data.
•   Teradata and Hadoop together to store data.
•   Uses Hadoop to understand customer needs.
•   Search queries, server logs, click throughs etc..,




                                                         37
• Uses Hadoop to analyze data.
• New data products like
    • People you may know
    • Jobs matching your skills
    • Profile visitors etc..,




                                  38
•   Amazon A9
•   The NewYork Times
•   IBM
•   Last.fm
•   Veoh
•   And the list goes on…..




Other Applications            39
• Optimized for high throughput of data at the expense of
  latency.
• Single Point Failure and limited NameNode memory.
• No modification to data in file
• Hadoop is not a substitute for a database.
• Consumes immense power.



Where Hadoop doesnot
work?                                                       40
YOU CHOOSE YOURSELF




Which is the best?       41
Hadoop is supplemented by an eco-system of Apache
projects such as
  •   PIG
  •   HIVE
  •   ZOOKEEPER
  •   HBASE
  •   SQOOP




                                                    42
•   Pig
•   Hive
•   Hbase
•   ZooKeeper
•   Sqoop




Hadoop applications   43
• Pig is a large-scale data analysis platform based on
  Hadoop
• Provides SQL-LIKE language called Pig Latin
• Convert SQL data request into a series optimized
  MapReduce computing
• Pig complex massive data parallel computing
• Provides a simple operation and programming interface




Pig decription                                            44
The scope of Pig   45
•   Amazon/A9
 •   AOL
 •   Facebook
 •   Fox interactive media
 •   Google
 •   IBM
 •   New York Times
 •   PowerSet (now Microsoft)
 •   Quantcast
 •   Rackspace/Mailtrust
 •   Veoh
 •   Yahoo!




• Who use Pig?                  46
•   Ad-hoc analysis
 •   Running in cluster computing architecture
 •   Operation similar SQL syntax
 •   Open source code




• Pig characteristics                            47
•   Ad-hoc analysis,
 • Running in cluster computing architecture
 • Operation similar SQL syntax
 • Open source code




• Pig interface                                48
• Connect to the local Hadoop cluster
 • Install Pig (Pig script, Grunt and embedded method)




• Pig usage                                          49
• records = Load 'first.txt' as (itemname: chararray, price:
  int, quality: int);
• filter_records = FILTER records BY price! = 999 AND
  quality == 0;
• group_records = GROUP filter_records BY itemname;
• max_temp = FOREACH group_records GENERATE
  group, MAX (filter_records.price);
• DUMP max_temp;




• Pig usage                                               50
SQL                                Pig
SQL is a description of the type   Pig is data flow programming
of programming language            language
Relational database management Pig requires data in a looser mode
system (RDBMS) stores data in a which can be defined at running
strictly defined mode table     time
Simple data structure              Pig supports complex nested data
Support transaction, index and     Does not support transaction,
random read                        index and random read




• Pig and SQL comparison 51
•   Procedures constitute a series of statements
•   Operations and commands are case insensitive
•   Aliases and function names are case-sensitive
•   Multi-line statement in the entire program logic programs




• Pig Latin                                                52
•   The hive is a data warehouse tool
•   Map structured data file into a database table
•   Provides complete sql queries
•   Converts sql statement into MapReduce tasks to execute




• Hive                                                   53
• Storage (Hadoop Distributed File System HDFS)
• Computing (MapReduce computing framework)




• Hive Framework                                  54
• Stored in HDFS is divided into blocks
• Distribute on multiple machines




• Hive File System                        55
• Pig is a programming language that simplifies Hadoop
  common tasks
• Hive in Hadoop plays the role of the data warehouse
• Pig use of Hadoop Java APIs can significantly reduce the
  amount of code
• Pig attract a large number of software developers




• About the pig and hive                                56
• HBase is a distributed, open source column-oriented
  database
• A structured data distributed storage system
• Bigtable-like ability
• Subproject of the Apache Hadoop project
• Suitable for unstructured data storage
• HBase is column-based rather than line-based mode
• Require random access, real-time read and write




HBase                                                   57
•   Amazon/A9
 •   AOL
 •   Facebook
 •   Fox interactive media
 •   Google
 •   IBM
 •   New York Times
 •   PowerSet (now Microsoft)
 •   Quantcast
 •   Rackspace/Mailtrust
 •   Veoh
 •   Yahoo!




• Who use Pig?                  58
• Hadoop Distributed Coordination Service
• Provides simple operations and additional abstract
  operations such as sorting and notice
• Implement a lot of coordination data structures and
  protocols
• Provides a generic coordination modes and methods of
  open source shares repository
• High-performance, which has more than 10,000 ops to
  write the main benchmark throughput is even higher then
  mainly to read several times




ZooKeeper                                              59
• Aimed to assist in efficient data exchange between
  RDBMS and Hadoop
• View database tables and other useful gadgets
• Support JDBC specification databases, such as DB2,
  MySQL




Sqoop                                                  60
•   Amazon/A9
 •   AOL
 •   Facebook
 •   Fox interactive media
 •   Google
 •   IBM
 •   New York Times
 •   PowerSet (now Microsoft)
 •   Quantcast
 •   Rackspace/Mailtrust
 •   Veoh
 •   Yahoo!




• Who use Pig?                  61
•   Data analysis. Retrieve from: http://public.web.cern.ch/public/en/research/DataAnalysis-en.html
•   James Gallagher. DNA sequencing of MRSA used to stop outbreak. http://www.bbc.co.uk/news/health-
    20314024
•   Shankland. (2009) Google uncloaks once-secret server. Retrieve from:
    http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/ma
    preduce-osdi04.pdf
•   J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI’04, 6th
    Symposium on Operating Systems Design and Implementation, Sponsored by USENIX, in cooperation
    with ACM SIGOPS, pages 137–150, 2004.
•   Ralf Lammel. (2007). Google’s MapReduce programming model—Revised. Science of Computer
    Porgramming, Volume 68 Issue 3, October, 2007.
•   Lucas Mearian. By 2020, there will be 5,200 GB of data for every person on Earth.
    http://www.computerworld.com/s/article/9234563/By_2020_there_will_be_5_200_GB_of_data_for_every
    _person_on_Earth
•   Tom White. Hadoop: the definitive guide.
    http://books.google.com/books?id=Wu_xeGdU4G8C&pg=PA648&dq=hadoop&hl=en&sa=X&ei=6mfKU
    PW7Je3U2QWtzIDgCg&ved=0CDcQ6AEwAA
•   Ilan Horn. Introduction to MapReduce, an Abstraction for Large-Scale Computation.
    http://www.slideshare.net/rantav/introduction-to-map-reduce#btnNext




References                                                                                              62
• A brief view to the Platform. Retrieve from: http://hadooper.blogspot.com/
• Hadoop. Retrieve from: http://pig.apache.org/
• Applications and organizations using Hadoop. Retrieve from:
  http://wiki.apache.org/hadoop/PoweredBy
• Installing and Running Pig. Retrieve from:
  http://ofps.oreilly.com/titles/9781449302641/running_pig.html
• Alan, Gates. Programming Pig. 1 st ed. O'Reilly Media, 2009. 11-50. Print.
• What is Hive? Retrieve from: http://hive.apache.org/docs/r0.8.1/
• Hive vs. Pig. Retrieve from: http://www.larsgeorge.com/2009/10/hive-vs-pig.html
• George , Lars . HBase: The Definitive Guide. 1 st ed. O'Reilly Media, 2011. 212-
  215. Print.
• White , Tom . Hadoop: The Definitive Guide. 1 st ed. O'Reilly Media, 2009. 312-
  368. Print.




References                                                                     63

Más contenido relacionado

La actualidad más candente

Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Mohamed Ali Mahmoud khouder
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetup
vmoorthy
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
J Singh
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
Modern Data Stack France
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big data
Cyanny LIANG
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 

La actualidad más candente (19)

Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetup
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologies
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big data
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
 
10c introduction
10c introduction10c introduction
10c introduction
 
White paper hadoop performancetuning
White paper hadoop performancetuningWhite paper hadoop performancetuning
White paper hadoop performancetuning
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 

Similar a Large scale computing with mapreduce

Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
Mohit Tare
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
Cloudera, Inc.
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
saipriyacoool
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
Don Demcsak
 

Similar a Large scale computing with mapreduce (20)

Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
 
Intro to Big Data - Spark
Intro to Big Data - SparkIntro to Big Data - Spark
Intro to Big Data - Spark
 
Hadoop DB
Hadoop DBHadoop DB
Hadoop DB
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Map reduce and hadoop at mylife
Map reduce and hadoop at mylifeMap reduce and hadoop at mylife
Map reduce and hadoop at mylife
 
Deploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache HadoopDeploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache Hadoop
 
Hadoop Map-Reduce from the subject: Big Data Analytics
Hadoop Map-Reduce from the subject: Big Data AnalyticsHadoop Map-Reduce from the subject: Big Data Analytics
Hadoop Map-Reduce from the subject: Big Data Analytics
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 

Large scale computing with mapreduce

  • 1. Large Scale Computing with MapReduce Sen Han 1
  • 2. • By 2020, there will be 5,200 GB of data for every person on Earth • next eight years, the amount of digital data produced will exceed 40 zetabytes, which is the equivalent of 5200 GB of data for every man. • The data recorded by each of the big experiments at the Large Hadron Colider (LHC) at Gern in Geneva is enough to fill around 100000 DVDs every year • Source: Facebook, Google, etc. Data Explosion 2
  • 3. • Big Data in Fields: Sport Finance Banking Science Marketing Journalism Medicine Education Data Explosion 3
  • 4. Downloading Creating Retrieve Most a large amount Indexes related pages of web pages A case study of Google 4
  • 5. • Single-thread performance doesn’t matter • Throughput more important than peak performance. • Stuff Breaks • 1 server can run many years but large cluster of servers, like lose 10 a day. • ―Ultra-reliable‖ hardware doesn’t really help. • Software needs to be fault tolerant. • Commodity machine with lower price is better. Large Data Set 5
  • 6. Traditional RDBMS MapReduce Data Size Gigabytes Petabytes Access Interactive and batch Batch Updates Read and write many times Write once, read many times Structure Static Schema Dynamic Schema Integrity High Low Scaling Nonlinear Linear Streaming Data 6
  • 8. • Map: • produces a set of intermediate key/value pairs • Reduce: • Deliver Results from key/value pairs Functional MapReduce 8
  • 9. • map(String key, String value): // key: document name // value:document conents: EmitIntermediate(w,‖1‖); • reduce(String key, Iterator values): //key: a word //value: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); Functional MapReduce 9
  • 10. Parallel map over input. • Parallel grouping of intermediate data. • Parallel map over groups. • Parallel reduction per group. Discover Parallelism in MapReduce 10
  • 13. • One master, many workers • Input data split into M map tasks (typically 64 MB in size) • Reduce phase partitioned into R reduce tasks. • Tasks are assigned to workers dynamically. • Often M = 200,000; R = 4,000; workers=2,000 • Master assigns each map task to a free worker. • Consider locality of data to worker then assigning task. • Worker reads task input (often from local disk) • Worker produces R local files containing intermediate k/v pairs. • Master assigns each reduce task to a free worker. • Worker reads intermediate k/v pairs from map workers. • Worker sorts&applies user’s Reduce op to produce the output. MapReduce: Job Scheduling 13
  • 15. On Worker Failure: • Detect failure via periodic heartbeats. • Re-execute complete and in-progress map tasks. • Re-execute in progress reduce tasks • Task completion committed through master. • On master failure: • State is check pointed to GFS: new master recovers & continues MapReduce: Fault Tolerance 15
  • 16. Master scheduling: • Asks GFS for locations of replicas of input file blocks. • Map tasks typically split into 64MB (==GFS block size) • Map tasks scheduled so GFS input block replica are on same machine or same rack. MapReduce: Locality Optimization 16
  • 17. Optional secondary keys for ordering. • Compression of intermediate data. • Combiner: useful for saving network bandwidth • User-defined counters. MapReduce: Other refinements 17
  • 18. • Distributed Grep • The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. • Count of URL Access Frequency • Input is web page logs. Output is <URL, 1> The reduce function adds together all values for the same URL and emits a <URL, total count> pair. • Reverse Web-Link Graph • The map function outputs (target, source) pairs for each link to target URL found in a page named source. The reduce function concatenates the lst of all source URLs associated with a given target URL and emits the pair: (target, list(source)). • Term-Vector per Host • A term vector summarizes the most important words that occur in a document or a set of documents as a list of (word, frequency) pairs. MapReduce: examples 18
  • 19. • MapReduce runtime library[8]: • Automatic parallelization. • Load balancing. • Network and disk transfer optimization. • Handling of machine failure. • Robustness. MapReduce: runtime library 19
  • 20. • Economy: a cluster of commodity computers • Usability: a simpler user interface of submitting computing jobs and all distributed computing are carried out on the back. No need of dealing with these issues. • Reliable: Fault tolerant. Hadoop: a opensource library 20
  • 22. Built-in back up became a necessity. • Built-in automated recovery mechanism. • Running things in parallel.(Distributed Programming) • Easy to Administrate. • Something that is Cost effective. What was required? 22
  • 23. • Google’s MapReduce • Apache Nutch (Open source web search engine) • Apache Lucene (Text search Library) Origin of HADOOP 23
  • 24. File System component of Hadoop. • Streaming Data Access • Hardware Failure • Commodity Hardware • Moving Data is Expensive HDFS 24
  • 25. Scalable • Fault tolerant • Distributed file system • Data Storage • Cost effective processing Hadoop 25
  • 26. Name User Node(Master) HDFS client Data Data Data Node(Slave) Node(Slave) HDFS Core Architecture 26
  • 27. Only one NameNode. • Selects DataNodes to create replicas. • Image • Checkpoint • Journal • CheckpointNode / BackupNode NameNode 27
  • 28. Variable block size (default is 128mb). • Replicas at multiple locations (default 3). • Namespaces of all the blocks stored in NameNode. • Handshake with NameNode at startup. • Storage ID – To identify a DataNode. • Update of block replicas every one hour. • Heartbeat – Normal operation of DataNode DataNode 28
  • 29. Heartbeat • Backup of the state of the file system. • To protect from data loss during software upgrade. DataNode NameNode • DataNode copies storage directories and hardlinks blocks into it. Snapshot Snapshot 29
  • 30. • Data in file cannot be modified once saved. (Only Append) • Only one client can have write access to a file at a time. • Soft limit and Hard limit. • Bytes sent in pipeline to Datablocks (in form of packets). • Optimized for Batch programming systems. Reads and Write 30
  • 31. • Two rules: 1. One DataNode should contain more than one replica. 2. No rack contains more than two replicas of the same block. • Placement of replicas play a vital role. • Block report gives the number of replicas. • Replication priority queue. Replica Management 31
  • 32. Represents POSIX model(read, write and execute). • Latest version uses Kerberos authentication. • Does not travel on untrusted networks. • Very weak security features but working on it. Security 32
  • 33. Yahoo • Facebook • Twitter • Ebay • LinkedIn • Amazon(A9) Who all use Hadoop? 33
  • 34. • Yahoo played a vital role in the development of Hadoop. • Initially used for indexing of web crawl results. • To block spams entering into the mail server, filters, content optimization etc.., 34
  • 35. When facebook first started – commerical RDBMS. • Need for infrastructure to handle such huge data. • Days turned into hours. • Log processing, Recommendation systems, Data warehouse and archiving. 35
  • 36. • Uses LZO compression to store data. • Used for analyzing and collecting information. • Uses Scala programming language along with Hadoop. • Tweets, Log information etc.., 36
  • 37. Huge data. • Teradata and Hadoop together to store data. • Uses Hadoop to understand customer needs. • Search queries, server logs, click throughs etc.., 37
  • 38. • Uses Hadoop to analyze data. • New data products like • People you may know • Jobs matching your skills • Profile visitors etc.., 38
  • 39. Amazon A9 • The NewYork Times • IBM • Last.fm • Veoh • And the list goes on….. Other Applications 39
  • 40. • Optimized for high throughput of data at the expense of latency. • Single Point Failure and limited NameNode memory. • No modification to data in file • Hadoop is not a substitute for a database. • Consumes immense power. Where Hadoop doesnot work? 40
  • 41. YOU CHOOSE YOURSELF Which is the best? 41
  • 42. Hadoop is supplemented by an eco-system of Apache projects such as • PIG • HIVE • ZOOKEEPER • HBASE • SQOOP 42
  • 43. Pig • Hive • Hbase • ZooKeeper • Sqoop Hadoop applications 43
  • 44. • Pig is a large-scale data analysis platform based on Hadoop • Provides SQL-LIKE language called Pig Latin • Convert SQL data request into a series optimized MapReduce computing • Pig complex massive data parallel computing • Provides a simple operation and programming interface Pig decription 44
  • 45. The scope of Pig 45
  • 46. Amazon/A9 • AOL • Facebook • Fox interactive media • Google • IBM • New York Times • PowerSet (now Microsoft) • Quantcast • Rackspace/Mailtrust • Veoh • Yahoo! • Who use Pig? 46
  • 47. Ad-hoc analysis • Running in cluster computing architecture • Operation similar SQL syntax • Open source code • Pig characteristics 47
  • 48. Ad-hoc analysis, • Running in cluster computing architecture • Operation similar SQL syntax • Open source code • Pig interface 48
  • 49. • Connect to the local Hadoop cluster • Install Pig (Pig script, Grunt and embedded method) • Pig usage 49
  • 50. • records = Load 'first.txt' as (itemname: chararray, price: int, quality: int); • filter_records = FILTER records BY price! = 999 AND quality == 0; • group_records = GROUP filter_records BY itemname; • max_temp = FOREACH group_records GENERATE group, MAX (filter_records.price); • DUMP max_temp; • Pig usage 50
  • 51. SQL Pig SQL is a description of the type Pig is data flow programming of programming language language Relational database management Pig requires data in a looser mode system (RDBMS) stores data in a which can be defined at running strictly defined mode table time Simple data structure Pig supports complex nested data Support transaction, index and Does not support transaction, random read index and random read • Pig and SQL comparison 51
  • 52. Procedures constitute a series of statements • Operations and commands are case insensitive • Aliases and function names are case-sensitive • Multi-line statement in the entire program logic programs • Pig Latin 52
  • 53. The hive is a data warehouse tool • Map structured data file into a database table • Provides complete sql queries • Converts sql statement into MapReduce tasks to execute • Hive 53
  • 54. • Storage (Hadoop Distributed File System HDFS) • Computing (MapReduce computing framework) • Hive Framework 54
  • 55. • Stored in HDFS is divided into blocks • Distribute on multiple machines • Hive File System 55
  • 56. • Pig is a programming language that simplifies Hadoop common tasks • Hive in Hadoop plays the role of the data warehouse • Pig use of Hadoop Java APIs can significantly reduce the amount of code • Pig attract a large number of software developers • About the pig and hive 56
  • 57. • HBase is a distributed, open source column-oriented database • A structured data distributed storage system • Bigtable-like ability • Subproject of the Apache Hadoop project • Suitable for unstructured data storage • HBase is column-based rather than line-based mode • Require random access, real-time read and write HBase 57
  • 58. Amazon/A9 • AOL • Facebook • Fox interactive media • Google • IBM • New York Times • PowerSet (now Microsoft) • Quantcast • Rackspace/Mailtrust • Veoh • Yahoo! • Who use Pig? 58
  • 59. • Hadoop Distributed Coordination Service • Provides simple operations and additional abstract operations such as sorting and notice • Implement a lot of coordination data structures and protocols • Provides a generic coordination modes and methods of open source shares repository • High-performance, which has more than 10,000 ops to write the main benchmark throughput is even higher then mainly to read several times ZooKeeper 59
  • 60. • Aimed to assist in efficient data exchange between RDBMS and Hadoop • View database tables and other useful gadgets • Support JDBC specification databases, such as DB2, MySQL Sqoop 60
  • 61. Amazon/A9 • AOL • Facebook • Fox interactive media • Google • IBM • New York Times • PowerSet (now Microsoft) • Quantcast • Rackspace/Mailtrust • Veoh • Yahoo! • Who use Pig? 61
  • 62. Data analysis. Retrieve from: http://public.web.cern.ch/public/en/research/DataAnalysis-en.html • James Gallagher. DNA sequencing of MRSA used to stop outbreak. http://www.bbc.co.uk/news/health- 20314024 • Shankland. (2009) Google uncloaks once-secret server. Retrieve from: http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/ma preduce-osdi04.pdf • J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI’04, 6th Symposium on Operating Systems Design and Implementation, Sponsored by USENIX, in cooperation with ACM SIGOPS, pages 137–150, 2004. • Ralf Lammel. (2007). Google’s MapReduce programming model—Revised. Science of Computer Porgramming, Volume 68 Issue 3, October, 2007. • Lucas Mearian. By 2020, there will be 5,200 GB of data for every person on Earth. http://www.computerworld.com/s/article/9234563/By_2020_there_will_be_5_200_GB_of_data_for_every _person_on_Earth • Tom White. Hadoop: the definitive guide. http://books.google.com/books?id=Wu_xeGdU4G8C&pg=PA648&dq=hadoop&hl=en&sa=X&ei=6mfKU PW7Je3U2QWtzIDgCg&ved=0CDcQ6AEwAA • Ilan Horn. Introduction to MapReduce, an Abstraction for Large-Scale Computation. http://www.slideshare.net/rantav/introduction-to-map-reduce#btnNext References 62
  • 63. • A brief view to the Platform. Retrieve from: http://hadooper.blogspot.com/ • Hadoop. Retrieve from: http://pig.apache.org/ • Applications and organizations using Hadoop. Retrieve from: http://wiki.apache.org/hadoop/PoweredBy • Installing and Running Pig. Retrieve from: http://ofps.oreilly.com/titles/9781449302641/running_pig.html • Alan, Gates. Programming Pig. 1 st ed. O'Reilly Media, 2009. 11-50. Print. • What is Hive? Retrieve from: http://hive.apache.org/docs/r0.8.1/ • Hive vs. Pig. Retrieve from: http://www.larsgeorge.com/2009/10/hive-vs-pig.html • George , Lars . HBase: The Definitive Guide. 1 st ed. O'Reilly Media, 2011. 212- 215. Print. • White , Tom . Hadoop: The Definitive Guide. 1 st ed. O'Reilly Media, 2009. 312- 368. Print. References 63