SlideShare una empresa de Scribd logo
Large Scale Data Processing and
Storage

           Ilayaraja Prabakaran
                Product Engineer

              ilayaraja@rediff.co.in
Agenda
 Introduction to large data problem
 MapReduce programming model
 Web mining using MapReduce
 MapReduce with Hadoop
 Hadoop Distributed File System
 Elastic MapReduce
 Scalable storage architecture
Large Data !
Large Data !
Large Data !
Large Data !
Internet 2009 !
 Websites
   234 million - The number of websites by December 2009.
   47 million - Added websites in 2009


 Social Media
   126 million – The number of blogs on the Internet (as
   tracked by BlogPulse).
   27.3 million – Number of tweets on Twitter per day
   (November, 2009)
   350 million – People on Facebook.
Internet 2009 !
 Images
   4 billion – Photos hosted by Flickr (October 2009).
   2.5 billion – Photos uploaded each month to Facebook.


 Videos
   1 billion – The total number of videos YouTube serves in
   one day.
   924 million – Videos viewed per month on Hulu in the US
   (November 2009).
The good news is that “Big Data” is here.

Bad news is that we are struggling to store and
                   analyze it.

    Anyways, Should you worry about it?
3 papers ..
 Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, The
 Google File System, 19th ACM Symposium on Operating
 Systems Principles, Lake George, NY, October, 2003.
 Jeffrey Dean and Sanjay Ghemawat,
 MapReduce: Simplified Data Processing on Large Clusters,
 OSDI'04: Sixth Symposium on Operating System Design and
 Implementation, San Francisco, CA, December, 2004.
 Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh,
 Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew
 Fikes, and Robert E. Gruber, Bigtable: A Distributed Storage
 System for Structured Data, OSDI'06: Seventh Symposium on
 Operating System Design and Implementation, Seattle, WA,
 November, 2006.
Opensource Solutions

    MapReduce




     GFS




    BigTable
MapReduce
Programming model for processing multi
terabyte data on hundreds of CPUs in
parallel.
MapReduce provides:
- Automatic parallelization and distribution
- Fault tolerance
- I/O scheduling
- Status and Monitoring
Programming model
  Input & Output: set of key/value pairs
  Programmer specifies two functions:
PDS LQBNH LQBYDOXH ! OLVW RXWBNH LQWHUPHGLDWHBYDOXH
  Processes input key/value pair
  Produces set of intermediate pairs
UHGXFH RXWBNH OLVW LQWHUPHGLDWHBYDOXH ! OLVW RXWBNH RXWBYDOXH
  Combines all intermediate values for a
  particular key
  Produces a set of merged output values
  (usually just one)
Execution
Parallel Execution
Example



     Thinking in MapReduce
Sam’s Mother
        Believed “an apple a day keeps a
        doctor away”

     Mother
                                                          Sam


                                               An Apple




Ref. SALSA HPC Group at Community Grids Labs
One day
 Sam thought of drinking the apple
                      He used a        to cut

                      the      and a            to

                      make juice.
Next Day
 Sam applied his invention to all the fruits he
 could find in the fruit basket

  (map      ‘(           ))
                                   A list of values mapped into another
                                 list of values, which gets reduced into
                                                a single value
(a,  , o,  , p,  , …)



         reduce                   Classical Notion of MapReduce in
                                      Functional Programming
18 Years Later
 Sam got his first job in JuiceRUs for his talent in
 making juice
                                    Wa i t !
  Now, it’s not just one basket
  but a whole container of fruits

                                                 Large data and list of values for
                                                             output
  Also, they produce a list of
  juice types separately



  But, Sam had just ONE
  and ONE
                                 NOT ENOUGH !!
Brave Sam
 Implemented a parallel version of his innovation


                        (a,  , o,  , p,  , …)


                        (a,  , o,  , p,  , …)
                        Grouped by key
                        Each input to a reduce is a key, value-list
                        (possibly a list of these, depending on the
                        grouping/hashing mechanism)
                        e.g. a, (           …)

                        Reduced into a list of values
Brave Sam
 Implemented a parallel version of his innovation

                            A list of key, value pairs mapped into
                          another list of key, value pairs which gets
                          grouped by the key and reduced into a list of
                                              values



                               The idea of MapReduce in Data
                                     Intensive Computing
Word Count
• map(String input_key, String input_value):
  // input_key: document name
  // input_value: document contents
  for each word w in input_value:
        EmitIntermediate(w, 1);

• reduce(String output_key, Iterator intermediate_values):
  //output_key: a word
  // output_values: a list of counts
  int result = 0;
  for each v in intermediate_values:
        result += ParseInt(v);
  Emit(output_key, AsString(result));
Word Count: Example
            a rose is a rose is a rose

   a,1
   rose,1
   is,1
   a,1         a1,1,1,1                a,4
   rose,1      rose1,1,1,1             rose,4
   is,1        is1,1,1                 is,3
   a,1
   rose,1
   is,1
Demo Time

Lets have some fun ☺
rediff uses MapReduce for..
 Web crawling and indexing
 Web data mining
 - Reverse web-link graph
 - ngram database
 - Anchor text analysis
 Mining usage logs
 - Related queries
 - Search  Suggest
 - Query classification
Reverse Web-link Graph
        Web-
Key: http://www.rediff.com/news
Values:
fromUrl: http://www.rediff.com anchor: news
fromUrl: http://en.wikipedia.org/wiki/Rediff.com
   anchor: rediff news
   anchor: rediff headlines
fromUrl: http://www.alexa.com/siteinfo/rediff.com
   anchor: rediff.com
…….
Web Graph: MapReduce
• map(String input_key, String input_value):
  // input_key: from-url
  // input_value: document contents

  for each outlink x in input_value: // parsed data
       to-url = x.url       // outgoing link
       anchor = x.anchor // click-able text
       from-url = input_key
       EmitIntermediate(to-url, from-url,anchor);
Web Graph: MapReduce
• reduce(String output_key, Iterator
  intermediate_values):
  //output_key: a word
  // output_values: a list of InLinks
  // i.e. from-url,anchor pairs

  result = new InLinks( )
  for each v in intermediate_values:
       result.add(v.url, v.anchor)
  Emit(output_key, result);
Navigational Search
Anchor text mining
 Input: Web Graph
 Output: ranked set of anchors.
Anchor text mining: MapReduce
   map(key,value)
Key: to-url; value: Inlinks
for each inlink ‘i’ in value:
  for each n-gram ‘ng’ in anchor:
      score = calc_rank(ng)
      emit( to-url, ng, score )
Anchor text mining: MapReduce
   reduce(key,values)
Key: to-url, ng pair; values: an iterator over
  score
agg_score = 0
for each score ‘s’ in values:
  agg_score = agg_score +s
  emit( to-url, ng, agg_score )
Hadoop
Opensource implementation of
        MapReduce
Hadoop
 Created by Doug Cutting
 Originated for Apache Nutch
Why hadoop?
Doug cutting - The name my kid gave a stuffed yellow
  elephant. Short, relatively easy to spell and pronounce,
  meaningless, and not used elsewhere: those are my naming
  criteria. Kids are good at generating such.
Implementation
 Hadoop: MapReduce APIs
 HDFS: Storage
 Mapper Interface
    map(WritableComparable key, Writable value,
    OutputCollector output, Reporter reporter)
 Reducer Interface
    reduce(WritableComparable key, Iterator values,
    OutputCollector output, Reporter reporter)
 Programmers has to just override these
 methods, makes life easier !
 Takes care of splitting the work, data flow,
 execution, handling failures so on.
Data flow
Map
Reduce
Driver Method
Combiner
 Performs local aggregation of the
 intermediate outputs.
 Cut down the amount of data transferred
 from the Mapper to Reducer.
    a,1
    rose,1    a1,1 into (a,2)
    is,1      rose1 into (rose,1)
    a,1       is1 into (is,1)         a,3
    rose,1    rose1,1 into (rose,2)   rose,3
    is,1      is1 into (is,1)         is,2
    a,1       a1 into (a,1)
    rose,1
Variations
 Identity Reducer
 - Zero reduce tasks
 - Examples:
      “Cleaning web link graph”
      “Populating HDFS from other data sources”
 - Map does the job and writes the output to HDFS.
 MapReduce Chain
 - Problems that are not solvable just by one map and
 reduce phase.
 - Series of map and reduce functions defined
 - Output of previous job goes as input to next job.
Streaming
  Allows you to write map/reduce in any
  programming language.
  Ex. Python, c++, perl, bash
  I/O is represented textually.
  Read from stdin and written to stdout as
  tab separated key, value pair.
  Format: key t value n
+$'223B+20(ELQKDGRRS MDU    +$'223B+20(KDGRRS VWUHDPLQJMDU
LQSXW P,QSXW'LUV
RXWSXW P2XWSXW'LU
PDSSHU P3WKRQ0DSSHUS
UHGXFHU P3WKRQ5HGXFHUS
Pipes
 API that provides strong coupling
 between c++ code and hadoop.
 Improved performance over
 streaming.
 Key and value pairs are STL strings.
 APIs: getInputKey(), getInputValue()
  ELQKDGRRS SLSHV LQSXW LQSXW3DWK RXWSXW RXWSXW3DWK SURJUDP
 SDWKWRSLSHVSURJUDPH[HFXWDEOH
Hadoop Distributed File System
            (HDFS)
HDFS design principles
 Handling hardware failures
 Streaming data access
 Storing very large files
 Running on cluster of commodity
 hardware
 Simple coherency model
 Data locality
 Portability
HDFS Architecture
HDFS Operation (Read)
HDFS operation (Write)
HDFS Robustness
 Name node failure, Data node failure
 and network partitions
 Heartbeats and Re-replication
 Cluster Rebalancing
 Data Integrity: checksum
 Metadata disk failure: FsImage, Editlog
 Snapshots
Anatomy of Hadoop MapReduce
      Job run on HDFS
Map/Reduce Processes
 Launching Application
 - User application cod
 - Submits a specific kind of Map/Reduce job
 JobTracker
 - Handles all jobs
 - Makes all scheduling decisions
 TaskTracker
 - Manager for all tasks on a given node
 Task
 - Runs an individual map or reduce fragment
 - Forks from the TaskTracker
Process Diagram
Job Control Flow
 Application launcher creates and submits job.
 JobTracker initializes job, creates FileSplits, and
 adds tasks to queue.
 TaskTrackers ask for a new map or reduce task
 every 10 seconds or when the previous task
 finishes.
 As tasks run, the TaskTracker reports status to
 the JobTracker every 10 seconds.
 Application launcher stops waiting when the job
 completes.
Hadoop Map/Reduce Job Admin.
Progress of reduce phase
HDFS
Hadoop Benchmarking
Jim Gray’s Sort Benchmark
 Started by Jim Gray at Microsoft in 1998
 Currently managed by 3 of the previous
 winners
 Sorting different number of 100 byte
 records
 - 10 byte key
 - 90 byte value
 Multiple variants:
   Minute Sort: sort must finish  60.0 secs
   Terabyte Sort: 10^12 bytes sort
   Gray Sort: = 10^14 bytes and = 1hour
Hadoop won Terabyte Sort ☺
 Hadoop won this in 2008
 Took 209 seconds to complete
 910 nodes, 1800 maps and 1800
 reduces .
 2 quad core Xeons @ 2.0ghz per a
 node
 8 GB RAM per a node.
Terabyte Sort Task Timeline
Further stats.

Bytes     Nodes   Maps     Reduces   Replication Time


500 GB    1406    8000     2600      1          59 s


1 TB      1460    8000     2700      1          62 s


100 TB    3452    190000   10000     2          173 m


1000 TB   3658    80000    20000     2          975 m
Petabyte Sort Task Timeline
Notes on Petabyte Sort
 80,000 maps and 20,000 reduces
 Each node ran 2 maps and 2 reduces at a
 time
 Tail of maps was 100 minutes
 Tail of reduces was 80 minutes
 - caused by one slow node
 Used speculative execution
 The “waste” tasks at the end are mostly
 speculative execution
Cloud Computing  Elastic
MapReduce
Impact of Cloud
Definition  Characteristics
     “A pool of highly scalable, abstracted infrastructure,
        capable of hosting end-customer applications,
                that is billed by consumption”

Characteristics:
  Dynamic computing infrastructure
  Service-centric approach
  Self service based usage model
  Minimally or self-managed platform
  Consumption based billing
Amazon web services (AWS)
 Elastic Compute Cloud (EC2)
 Elastic MapReduce
 Simple Storage Service (S3)
 Elastic Block Storage
 Elastic Load Balancing
 Amazon CloudWatch
Elastic MapReduce (EMR)
 Automatically spins up a Hadoop implementation
 of mapreduce framework on EC2 cluster.
 Sub-dividing data in a job flow into smaller
 chunks so that they can be processed (the “map”
 function) in parallel.
 Recombining the processed data into the final
 solution (the “reduce” function).
 S3 as the source and destination of input and
 output data respectively.
 Easy to use console for launching job with
 dynamic configuration
BigTable
Motivation
 Lots of (semi-)structured data
  – URLs:
      • Contents, crawl metadata, links, anchors,
        pagerank, …
  – Per-user Data:
      • User preference settings, recent queries/search
        results, …
  – Geographic locations:
      • Physical entities (shops, restaurants, etc.). roads,
        satellite image data..
 Scale is large
  – Billions of URLs, many versions/page(~20K/version)
  – Hundreds of millions of users, thousands of q/sec
  – 100TB+ of satellite image data
Why not just use commercial DB?

 Scale is too large for most commercial databases
 Even if it weren’t, cost would be very high
  – Building internally means system can be
    applied across many projects for low
    incremental cost
 Low-level storage optimizations help
 performance significantly
  – Much harder to do when running on top of a
    database layer

  – Also fun and challenging to build large-scale
    systems ☺
Goals
 Want asynchronous processes to be
 continuously updating different pieces of data
  – Want access to most current data at any time
 Need to support
  – Very high read/write rates (millions of ops per
    second)
  – Efficient scans over all or interesting subsets
    of data
 Often want to examine data changes over time
  – E.g. Contents of a web page over multiple
    crawls
BigTable
 Distributed multi-level map
 – With an interesting data model
 Fault-tolerant, persistent
 Scalable
 –   Thousands of servers
 –   Terabytes of in-memory data
 –   Petabytes of disk-based data
 –   Millions of reads/writes per second, efficient
     scans
 Self-managing
 – Servers can be added/removed dynamically
 – Servers adjust to load imbalance
Hbase  Hypertable
 Use data model similar to BigTable
 Sparse, distributed, persistent multi-
 dimensional sorted map
 Map is indexed by
 - row key
 - column key
 - timestamp
Table: Visual representation




                           hypertable.org
Table: Actual Representation




                           hypertable.org
System Overview




                  hypertable.org
Range Server
 Manages ranges of table data
 Caches updates in memory (CellCache)
 Periodically spills (compacts) cached updates to
 disk (CellStore)




                                         hypertable.org
Master
 Single Master (hot standbys)
 Directs meta operations
 – CREATE TABLE
 – DROP TABLE
 – ALTER TABLE
 Handles recovery of RangeServer
 Manages RangeServer Load Balancing
 Client data does not move through Master

                                hypertable.org
Hyperspace
 Chubby equivalent
 – Distributed Lock Manager
 – Filesystem for storing small amounts of
   metadata
 – Highly available
 “Root of distributed data structures”


                             hypertable.org
Optimizations
 Compression: Cell Store blocks are compressed
 Caching: Block Cache  Query Cache
 Bloom Filter: Indicates if key is not present
 Access Groups: minimizing I/O by locality
QA
Thanks Much !
References
 Jeffery Dean and Sanjay Ghemawat, MapReduce: Simplified
 Data Processing on Large Clusters
 SALSA HPC Group at Community Grids Labs
 http://code.google.com/edu/parallel/mapreduce-tutorial.html
 http://developer.yahoo.net/blogs/hadoop/2009/05/hadoop_s
 orts_a_petabyte_in_162.html
 http://hadoop.apache.org/
 http://aws.amazon.com
 http://www.emc.com
 http://pingdom.com/

Más contenido relacionado

La actualidad más candente

Big Data & Text Mining
Big Data & Text MiningBig Data & Text Mining
Big Data & Text MiningMichel Bruley
 
T9. Trust and reputation in multi-agent systems
T9. Trust and reputation in multi-agent systemsT9. Trust and reputation in multi-agent systems
T9. Trust and reputation in multi-agent systemsEASSS 2012
 
Fake news detection project
Fake news detection projectFake news detection project
Fake news detection projectHarshdaGhai
 
Data warehousing and online analytical processing
Data warehousing and online analytical processingData warehousing and online analytical processing
Data warehousing and online analytical processingVijayasankariS
 
Machine learning with Big Data power point presentation
Machine learning with Big Data power point presentationMachine learning with Big Data power point presentation
Machine learning with Big Data power point presentationDavid Raj Kanthi
 
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)Krishna Petrochemicals
 
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis Salah Amean
 
Data analysis with R
Data analysis with RData analysis with R
Data analysis with RShareThis
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streamsKrish_ver2
 
Data Warehouse Modeling
Data Warehouse ModelingData Warehouse Modeling
Data Warehouse Modelingvivekjv
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining Sushil Kulkarni
 
Data Mining and Business Intelligence Tools
Data Mining and Business Intelligence ToolsData Mining and Business Intelligence Tools
Data Mining and Business Intelligence ToolsMotaz Saad
 

La actualidad más candente (20)

Big Data & Text Mining
Big Data & Text MiningBig Data & Text Mining
Big Data & Text Mining
 
Data science unit1
Data science unit1Data science unit1
Data science unit1
 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
 
Distributed Systems
Distributed SystemsDistributed Systems
Distributed Systems
 
T9. Trust and reputation in multi-agent systems
T9. Trust and reputation in multi-agent systemsT9. Trust and reputation in multi-agent systems
T9. Trust and reputation in multi-agent systems
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
Data mining tasks
Data mining tasksData mining tasks
Data mining tasks
 
Fake news detection project
Fake news detection projectFake news detection project
Fake news detection project
 
Data warehousing and online analytical processing
Data warehousing and online analytical processingData warehousing and online analytical processing
Data warehousing and online analytical processing
 
Machine learning with Big Data power point presentation
Machine learning with Big Data power point presentationMachine learning with Big Data power point presentation
Machine learning with Big Data power point presentation
 
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
 
Cloud Security Mechanisms
Cloud Security MechanismsCloud Security Mechanisms
Cloud Security Mechanisms
 
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
 
Data analysis with R
Data analysis with RData analysis with R
Data analysis with R
 
Big data unit i
Big data unit iBig data unit i
Big data unit i
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
 
Data Warehouse Modeling
Data Warehouse ModelingData Warehouse Modeling
Data Warehouse Modeling
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining
 
Data Mining and Business Intelligence Tools
Data Mining and Business Intelligence ToolsData Mining and Business Intelligence Tools
Data Mining and Business Intelligence Tools
 

Destacado

Reputational Due Diligence - The key to strategic risk management
Reputational Due Diligence - The key to strategic risk managementReputational Due Diligence - The key to strategic risk management
Reputational Due Diligence - The key to strategic risk managementJenniferHG
 
The World of Geocoding and Challenges in India
The World of Geocoding and Challenges in IndiaThe World of Geocoding and Challenges in India
The World of Geocoding and Challenges in IndiaNishant Sinha
 
Resilency
 Resilency  Resilency
Resilency price56
 
Learning & Development and the Performance management
Learning & Development and the Performance managementLearning & Development and the Performance management
Learning & Development and the Performance managementAhmed Shamim
 
Power Hour: 50 Actionable SEO Tips & Tricks
Power Hour: 50 Actionable SEO Tips & TricksPower Hour: 50 Actionable SEO Tips & Tricks
Power Hour: 50 Actionable SEO Tips & TricksConductor
 
SearchLove San Diego 2017 | Larry Kim | Content Marketing Moneyball
SearchLove San Diego 2017 | Larry Kim | Content Marketing MoneyballSearchLove San Diego 2017 | Larry Kim | Content Marketing Moneyball
SearchLove San Diego 2017 | Larry Kim | Content Marketing MoneyballDistilled
 
Nigerian government revenue allocation
Nigerian government revenue allocationNigerian government revenue allocation
Nigerian government revenue allocationstatisense
 
Project scope vs product scope
Project scope vs product scopeProject scope vs product scope
Project scope vs product scopeNiladri Choudhuri
 
Role of technology in service operation
Role of technology in service operationRole of technology in service operation
Role of technology in service operationMahesh Sherkhane
 
Routine letters And Good Will Messages
Routine letters And Good Will MessagesRoutine letters And Good Will Messages
Routine letters And Good Will MessagesGaurav Singh
 
Research Techniques Introduction
Research Techniques IntroductionResearch Techniques Introduction
Research Techniques IntroductionCreativeMediaSarah
 
RETAIL MERCHANDISING MANAGEMENT PROCESS
RETAIL MERCHANDISING MANAGEMENT PROCESSRETAIL MERCHANDISING MANAGEMENT PROCESS
RETAIL MERCHANDISING MANAGEMENT PROCESSNagarjuna Kalluru
 
Pumps and pumping systems
Pumps and pumping systemsPumps and pumping systems
Pumps and pumping systemsPrem Baboo
 
Johnson and johnson and tylenol case study
Johnson and johnson and tylenol case studyJohnson and johnson and tylenol case study
Johnson and johnson and tylenol case studyRhit Srivastava
 
Introduction to the Knowledge Cafe
Introduction to the Knowledge CafeIntroduction to the Knowledge Cafe
Introduction to the Knowledge CafeDavid Gurteen
 
A Charitable Life Wellness
A Charitable Life Wellness A Charitable Life Wellness
A Charitable Life Wellness Brian Barden
 

Destacado (20)

Reputational Due Diligence - The key to strategic risk management
Reputational Due Diligence - The key to strategic risk managementReputational Due Diligence - The key to strategic risk management
Reputational Due Diligence - The key to strategic risk management
 
The World of Geocoding and Challenges in India
The World of Geocoding and Challenges in IndiaThe World of Geocoding and Challenges in India
The World of Geocoding and Challenges in India
 
Resilency
 Resilency  Resilency
Resilency
 
Learning & Development and the Performance management
Learning & Development and the Performance managementLearning & Development and the Performance management
Learning & Development and the Performance management
 
Power Hour: 50 Actionable SEO Tips & Tricks
Power Hour: 50 Actionable SEO Tips & TricksPower Hour: 50 Actionable SEO Tips & Tricks
Power Hour: 50 Actionable SEO Tips & Tricks
 
SearchLove San Diego 2017 | Larry Kim | Content Marketing Moneyball
SearchLove San Diego 2017 | Larry Kim | Content Marketing MoneyballSearchLove San Diego 2017 | Larry Kim | Content Marketing Moneyball
SearchLove San Diego 2017 | Larry Kim | Content Marketing Moneyball
 
Nigerian government revenue allocation
Nigerian government revenue allocationNigerian government revenue allocation
Nigerian government revenue allocation
 
Role Of It In CRM
Role Of It In CRMRole Of It In CRM
Role Of It In CRM
 
Project scope vs product scope
Project scope vs product scopeProject scope vs product scope
Project scope vs product scope
 
Role of technology in service operation
Role of technology in service operationRole of technology in service operation
Role of technology in service operation
 
Routine letters And Good Will Messages
Routine letters And Good Will MessagesRoutine letters And Good Will Messages
Routine letters And Good Will Messages
 
Research Techniques Introduction
Research Techniques IntroductionResearch Techniques Introduction
Research Techniques Introduction
 
RETAIL MERCHANDISING MANAGEMENT PROCESS
RETAIL MERCHANDISING MANAGEMENT PROCESSRETAIL MERCHANDISING MANAGEMENT PROCESS
RETAIL MERCHANDISING MANAGEMENT PROCESS
 
Pumps and pumping systems
Pumps and pumping systemsPumps and pumping systems
Pumps and pumping systems
 
Johnson and johnson and tylenol case study
Johnson and johnson and tylenol case studyJohnson and johnson and tylenol case study
Johnson and johnson and tylenol case study
 
Work Flow
Work  FlowWork  Flow
Work Flow
 
Reference Groups
Reference GroupsReference Groups
Reference Groups
 
RTOS Basic Concepts
RTOS Basic ConceptsRTOS Basic Concepts
RTOS Basic Concepts
 
Introduction to the Knowledge Cafe
Introduction to the Knowledge CafeIntroduction to the Knowledge Cafe
Introduction to the Knowledge Cafe
 
A Charitable Life Wellness
A Charitable Life Wellness A Charitable Life Wellness
A Charitable Life Wellness
 

Similar a Large Scale Data Processing & Storage

TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueShay Sofer
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analyticsAvinash Pandu
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Parallel Computing for Econometricians with Amazon Web Services
Parallel Computing for Econometricians with Amazon Web ServicesParallel Computing for Econometricians with Amazon Web Services
Parallel Computing for Econometricians with Amazon Web Servicesstephenjbarr
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview questionpappupassindia
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
Hadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaHadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaGlenn K. Lockwood
 
The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)Revolution Analytics
 
Python in big data world
Python in big data worldPython in big data world
Python in big data worldRohit
 
Mapreduce hadoop原理
Mapreduce hadoop原理Mapreduce hadoop原理
Mapreduce hadoop原理baggiolily
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 

Similar a Large Scale Data Processing & Storage (20)

TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the Rescue
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
 
Apache hadoop
Apache hadoopApache hadoop
Apache hadoop
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Parallel Computing for Econometricians with Amazon Web Services
Parallel Computing for Econometricians with Amazon Web ServicesParallel Computing for Econometricians with Amazon Web Services
Parallel Computing for Econometricians with Amazon Web Services
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Hadoop
HadoopHadoop
Hadoop
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Hadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaHadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without Java
 
Tutorial5
Tutorial5Tutorial5
Tutorial5
 
Hadoop london
Hadoop londonHadoop london
Hadoop london
 
Map reducefunnyslide
Map reducefunnyslideMap reducefunnyslide
Map reducefunnyslide
 
The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
 
2 mapreduce-model-principles
2 mapreduce-model-principles2 mapreduce-model-principles
2 mapreduce-model-principles
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Mapreduce hadoop原理
Mapreduce hadoop原理Mapreduce hadoop原理
Mapreduce hadoop原理
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 

Último

SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...CzechDreamin
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfFIDO Alliance
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1DianaGray10
 
Buy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptxBuy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptxEasyPrinterHelp
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxJennifer Lim
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationZilliz
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka DoktorováCzechDreamin
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlPeter Udo Diehl
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyUXDXConf
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyJohn Staveley
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераMark Opanasiuk
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaCzechDreamin
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsStefano
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxDavid Michel
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastUXDXConf
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2DianaGray10
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIES VE
 
Buy Epson EcoTank L3210 Colour Printer Online.pdf
Buy Epson EcoTank L3210 Colour Printer Online.pdfBuy Epson EcoTank L3210 Colour Printer Online.pdf
Buy Epson EcoTank L3210 Colour Printer Online.pdfEasyPrinterHelp
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityScyllaDB
 

Último (20)

SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
Buy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptxBuy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptx
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System Strategy
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. Startups
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at Comcast
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Buy Epson EcoTank L3210 Colour Printer Online.pdf
Buy Epson EcoTank L3210 Colour Printer Online.pdfBuy Epson EcoTank L3210 Colour Printer Online.pdf
Buy Epson EcoTank L3210 Colour Printer Online.pdf
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 

Large Scale Data Processing & Storage

  • 1. Large Scale Data Processing and Storage Ilayaraja Prabakaran Product Engineer ilayaraja@rediff.co.in
  • 2. Agenda Introduction to large data problem MapReduce programming model Web mining using MapReduce MapReduce with Hadoop Hadoop Distributed File System Elastic MapReduce Scalable storage architecture
  • 7. Internet 2009 ! Websites 234 million - The number of websites by December 2009. 47 million - Added websites in 2009 Social Media 126 million – The number of blogs on the Internet (as tracked by BlogPulse). 27.3 million – Number of tweets on Twitter per day (November, 2009) 350 million – People on Facebook.
  • 8. Internet 2009 ! Images 4 billion – Photos hosted by Flickr (October 2009). 2.5 billion – Photos uploaded each month to Facebook. Videos 1 billion – The total number of videos YouTube serves in one day. 924 million – Videos viewed per month on Hulu in the US (November 2009).
  • 9. The good news is that “Big Data” is here. Bad news is that we are struggling to store and analyze it. Anyways, Should you worry about it?
  • 10. 3 papers .. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, The Google File System, 19th ACM Symposium on Operating Systems Principles, Lake George, NY, October, 2003. Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber, Bigtable: A Distributed Storage System for Structured Data, OSDI'06: Seventh Symposium on Operating System Design and Implementation, Seattle, WA, November, 2006.
  • 11. Opensource Solutions MapReduce GFS BigTable
  • 12. MapReduce Programming model for processing multi terabyte data on hundreds of CPUs in parallel. MapReduce provides: - Automatic parallelization and distribution - Fault tolerance - I/O scheduling - Status and Monitoring
  • 13. Programming model Input & Output: set of key/value pairs Programmer specifies two functions: PDS LQBNH LQBYDOXH ! OLVW RXWBNH LQWHUPHGLDWHBYDOXH Processes input key/value pair Produces set of intermediate pairs UHGXFH RXWBNH OLVW LQWHUPHGLDWHBYDOXH ! OLVW RXWBNH RXWBYDOXH Combines all intermediate values for a particular key Produces a set of merged output values (usually just one)
  • 16. Example Thinking in MapReduce
  • 17. Sam’s Mother Believed “an apple a day keeps a doctor away” Mother Sam An Apple Ref. SALSA HPC Group at Community Grids Labs
  • 18. One day Sam thought of drinking the apple He used a to cut the and a to make juice.
  • 19. Next Day Sam applied his invention to all the fruits he could find in the fruit basket (map ‘( )) A list of values mapped into another list of values, which gets reduced into a single value (a, , o, , p, , …) reduce Classical Notion of MapReduce in Functional Programming
  • 20. 18 Years Later Sam got his first job in JuiceRUs for his talent in making juice Wa i t ! Now, it’s not just one basket but a whole container of fruits Large data and list of values for output Also, they produce a list of juice types separately But, Sam had just ONE and ONE NOT ENOUGH !!
  • 21. Brave Sam Implemented a parallel version of his innovation (a, , o, , p, , …) (a, , o, , p, , …) Grouped by key Each input to a reduce is a key, value-list (possibly a list of these, depending on the grouping/hashing mechanism) e.g. a, ( …) Reduced into a list of values
  • 22. Brave Sam Implemented a parallel version of his innovation A list of key, value pairs mapped into another list of key, value pairs which gets grouped by the key and reduced into a list of values The idea of MapReduce in Data Intensive Computing
  • 23. Word Count • map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, 1); • reduce(String output_key, Iterator intermediate_values): //output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(output_key, AsString(result));
  • 24. Word Count: Example a rose is a rose is a rose a,1 rose,1 is,1 a,1 a1,1,1,1 a,4 rose,1 rose1,1,1,1 rose,4 is,1 is1,1,1 is,3 a,1 rose,1 is,1
  • 25. Demo Time Lets have some fun ☺
  • 26. rediff uses MapReduce for.. Web crawling and indexing Web data mining - Reverse web-link graph - ngram database - Anchor text analysis Mining usage logs - Related queries - Search Suggest - Query classification
  • 27. Reverse Web-link Graph Web- Key: http://www.rediff.com/news Values: fromUrl: http://www.rediff.com anchor: news fromUrl: http://en.wikipedia.org/wiki/Rediff.com anchor: rediff news anchor: rediff headlines fromUrl: http://www.alexa.com/siteinfo/rediff.com anchor: rediff.com …….
  • 28. Web Graph: MapReduce • map(String input_key, String input_value): // input_key: from-url // input_value: document contents for each outlink x in input_value: // parsed data to-url = x.url // outgoing link anchor = x.anchor // click-able text from-url = input_key EmitIntermediate(to-url, from-url,anchor);
  • 29. Web Graph: MapReduce • reduce(String output_key, Iterator intermediate_values): //output_key: a word // output_values: a list of InLinks // i.e. from-url,anchor pairs result = new InLinks( ) for each v in intermediate_values: result.add(v.url, v.anchor) Emit(output_key, result);
  • 31. Anchor text mining Input: Web Graph Output: ranked set of anchors.
  • 32. Anchor text mining: MapReduce map(key,value) Key: to-url; value: Inlinks for each inlink ‘i’ in value: for each n-gram ‘ng’ in anchor: score = calc_rank(ng) emit( to-url, ng, score )
  • 33. Anchor text mining: MapReduce reduce(key,values) Key: to-url, ng pair; values: an iterator over score agg_score = 0 for each score ‘s’ in values: agg_score = agg_score +s emit( to-url, ng, agg_score )
  • 35. Hadoop Created by Doug Cutting Originated for Apache Nutch Why hadoop? Doug cutting - The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such.
  • 36. Implementation Hadoop: MapReduce APIs HDFS: Storage Mapper Interface map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) Reducer Interface reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) Programmers has to just override these methods, makes life easier ! Takes care of splitting the work, data flow, execution, handling failures so on.
  • 38. Map
  • 41. Combiner Performs local aggregation of the intermediate outputs. Cut down the amount of data transferred from the Mapper to Reducer. a,1 rose,1 a1,1 into (a,2) is,1 rose1 into (rose,1) a,1 is1 into (is,1) a,3 rose,1 rose1,1 into (rose,2) rose,3 is,1 is1 into (is,1) is,2 a,1 a1 into (a,1) rose,1
  • 42. Variations Identity Reducer - Zero reduce tasks - Examples: “Cleaning web link graph” “Populating HDFS from other data sources” - Map does the job and writes the output to HDFS. MapReduce Chain - Problems that are not solvable just by one map and reduce phase. - Series of map and reduce functions defined - Output of previous job goes as input to next job.
  • 43. Streaming Allows you to write map/reduce in any programming language. Ex. Python, c++, perl, bash I/O is represented textually. Read from stdin and written to stdout as tab separated key, value pair. Format: key t value n +$'223B+20(ELQKDGRRS MDU +$'223B+20(KDGRRS VWUHDPLQJMDU LQSXW P,QSXW'LUV RXWSXW P2XWSXW'LU PDSSHU P3WKRQ0DSSHUS UHGXFHU P3WKRQ5HGXFHUS
  • 44. Pipes API that provides strong coupling between c++ code and hadoop. Improved performance over streaming. Key and value pairs are STL strings. APIs: getInputKey(), getInputValue() ELQKDGRRS SLSHV LQSXW LQSXW3DWK RXWSXW RXWSXW3DWK SURJUDP SDWKWRSLSHVSURJUDPH[HFXWDEOH
  • 45. Hadoop Distributed File System (HDFS)
  • 46. HDFS design principles Handling hardware failures Streaming data access Storing very large files Running on cluster of commodity hardware Simple coherency model Data locality Portability
  • 50. HDFS Robustness Name node failure, Data node failure and network partitions Heartbeats and Re-replication Cluster Rebalancing Data Integrity: checksum Metadata disk failure: FsImage, Editlog Snapshots
  • 51. Anatomy of Hadoop MapReduce Job run on HDFS
  • 52. Map/Reduce Processes Launching Application - User application cod - Submits a specific kind of Map/Reduce job JobTracker - Handles all jobs - Makes all scheduling decisions TaskTracker - Manager for all tasks on a given node Task - Runs an individual map or reduce fragment - Forks from the TaskTracker
  • 54. Job Control Flow Application launcher creates and submits job. JobTracker initializes job, creates FileSplits, and adds tasks to queue. TaskTrackers ask for a new map or reduce task every 10 seconds or when the previous task finishes. As tasks run, the TaskTracker reports status to the JobTracker every 10 seconds. Application launcher stops waiting when the job completes.
  • 57. HDFS
  • 59. Jim Gray’s Sort Benchmark Started by Jim Gray at Microsoft in 1998 Currently managed by 3 of the previous winners Sorting different number of 100 byte records - 10 byte key - 90 byte value Multiple variants: Minute Sort: sort must finish 60.0 secs Terabyte Sort: 10^12 bytes sort Gray Sort: = 10^14 bytes and = 1hour
  • 60. Hadoop won Terabyte Sort ☺ Hadoop won this in 2008 Took 209 seconds to complete 910 nodes, 1800 maps and 1800 reduces . 2 quad core Xeons @ 2.0ghz per a node 8 GB RAM per a node.
  • 61. Terabyte Sort Task Timeline
  • 62. Further stats. Bytes Nodes Maps Reduces Replication Time 500 GB 1406 8000 2600 1 59 s 1 TB 1460 8000 2700 1 62 s 100 TB 3452 190000 10000 2 173 m 1000 TB 3658 80000 20000 2 975 m
  • 63. Petabyte Sort Task Timeline
  • 64. Notes on Petabyte Sort 80,000 maps and 20,000 reduces Each node ran 2 maps and 2 reduces at a time Tail of maps was 100 minutes Tail of reduces was 80 minutes - caused by one slow node Used speculative execution The “waste” tasks at the end are mostly speculative execution
  • 65. Cloud Computing Elastic MapReduce
  • 67. Definition Characteristics “A pool of highly scalable, abstracted infrastructure, capable of hosting end-customer applications, that is billed by consumption” Characteristics: Dynamic computing infrastructure Service-centric approach Self service based usage model Minimally or self-managed platform Consumption based billing
  • 68. Amazon web services (AWS) Elastic Compute Cloud (EC2) Elastic MapReduce Simple Storage Service (S3) Elastic Block Storage Elastic Load Balancing Amazon CloudWatch
  • 69. Elastic MapReduce (EMR) Automatically spins up a Hadoop implementation of mapreduce framework on EC2 cluster. Sub-dividing data in a job flow into smaller chunks so that they can be processed (the “map” function) in parallel. Recombining the processed data into the final solution (the “reduce” function). S3 as the source and destination of input and output data respectively. Easy to use console for launching job with dynamic configuration
  • 71. Motivation Lots of (semi-)structured data – URLs: • Contents, crawl metadata, links, anchors, pagerank, … – Per-user Data: • User preference settings, recent queries/search results, … – Geographic locations: • Physical entities (shops, restaurants, etc.). roads, satellite image data.. Scale is large – Billions of URLs, many versions/page(~20K/version) – Hundreds of millions of users, thousands of q/sec – 100TB+ of satellite image data
  • 72. Why not just use commercial DB? Scale is too large for most commercial databases Even if it weren’t, cost would be very high – Building internally means system can be applied across many projects for low incremental cost Low-level storage optimizations help performance significantly – Much harder to do when running on top of a database layer – Also fun and challenging to build large-scale systems ☺
  • 73. Goals Want asynchronous processes to be continuously updating different pieces of data – Want access to most current data at any time Need to support – Very high read/write rates (millions of ops per second) – Efficient scans over all or interesting subsets of data Often want to examine data changes over time – E.g. Contents of a web page over multiple crawls
  • 74. BigTable Distributed multi-level map – With an interesting data model Fault-tolerant, persistent Scalable – Thousands of servers – Terabytes of in-memory data – Petabytes of disk-based data – Millions of reads/writes per second, efficient scans Self-managing – Servers can be added/removed dynamically – Servers adjust to load imbalance
  • 75. Hbase Hypertable Use data model similar to BigTable Sparse, distributed, persistent multi- dimensional sorted map Map is indexed by - row key - column key - timestamp
  • 78. System Overview hypertable.org
  • 79. Range Server Manages ranges of table data Caches updates in memory (CellCache) Periodically spills (compacts) cached updates to disk (CellStore) hypertable.org
  • 80. Master Single Master (hot standbys) Directs meta operations – CREATE TABLE – DROP TABLE – ALTER TABLE Handles recovery of RangeServer Manages RangeServer Load Balancing Client data does not move through Master hypertable.org
  • 81. Hyperspace Chubby equivalent – Distributed Lock Manager – Filesystem for storing small amounts of metadata – Highly available “Root of distributed data structures” hypertable.org
  • 82. Optimizations Compression: Cell Store blocks are compressed Caching: Block Cache Query Cache Bloom Filter: Indicates if key is not present Access Groups: minimizing I/O by locality
  • 83. QA
  • 85. References Jeffery Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters SALSA HPC Group at Community Grids Labs http://code.google.com/edu/parallel/mapreduce-tutorial.html http://developer.yahoo.net/blogs/hadoop/2009/05/hadoop_s orts_a_petabyte_in_162.html http://hadoop.apache.org/ http://aws.amazon.com http://www.emc.com http://pingdom.com/