SlideShare a Scribd company logo
1 of 42
Download to read offline
Large Scale Data Analysis with
     Map/Reduce, part I
           Marin Dimitrov
        (technology watch #1)


              Feb 2010
Contents

• Map/Reduce
• Dryad
• Sector/Sphere
• Open source M/R frameworks & tools
   –   Hadoop (Yahoo/Apache)
   –   Cloud MapReduce (Accenture)
   –   Elastic MapReduce (Hadoop on AWS)
   –   MR.Flow
• Some M/R algorithms
   – Graph algorithms, Text Indexing & retrieval



                    Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #2
Contents



                       Part I

Distributed computing
      frameworks


    Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #3
Scalability & Parallelisation

• Scalability approaches
   – Scale up (vertical scaling)
       • Only one direction of improvement (bigger box)
   – Scale out (horizontal scaling)
       • Two directions – add more nodes + scale up each node
       • Can achieve x4 the performance of a similarly priced scale-up system
         (ref?)
   – Hybrid (“scale out in a box”)
• Parallel algorithms... Not
   – Algorithms with state
   – Dependencies from one iteration to another (recurrence, induction)




                      Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #4
Parallelisation approaches

• Parallelization approaches
   – Task decomposition
       • Distribute coarse-grained (synchronisation wise) and computationally
         expensive tasks (otherwise too much coordination/management
         overhead)
       • Dependencies - execution order vs. data dependencies
       • Move the data to the processing (when needed)
   – Data decomposition
       • Each parallel task works with a data partition assigned to it (no sharing)
       • Data has regular structure, i.e. chunks expected to need the same
         amount of processing time
       • Two criteria: granularity (size of chunk) and shape (data exchange
         between chunk neighbours)
       • Move the processing to the data



                      Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010      #5
Amdahl’s law

• Impossible to achieve linear speedup
• Maximum speedup is always bounded by the overhead for
  parallelisation and by the serial processing part
• Amdahl’s law
   – max_speedup =

   – P: proportion of the program than can be parallelised (1-P still
     remains serial or overhead)
   – N: number of processors / parallel nodes
   – Example: P=75% (i.e. 25% serial or overhead)
  N (parallel nodes)    2         4         8        16        32     1024      64K
  Max speedup           1.60      2.29      2.91     3.37      3.66   3.99      3.99


                     Large Scale Data Analysis (Map/Reduce), part I          Feb, 2010   #6
Map/Reduce

• Google (2005), US patent (2010)
• General idea - co-locate data with computation nodes
   – Data decomposition (parallelization) – no data/order dependencies
     between tasks (except the Map-to-Reduce phase)
   – Try to utilise data locality (bandwidth is $$$)
   – Implicit data flow (higher abstraction level than MPI)
   – Partial failure handling (failed map/reduce tasks are re-scheduled)
• Structure
   – Map - for each input (Ki,Vi) produce zero or more output pairs
     (Km,Vm)
   – Combine – optional intermediate aggregation (less M->R data
     transfer)
   – Reduce - for input pair (Km, list(V1,V2,…, Vn)) produce zero or more
     output pairs (Kr,Vr)
                     Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #7
Map/Reduce (2)




                                                 (C) Jimmy Lin


Large Scale Data Analysis (Map/Reduce), part I       Feb, 2010   #8
Map/Reduce - examples

• In other words…
   – Map = partitioning of the data (compute part of a problem across
     several servers)
   – Reduce = processing of the partitions (aggregate the partial results
     from all servers into a single resultset)
   – The M/R framework takes care of grouping of partitions by key
• Example: word count
   – Map (1 task per document in the collection)
       • In: docx
       • Out: (term1, count1,x), (term2, count2,x), …
   – Reduce (1 task per term in the collection)
       • In: (term1, < count1,x, count1,y, … count1,z >)
       • Out: (term1, SUM(count1,x, count1,y, … count1,z))


                       Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #9
Map/Reduce
                                 examples (2)
• Example: Shortest path in graph (naïve)
   – Map: in (nodein, dist); out (nodeout, dist++) where nodein->nodeout
   – Reduce: in (noder, <dista,r, distb,r, …, dustc,r>); out (noder, MIN(dista,r,
     distb,r, …, dustc,r))
   – Multiple M/R iterations required, start with (nodestart,0)
• Example: Inverted indexing (full text search)
   – Map
       • In: docx
       • out: (term1, (docx, pos’1,x)), (term1, (docx, pos’’1,x)), (term2, (docx, pos2,x))…
   – Reduce
       • in = (term1, < (docx, pos’1,x), (docx, pos’’1,x), (docy, pos1,y), … (docz, pos1,z)>)
       • out = (term1, < (docx, <pos’1,x, pos’’1,x,…>), (docy, <pos1,y>), … (docz,
         <pos1,z>)>)



                        Large Scale Data Analysis (Map/Reduce), part I        Feb, 2010         #10
Map/Reduce - examples (3)

• Inverted index example rundown
• input
   – Doc1: “Why did the chicken cross the road?”
   – Doc2: “The chicken and egg problem”
   – Doc3: “Kentucky Fried Chicken”
• Map phase (3 parallel tasks)
   – map1 => (“why”,(doc1,1)), (“did”,(doc1,2)), (“the”,(doc1,3)),
     (“chicken”,(doc1,4)), (“cross”,(doc1,5)), (“the”,(doc1,6)),
     (“road”,(doc1,7))
   – map2 => (“the”,(doc2,1)), (“chicken”,(doc2,2)), (“and”,(doc2,3)),
     (“egg”,(doc2,4)), (“problem”, (doc2,5))
   – map3 => (“kentucky”,(doc3,1)), (“fried”,(doc3,2)), (“chicken”,(doc3,3))



                     Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #11
Map/Reduce - examples (4)

• Inverted index example rundown (cont.)
• Intermediate shuffle & sort phase
   –   (“why”, <(doc1,1)>),
   –   (“did”, <(doc1,2)>),
   –   (“the”, <(doc1,3), (doc1,6), (doc2,1)>)
   –   (“chicken”, <(doc1,4), (doc2,2), (doc3,3)>)
   –   (“cross”, <(doc1,5)>)
   –   (“road”, <(doc1,7)>)
   –   (“and”, <(doc2,3)>)
   –   (“egg”, <(doc2,4)>)
   –   (“problem”, <(doc2,5)>)
   –   (“kentucky”, <(doc3,1)>)
   –   (“fried”, <(doc3,2)>)

                       Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #12
Map/Reduce - examples (5)

• Inverted index example rundown (cont.)
• Reduce phase (11 parallel tasks)
   –   (“why”, <(doc1,<1>)>),
   –   (“did”, <(doc1,<2>)>),
   –   (“the”, <(doc1, <3,6>), (doc2, <1>)>)
   –   (“chicken”, <(doc1,<4>), (doc2,<2>), (doc3,<3>)>)
   –   (“cross”, <(doc1,<5>)>)
   –   (“road”, <(doc1,<7>)>)
   –   (“and”, <(doc2,<3>)>)
   –   (“egg”, <(doc2,<4>)>)
   –   (“problem”, <(doc2,<5>)>)
   –   (“kentucky”, <(doc3,<1>)>)
   –   (“fried”, <(doc3,<2>)>)

                      Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #13
Map/Reduce – pros & cons

• Good for
   – Lots of input, intermediate & output data
   – Little or no synchronisation required
   – “Read once”, batch oriented datasets (ETL)
• Bad for
   –   Fast response time
   –   Large amounts of shared data
   –   Fine-grained synchronisation required
   –   CPU intensive operations (as opposed to data intensive)




                      Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #14
Dryad

• Microsoft Research (2007),
  http://research.microsoft.com/en-us/projects/dryad/
• General purpose distributed execution engine
   – Focus on throughput, not latency
   – Automatic management of scheduling, distribution &fault tolerance
• Simple DAG model
   – Vertices -> processes (processing nodes)
   – Edges -> communication channels between the processes
• DAG model benefits
   – Generic scheduler
   – No deadlocks / deterministic
   – Easier fault tolerance

                    Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #15
Dryad DAG jobs




                                                  (C) Michael Isard

Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010            #16
Dryad (3)

• The job graph can mutate during execution (?)
• Channel types (one way)
   –   Files on a DFS
   –   Temporary file
   –   Shared memory FIFO
   –   TCP pipes
• Fault tolerance
   – Node fails => re-run
   – Input disappears => re-run upstream node
   – Node is slow => run a duplicate copy at another node, get first result




                     Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #17
Dryad architecture & components




                                                       (C) Mihai Budiu




     Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010          #18
Dryad programming

• C++ API (incl. Map/Reduce interfaces)
• SQL Integration Services (SSIS)
   – Many parallel SQL Server instances (each is a vertex in the DAG)
• DryadLINQ
   – LINQ to Dryad translator
• Distributed shell
   – Generalisation of the Unix shell & pipes
   – Many inputs/outputs per process!
   – Pipes span multiple machines




                     Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #19
Dryad vs. Map/Reduce




                                                 (C) Mihai Budiu


Large Scale Data Analysis (Map/Reduce), part I       Feb, 2010     #20
Contents



                       Part II

Open Source Map/Reduce
      frameworks


     Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #21
Hadoop

• Apache Nutch (2004), Yahoo is currently the major
  contributor
• http://hadoop.apache.org/
• Not only a Map/Reduce implementation!
   –   HDFS – distributed filesystem
   –   HBase – distributed column store
   –   Pig – high level query language (SQL like)
   –   Hive – Hadoop based data warehouse
   –   ZooKeeper, Chukwa, Pipes/Streaming, …
• Also available on Amazon EC2
• Largest Hadoop cluster – 25K nodes / 100K cores (Yahoo)


                      Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #22
Hadoop - Map/Reduce

• Components
  – Job client
  – Job Tracker
      • Only one
      • Scheduling, coordinating, monitoring, failure handling
  – Task Tracker
      • Many
      • Executes tasks received by the Job Tracker
      • Sends “heartbeats” and progress reports back to the Job Tracker
  – Task Runner
      • The actual Map or Reduce task started in a separate JVM
      • Crashes & failures do not affect the Task Tracker on the node!




                     Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #23
Hadoop - Map/Reduce (2)




                                                  (C) Tom White


 Large Scale Data Analysis (Map/Reduce), part I        Feb, 2010   #24
Hadoop - Map/Reduce (3)

• Integrated with HDFS
   – Map tasks executed on the HDFS node where the data is (data
     locality => reduce traffic)
   – Data locality is not possible for Reduce tasks
   – Intermediate outputs of Map tasks (nodes) are not stored on HDFS,
     but locally, and then sent to the proper Reduce task (node)
• Status updates
   – Task Runner => Task Tracker, progress updates every 3s
   – Task Tracker => Job Tracker, heartbeat + progress for all local tasks
     every 5s
   – If a task has no progress report for too long, it will be considered
     failed and re-started



                     Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #25
Hadoop - Map/Reduce (4)

• Some extras
   – Counters
       •   Gather stats about a task
       •   Globally aggregated (Job Runner => Task Tracker => Job Tracker)
       •   M/R counters: M/R input records, M/R output records
       •   Filesystem counters: bytes read/written
       •   Job counters: launched M/R tasks, failed M/R tasks, …
   – Joins
       • Copy the small set on each node and perform joins locally. Useful when
         one dataset is very large, the other very small (e.g. “Scalable Distributed
         Reasoning using MapReduce” from VUA)
       • Map side join – data is joined before the Map function, very efficient but
         less flexible (datasets must be partitioned & sorted in a particular way)
       • Reduce side join – more general but less efficient (Map generates (K,V)
         pairs using the join key)


                       Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010      #26
Hadoop - Map/Reduce (5)

• Built-in mappers and reducers
   – Chain – run a chain/pipe of sequential Maps (M+RM*). The last Map
     output is the Task output
   – FieldSelection – select a list of fields from the input dataset to be
     used as MR keys/values
   – TokenCounterMapper, SumReducer – (remember the “word count”
     example?)
   – RegexMapper – matches a regex in the input key/value pairs




                     Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #27
Cloud MapReduce

• Accenture (2010)
• http://code.google.com/p/cloudmapreduce/
• Map/Reduce implementation for AWS (EC2, S3, SimpleDB,
  SQS)
   – fast (reported as up to 60 times faster than Hadoop/EC2 in some
     cases)
   – scalable & robust (no single point of bottleneck or failure)
   – simple (3 KLOC)
• Features
   – No need for centralised coordinator (JobTracker), just put job status
     in the cloud datastore (SimpleDB)
   – All data transfer & communication is handled by the Cloud
   – All I/O and storage is handled by the Cloud
                     Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #28
Cloud MapReduce (2)




                                                 (C) Ricky Ho



Large Scale Data Analysis (Map/Reduce), part I       Feb, 2010   #29
Cloud MapReduce (3)

• Job client workflow
   1.   Store input data (S3)
   2.   Create a Map task for each data split & put it into the Mapper
        Queue (SQS)
   3.   Create Multiple Partition Queue (SQS)
   4.   Create Reducer Queue (SQS) & put a Reduce task for each Partition
        Queue
   5.   Create the Output Queue (SQS)
   6.   Create a Job Request (ref to all queues) and put it into SimpleDB
   7.   Start EC2 instances for Mappers & Reducers
   8.   Poll SimpleDB for job status
   9.   When job complete download results from S3



                     Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #30
Cloud MapReduce (4)

• Mapper worflow
   1.   Dequeue a Map task from the Mapper Queue
   2.   Fetch data from S3
   3.   Perform user defined map function, add multiple output (Km,Vm)
        pairs to some Multiple Partition Queue (hash(Km)) => several
        partition keys may share the same partition queue!
   4.   When done remove Map task from Mapper Queue
• Reducer workflow
   1.   Dequeue a Reeduce task from the Reducer Queue
   2.   Dequeue the (Km,Vm) pairs from the corresponding Partition Queue
        => several partitions may share the same queue!
   3.   Perform a user defined reduce function and add output pairs (Kr,Vr)
        to the Output Queue
   4.   When done remove the Reduce task from the Reducer Queue
                     Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #31
MR.Flow

• Web based M/R editor
   – http://www.mr-flow.com
   – Reusable M/R modules
   – Execution & status monitoring (Hadoop clusters)




                    Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #32
Contents



                    Part III

Some Map/Reduce
   algorithms


  Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #33
General considerations

• Map execution order is not deterministic
• Map processing time cannot be predicted
• Reduce tasks cannot start before all Maps have finished
  (dataset needs to be fully partitioned)
• Not suitable for continuous input streams
• There will be a spike in network utilisation after the Map /
  before the Reduce phase
• Number & size of key/value pairs
   – Object creation & serialisation overhead (Amdahl’s law!)
• Aggregate partial results when possible!
   – Use Combiners

                     Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #34
Graph algorithms

• Very suitable for M/R processing
   – Data (graph node) locality
   – “spreading activation” type of processing
   – Some algorithms with sequential dependency not suitable for M/R
       • Breadth-first search algorithms better than depth-first

• General Approach
   – Graph represented by adjacency lists
   – Map task – input: node + its adjacency list; perform some analysis
     over the node link structure; output: target key + analysis result
   – Reduce task – aggregate values by key
   – Perform multiple iterations (with a termination criteria)




                      Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #35
Social Network Analysis

• Problem: recommend new friends (friend-of-a-friend, FOAF)
• Map task
   – U (target user) is fixed and its friends list copied to all cluster nodes
     (“copy join”); each cluster node stores part of the social graph
   – In: (X, <friendsX>), i.e. the local data for the cluster node
   – Out:
       • if (U, X) are friends => (U, <friendsXfriendsU>), i.e. the users who are
         friends of X but not already friends of U
       • nil otherwise

• Reduce task
   – In: (U, <<friendsAfriendsU>,<friendsBfriendsU>, … >), i.e. the FOAF
     lists for all users A, B, etc. who are friends with U
   – Out (U, <(X1, N1), (X2, N2), …>), where each X is a FOAF for U, and N is
     its total number of occurrences in all FOAF lists (sort/rank the result!)
                       Large Scale Data Analysis (Map/Reduce), part I    Feb, 2010   #36
PageRank with M/R




                                                     (C) Jimmy Lin




Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010           #37
Text Indexing & Retrieval

• Indexing is very suitable for M/R
   – Focus on scalability, not on latency & response time
   – Batch oriented
• Map task
   – emit (Term, (DocID, position))
• Reduce task
   – Group pairs by Term and sort by DocID




                    Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #38
Text Indexing & Retrieval (2)




                                                   (C) Jimmy Lin



  Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010       #39
Text Indexing & Retrieval (3)

• Retrieval not suitable for M/R
   – Focus on response time
   – Startup of Mappers & Reducers is usually prohibitively expensive
• Katta
   – http://katta.sourceforge.net/
   – Distributed Lucene indexing with Hadoop (HDFS)
   – Multicast querying & ranking




                    Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #40
Useful links

• "MapReduce: Simplified Data Processing on Large Clusters"
• “Dryad: Distributed Data-Parallel Programs from Sequential
  Building Blocks”
• “Cloud MapReduce Technical Report”
• Data-Intensive Text Processing with MapReduce
• Hadoop - The Definitive Guide




                  Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #41
Q&A




    Questions?




Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #42

More Related Content

What's hot

What's hot (20)

Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
MapReduce
MapReduceMapReduce
MapReduce
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling Algorithms
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce Applications
 
Hadoop Map Reduce Arch
Hadoop Map Reduce ArchHadoop Map Reduce Arch
Hadoop Map Reduce Arch
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 

Viewers also liked

Product Sentiment Analysis
Product Sentiment AnalysisProduct Sentiment Analysis
Product Sentiment Analysisnancy amala
 
A hadoop implementation of pagerank
A hadoop implementation of pagerankA hadoop implementation of pagerank
A hadoop implementation of pagerankChengeng Ma
 
Hadoop implementation for algorithms apriori, pcy, son
Hadoop implementation for algorithms apriori, pcy, sonHadoop implementation for algorithms apriori, pcy, son
Hadoop implementation for algorithms apriori, pcy, sonChengeng Ma
 
Mike davies sentiment_analysis_presentation_backup
Mike davies sentiment_analysis_presentation_backupMike davies sentiment_analysis_presentation_backup
Mike davies sentiment_analysis_presentation_backupm1ked
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystemDataiku
 
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersZohar Elkayam
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemRajkumar Singh
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem pptsunera pathan
 
Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceNeev Technologies
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop EcosystemLior Sidi
 
Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010
Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010
Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010Cloudera, Inc.
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystemtfmailru
 

Viewers also liked (20)

Product Sentiment Analysis
Product Sentiment AnalysisProduct Sentiment Analysis
Product Sentiment Analysis
 
Hadoop Futures
Hadoop FuturesHadoop Futures
Hadoop Futures
 
Graphs
GraphsGraphs
Graphs
 
A hadoop implementation of pagerank
A hadoop implementation of pagerankA hadoop implementation of pagerank
A hadoop implementation of pagerank
 
Hadoop implementation for algorithms apriori, pcy, son
Hadoop implementation for algorithms apriori, pcy, sonHadoop implementation for algorithms apriori, pcy, son
Hadoop implementation for algorithms apriori, pcy, son
 
Mike davies sentiment_analysis_presentation_backup
Mike davies sentiment_analysis_presentation_backupMike davies sentiment_analysis_presentation_backup
Mike davies sentiment_analysis_presentation_backup
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystem
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for Developers
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a Glance
 
Google PageRank
Google PageRankGoogle PageRank
Google PageRank
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010
Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010
Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 

Similar to Large Scale Data Analysis with Map/Reduce, part I

(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Konstantin V. Shvachko
 
Survey on load balancing and data skew mitigation in mapreduce applications
Survey on load balancing and data skew mitigation in mapreduce applicationsSurvey on load balancing and data skew mitigation in mapreduce applications
Survey on load balancing and data skew mitigation in mapreduce applicationsIAEME Publication
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and DeploymentCisco Canada
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keownCisco Canada
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
 
An introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysisAn introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysisAbhijit Sharma
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingCollin Bennett
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupCsaba Toth
 
Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?TerrierTeam
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalabilityWANdisco Plc
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesKelly Technologies
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 

Similar to Large Scale Data Analysis with Map/Reduce, part I (20)

(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
 
Survey on load balancing and data skew mitigation in mapreduce applications
Survey on load balancing and data skew mitigation in mapreduce applicationsSurvey on load balancing and data skew mitigation in mapreduce applications
Survey on load balancing and data skew mitigation in mapreduce applications
 
iot.pptx
iot.pptxiot.pptx
iot.pptx
 
Hadoop
HadoopHadoop
Hadoop
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
An introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysisAn introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysis
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive Computing
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
 
Hadoop
HadoopHadoop
Hadoop
 
Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
 
E031201032036
E031201032036E031201032036
E031201032036
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 

More from Marin Dimitrov

Measuring the Productivity of Your Engineering Organisation - the Good, the B...
Measuring the Productivity of Your Engineering Organisation - the Good, the B...Measuring the Productivity of Your Engineering Organisation - the Good, the B...
Measuring the Productivity of Your Engineering Organisation - the Good, the B...Marin Dimitrov
 
Mapping Your Career Journey
Mapping Your Career JourneyMapping Your Career Journey
Mapping Your Career JourneyMarin Dimitrov
 
Trust - the Key Success Factor for Teams & Organisations
Trust - the Key Success Factor for Teams & OrganisationsTrust - the Key Success Factor for Teams & Organisations
Trust - the Key Success Factor for Teams & OrganisationsMarin Dimitrov
 
Uber @ Telerik Academy 2018
Uber @ Telerik Academy 2018Uber @ Telerik Academy 2018
Uber @ Telerik Academy 2018Marin Dimitrov
 
Machine Learning @ Uber
Machine Learning @ UberMachine Learning @ Uber
Machine Learning @ UberMarin Dimitrov
 
Career Advice for My Younger Self
Career Advice for My Younger SelfCareer Advice for My Younger Self
Career Advice for My Younger SelfMarin Dimitrov
 
Scaling Your Engineering Organization with Distributed Sites
Scaling Your Engineering Organization with Distributed SitesScaling Your Engineering Organization with Distributed Sites
Scaling Your Engineering Organization with Distributed SitesMarin Dimitrov
 
Building, Scaling and Leading High-Performance Teams
Building, Scaling and Leading High-Performance TeamsBuilding, Scaling and Leading High-Performance Teams
Building, Scaling and Leading High-Performance TeamsMarin Dimitrov
 
Uber @ Career Days 2017 (Sofia University)
Uber @ Career Days 2017 (Sofia University)Uber @ Career Days 2017 (Sofia University)
Uber @ Career Days 2017 (Sofia University)Marin Dimitrov
 
GraphDB Connectors – Powering Complex SPARQL Queries
GraphDB Connectors – Powering Complex SPARQL QueriesGraphDB Connectors – Powering Complex SPARQL Queries
GraphDB Connectors – Powering Complex SPARQL QueriesMarin Dimitrov
 
DataGraft Platform: RDF Database-as-a-Service
DataGraft Platform: RDF Database-as-a-ServiceDataGraft Platform: RDF Database-as-a-Service
DataGraft Platform: RDF Database-as-a-ServiceMarin Dimitrov
 
On-Demand RDF Graph Databases in the Cloud
On-Demand RDF Graph Databases in the CloudOn-Demand RDF Graph Databases in the Cloud
On-Demand RDF Graph Databases in the CloudMarin Dimitrov
 
Low-cost Open Data As-a-Service
Low-cost Open Data As-a-ServiceLow-cost Open Data As-a-Service
Low-cost Open Data As-a-ServiceMarin Dimitrov
 
Text Analytics & Linked Data Management As-a-Service
Text Analytics & Linked Data Management As-a-ServiceText Analytics & Linked Data Management As-a-Service
Text Analytics & Linked Data Management As-a-ServiceMarin Dimitrov
 
RDF Database-as-a-Service with S4
RDF Database-as-a-Service with S4RDF Database-as-a-Service with S4
RDF Database-as-a-Service with S4Marin Dimitrov
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked DataMarin Dimitrov
 
Enabling Low-cost Open Data Publishing and Reuse
Enabling Low-cost Open Data Publishing and ReuseEnabling Low-cost Open Data Publishing and Reuse
Enabling Low-cost Open Data Publishing and ReuseMarin Dimitrov
 
S4: The Self-Service Semantic Suite
S4: The Self-Service Semantic SuiteS4: The Self-Service Semantic Suite
S4: The Self-Service Semantic SuiteMarin Dimitrov
 
Scaling to Millions of Concurrent SPARQL Queries on the Cloud
Scaling to Millions of Concurrent SPARQL Queries on the CloudScaling to Millions of Concurrent SPARQL Queries on the Cloud
Scaling to Millions of Concurrent SPARQL Queries on the CloudMarin Dimitrov
 

More from Marin Dimitrov (20)

Measuring the Productivity of Your Engineering Organisation - the Good, the B...
Measuring the Productivity of Your Engineering Organisation - the Good, the B...Measuring the Productivity of Your Engineering Organisation - the Good, the B...
Measuring the Productivity of Your Engineering Organisation - the Good, the B...
 
Mapping Your Career Journey
Mapping Your Career JourneyMapping Your Career Journey
Mapping Your Career Journey
 
Open Source @ Uber
Open Source @ Uber Open Source @ Uber
Open Source @ Uber
 
Trust - the Key Success Factor for Teams & Organisations
Trust - the Key Success Factor for Teams & OrganisationsTrust - the Key Success Factor for Teams & Organisations
Trust - the Key Success Factor for Teams & Organisations
 
Uber @ Telerik Academy 2018
Uber @ Telerik Academy 2018Uber @ Telerik Academy 2018
Uber @ Telerik Academy 2018
 
Machine Learning @ Uber
Machine Learning @ UberMachine Learning @ Uber
Machine Learning @ Uber
 
Career Advice for My Younger Self
Career Advice for My Younger SelfCareer Advice for My Younger Self
Career Advice for My Younger Self
 
Scaling Your Engineering Organization with Distributed Sites
Scaling Your Engineering Organization with Distributed SitesScaling Your Engineering Organization with Distributed Sites
Scaling Your Engineering Organization with Distributed Sites
 
Building, Scaling and Leading High-Performance Teams
Building, Scaling and Leading High-Performance TeamsBuilding, Scaling and Leading High-Performance Teams
Building, Scaling and Leading High-Performance Teams
 
Uber @ Career Days 2017 (Sofia University)
Uber @ Career Days 2017 (Sofia University)Uber @ Career Days 2017 (Sofia University)
Uber @ Career Days 2017 (Sofia University)
 
GraphDB Connectors – Powering Complex SPARQL Queries
GraphDB Connectors – Powering Complex SPARQL QueriesGraphDB Connectors – Powering Complex SPARQL Queries
GraphDB Connectors – Powering Complex SPARQL Queries
 
DataGraft Platform: RDF Database-as-a-Service
DataGraft Platform: RDF Database-as-a-ServiceDataGraft Platform: RDF Database-as-a-Service
DataGraft Platform: RDF Database-as-a-Service
 
On-Demand RDF Graph Databases in the Cloud
On-Demand RDF Graph Databases in the CloudOn-Demand RDF Graph Databases in the Cloud
On-Demand RDF Graph Databases in the Cloud
 
Low-cost Open Data As-a-Service
Low-cost Open Data As-a-ServiceLow-cost Open Data As-a-Service
Low-cost Open Data As-a-Service
 
Text Analytics & Linked Data Management As-a-Service
Text Analytics & Linked Data Management As-a-ServiceText Analytics & Linked Data Management As-a-Service
Text Analytics & Linked Data Management As-a-Service
 
RDF Database-as-a-Service with S4
RDF Database-as-a-Service with S4RDF Database-as-a-Service with S4
RDF Database-as-a-Service with S4
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
 
Enabling Low-cost Open Data Publishing and Reuse
Enabling Low-cost Open Data Publishing and ReuseEnabling Low-cost Open Data Publishing and Reuse
Enabling Low-cost Open Data Publishing and Reuse
 
S4: The Self-Service Semantic Suite
S4: The Self-Service Semantic SuiteS4: The Self-Service Semantic Suite
S4: The Self-Service Semantic Suite
 
Scaling to Millions of Concurrent SPARQL Queries on the Cloud
Scaling to Millions of Concurrent SPARQL Queries on the CloudScaling to Millions of Concurrent SPARQL Queries on the Cloud
Scaling to Millions of Concurrent SPARQL Queries on the Cloud
 

Recently uploaded

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 

Recently uploaded (20)

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 

Large Scale Data Analysis with Map/Reduce, part I

  • 1. Large Scale Data Analysis with Map/Reduce, part I Marin Dimitrov (technology watch #1) Feb 2010
  • 2. Contents • Map/Reduce • Dryad • Sector/Sphere • Open source M/R frameworks & tools – Hadoop (Yahoo/Apache) – Cloud MapReduce (Accenture) – Elastic MapReduce (Hadoop on AWS) – MR.Flow • Some M/R algorithms – Graph algorithms, Text Indexing & retrieval Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #2
  • 3. Contents Part I Distributed computing frameworks Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #3
  • 4. Scalability & Parallelisation • Scalability approaches – Scale up (vertical scaling) • Only one direction of improvement (bigger box) – Scale out (horizontal scaling) • Two directions – add more nodes + scale up each node • Can achieve x4 the performance of a similarly priced scale-up system (ref?) – Hybrid (“scale out in a box”) • Parallel algorithms... Not – Algorithms with state – Dependencies from one iteration to another (recurrence, induction) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #4
  • 5. Parallelisation approaches • Parallelization approaches – Task decomposition • Distribute coarse-grained (synchronisation wise) and computationally expensive tasks (otherwise too much coordination/management overhead) • Dependencies - execution order vs. data dependencies • Move the data to the processing (when needed) – Data decomposition • Each parallel task works with a data partition assigned to it (no sharing) • Data has regular structure, i.e. chunks expected to need the same amount of processing time • Two criteria: granularity (size of chunk) and shape (data exchange between chunk neighbours) • Move the processing to the data Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #5
  • 6. Amdahl’s law • Impossible to achieve linear speedup • Maximum speedup is always bounded by the overhead for parallelisation and by the serial processing part • Amdahl’s law – max_speedup = – P: proportion of the program than can be parallelised (1-P still remains serial or overhead) – N: number of processors / parallel nodes – Example: P=75% (i.e. 25% serial or overhead) N (parallel nodes) 2 4 8 16 32 1024 64K Max speedup 1.60 2.29 2.91 3.37 3.66 3.99 3.99 Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #6
  • 7. Map/Reduce • Google (2005), US patent (2010) • General idea - co-locate data with computation nodes – Data decomposition (parallelization) – no data/order dependencies between tasks (except the Map-to-Reduce phase) – Try to utilise data locality (bandwidth is $$$) – Implicit data flow (higher abstraction level than MPI) – Partial failure handling (failed map/reduce tasks are re-scheduled) • Structure – Map - for each input (Ki,Vi) produce zero or more output pairs (Km,Vm) – Combine – optional intermediate aggregation (less M->R data transfer) – Reduce - for input pair (Km, list(V1,V2,…, Vn)) produce zero or more output pairs (Kr,Vr) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #7
  • 8. Map/Reduce (2) (C) Jimmy Lin Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #8
  • 9. Map/Reduce - examples • In other words… – Map = partitioning of the data (compute part of a problem across several servers) – Reduce = processing of the partitions (aggregate the partial results from all servers into a single resultset) – The M/R framework takes care of grouping of partitions by key • Example: word count – Map (1 task per document in the collection) • In: docx • Out: (term1, count1,x), (term2, count2,x), … – Reduce (1 task per term in the collection) • In: (term1, < count1,x, count1,y, … count1,z >) • Out: (term1, SUM(count1,x, count1,y, … count1,z)) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #9
  • 10. Map/Reduce examples (2) • Example: Shortest path in graph (naïve) – Map: in (nodein, dist); out (nodeout, dist++) where nodein->nodeout – Reduce: in (noder, <dista,r, distb,r, …, dustc,r>); out (noder, MIN(dista,r, distb,r, …, dustc,r)) – Multiple M/R iterations required, start with (nodestart,0) • Example: Inverted indexing (full text search) – Map • In: docx • out: (term1, (docx, pos’1,x)), (term1, (docx, pos’’1,x)), (term2, (docx, pos2,x))… – Reduce • in = (term1, < (docx, pos’1,x), (docx, pos’’1,x), (docy, pos1,y), … (docz, pos1,z)>) • out = (term1, < (docx, <pos’1,x, pos’’1,x,…>), (docy, <pos1,y>), … (docz, <pos1,z>)>) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #10
  • 11. Map/Reduce - examples (3) • Inverted index example rundown • input – Doc1: “Why did the chicken cross the road?” – Doc2: “The chicken and egg problem” – Doc3: “Kentucky Fried Chicken” • Map phase (3 parallel tasks) – map1 => (“why”,(doc1,1)), (“did”,(doc1,2)), (“the”,(doc1,3)), (“chicken”,(doc1,4)), (“cross”,(doc1,5)), (“the”,(doc1,6)), (“road”,(doc1,7)) – map2 => (“the”,(doc2,1)), (“chicken”,(doc2,2)), (“and”,(doc2,3)), (“egg”,(doc2,4)), (“problem”, (doc2,5)) – map3 => (“kentucky”,(doc3,1)), (“fried”,(doc3,2)), (“chicken”,(doc3,3)) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #11
  • 12. Map/Reduce - examples (4) • Inverted index example rundown (cont.) • Intermediate shuffle & sort phase – (“why”, <(doc1,1)>), – (“did”, <(doc1,2)>), – (“the”, <(doc1,3), (doc1,6), (doc2,1)>) – (“chicken”, <(doc1,4), (doc2,2), (doc3,3)>) – (“cross”, <(doc1,5)>) – (“road”, <(doc1,7)>) – (“and”, <(doc2,3)>) – (“egg”, <(doc2,4)>) – (“problem”, <(doc2,5)>) – (“kentucky”, <(doc3,1)>) – (“fried”, <(doc3,2)>) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #12
  • 13. Map/Reduce - examples (5) • Inverted index example rundown (cont.) • Reduce phase (11 parallel tasks) – (“why”, <(doc1,<1>)>), – (“did”, <(doc1,<2>)>), – (“the”, <(doc1, <3,6>), (doc2, <1>)>) – (“chicken”, <(doc1,<4>), (doc2,<2>), (doc3,<3>)>) – (“cross”, <(doc1,<5>)>) – (“road”, <(doc1,<7>)>) – (“and”, <(doc2,<3>)>) – (“egg”, <(doc2,<4>)>) – (“problem”, <(doc2,<5>)>) – (“kentucky”, <(doc3,<1>)>) – (“fried”, <(doc3,<2>)>) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #13
  • 14. Map/Reduce – pros & cons • Good for – Lots of input, intermediate & output data – Little or no synchronisation required – “Read once”, batch oriented datasets (ETL) • Bad for – Fast response time – Large amounts of shared data – Fine-grained synchronisation required – CPU intensive operations (as opposed to data intensive) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #14
  • 15. Dryad • Microsoft Research (2007), http://research.microsoft.com/en-us/projects/dryad/ • General purpose distributed execution engine – Focus on throughput, not latency – Automatic management of scheduling, distribution &fault tolerance • Simple DAG model – Vertices -> processes (processing nodes) – Edges -> communication channels between the processes • DAG model benefits – Generic scheduler – No deadlocks / deterministic – Easier fault tolerance Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #15
  • 16. Dryad DAG jobs (C) Michael Isard Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #16
  • 17. Dryad (3) • The job graph can mutate during execution (?) • Channel types (one way) – Files on a DFS – Temporary file – Shared memory FIFO – TCP pipes • Fault tolerance – Node fails => re-run – Input disappears => re-run upstream node – Node is slow => run a duplicate copy at another node, get first result Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #17
  • 18. Dryad architecture & components (C) Mihai Budiu Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #18
  • 19. Dryad programming • C++ API (incl. Map/Reduce interfaces) • SQL Integration Services (SSIS) – Many parallel SQL Server instances (each is a vertex in the DAG) • DryadLINQ – LINQ to Dryad translator • Distributed shell – Generalisation of the Unix shell & pipes – Many inputs/outputs per process! – Pipes span multiple machines Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #19
  • 20. Dryad vs. Map/Reduce (C) Mihai Budiu Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #20
  • 21. Contents Part II Open Source Map/Reduce frameworks Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #21
  • 22. Hadoop • Apache Nutch (2004), Yahoo is currently the major contributor • http://hadoop.apache.org/ • Not only a Map/Reduce implementation! – HDFS – distributed filesystem – HBase – distributed column store – Pig – high level query language (SQL like) – Hive – Hadoop based data warehouse – ZooKeeper, Chukwa, Pipes/Streaming, … • Also available on Amazon EC2 • Largest Hadoop cluster – 25K nodes / 100K cores (Yahoo) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #22
  • 23. Hadoop - Map/Reduce • Components – Job client – Job Tracker • Only one • Scheduling, coordinating, monitoring, failure handling – Task Tracker • Many • Executes tasks received by the Job Tracker • Sends “heartbeats” and progress reports back to the Job Tracker – Task Runner • The actual Map or Reduce task started in a separate JVM • Crashes & failures do not affect the Task Tracker on the node! Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #23
  • 24. Hadoop - Map/Reduce (2) (C) Tom White Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #24
  • 25. Hadoop - Map/Reduce (3) • Integrated with HDFS – Map tasks executed on the HDFS node where the data is (data locality => reduce traffic) – Data locality is not possible for Reduce tasks – Intermediate outputs of Map tasks (nodes) are not stored on HDFS, but locally, and then sent to the proper Reduce task (node) • Status updates – Task Runner => Task Tracker, progress updates every 3s – Task Tracker => Job Tracker, heartbeat + progress for all local tasks every 5s – If a task has no progress report for too long, it will be considered failed and re-started Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #25
  • 26. Hadoop - Map/Reduce (4) • Some extras – Counters • Gather stats about a task • Globally aggregated (Job Runner => Task Tracker => Job Tracker) • M/R counters: M/R input records, M/R output records • Filesystem counters: bytes read/written • Job counters: launched M/R tasks, failed M/R tasks, … – Joins • Copy the small set on each node and perform joins locally. Useful when one dataset is very large, the other very small (e.g. “Scalable Distributed Reasoning using MapReduce” from VUA) • Map side join – data is joined before the Map function, very efficient but less flexible (datasets must be partitioned & sorted in a particular way) • Reduce side join – more general but less efficient (Map generates (K,V) pairs using the join key) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #26
  • 27. Hadoop - Map/Reduce (5) • Built-in mappers and reducers – Chain – run a chain/pipe of sequential Maps (M+RM*). The last Map output is the Task output – FieldSelection – select a list of fields from the input dataset to be used as MR keys/values – TokenCounterMapper, SumReducer – (remember the “word count” example?) – RegexMapper – matches a regex in the input key/value pairs Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #27
  • 28. Cloud MapReduce • Accenture (2010) • http://code.google.com/p/cloudmapreduce/ • Map/Reduce implementation for AWS (EC2, S3, SimpleDB, SQS) – fast (reported as up to 60 times faster than Hadoop/EC2 in some cases) – scalable & robust (no single point of bottleneck or failure) – simple (3 KLOC) • Features – No need for centralised coordinator (JobTracker), just put job status in the cloud datastore (SimpleDB) – All data transfer & communication is handled by the Cloud – All I/O and storage is handled by the Cloud Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #28
  • 29. Cloud MapReduce (2) (C) Ricky Ho Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #29
  • 30. Cloud MapReduce (3) • Job client workflow 1. Store input data (S3) 2. Create a Map task for each data split & put it into the Mapper Queue (SQS) 3. Create Multiple Partition Queue (SQS) 4. Create Reducer Queue (SQS) & put a Reduce task for each Partition Queue 5. Create the Output Queue (SQS) 6. Create a Job Request (ref to all queues) and put it into SimpleDB 7. Start EC2 instances for Mappers & Reducers 8. Poll SimpleDB for job status 9. When job complete download results from S3 Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #30
  • 31. Cloud MapReduce (4) • Mapper worflow 1. Dequeue a Map task from the Mapper Queue 2. Fetch data from S3 3. Perform user defined map function, add multiple output (Km,Vm) pairs to some Multiple Partition Queue (hash(Km)) => several partition keys may share the same partition queue! 4. When done remove Map task from Mapper Queue • Reducer workflow 1. Dequeue a Reeduce task from the Reducer Queue 2. Dequeue the (Km,Vm) pairs from the corresponding Partition Queue => several partitions may share the same queue! 3. Perform a user defined reduce function and add output pairs (Kr,Vr) to the Output Queue 4. When done remove the Reduce task from the Reducer Queue Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #31
  • 32. MR.Flow • Web based M/R editor – http://www.mr-flow.com – Reusable M/R modules – Execution & status monitoring (Hadoop clusters) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #32
  • 33. Contents Part III Some Map/Reduce algorithms Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #33
  • 34. General considerations • Map execution order is not deterministic • Map processing time cannot be predicted • Reduce tasks cannot start before all Maps have finished (dataset needs to be fully partitioned) • Not suitable for continuous input streams • There will be a spike in network utilisation after the Map / before the Reduce phase • Number & size of key/value pairs – Object creation & serialisation overhead (Amdahl’s law!) • Aggregate partial results when possible! – Use Combiners Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #34
  • 35. Graph algorithms • Very suitable for M/R processing – Data (graph node) locality – “spreading activation” type of processing – Some algorithms with sequential dependency not suitable for M/R • Breadth-first search algorithms better than depth-first • General Approach – Graph represented by adjacency lists – Map task – input: node + its adjacency list; perform some analysis over the node link structure; output: target key + analysis result – Reduce task – aggregate values by key – Perform multiple iterations (with a termination criteria) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #35
  • 36. Social Network Analysis • Problem: recommend new friends (friend-of-a-friend, FOAF) • Map task – U (target user) is fixed and its friends list copied to all cluster nodes (“copy join”); each cluster node stores part of the social graph – In: (X, <friendsX>), i.e. the local data for the cluster node – Out: • if (U, X) are friends => (U, <friendsXfriendsU>), i.e. the users who are friends of X but not already friends of U • nil otherwise • Reduce task – In: (U, <<friendsAfriendsU>,<friendsBfriendsU>, … >), i.e. the FOAF lists for all users A, B, etc. who are friends with U – Out (U, <(X1, N1), (X2, N2), …>), where each X is a FOAF for U, and N is its total number of occurrences in all FOAF lists (sort/rank the result!) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #36
  • 37. PageRank with M/R (C) Jimmy Lin Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #37
  • 38. Text Indexing & Retrieval • Indexing is very suitable for M/R – Focus on scalability, not on latency & response time – Batch oriented • Map task – emit (Term, (DocID, position)) • Reduce task – Group pairs by Term and sort by DocID Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #38
  • 39. Text Indexing & Retrieval (2) (C) Jimmy Lin Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #39
  • 40. Text Indexing & Retrieval (3) • Retrieval not suitable for M/R – Focus on response time – Startup of Mappers & Reducers is usually prohibitively expensive • Katta – http://katta.sourceforge.net/ – Distributed Lucene indexing with Hadoop (HDFS) – Multicast querying & ranking Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #40
  • 41. Useful links • "MapReduce: Simplified Data Processing on Large Clusters" • “Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks” • “Cloud MapReduce Technical Report” • Data-Intensive Text Processing with MapReduce • Hadoop - The Definitive Guide Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #41
  • 42. Q&A Questions? Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #42