SlideShare a Scribd company logo
1 of 54
Download to read offline
DISTRIBUTED COMPUTING IN
PRAXIS
GFS, BIGTABLE, MAPREDUCE, CHUBBY




Dominik Roblek
Software Engineer
Google Inc.
GOOGLE TECHNOLOGY LAYERS

                                                      Google™ search
                                                      Gmail™
                                                      Ads system
     Services and Applications                        Google Maps™


     Distributed Computing
                                                       Commodity PC Hardware
                                                       Linux
     Computing Platform                                Physical Network



                JavaBlend 2008, http://www.javablend.net/                      2
IMPLICATIONS OF GOOGLE ENVIRONMENT
 • Single process performance does not matter
    – Total throughput is more important

 • Stuff breaks
    – If you have one server, it may stay up three years
    – If you have 10,000 servers, expect to lose ten a day

 • “Ultra-reliable” hardware doesn’t really help
    – At large scales, reliable hardware still fails, albeit less often
    – Software still needs to be fault-tolerant


                        JavaBlend 2008, http://www.javablend.net/         3
BUILDING BLOCKS OF google.com?
 • Distributed data
    – Google File System (GFS)
    – BigTable

 • Job manager

 • Distributed computation
    – MapReduce

 • Distributed lock service
    – Chubby

                      JavaBlend 2008, http://www.javablend.net/   4
SCALABLE DISTRIBUTED FILE SYSTEM




     Google File System
           (GFS)

                 JavaBlend 2008, http://www.javablend.net/   5
GFS: REQUIREMENTS
 • High component failure rates
   – Inexpensive commodity components fail all the time

 • Modest number of huge files
   – Just a few millions, most of them multi-GB

 • Files are write-once, mostly appended to
   – Perhaps concurrently
   – Large streaming reads

                  JavaBlend 2008, http://www.javablend.net/
GFS: DESIGN DECISION
 •   Files stored as chunks
      – Fixed size (64MB)

 •   Reliability through replication

 •   Each chunk replicated 3+ times

 •   Single master to coordinate access, keep metadata
      – Simple centralized management

 •   No data caching
      – Little benefit due to large data sets, streaming reads


                           JavaBlend 2008, http://www.javablend.net/
GFS: ARCHITECTURE




              Where is a potential weaknes of this design?
                JavaBlend 2008, http://www.javablend.net/
GFS: WEAK POINT - SINGLE MASTER
 • From distributed systems we know this is a
    – Single point of failure
    – Scalability bottleneck

 • GFS solutions
    – Shadow masters
    – Minimize master involvement
        • never move data through it, use only for metadata
        • large chunk size
        • master delegates authority to primary replicas in data mutations
          (chunk leases)


                        JavaBlend 2008, http://www.javablend.net/
GFS: METADATA
 • Global metadata is stored on the master
    – File and chunk namespaces
    – Mapping from files to chunks
       • Locations of each chunk’s replicas
    – All in memory (64 bytes / chunk)

 • Master has an operation log for persistent logging of
   critical metadata updates
    – Persistent on local disk
    – Replicated
    – Checkpoints for faster recovery

                       JavaBlend 2008, http://www.javablend.net/
GFS: MUTATIONS
 • Mutations must be done
   for all replicas

 • Master picks one replica
   as primary; gives it a
   “lease” for mutations
    – Primary defines a serial
      order of mutations

 • Data flow decoupled from
   control flow

                      JavaBlend 2008, http://www.javablend.net/
GFS: OPEN SOURCE ALTERNATIVES
 • Hadoop Distributed File System - HDFS (Java)
    – http://hadoop.apache.org/core/docs/current/hdfs_design.html




                      JavaBlend 2008, http://www.javablend.net/     12
DISTRIBUTED STORAGE FOR LARGE STRUCTURED DATA SETS




                 Bigtable


                 JavaBlend 2008, http://www.javablend.net/   13
BIGTABLE: REQUIREMENTS
 • Want to store petabytes of structured data across
   thousands of commodity servers

 • Want a simple data format that supports dynamic control
   over data layout and format

 • Must support very high read/write rates
    – millions of operations per second

 • Latency requirements:
    – backend bulk processing
    – real-time data serving
                      JavaBlend 2008, http://www.javablend.net/   14
BIGTABLE: STRUCTURE
 •    Bigtable is multi-dimensional map:
       – sparse
       – persistent
       – distributed

 •    Key:
       – Row name
       – Column name
       – Timestamp

 •    Value:
       – array of bytes

     (rowName: string, columnName: string, timestamp: long) → byte[]

                           JavaBlend 2008, http://www.javablend.net/   15
BIGTABLE: EXAMPLE
•   A web crawling system might use Bigtable that stores web pages
    – Each row key could represent a specific URL
    – Columns represent page contents, the references to that page, and
      other metadata
    – The row range for a table is dynamically partitioned between servers

•   Rows are clustered together on machines by key
    – Using inversed URLs as keys minimizes the number of machines where
      pages from a single domain are stored
    – Each cell is timestamped so there could be multiple versions of the
      same data in the table


                        JavaBlend 2008, http://www.javablend.net/            16
BIGTABLE: EXAMPLE

                  “contents:”           “anchor:cnnsi.com” “anchor:my.look.ca”



                   “<html>…quot;           t3
 “com.cnn.www”    “<html>…quot;            t5      “CNNquot;            t9   “CNN.comquot;   t8
                 “<html>…quot;             t6




                    JavaBlend 2008, http://www.javablend.net/                         17
BIGTABLE: ROWS

 • Name is an arbitrary string
   – Access to data in a row is atomic
   – Row creation is implicit upon storing data

 • Rows ordered lexicographically
   – Rows close together lexicographically usually
     on one or a small number of machines

                 JavaBlend 2008, http://www.javablend.net/   18
BIGTABLE: TABLETS

 • Row range for a table is dynamically
   partitioned into tablets

 • Tablet holds contiguous range of rows
   – Reads over short row ranges are efficient
   – Clients can choose row keys to achieve
     locality

                    JavaBlend 2008, http://www.javablend.net/   19
BIGTABLE: COLUMNS
 •   Columns have two-level name structure
                  <column_family>:[<column_qualifier>]

 •   Column family:
     – Creation must be explicit
     – Has associated type information and other metadata
     – Unit of access control

 •   Column qualifier
     – Unbounded number of columns
     – Creation of column within a family is implicit at updates
         • Additional dimensions

                          JavaBlend 2008, http://www.javablend.net/   20
BIGTABLE: TIMESTAMPS
 •   Used to store different versions of data in a cell
      – New writes default to current time
      – Can also be set explicitly by clients

 •   Lookup options
      – Return all values
      – Return most recent K values
      – Return all values in timestamp range

 •   Column families can be marked with attributes
      – Only retain most recent K values in a cell
      – Keep values until they are older than K seconds


                           JavaBlend 2008, http://www.javablend.net/   21
BIGTABLE: AT GOOGLE

 • Good match for most of our applications:
   – Google Earth™
   – Google Maps™
   – Google Talk™
   – Google Finance™
   – Orkut™


                JavaBlend 2008, http://www.javablend.net/   22
BIGTABLE: OPEN SOURCE ALTERNATIVES

 • HBase (Java)
   – http://hadoop.apache.org/hbase/

 • Hypertable (C++)
   – http://www.hypertable.org/




                 JavaBlend 2008, http://www.javablend.net/   23
PROGRAMMING MODEL FOR PROCESSING LARGE DATA SETS




            MapReduce


                JavaBlend 2008, http://www.javablend.net/   24
MAPREDUCE: REQUIREMENTS

 • Want to process lots of data ( > 1 TB)
 • Want to run it on thousands of commodity PCs
 • Must be robust
 • … And simple to use
MAPREDUCE: DESCRIPTION
 •   A simple programming model that applies to many large-scale
     computing problems
      – Based on principles of functional languages
      – Scalable, robust

 •   Hide messy details in MapReduce runtime library:
      –   automatic parallelization
      –   load balancing
      –   network and disk transfer optimization
      –   handling of machine failures
      –   robustness

 •   Improvements to core library benefit all users of library!
                            JavaBlend 2008, http://www.javablend.net/   26
MAPREDUCE: FUNCTIONAL PROGRAMMING
 •   Functions don’t change data structures
     –   They always create new ones
     –   Input data remain unchanged

 •   Functions don’t have side effects

 •   Data flows are implicit in program design

 •   Order of operations does not matter

                   z := f(g(x), h(x, y), k(y))
MAPREDUCE: TYPICAL EXECUTION FLOW
•   Read a lot of data

•   Map: extract something you care about from each record

•   Shuffle and Sort

•   Reduce: aggregate, summarize, filter, or transform

•   Write the results




     Outline stays the same, map and reduce change to fit the problem
                         JavaBlend 2008, http://www.javablend.net/      28
MAPREDUCE: PROGRAMING INTERFACE

 User must implement two functions

 Map(input_key, input_value)
   → (output_key, intermediate_value)

 Reduce(output_key, intermediate_value_list)
   → output_value_list
MAPREDUCE: MAP
 • Records from the data source …
    – lines out of files
    – rows of a database
    – etc.
 • … are fed into the map function as (key, value pairs)
    – filename, line
    – etc.

 • map produces zero, one or more intermediate values
   along with an output key from the input
MAPREDUCE: REDUCE

 • After the map phase is over, all the
   intermediate values for a given output key
   are combined together into a list

 • reduce combines those intermediate
   values into zero, one or more final values
   for that same output key
MAPREDUCE: EXAMPLE - WORD FREQUENCY 1/5
 • Input is files with one document per record

 • Specify a map function that takes a key/value pair
    – key = document name
    – value = document contents

 • Output of map function is zero, one or more key/value
   pairs
    – In our case, output (word, “1”) once per word in the document


                      JavaBlend 2008, http://www.javablend.net/       32
MAPREDUCE: EXAMPLE - WORD FREQUENCY 2/5

              “To be or not to be?”
                        “document1”


                             “to”, “1”
                            “be”, “1”
                             “or”, “1”
                                 …


                JavaBlend 2008, http://www.javablend.net/   33
MAPREDUCE: PRIMER - WORD FREQUENCY 3/5
 • MapReduce library gathers together all pairs
   with the same key
   – shuffle/sort

 • reduce function combines the values for a key
   – In our case, compute the sum

 • Output of reduce is zero, one or more values
   paired with key and saved

                    JavaBlend 2008, http://www.javablend.net/   34
MAPREDUCE: EXAMPLE - WORD FREQUENCY 4/5

      key = “be”        key = “not”           key = “or”              key = “to”
   values = “1”, “1”   values = “1”          values = “1”          values = “1”, “1”

         “2”                 “1”                     “1”                 “2”


                                    “be”, “2”
                                   “not”, “1”
                                    “or”, “1”
                                    “to”, “2”

                       JavaBlend 2008, http://www.javablend.net/                       35
MAPREDUCE: EXAMPLE - WORD FREQUENCY 5/5
Map(String input_key, String input_value):
  // input_key: document name
  // input_value: document contents
  for each word w in input_values:
    EmitIntermediate(w, quot;1quot;);

Reduce(String output_key, Iterator intermediate_values):
  // output_key: a word, same for input and output
  // intermediate_values: a list of counts
  int result = 0;
  for each v in intermediate_values:
    result += ParseInt(v);
  Emit(AsString(result));

                   JavaBlend 2008, http://www.javablend.net/   36
MAPREDUCE: DISTRIBUTED EXECUTION




                JavaBlend 2008, http://www.javablend.net/   37
MAPREDUCE: LOGICAL FLOW




                JavaBlend 2008, http://www.javablend.net/   38
MAPREDUCE: PARALLEL FLOW 1/2
 • map functions run in parallel, creating different
   intermediate values from different input data sets
 • reduce functions also run in parallel, each
   working on a different output key
   – All values are processed independently
 • Bottleneck
   – reduce phase can’t start until map phase is
     completely finished


                  JavaBlend 2008, http://www.javablend.net/   39
MAPREDUCE: PARALLEL FLOW 2/2




                JavaBlend 2008, http://www.javablend.net/   40
MAPREDUCE: WIDELY APPLICABLE
 • distributed grep
 • distributed sort
 • document clustering
 • machine learning
 • web access log stats
 • inverted index construction
 • statistical machine translation
 • etc.
                      JavaBlend 2008, http://www.javablend.net/   41
MAPREDUCE: EXAMPLE - LANGUAGE MODEL STATISTICS
 • Used in our statistical machine translation system
 • Ned to count # of times every 5-word sequence occurs
   in large corpus of documents (and keep all those where
   count >= 4)

 • map:
    – extract 5-word sequences => count from document
 • reduce:
    – summarize counts
    – keep those where count >= 4


                     JavaBlend 2008, http://www.javablend.net/   42
MAPREDUCE: EXAMPLE - JOINING WITH OTHER DATA
 • Generate per-doc summary, but include per-host
   information (e.g. # of pages on host, important terms on
   host)
    – per-host information might involve RPC to a set of machines
      containing data for all sites

 • map:
    – extract host name from URL, lookup per-host info, combine with
      per-doc data and emit
 • reduce:
    – identity function (just emit input value directly)

                        JavaBlend 2008, http://www.javablend.net/      43
MAPREDUCE: FAULT TOLERANCE

 • Master detects worker failures
   – Re-executes failed map tasks
   – Re-executes reduce tasks

 • Master notices particular input key/values
   cause crashes in map
   – Skips those values on re-execution
MAPREDUCE: LOCAL OPTIMIZATIONS

 • Master program divides up tasks based on
   location of data
   – tries to have map tasks on same machine as
     physical file data, or at least same rack
MAPREDUCE: SLOW MAP TASKS
• reduce phase cannot start before the map phase
  completes
   – On slow disk controller can slow down the whole system

• Master redundantly starts slow-moving map task
   – Uses results of first copy to finish
MAPREDUCE: COMBINE

 • combine is a mini-reduce phase that runs
   on the same machine as map phase
   – It aggregates the results of local map phases
   – Saves network bandwidth
MAPREDUCE: CONCLUSION
 • MapReduce proved to be extremely useful
   abstraction
   – It greatly simplifies the processing of huge amounts of
     data

 • MapReduce is easy to use
   – Programer can focus on problem
   – MapReduce takes care for messy details

                   JavaBlend 2008, http://www.javablend.net/   48
MAPREDUCE: OPEN SOURCE ALTERNATIVES
 • Hadoop (Java)
   – http://hadoop.apache.org/

 • Disco (Erlang, Python)
   – http://discoproject.org/

 • etc.

                 JavaBlend 2008, http://www.javablend.net/   49
LOCK SERVICE FOR LOOSELY-COUPLED DISTRIBUTED SYSTEMS




                 Chubby

                 JavaBlend 2008, http://www.javablend.net/   50
CHUBBY: SYNCHRONIZED ACCESS TO SHARED RESOURCES
 •   Key element of distributed architecture at Google:
      – Used by GFS, Bigtable and Mapreduce

 •   Interface similar to distributed file system with advisory locks
      – Access control list
      – No links

 •   Every Chubby file can hold a small amount of data

 •   Every Chubby file or directory can be used as read or write lock
      – Locks are advisory, not mandatory
          • Clients must be well-behaved
          • A client that does not hold a lock can still read the content of a Chubby file

                              JavaBlend 2008, http://www.javablend.net/                      51
CHUBBY: DESIGN
 • Design emphasis not on high performance, but
   on availability and reliability

 • Reading and writing is atomic

 • Chubby service is composed of 5 active replicas
   – One of them elected as master
   – Requires the majority of replicas to be alive

                   JavaBlend 2008, http://www.javablend.net/   52
CHUBBY: EVENTS

 • Client can subscribe for various events:
   – file contents modified
   – child node added, removed, or modified
   – lock acquired
   – conflicting lock request from another client
   – etc.


                 JavaBlend 2008, http://www.javablend.net/   53
REFERENCES
 •   Bibliography:
      –   Ghemawat, S., Gobioff, H., and Leung, S.-T. (2003). The google file system. In SOSP '03: Proceedings of the
          nineteenth ACM symposium on Operating systems principles, pages 29-43. ACM Press.
      –   Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., and
          Gruber, R. E. (2006). Bigtable: A distributed storage system for structured data. In Operating Systems Design
          and Implementation, pages 205-218.
      –   Dean, J. and Ghemawat, S (2004). Mapreduce: Simplified data processing on large clusters. In OSDI '04:
          Proceedings of the 6th symposium on Operating systems design and implementation, pages 137-150.
      –   Dean, J. (2006). Experiences with MapReduce, an abstraction for large-scale computation. In PACT '06:
          Proceedings of the 15th international Conference on Parallel Architectures and Compilation Techniques. ACM.
      –   Burrows, M. (2006). The chubby lock service for loosely-coupled distributed systems. In OSDI '06: Proceedings
          of the 7th symposium on Operating systems design and implementation, pages 335-350.
 •   Partially based on:
      –   Bisciglia, C., Kimball, A., & Michels-Slettvet, S. (2007). Distributed Computing Seminar, Lecture 2: MapReduce
          Theory and Implementation. Retrieved September 6, 2008, from
          http://code.google.com/edu/submissions/mapreduce-minilecture/lec2-mapred.ppt
      –   Dean, J. (2006). Experiences with MapReduce, an abstraction for large-scale computation. Retrieved
          September 6, 2008, from http://www.cs.virginia.edu/~pact2006/program/mapreduce-pact06-keynote.pdf
      –   Bisciglia, C., Kimball, A., & Michels-Slettvet, S. (2007). Distributed Computing Seminar, Lecture 3: Distributed
          Filesystems. Retrieved September 6, 2008, from http://code.google.com/edu/submissions/mapreduce-
          minilecture/lec3-dfs.ppt
      –   Stokely, M. (2007). Distributed Computing at Google. Retrieved September 6, 2008, from
          http://www.swinog.ch/meetings/swinog15/SRE-Recruiting-SwiNOG2007.ppt
                                        JavaBlend 2008, http://www.javablend.net/                                          54

More Related Content

What's hot

HandsOn ProxySQL Tutorial - PLSC18
HandsOn ProxySQL Tutorial - PLSC18HandsOn ProxySQL Tutorial - PLSC18
HandsOn ProxySQL Tutorial - PLSC18Derek Downey
 
Apachecon Europe 2012: Operating HBase - Things you need to know
Apachecon Europe 2012: Operating HBase - Things you need to knowApachecon Europe 2012: Operating HBase - Things you need to know
Apachecon Europe 2012: Operating HBase - Things you need to knowChristian Gügi
 
Elephant Roads: a tour of Postgres forks
Elephant Roads: a tour of Postgres forksElephant Roads: a tour of Postgres forks
Elephant Roads: a tour of Postgres forksCommand Prompt., Inc
 
Parallel Replication in MySQL and MariaDB
Parallel Replication in MySQL and MariaDBParallel Replication in MySQL and MariaDB
Parallel Replication in MySQL and MariaDBMydbops
 
PGConf.ASIA 2019 Bali - PostgreSQL Database Migration and Maintenance - Koich...
PGConf.ASIA 2019 Bali - PostgreSQL Database Migration and Maintenance - Koich...PGConf.ASIA 2019 Bali - PostgreSQL Database Migration and Maintenance - Koich...
PGConf.ASIA 2019 Bali - PostgreSQL Database Migration and Maintenance - Koich...Equnix Business Solutions
 
WebLogic Deployment Plan Example
WebLogic Deployment Plan ExampleWebLogic Deployment Plan Example
WebLogic Deployment Plan ExampleJames Bayer
 
Percona live 2021 Practical Database Automation with Ansible
Percona live 2021 Practical Database Automation with Ansible Percona live 2021 Practical Database Automation with Ansible
Percona live 2021 Practical Database Automation with Ansible Derek Downey
 
MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017Severalnines
 
Client Killed the Server Star
Client Killed the Server StarClient Killed the Server Star
Client Killed the Server StarPamela Fox
 
Feed Burner Scalability
Feed Burner ScalabilityFeed Burner Scalability
Feed Burner Scalabilitydidip
 
Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)
Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)
Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)Masao Fujii
 
Migrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at FacebookMigrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at FacebookMariaDB plc
 
12 things Oracle DBAs must know about SQL
12 things Oracle DBAs must know about SQL12 things Oracle DBAs must know about SQL
12 things Oracle DBAs must know about SQLSolarWinds
 
How to migrate from Oracle Database with ease
How to migrate from Oracle Database with easeHow to migrate from Oracle Database with ease
How to migrate from Oracle Database with easeMariaDB plc
 
Evolution of MySQL Parallel Replication
Evolution of MySQL Parallel Replication Evolution of MySQL Parallel Replication
Evolution of MySQL Parallel Replication Mydbops
 
Webinar Slides: Migrating to Galera Cluster
Webinar Slides: Migrating to Galera ClusterWebinar Slides: Migrating to Galera Cluster
Webinar Slides: Migrating to Galera ClusterSeveralnines
 
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLWebinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLSeveralnines
 
What to expect from MariaDB Platform X5, part 1
What to expect from MariaDB Platform X5, part 1What to expect from MariaDB Platform X5, part 1
What to expect from MariaDB Platform X5, part 1MariaDB plc
 
The MySQL High Availability Landscape and where Galera Cluster fits in
The MySQL High Availability Landscape and where Galera Cluster fits inThe MySQL High Availability Landscape and where Galera Cluster fits in
The MySQL High Availability Landscape and where Galera Cluster fits inSakari Keskitalo
 

What's hot (20)

HandsOn ProxySQL Tutorial - PLSC18
HandsOn ProxySQL Tutorial - PLSC18HandsOn ProxySQL Tutorial - PLSC18
HandsOn ProxySQL Tutorial - PLSC18
 
Apachecon Europe 2012: Operating HBase - Things you need to know
Apachecon Europe 2012: Operating HBase - Things you need to knowApachecon Europe 2012: Operating HBase - Things you need to know
Apachecon Europe 2012: Operating HBase - Things you need to know
 
Elephant Roads: a tour of Postgres forks
Elephant Roads: a tour of Postgres forksElephant Roads: a tour of Postgres forks
Elephant Roads: a tour of Postgres forks
 
Parallel Replication in MySQL and MariaDB
Parallel Replication in MySQL and MariaDBParallel Replication in MySQL and MariaDB
Parallel Replication in MySQL and MariaDB
 
PGConf.ASIA 2019 Bali - PostgreSQL Database Migration and Maintenance - Koich...
PGConf.ASIA 2019 Bali - PostgreSQL Database Migration and Maintenance - Koich...PGConf.ASIA 2019 Bali - PostgreSQL Database Migration and Maintenance - Koich...
PGConf.ASIA 2019 Bali - PostgreSQL Database Migration and Maintenance - Koich...
 
WebLogic Deployment Plan Example
WebLogic Deployment Plan ExampleWebLogic Deployment Plan Example
WebLogic Deployment Plan Example
 
Percona live 2021 Practical Database Automation with Ansible
Percona live 2021 Practical Database Automation with Ansible Percona live 2021 Practical Database Automation with Ansible
Percona live 2021 Practical Database Automation with Ansible
 
MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017
 
Client Killed the Server Star
Client Killed the Server StarClient Killed the Server Star
Client Killed the Server Star
 
Feed Burner Scalability
Feed Burner ScalabilityFeed Burner Scalability
Feed Burner Scalability
 
Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)
Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)
Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)
 
The Accidental DBA
The Accidental DBAThe Accidental DBA
The Accidental DBA
 
Migrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at FacebookMigrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at Facebook
 
12 things Oracle DBAs must know about SQL
12 things Oracle DBAs must know about SQL12 things Oracle DBAs must know about SQL
12 things Oracle DBAs must know about SQL
 
How to migrate from Oracle Database with ease
How to migrate from Oracle Database with easeHow to migrate from Oracle Database with ease
How to migrate from Oracle Database with ease
 
Evolution of MySQL Parallel Replication
Evolution of MySQL Parallel Replication Evolution of MySQL Parallel Replication
Evolution of MySQL Parallel Replication
 
Webinar Slides: Migrating to Galera Cluster
Webinar Slides: Migrating to Galera ClusterWebinar Slides: Migrating to Galera Cluster
Webinar Slides: Migrating to Galera Cluster
 
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLWebinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
 
What to expect from MariaDB Platform X5, part 1
What to expect from MariaDB Platform X5, part 1What to expect from MariaDB Platform X5, part 1
What to expect from MariaDB Platform X5, part 1
 
The MySQL High Availability Landscape and where Galera Cluster fits in
The MySQL High Availability Landscape and where Galera Cluster fits inThe MySQL High Availability Landscape and where Galera Cluster fits in
The MySQL High Availability Landscape and where Galera Cluster fits in
 

Viewers also liked

[Strukelj] Why will Java 7.0 be so cool
[Strukelj] Why will Java 7.0 be so cool[Strukelj] Why will Java 7.0 be so cool
[Strukelj] Why will Java 7.0 be so cooljavablend
 
windows linux BEÑAT HAITZ
windows linux BEÑAT HAITZwindows linux BEÑAT HAITZ
windows linux BEÑAT HAITZguest5b2d8e
 
[Pilarczyk] Adrenaline programing implementing - SOA and BPM in your application
[Pilarczyk] Adrenaline programing implementing - SOA and BPM in your application[Pilarczyk] Adrenaline programing implementing - SOA and BPM in your application
[Pilarczyk] Adrenaline programing implementing - SOA and BPM in your applicationjavablend
 
[Muir] Seam 2 in practice
[Muir] Seam 2 in practice[Muir] Seam 2 in practice
[Muir] Seam 2 in practicejavablend
 
Portfólio - Zarabatana Digital 360
Portfólio - Zarabatana Digital 360 Portfólio - Zarabatana Digital 360
Portfólio - Zarabatana Digital 360 Sergio Akash
 

Viewers also liked (7)

[Strukelj] Why will Java 7.0 be so cool
[Strukelj] Why will Java 7.0 be so cool[Strukelj] Why will Java 7.0 be so cool
[Strukelj] Why will Java 7.0 be so cool
 
.
..
.
 
windows linux BEÑAT HAITZ
windows linux BEÑAT HAITZwindows linux BEÑAT HAITZ
windows linux BEÑAT HAITZ
 
[Pilarczyk] Adrenaline programing implementing - SOA and BPM in your application
[Pilarczyk] Adrenaline programing implementing - SOA and BPM in your application[Pilarczyk] Adrenaline programing implementing - SOA and BPM in your application
[Pilarczyk] Adrenaline programing implementing - SOA and BPM in your application
 
Bryanpulido
BryanpulidoBryanpulido
Bryanpulido
 
[Muir] Seam 2 in practice
[Muir] Seam 2 in practice[Muir] Seam 2 in practice
[Muir] Seam 2 in practice
 
Portfólio - Zarabatana Digital 360
Portfólio - Zarabatana Digital 360 Portfólio - Zarabatana Digital 360
Portfólio - Zarabatana Digital 360
 

Similar to [Roblek] Distributed computing in practice

Lightweight Grids With Terracotta
Lightweight Grids With TerracottaLightweight Grids With Terracotta
Lightweight Grids With TerracottaPT.JUG
 
Cloudcon East Presentation
Cloudcon East PresentationCloudcon East Presentation
Cloudcon East Presentationbr7tt
 
Cloudcon East Presentation
Cloudcon East PresentationCloudcon East Presentation
Cloudcon East Presentationbr7tt
 
Galaxy
GalaxyGalaxy
Galaxybosc
 
Rails Conf Europe 2007 Notes
Rails Conf  Europe 2007  NotesRails Conf  Europe 2007  Notes
Rails Conf Europe 2007 NotesRoss Lawley
 
Instant J Chem: one-stop information hub for medicinal chemists: US UGM 2008
Instant J Chem: one-stop information hub for medicinal chemists: US UGM 2008Instant J Chem: one-stop information hub for medicinal chemists: US UGM 2008
Instant J Chem: one-stop information hub for medicinal chemists: US UGM 2008ChemAxon
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.Kyong-Ha Lee
 
Five Lessons in Distributed Databases
Five Lessons  in Distributed DatabasesFive Lessons  in Distributed Databases
Five Lessons in Distributed Databasesjbellis
 
Brian Oliver Pimp My Data Grid
Brian Oliver  Pimp My Data GridBrian Oliver  Pimp My Data Grid
Brian Oliver Pimp My Data Griddeimos
 
Building FoundationDB
Building FoundationDBBuilding FoundationDB
Building FoundationDBFoundationDB
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the Worldjhugg
 
Running Stateful Apps on Kubernetes
Running Stateful Apps on KubernetesRunning Stateful Apps on Kubernetes
Running Stateful Apps on KubernetesYugabyte
 
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...Redis Labs
 
Storage Systems for High Scalable Systems Presentation
Storage Systems for High Scalable Systems PresentationStorage Systems for High Scalable Systems Presentation
Storage Systems for High Scalable Systems Presentationandyman3000
 

Similar to [Roblek] Distributed computing in practice (20)

Lightweight Grids With Terracotta
Lightweight Grids With TerracottaLightweight Grids With Terracotta
Lightweight Grids With Terracotta
 
MySQL Aquarium Paris
MySQL Aquarium ParisMySQL Aquarium Paris
MySQL Aquarium Paris
 
Cloudcon East Presentation
Cloudcon East PresentationCloudcon East Presentation
Cloudcon East Presentation
 
Cloudcon East Presentation
Cloudcon East PresentationCloudcon East Presentation
Cloudcon East Presentation
 
Galaxy
GalaxyGalaxy
Galaxy
 
20080611accel
20080611accel20080611accel
20080611accel
 
Advanced Deployment
Advanced DeploymentAdvanced Deployment
Advanced Deployment
 
Rails Conf Europe 2007 Notes
Rails Conf  Europe 2007  NotesRails Conf  Europe 2007  Notes
Rails Conf Europe 2007 Notes
 
Qcon
QconQcon
Qcon
 
Instant J Chem: one-stop information hub for medicinal chemists: US UGM 2008
Instant J Chem: one-stop information hub for medicinal chemists: US UGM 2008Instant J Chem: one-stop information hub for medicinal chemists: US UGM 2008
Instant J Chem: one-stop information hub for medicinal chemists: US UGM 2008
 
Brandon
BrandonBrandon
Brandon
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.
 
Five Lessons in Distributed Databases
Five Lessons  in Distributed DatabasesFive Lessons  in Distributed Databases
Five Lessons in Distributed Databases
 
MySQL Tuning
MySQL TuningMySQL Tuning
MySQL Tuning
 
Brian Oliver Pimp My Data Grid
Brian Oliver  Pimp My Data GridBrian Oliver  Pimp My Data Grid
Brian Oliver Pimp My Data Grid
 
Building FoundationDB
Building FoundationDBBuilding FoundationDB
Building FoundationDB
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the World
 
Running Stateful Apps on Kubernetes
Running Stateful Apps on KubernetesRunning Stateful Apps on Kubernetes
Running Stateful Apps on Kubernetes
 
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
 
Storage Systems for High Scalable Systems Presentation
Storage Systems for High Scalable Systems PresentationStorage Systems for High Scalable Systems Presentation
Storage Systems for High Scalable Systems Presentation
 

Recently uploaded

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 

Recently uploaded (20)

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 

[Roblek] Distributed computing in practice

  • 1. DISTRIBUTED COMPUTING IN PRAXIS GFS, BIGTABLE, MAPREDUCE, CHUBBY Dominik Roblek Software Engineer Google Inc.
  • 2. GOOGLE TECHNOLOGY LAYERS Google™ search Gmail™ Ads system Services and Applications Google Maps™ Distributed Computing Commodity PC Hardware Linux Computing Platform Physical Network JavaBlend 2008, http://www.javablend.net/ 2
  • 3. IMPLICATIONS OF GOOGLE ENVIRONMENT • Single process performance does not matter – Total throughput is more important • Stuff breaks – If you have one server, it may stay up three years – If you have 10,000 servers, expect to lose ten a day • “Ultra-reliable” hardware doesn’t really help – At large scales, reliable hardware still fails, albeit less often – Software still needs to be fault-tolerant JavaBlend 2008, http://www.javablend.net/ 3
  • 4. BUILDING BLOCKS OF google.com? • Distributed data – Google File System (GFS) – BigTable • Job manager • Distributed computation – MapReduce • Distributed lock service – Chubby JavaBlend 2008, http://www.javablend.net/ 4
  • 5. SCALABLE DISTRIBUTED FILE SYSTEM Google File System (GFS) JavaBlend 2008, http://www.javablend.net/ 5
  • 6. GFS: REQUIREMENTS • High component failure rates – Inexpensive commodity components fail all the time • Modest number of huge files – Just a few millions, most of them multi-GB • Files are write-once, mostly appended to – Perhaps concurrently – Large streaming reads JavaBlend 2008, http://www.javablend.net/
  • 7. GFS: DESIGN DECISION • Files stored as chunks – Fixed size (64MB) • Reliability through replication • Each chunk replicated 3+ times • Single master to coordinate access, keep metadata – Simple centralized management • No data caching – Little benefit due to large data sets, streaming reads JavaBlend 2008, http://www.javablend.net/
  • 8. GFS: ARCHITECTURE Where is a potential weaknes of this design? JavaBlend 2008, http://www.javablend.net/
  • 9. GFS: WEAK POINT - SINGLE MASTER • From distributed systems we know this is a – Single point of failure – Scalability bottleneck • GFS solutions – Shadow masters – Minimize master involvement • never move data through it, use only for metadata • large chunk size • master delegates authority to primary replicas in data mutations (chunk leases) JavaBlend 2008, http://www.javablend.net/
  • 10. GFS: METADATA • Global metadata is stored on the master – File and chunk namespaces – Mapping from files to chunks • Locations of each chunk’s replicas – All in memory (64 bytes / chunk) • Master has an operation log for persistent logging of critical metadata updates – Persistent on local disk – Replicated – Checkpoints for faster recovery JavaBlend 2008, http://www.javablend.net/
  • 11. GFS: MUTATIONS • Mutations must be done for all replicas • Master picks one replica as primary; gives it a “lease” for mutations – Primary defines a serial order of mutations • Data flow decoupled from control flow JavaBlend 2008, http://www.javablend.net/
  • 12. GFS: OPEN SOURCE ALTERNATIVES • Hadoop Distributed File System - HDFS (Java) – http://hadoop.apache.org/core/docs/current/hdfs_design.html JavaBlend 2008, http://www.javablend.net/ 12
  • 13. DISTRIBUTED STORAGE FOR LARGE STRUCTURED DATA SETS Bigtable JavaBlend 2008, http://www.javablend.net/ 13
  • 14. BIGTABLE: REQUIREMENTS • Want to store petabytes of structured data across thousands of commodity servers • Want a simple data format that supports dynamic control over data layout and format • Must support very high read/write rates – millions of operations per second • Latency requirements: – backend bulk processing – real-time data serving JavaBlend 2008, http://www.javablend.net/ 14
  • 15. BIGTABLE: STRUCTURE • Bigtable is multi-dimensional map: – sparse – persistent – distributed • Key: – Row name – Column name – Timestamp • Value: – array of bytes (rowName: string, columnName: string, timestamp: long) → byte[] JavaBlend 2008, http://www.javablend.net/ 15
  • 16. BIGTABLE: EXAMPLE • A web crawling system might use Bigtable that stores web pages – Each row key could represent a specific URL – Columns represent page contents, the references to that page, and other metadata – The row range for a table is dynamically partitioned between servers • Rows are clustered together on machines by key – Using inversed URLs as keys minimizes the number of machines where pages from a single domain are stored – Each cell is timestamped so there could be multiple versions of the same data in the table JavaBlend 2008, http://www.javablend.net/ 16
  • 17. BIGTABLE: EXAMPLE “contents:” “anchor:cnnsi.com” “anchor:my.look.ca” “<html>…quot; t3 “com.cnn.www” “<html>…quot; t5 “CNNquot; t9 “CNN.comquot; t8 “<html>…quot; t6 JavaBlend 2008, http://www.javablend.net/ 17
  • 18. BIGTABLE: ROWS • Name is an arbitrary string – Access to data in a row is atomic – Row creation is implicit upon storing data • Rows ordered lexicographically – Rows close together lexicographically usually on one or a small number of machines JavaBlend 2008, http://www.javablend.net/ 18
  • 19. BIGTABLE: TABLETS • Row range for a table is dynamically partitioned into tablets • Tablet holds contiguous range of rows – Reads over short row ranges are efficient – Clients can choose row keys to achieve locality JavaBlend 2008, http://www.javablend.net/ 19
  • 20. BIGTABLE: COLUMNS • Columns have two-level name structure <column_family>:[<column_qualifier>] • Column family: – Creation must be explicit – Has associated type information and other metadata – Unit of access control • Column qualifier – Unbounded number of columns – Creation of column within a family is implicit at updates • Additional dimensions JavaBlend 2008, http://www.javablend.net/ 20
  • 21. BIGTABLE: TIMESTAMPS • Used to store different versions of data in a cell – New writes default to current time – Can also be set explicitly by clients • Lookup options – Return all values – Return most recent K values – Return all values in timestamp range • Column families can be marked with attributes – Only retain most recent K values in a cell – Keep values until they are older than K seconds JavaBlend 2008, http://www.javablend.net/ 21
  • 22. BIGTABLE: AT GOOGLE • Good match for most of our applications: – Google Earth™ – Google Maps™ – Google Talk™ – Google Finance™ – Orkut™ JavaBlend 2008, http://www.javablend.net/ 22
  • 23. BIGTABLE: OPEN SOURCE ALTERNATIVES • HBase (Java) – http://hadoop.apache.org/hbase/ • Hypertable (C++) – http://www.hypertable.org/ JavaBlend 2008, http://www.javablend.net/ 23
  • 24. PROGRAMMING MODEL FOR PROCESSING LARGE DATA SETS MapReduce JavaBlend 2008, http://www.javablend.net/ 24
  • 25. MAPREDUCE: REQUIREMENTS • Want to process lots of data ( > 1 TB) • Want to run it on thousands of commodity PCs • Must be robust • … And simple to use
  • 26. MAPREDUCE: DESCRIPTION • A simple programming model that applies to many large-scale computing problems – Based on principles of functional languages – Scalable, robust • Hide messy details in MapReduce runtime library: – automatic parallelization – load balancing – network and disk transfer optimization – handling of machine failures – robustness • Improvements to core library benefit all users of library! JavaBlend 2008, http://www.javablend.net/ 26
  • 27. MAPREDUCE: FUNCTIONAL PROGRAMMING • Functions don’t change data structures – They always create new ones – Input data remain unchanged • Functions don’t have side effects • Data flows are implicit in program design • Order of operations does not matter z := f(g(x), h(x, y), k(y))
  • 28. MAPREDUCE: TYPICAL EXECUTION FLOW • Read a lot of data • Map: extract something you care about from each record • Shuffle and Sort • Reduce: aggregate, summarize, filter, or transform • Write the results Outline stays the same, map and reduce change to fit the problem JavaBlend 2008, http://www.javablend.net/ 28
  • 29. MAPREDUCE: PROGRAMING INTERFACE User must implement two functions Map(input_key, input_value) → (output_key, intermediate_value) Reduce(output_key, intermediate_value_list) → output_value_list
  • 30. MAPREDUCE: MAP • Records from the data source … – lines out of files – rows of a database – etc. • … are fed into the map function as (key, value pairs) – filename, line – etc. • map produces zero, one or more intermediate values along with an output key from the input
  • 31. MAPREDUCE: REDUCE • After the map phase is over, all the intermediate values for a given output key are combined together into a list • reduce combines those intermediate values into zero, one or more final values for that same output key
  • 32. MAPREDUCE: EXAMPLE - WORD FREQUENCY 1/5 • Input is files with one document per record • Specify a map function that takes a key/value pair – key = document name – value = document contents • Output of map function is zero, one or more key/value pairs – In our case, output (word, “1”) once per word in the document JavaBlend 2008, http://www.javablend.net/ 32
  • 33. MAPREDUCE: EXAMPLE - WORD FREQUENCY 2/5 “To be or not to be?” “document1” “to”, “1” “be”, “1” “or”, “1” … JavaBlend 2008, http://www.javablend.net/ 33
  • 34. MAPREDUCE: PRIMER - WORD FREQUENCY 3/5 • MapReduce library gathers together all pairs with the same key – shuffle/sort • reduce function combines the values for a key – In our case, compute the sum • Output of reduce is zero, one or more values paired with key and saved JavaBlend 2008, http://www.javablend.net/ 34
  • 35. MAPREDUCE: EXAMPLE - WORD FREQUENCY 4/5 key = “be” key = “not” key = “or” key = “to” values = “1”, “1” values = “1” values = “1” values = “1”, “1” “2” “1” “1” “2” “be”, “2” “not”, “1” “or”, “1” “to”, “2” JavaBlend 2008, http://www.javablend.net/ 35
  • 36. MAPREDUCE: EXAMPLE - WORD FREQUENCY 5/5 Map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_values: EmitIntermediate(w, quot;1quot;); Reduce(String output_key, Iterator intermediate_values): // output_key: a word, same for input and output // intermediate_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result)); JavaBlend 2008, http://www.javablend.net/ 36
  • 37. MAPREDUCE: DISTRIBUTED EXECUTION JavaBlend 2008, http://www.javablend.net/ 37
  • 38. MAPREDUCE: LOGICAL FLOW JavaBlend 2008, http://www.javablend.net/ 38
  • 39. MAPREDUCE: PARALLEL FLOW 1/2 • map functions run in parallel, creating different intermediate values from different input data sets • reduce functions also run in parallel, each working on a different output key – All values are processed independently • Bottleneck – reduce phase can’t start until map phase is completely finished JavaBlend 2008, http://www.javablend.net/ 39
  • 40. MAPREDUCE: PARALLEL FLOW 2/2 JavaBlend 2008, http://www.javablend.net/ 40
  • 41. MAPREDUCE: WIDELY APPLICABLE • distributed grep • distributed sort • document clustering • machine learning • web access log stats • inverted index construction • statistical machine translation • etc. JavaBlend 2008, http://www.javablend.net/ 41
  • 42. MAPREDUCE: EXAMPLE - LANGUAGE MODEL STATISTICS • Used in our statistical machine translation system • Ned to count # of times every 5-word sequence occurs in large corpus of documents (and keep all those where count >= 4) • map: – extract 5-word sequences => count from document • reduce: – summarize counts – keep those where count >= 4 JavaBlend 2008, http://www.javablend.net/ 42
  • 43. MAPREDUCE: EXAMPLE - JOINING WITH OTHER DATA • Generate per-doc summary, but include per-host information (e.g. # of pages on host, important terms on host) – per-host information might involve RPC to a set of machines containing data for all sites • map: – extract host name from URL, lookup per-host info, combine with per-doc data and emit • reduce: – identity function (just emit input value directly) JavaBlend 2008, http://www.javablend.net/ 43
  • 44. MAPREDUCE: FAULT TOLERANCE • Master detects worker failures – Re-executes failed map tasks – Re-executes reduce tasks • Master notices particular input key/values cause crashes in map – Skips those values on re-execution
  • 45. MAPREDUCE: LOCAL OPTIMIZATIONS • Master program divides up tasks based on location of data – tries to have map tasks on same machine as physical file data, or at least same rack
  • 46. MAPREDUCE: SLOW MAP TASKS • reduce phase cannot start before the map phase completes – On slow disk controller can slow down the whole system • Master redundantly starts slow-moving map task – Uses results of first copy to finish
  • 47. MAPREDUCE: COMBINE • combine is a mini-reduce phase that runs on the same machine as map phase – It aggregates the results of local map phases – Saves network bandwidth
  • 48. MAPREDUCE: CONCLUSION • MapReduce proved to be extremely useful abstraction – It greatly simplifies the processing of huge amounts of data • MapReduce is easy to use – Programer can focus on problem – MapReduce takes care for messy details JavaBlend 2008, http://www.javablend.net/ 48
  • 49. MAPREDUCE: OPEN SOURCE ALTERNATIVES • Hadoop (Java) – http://hadoop.apache.org/ • Disco (Erlang, Python) – http://discoproject.org/ • etc. JavaBlend 2008, http://www.javablend.net/ 49
  • 50. LOCK SERVICE FOR LOOSELY-COUPLED DISTRIBUTED SYSTEMS Chubby JavaBlend 2008, http://www.javablend.net/ 50
  • 51. CHUBBY: SYNCHRONIZED ACCESS TO SHARED RESOURCES • Key element of distributed architecture at Google: – Used by GFS, Bigtable and Mapreduce • Interface similar to distributed file system with advisory locks – Access control list – No links • Every Chubby file can hold a small amount of data • Every Chubby file or directory can be used as read or write lock – Locks are advisory, not mandatory • Clients must be well-behaved • A client that does not hold a lock can still read the content of a Chubby file JavaBlend 2008, http://www.javablend.net/ 51
  • 52. CHUBBY: DESIGN • Design emphasis not on high performance, but on availability and reliability • Reading and writing is atomic • Chubby service is composed of 5 active replicas – One of them elected as master – Requires the majority of replicas to be alive JavaBlend 2008, http://www.javablend.net/ 52
  • 53. CHUBBY: EVENTS • Client can subscribe for various events: – file contents modified – child node added, removed, or modified – lock acquired – conflicting lock request from another client – etc. JavaBlend 2008, http://www.javablend.net/ 53
  • 54. REFERENCES • Bibliography: – Ghemawat, S., Gobioff, H., and Leung, S.-T. (2003). The google file system. In SOSP '03: Proceedings of the nineteenth ACM symposium on Operating systems principles, pages 29-43. ACM Press. – Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. E. (2006). Bigtable: A distributed storage system for structured data. In Operating Systems Design and Implementation, pages 205-218. – Dean, J. and Ghemawat, S (2004). Mapreduce: Simplified data processing on large clusters. In OSDI '04: Proceedings of the 6th symposium on Operating systems design and implementation, pages 137-150. – Dean, J. (2006). Experiences with MapReduce, an abstraction for large-scale computation. In PACT '06: Proceedings of the 15th international Conference on Parallel Architectures and Compilation Techniques. ACM. – Burrows, M. (2006). The chubby lock service for loosely-coupled distributed systems. In OSDI '06: Proceedings of the 7th symposium on Operating systems design and implementation, pages 335-350. • Partially based on: – Bisciglia, C., Kimball, A., & Michels-Slettvet, S. (2007). Distributed Computing Seminar, Lecture 2: MapReduce Theory and Implementation. Retrieved September 6, 2008, from http://code.google.com/edu/submissions/mapreduce-minilecture/lec2-mapred.ppt – Dean, J. (2006). Experiences with MapReduce, an abstraction for large-scale computation. Retrieved September 6, 2008, from http://www.cs.virginia.edu/~pact2006/program/mapreduce-pact06-keynote.pdf – Bisciglia, C., Kimball, A., & Michels-Slettvet, S. (2007). Distributed Computing Seminar, Lecture 3: Distributed Filesystems. Retrieved September 6, 2008, from http://code.google.com/edu/submissions/mapreduce- minilecture/lec3-dfs.ppt – Stokely, M. (2007). Distributed Computing at Google. Retrieved September 6, 2008, from http://www.swinog.ch/meetings/swinog15/SRE-Recruiting-SwiNOG2007.ppt JavaBlend 2008, http://www.javablend.net/ 54