SlideShare una empresa de Scribd logo
1 de 58
Descargar para leer sin conexión
Handling Large Datasets at Google:
   Current Systems and Future
            Directions
                  Jeff Dean
                Google Fellow

       http://labs.google.com/people/jeff
Outline
• Hardware infrastructure
• Distributed systems infrastructure:
  – Scheduling system
  – GFS
  – BigTable
  – MapReduce
• Challenges and Future Directions
  – Infrastructure that spans all datacenters
  – More automation
Sample Problem Domains
• Offline batch jobs
   – Large datasets (PBs), bulk reads/writes (MB chunks)
   – Short outages acceptable
   – Web indexing, log processing, satellite imagery, etc.

• Online applications
   – Smaller datasets (TBs), small reads/writes small (KBs)
   – Outages immediately visible to users, low latency vital
   – Web search, Orkut, GMail, Google Docs, etc.

• Many areas: IR, machine learning, image/video
  processing, NLP, machine translation, ...
Typical New Engineer
                     • Never seen a
                       petabyte of data
                     • Never used a
                       thousand machines
                     • Never really
                       experienced machine
                       failure
Our software has to make them successful.
Google’s Hardware Philosophy
        Truckloads of low-cost machines
• Workloads are large and easily parallelized
• Care about perf/$, not absolute machine perf
• Even reliable hardware fails at our scale

• Many datacenters, all around the world
  – Intra-DC bandwidth >> Inter-DC bandwidth
  – Speed of light has remained fixed in last 10 yrs :)
Effects of Hardware Philosophy
                   • Software must
                     tolerate failure
                   • Application’s
                     particular machine
                     should not matter
                   • No special machines
                     - just 2 or 3 flavors
   Google - 1999
Current Design
• In-house rack design
• PC-class
  motherboards
• Low-end storage and
  networking hardware
• Linux
• + in-house software
The Joys of Real Hardware
Typical first year for a new cluster:

~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)
~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back)
~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours)
~1 network rewiring (rolling ~5% of machines down over 2-day span)
~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)
~5 racks go wonky (40-80 machines see 50% packet loss)
~8 network maintenances (4 might cause ~30-minute random connectivity losses)
~12 router reloads (takes out DNS and external vips for a couple minutes)
~3 router failures (have to immediately pull traffic for an hour)
~dozens of minor 30-second blips for dns
~1000 individual machine failures
~thousands of hard drive failures


slow disks, bad memory, misconfigured machines, flaky machines, etc.
Typical Cluster



      Machine 1                    Machine 2                        Machine N




                                                          …
Scheduler                    Scheduler                        Scheduler
                  GFS                          GFS                              GFS
  slave                        slave                            slave
               chunkserver                  chunkserver                      chunkserver

            Linux                        Linux                            Linux
Typical Cluster
    Cluster scheduling master              Chubby Lock service            GFS master




      Machine 1                    Machine 2                        Machine N




                                                          …
Scheduler                    Scheduler                        Scheduler
                  GFS                          GFS                              GFS
  slave                        slave                            slave
               chunkserver                  chunkserver                      chunkserver

            Linux                        Linux                            Linux
Typical Cluster
    Cluster scheduling master              Chubby Lock service            GFS master




        Machine 1                    Machine 2                      Machine N
 User
 app1
                              User
                                                          …
                              app1


Scheduler                    Scheduler                        Scheduler
                  GFS                          GFS                              GFS
  slave                        slave                            slave
               chunkserver                  chunkserver                      chunkserver

            Linux                        Linux                            Linux
Typical Cluster
    Cluster scheduling master              Chubby Lock service            GFS master




        Machine 1                    Machine 2                      Machine N
 User
 app1
                              User
                                                          …
User app2                     app1


Scheduler                    Scheduler                        Scheduler
                  GFS                          GFS                              GFS
  slave                        slave                            slave
               chunkserver                  chunkserver                      chunkserver

            Linux                        Linux                            Linux
Typical Cluster
    Cluster scheduling master              Chubby Lock service            GFS master




        Machine 1                    Machine 2                      Machine N
 User
 app1           BigTable                     BigTable
                 server                       server
                              User
                                                          …
User app2                     app1


Scheduler                    Scheduler                        Scheduler
                  GFS                          GFS                              GFS
  slave                        slave                            slave
               chunkserver                  chunkserver                      chunkserver

            Linux                        Linux                            Linux
Typical Cluster
    Cluster scheduling master              Chubby Lock service            GFS master




        Machine 1                    Machine 2                      Machine N
 User
 app1           BigTable                     BigTable
                                                                 BigTable master
                 server                       server
                              User
                                                          …
User app2                     app1


Scheduler                    Scheduler                        Scheduler
                  GFS                          GFS                              GFS
  slave                        slave                            slave
               chunkserver                  chunkserver                      chunkserver

            Linux                        Linux                            Linux
File Storage: GFS
                                                          Client
                            GFS
                            Master
                                                        Client
                                                          Client



           C1              C1            C0    C5
     C0

                                    …
           C2               C3
                     C5                        C2
     C5

                    Chunkserver 2       Chunkserver N
    Chunkserver 1

•    Master: Manages file metadata
•    Chunkserver: Manages 64MB file chunks
•    Clients talk to master to open and find files
•    Clients talk directly to chunkservers for data
GFS Usage

• 200+ GFS clusters
• Managed by an internal service team
• Largest clusters
  – 5000+ machines
  – 5+ PB of disk usage
  – 10000+ clients
Data Storage: BigTable
         What is it, really?
         • 10-ft view: Row &
           column abstraction for
           storing data
         • Reality: Distributed,
           persistent, multi-level
           sorted map
BigTable Data Model
• Multi-dimensional sparse sorted map
   (row, column, timestamp) => value
BigTable Data Model
• Multi-dimensional sparse sorted map
   (row, column, timestamp) => value


       Rows


  “www.cnn.com”
BigTable Data Model
• Multi-dimensional sparse sorted map
   (row, column, timestamp) => value
                         “contents:”   Columns

       Rows


  “www.cnn.com”
BigTable Data Model
• Multi-dimensional sparse sorted map
   (row, column, timestamp) => value
                         “contents:”   Columns

       Rows


  “www.cnn.com”
                        “<html>…”
BigTable Data Model
• Multi-dimensional sparse sorted map
   (row, column, timestamp) => value
                         “contents:”   Columns

       Rows


  “www.cnn.com”
                                        t17
                        “<html>…”

                                    Timestamps
BigTable Data Model
• Multi-dimensional sparse sorted map
   (row, column, timestamp) => value
                         “contents:”   Columns

       Rows


                                          t11
  “www.cnn.com”
                                        t17
                        “<html>…”

                                    Timestamps
BigTable Data Model
• Multi-dimensional sparse sorted map
   (row, column, timestamp) => value
                         “contents:”   Columns

       Rows

                                                t3
                                          t11
  “www.cnn.com”
                                        t17
                        “<html>…”

                                    Timestamps
Tablets (cont.)
                          “language:”   “contents:”

“aaa.com”

“cnn.com”                     EN         “<html>…”


 “cnn.com/sports.html”



“website.com”
        …
        …
“yahoo.com/kids.html”



       …
“zuppa.com/menu.html”
Tablets (cont.)
                          “language:”   “contents:”

“aaa.com”

“cnn.com”                     EN         “<html>…”


 “cnn.com/sports.html”



“website.com”
        …
        …
“yahoo.com/kids.html”



       …
“zuppa.com/menu.html”
Tablets (cont.)
                          “language:”   “contents:”

“aaa.com”

“cnn.com”                     EN         “<html>…”


 “cnn.com/sports.html”



“website.com”
        …
        …
“yahoo.com/kids.html”



       …
“zuppa.com/menu.html”
Tablets (cont.)
                          “language:”        “contents:”

“aaa.com”

“cnn.com”                     EN              “<html>…”


 “cnn.com/sports.html”

                                        Tablets

“website.com”
        …
        …
“yahoo.com/kids.html”



       …
“zuppa.com/menu.html”
Tablets (cont.)
                          “language:”        “contents:”

“aaa.com”

“cnn.com”                     EN              “<html>…”


 “cnn.com/sports.html”

                                        Tablets

“website.com”
        …
        …
“yahoo.com/kids.html”



       …
“zuppa.com/menu.html”
Tablets (cont.)
                          “language:”        “contents:”

“aaa.com”

“cnn.com”                     EN              “<html>…”


 “cnn.com/sports.html”

                                        Tablets

“website.com”
        …
        …
“yahoo.com/kids.html”



       …
“zuppa.com/menu.html”
Tablets (cont.)
                          “language:”        “contents:”

“aaa.com”

“cnn.com”                     EN              “<html>…”


 “cnn.com/sports.html”

                                        Tablets

“website.com”
        …
        …
“yahoo.com/kids.html”



       …
“zuppa.com/menu.html”
Tablets (cont.)
                          “language:”        “contents:”

“aaa.com”

“cnn.com”                     EN              “<html>…”


 “cnn.com/sports.html”

                                        Tablets

“website.com”
        …
        …
“yahoo.com/kids.html”



       …
“zuppa.com/menu.html”
Tablets (cont.)
                          “language:”        “contents:”

“aaa.com”

“cnn.com”                     EN              “<html>…”


 “cnn.com/sports.html”

                                        Tablets

“website.com”
        …
        …
“yahoo.com/kids.html”
“yahoo.com/kids.html”

       …
“zuppa.com/menu.html”
Tablets (cont.)
                          “language:”        “contents:”

“aaa.com”

“cnn.com”                     EN              “<html>…”


 “cnn.com/sports.html”

                                        Tablets

“website.com”
        …
        …
“yahoo.com/kids.html”
“yahoo.com/kids.html”

       …
“zuppa.com/menu.html”
Bigtable System Structure
Bigtable Cell

                           Bigtable master




Bigtable tablet server                            …
                         Bigtable tablet server       Bigtable tablet server
Bigtable System Structure
Bigtable Cell

                             Bigtable master
                         performs metadata ops +
                              load balancing


Bigtable tablet server                             …
                          Bigtable tablet server       Bigtable tablet server
Bigtable System Structure
Bigtable Cell

                             Bigtable master
                         performs metadata ops +
                              load balancing


Bigtable tablet server                             …
                          Bigtable tablet server       Bigtable tablet server

    serves data               serves data                  serves data
Bigtable System Structure
Bigtable Cell

                             Bigtable master
                         performs metadata ops +
                              load balancing


Bigtable tablet server                             …
                          Bigtable tablet server       Bigtable tablet server

    serves data               serves data                  serves data


  Cluster scheduling system            GFS                  Lock service
Bigtable System Structure
Bigtable Cell

                              Bigtable master
                          performs metadata ops +
                               load balancing


Bigtable tablet server                               …
                            Bigtable tablet server       Bigtable tablet server

    serves data                 serves data                  serves data


  Cluster scheduling system              GFS                  Lock service

 handles failover, monitoring
Bigtable System Structure
Bigtable Cell

                              Bigtable master
                          performs metadata ops +
                               load balancing


Bigtable tablet server                               …
                            Bigtable tablet server        Bigtable tablet server

    serves data                 serves data                   serves data


  Cluster scheduling system              GFS                   Lock service

 handles failover, monitoring   holds tablet data, logs
Bigtable System Structure
Bigtable Cell

                              Bigtable master
                          performs metadata ops +
                               load balancing


Bigtable tablet server                               …
                            Bigtable tablet server        Bigtable tablet server

    serves data                 serves data                    serves data


  Cluster scheduling system              GFS                   Lock service
                                                             holds metadata,
 handles failover, monitoring   holds tablet data, logs   handles master-election
Bigtable System Structure
                                                              Bigtable client
Bigtable Cell                                                 Bigtable client
                                                                  library

                              Bigtable master
                          performs metadata ops +
                               load balancing


Bigtable tablet server                               …
                            Bigtable tablet server        Bigtable tablet server

    serves data                 serves data                    serves data


  Cluster scheduling system              GFS                   Lock service
                                                             holds metadata,
 handles failover, monitoring   holds tablet data, logs   handles master-election
Bigtable System Structure
                                                              Bigtable client
Bigtable Cell                                                 Bigtable client
                                                                  library

                              Bigtable master
                          performs metadata ops +                         Open()
                               load balancing


Bigtable tablet server                               …
                            Bigtable tablet server        Bigtable tablet server

    serves data                 serves data                    serves data


  Cluster scheduling system              GFS                   Lock service
                                                             holds metadata,
 handles failover, monitoring   holds tablet data, logs   handles master-election
Bigtable System Structure
                                                                Bigtable client
Bigtable Cell                                                   Bigtable client
                                                                    library

                              Bigtable master
                          performs metadata ops +                           Open()
                                                          read/write
                               load balancing


Bigtable tablet server                               …
                            Bigtable tablet server          Bigtable tablet server

    serves data                 serves data                      serves data


  Cluster scheduling system              GFS                     Lock service
                                                              holds metadata,
 handles failover, monitoring   holds tablet data, logs    handles master-election
Bigtable System Structure
                                                                Bigtable client
Bigtable Cell                                                   Bigtable client
                                              metadata ops
                                                                    library

                              Bigtable master
                          performs metadata ops +                           Open()
                                                          read/write
                               load balancing


Bigtable tablet server                               …
                            Bigtable tablet server          Bigtable tablet server

    serves data                 serves data                      serves data


  Cluster scheduling system              GFS                     Lock service
                                                              holds metadata,
 handles failover, monitoring   holds tablet data, logs    handles master-election
Some BigTable Features
• Single-row transactions: easy to do read/modify/write
  operations
• Locality groups: segregate columns into different files
• In-memory columns: random access to small items
• Suite of compression techniques: per-locality group
• Bloom filters: avoid seeks for non-existent data
• Replication: eventual-consistency replication across
  datacenters, between multiple BigTable serving setups
  (master/slave & multi-master)
BigTable Usage
• 500+ BigTable cells
• Largest cells manage 6000TB+ of data,
  3000+ machines
• Busiest cells sustain >500000+ ops/
  second 24 hours/day, and peak much
  higher
Data Processing: MapReduce
• Google’s batch processing tool of choice
• Users write two functions:
  – Map: Produces (key, value) pairs from input
  – Reduce: Merges (key, value) pairs from Map
• Library handles data transfer and failures
• Used everywhere: Earth, News, Analytics,
  Search Quality, Indexing, …
Example: Document Indexing
• Input: Set of documents D1, …, DN
• Map
  – Parse document D into terms T1, …, TN
  – Produces (key, value) pairs
     • (T1, D), …, (TN, D)
• Reduce
  – Receives list of (key, value) pairs for term T
     • (T, D1), …, (T, DN)
  – Emits single (key, value) pair
     • (T, (D1, …, DN))
MapReduce Execution
MapReduce
                                            GFS
  master



      Map task 1                         Map task 2                  Map task 3
            map                             map                             map

      k1:v        k2:v                   k1:v     k3:v               k1:v     k4:v


                              Shuffle and Sort
                  Reduce task 1                          Reduce task 2
                   k1:v,v,v       k3,v                    k2:v   k4,v

                         reduce                             reduce


                                            GFS
MapReduce Tricks / Features

     •                             •
         Data locality                 Backup copies of tasks
                                   •   # tasks >> # machines
     •   Multiple I/O data types
                                   •   Task re-execution on failure
     •   Data compression
                                   •   Local or cluster execution
     •   Pipelined shuffle stage
                                   •   Distributed counters
     •   Fast sorter




21
MapReduce Programs in Google’s Source Tree
      12500



      10000



       7500



       5000



       2500



          0
              Jan-03 May-03 Sep-03 Jan-04 May-04 Sep-04 Jan-05 May-05 Sep-05 Jan-06 May-06 Sep-06 Jan-07 May-07 Sep-07




22
New MapReduce Programs Per Month
700



525



350



175



  0
       Jan-03 May-03 Sep-03 Jan-04 May-04 Sep-04 Jan-05 May-05 Sep-05 Jan-06 May-06 Sep-06 Jan-07 May-07 Sep-07




  23
New MapReduce Programs Per Month
                                                     Summer intern effect
700



525



350



175



  0
       Jan-03 May-03 Sep-03 Jan-04 May-04 Sep-04 Jan-05 May-05 Sep-05 Jan-06 May-06 Sep-06 Jan-07 May-07 Sep-07




  23
MapReduce in Google
      Easy to use. Library hides complexity.
                           Mar, ‘05 Mar, ‘06   Sep, '07
Number of jobs                72K     171K     2,217K
Average time (seconds)         934       874       395
Machine years used             981     2,002    11,081
Input data read (TB)        12,571    52,254   403,152
Intermediate data (TB)       2,756     6,743    34,774
Output data written (TB)       941     2,970    14,018
Average worker machines        232       268       394
Current Work
Scheduling system + GFS + BigTable + MapReduce work
  well within single clusters

Many separate instances in different data centers
  – Tools on top deal with cross-cluster issues
  – Each tool solves relatively narrow problem
       – Many tools => lots of complexity

 Can next generation infrastructure do more?
Next Generation Infrastructure
Truly global systems to span all our datacenters
• Global namespace with many replicas of data worldwide
• Support both consistent and inconsistent operations
• Continued operation even with datacenter partitions
• Users specify high-level desires:
       “99%ile latency for accessing this data should be <50ms”
       “Store this data on at least 2 disks in EU, 2 in U.S. & 1 in Asia”


–   Increased utilization through automation
–   Automatic migration, growing and shrinking of services
–   Lower end-user latency
–   Provide high-level programming model for data-intensive
       interactive services
Questions?
Further info:

• The Google File System, Sanjay Ghemawat, Howard Gobioff, Shun-Tak
 Leung, SOSP ‘03.

• Web Search for a Planet: The Google Cluster Architecture, Luiz Andre
Barroso, Jeffrey Dean, Urs Hölzle, IEEE Micro, 2003.

• Bigtable: A Distributed Storage System for Structured Data, Fay Chang,
Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike
Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber, OSDI’06

• MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and
Sanjay Ghemawat, OSDIʼ04

• Failure Trends in a Large Disk Drive Population, Eduardo Pinheiro, Wolf-
Dietrich Weber and Luiz André Barroso. FAST, ‘07.


   http://labs.google.com/papers
   http://labs.google.com/people/jeff or jeff@google.com

Más contenido relacionado

La actualidad más candente

An Introduction to the Formalised Memory Model for Linux Kernel
An Introduction to the Formalised Memory Model for Linux KernelAn Introduction to the Formalised Memory Model for Linux Kernel
An Introduction to the Formalised Memory Model for Linux KernelSeongJae Park
 
XenSummit NA 2012: Xen on ARM Cortex A15
XenSummit NA 2012: Xen on ARM Cortex A15XenSummit NA 2012: Xen on ARM Cortex A15
XenSummit NA 2012: Xen on ARM Cortex A15The Linux Foundation
 
Porting Xen Paravirtualization to MIPS Architecture
Porting Xen Paravirtualization to MIPS ArchitecturePorting Xen Paravirtualization to MIPS Architecture
Porting Xen Paravirtualization to MIPS ArchitectureThe Linux Foundation
 
Sync in an NFV World (Ram, ITSF 2016)
Sync in an NFV World  (Ram, ITSF 2016)Sync in an NFV World  (Ram, ITSF 2016)
Sync in an NFV World (Ram, ITSF 2016)Adam Paterson
 
Overview of Linux real-time challenges
Overview of Linux real-time challengesOverview of Linux real-time challenges
Overview of Linux real-time challengesDaniel Stenberg
 

La actualidad más candente (7)

An Introduction to the Formalised Memory Model for Linux Kernel
An Introduction to the Formalised Memory Model for Linux KernelAn Introduction to the Formalised Memory Model for Linux Kernel
An Introduction to the Formalised Memory Model for Linux Kernel
 
Preempt_rt realtime patch
Preempt_rt realtime patchPreempt_rt realtime patch
Preempt_rt realtime patch
 
XenSummit NA 2012: Xen on ARM Cortex A15
XenSummit NA 2012: Xen on ARM Cortex A15XenSummit NA 2012: Xen on ARM Cortex A15
XenSummit NA 2012: Xen on ARM Cortex A15
 
Porting Xen Paravirtualization to MIPS Architecture
Porting Xen Paravirtualization to MIPS ArchitecturePorting Xen Paravirtualization to MIPS Architecture
Porting Xen Paravirtualization to MIPS Architecture
 
Sync in an NFV World (Ram, ITSF 2016)
Sync in an NFV World  (Ram, ITSF 2016)Sync in an NFV World  (Ram, ITSF 2016)
Sync in an NFV World (Ram, ITSF 2016)
 
Real time Linux
Real time LinuxReal time Linux
Real time Linux
 
Overview of Linux real-time challenges
Overview of Linux real-time challengesOverview of Linux real-time challenges
Overview of Linux real-time challenges
 

Destacado

Scaling web applications with cassandra presentation
Scaling web applications with cassandra presentationScaling web applications with cassandra presentation
Scaling web applications with cassandra presentationMurat Çakal
 
Automated and Adaptive Infrastructure Monitoring using Chef, Nagios and Graphite
Automated and Adaptive Infrastructure Monitoring using Chef, Nagios and GraphiteAutomated and Adaptive Infrastructure Monitoring using Chef, Nagios and Graphite
Automated and Adaptive Infrastructure Monitoring using Chef, Nagios and GraphitePratima Singh
 
Linux kernel development chapter 10
Linux kernel development chapter 10Linux kernel development chapter 10
Linux kernel development chapter 10huangachou
 
An Overview of [Linux] Kernel Lock Improvements -- Linuxcon NA 2014
An Overview of [Linux] Kernel Lock Improvements -- Linuxcon NA 2014An Overview of [Linux] Kernel Lock Improvements -- Linuxcon NA 2014
An Overview of [Linux] Kernel Lock Improvements -- Linuxcon NA 2014Davidlohr Bueso
 
Cassandra
CassandraCassandra
Cassandrapcmanus
 
Linux Locking Mechanisms
Linux Locking MechanismsLinux Locking Mechanisms
Linux Locking MechanismsKernel TLV
 
Pregel: A System for Large-Scale Graph Processing
Pregel: A System for Large-Scale Graph ProcessingPregel: A System for Large-Scale Graph Processing
Pregel: A System for Large-Scale Graph ProcessingChris Bunch
 
Linux and H/W optimizations for MySQL
Linux and H/W optimizations for MySQLLinux and H/W optimizations for MySQL
Linux and H/W optimizations for MySQLYoshinori Matsunobu
 
Writing and testing high frequency trading engines in java
Writing and testing high frequency trading engines in javaWriting and testing high frequency trading engines in java
Writing and testing high frequency trading engines in javaPeter Lawrey
 
Understanding Data Consistency in Apache Cassandra
Understanding Data Consistency in Apache CassandraUnderstanding Data Consistency in Apache Cassandra
Understanding Data Consistency in Apache CassandraDataStax
 
Futex Scaling for Multi-core Systems
Futex Scaling for Multi-core SystemsFutex Scaling for Multi-core Systems
Futex Scaling for Multi-core SystemsDavidlohr Bueso
 
Memory Barriers in the Linux Kernel
Memory Barriers in the Linux KernelMemory Barriers in the Linux Kernel
Memory Barriers in the Linux KernelDavidlohr Bueso
 
Dead Lock Analysis of spin_lock() in Linux Kernel (english)
Dead Lock Analysis of spin_lock() in Linux Kernel (english)Dead Lock Analysis of spin_lock() in Linux Kernel (english)
Dead Lock Analysis of spin_lock() in Linux Kernel (english)Sneeker Yeh
 
SQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureSQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureVenu Anuganti
 

Destacado (15)

Scaling web applications with cassandra presentation
Scaling web applications with cassandra presentationScaling web applications with cassandra presentation
Scaling web applications with cassandra presentation
 
Automated and Adaptive Infrastructure Monitoring using Chef, Nagios and Graphite
Automated and Adaptive Infrastructure Monitoring using Chef, Nagios and GraphiteAutomated and Adaptive Infrastructure Monitoring using Chef, Nagios and Graphite
Automated and Adaptive Infrastructure Monitoring using Chef, Nagios and Graphite
 
Linux kernel development chapter 10
Linux kernel development chapter 10Linux kernel development chapter 10
Linux kernel development chapter 10
 
An Overview of [Linux] Kernel Lock Improvements -- Linuxcon NA 2014
An Overview of [Linux] Kernel Lock Improvements -- Linuxcon NA 2014An Overview of [Linux] Kernel Lock Improvements -- Linuxcon NA 2014
An Overview of [Linux] Kernel Lock Improvements -- Linuxcon NA 2014
 
Cassandra
CassandraCassandra
Cassandra
 
Linux Locking Mechanisms
Linux Locking MechanismsLinux Locking Mechanisms
Linux Locking Mechanisms
 
Pregel: A System for Large-Scale Graph Processing
Pregel: A System for Large-Scale Graph ProcessingPregel: A System for Large-Scale Graph Processing
Pregel: A System for Large-Scale Graph Processing
 
Linux and H/W optimizations for MySQL
Linux and H/W optimizations for MySQLLinux and H/W optimizations for MySQL
Linux and H/W optimizations for MySQL
 
Writing and testing high frequency trading engines in java
Writing and testing high frequency trading engines in javaWriting and testing high frequency trading engines in java
Writing and testing high frequency trading engines in java
 
Posix Threads
Posix ThreadsPosix Threads
Posix Threads
 
Understanding Data Consistency in Apache Cassandra
Understanding Data Consistency in Apache CassandraUnderstanding Data Consistency in Apache Cassandra
Understanding Data Consistency in Apache Cassandra
 
Futex Scaling for Multi-core Systems
Futex Scaling for Multi-core SystemsFutex Scaling for Multi-core Systems
Futex Scaling for Multi-core Systems
 
Memory Barriers in the Linux Kernel
Memory Barriers in the Linux KernelMemory Barriers in the Linux Kernel
Memory Barriers in the Linux Kernel
 
Dead Lock Analysis of spin_lock() in Linux Kernel (english)
Dead Lock Analysis of spin_lock() in Linux Kernel (english)Dead Lock Analysis of spin_lock() in Linux Kernel (english)
Dead Lock Analysis of spin_lock() in Linux Kernel (english)
 
SQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureSQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data Architecture
 

Similar a 6 Dean Google

Red Hat Global File System (GFS)
Red Hat Global File System (GFS)Red Hat Global File System (GFS)
Red Hat Global File System (GFS)Schubert Zhang
 
Infrastructure for cloud_computing
Infrastructure for cloud_computingInfrastructure for cloud_computing
Infrastructure for cloud_computingJULIO GONZALEZ SANZ
 
%w(map reduce).first - A Tale About Rabbits, Latency, and Slim Crontabs
%w(map reduce).first - A Tale About Rabbits, Latency, and Slim Crontabs%w(map reduce).first - A Tale About Rabbits, Latency, and Slim Crontabs
%w(map reduce).first - A Tale About Rabbits, Latency, and Slim CrontabsPaolo Negri
 
Nuxeo World Session: Scaling Nuxeo Applications
Nuxeo World Session: Scaling Nuxeo ApplicationsNuxeo World Session: Scaling Nuxeo Applications
Nuxeo World Session: Scaling Nuxeo ApplicationsNuxeo
 
LXDE Presentation at FOSDEM 2009
LXDE Presentation at FOSDEM 2009LXDE Presentation at FOSDEM 2009
LXDE Presentation at FOSDEM 2009Mario B.
 
Building the World's Largest GPU
Building the World's Largest GPUBuilding the World's Largest GPU
Building the World's Largest GPURenee Yao
 
Sync in an NFV World (Ram, ITSF 2016)
Sync in an NFV World (Ram, ITSF 2016)Sync in an NFV World (Ram, ITSF 2016)
Sync in an NFV World (Ram, ITSF 2016)Calnex Solutions
 
Running Applications on the NetBSD Rump Kernel by Justin Cormack
Running Applications on the NetBSD Rump Kernel by Justin Cormack Running Applications on the NetBSD Rump Kernel by Justin Cormack
Running Applications on the NetBSD Rump Kernel by Justin Cormack eurobsdcon
 
MapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large ClustersMapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large ClustersAshraf Uddin
 
Cache-partitioning
Cache-partitioningCache-partitioning
Cache-partitioningdavidkftam
 
[Podman Special Event] Kubernetes in Rootless Podman
[Podman Special Event] Kubernetes in Rootless Podman[Podman Special Event] Kubernetes in Rootless Podman
[Podman Special Event] Kubernetes in Rootless PodmanAkihiro Suda
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...Edge AI and Vision Alliance
 
GTS, Global Trigger and Synchronization system
GTS, Global Trigger and Synchronization systemGTS, Global Trigger and Synchronization system
GTS, Global Trigger and Synchronization systemJoelChavas
 
Automation@Brainly - Polish Linux Autumn 2014
Automation@Brainly - Polish Linux Autumn 2014Automation@Brainly - Polish Linux Autumn 2014
Automation@Brainly - Polish Linux Autumn 2014vespian_256
 
Sweetening Systems Management with Salt
Sweetening Systems Management with SaltSweetening Systems Management with Salt
Sweetening Systems Management with Saltmchesnut
 
Linux beginner's Workshop
Linux beginner's WorkshopLinux beginner's Workshop
Linux beginner's Workshopfutureshocked
 

Similar a 6 Dean Google (20)

Red Hat Global File System (GFS)
Red Hat Global File System (GFS)Red Hat Global File System (GFS)
Red Hat Global File System (GFS)
 
Google file system
Google file systemGoogle file system
Google file system
 
Infrastructure for cloud_computing
Infrastructure for cloud_computingInfrastructure for cloud_computing
Infrastructure for cloud_computing
 
%w(map reduce).first - A Tale About Rabbits, Latency, and Slim Crontabs
%w(map reduce).first - A Tale About Rabbits, Latency, and Slim Crontabs%w(map reduce).first - A Tale About Rabbits, Latency, and Slim Crontabs
%w(map reduce).first - A Tale About Rabbits, Latency, and Slim Crontabs
 
Nuxeo World Session: Scaling Nuxeo Applications
Nuxeo World Session: Scaling Nuxeo ApplicationsNuxeo World Session: Scaling Nuxeo Applications
Nuxeo World Session: Scaling Nuxeo Applications
 
LXDE Presentation at FOSDEM 2009
LXDE Presentation at FOSDEM 2009LXDE Presentation at FOSDEM 2009
LXDE Presentation at FOSDEM 2009
 
Building the World's Largest GPU
Building the World's Largest GPUBuilding the World's Largest GPU
Building the World's Largest GPU
 
Sync in an NFV World (Ram, ITSF 2016)
Sync in an NFV World (Ram, ITSF 2016)Sync in an NFV World (Ram, ITSF 2016)
Sync in an NFV World (Ram, ITSF 2016)
 
Ryu ods2012-spring
Ryu ods2012-springRyu ods2012-spring
Ryu ods2012-spring
 
Running Applications on the NetBSD Rump Kernel by Justin Cormack
Running Applications on the NetBSD Rump Kernel by Justin Cormack Running Applications on the NetBSD Rump Kernel by Justin Cormack
Running Applications on the NetBSD Rump Kernel by Justin Cormack
 
Realtime
RealtimeRealtime
Realtime
 
MapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large ClustersMapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large Clusters
 
Cache-partitioning
Cache-partitioningCache-partitioning
Cache-partitioning
 
[Podman Special Event] Kubernetes in Rootless Podman
[Podman Special Event] Kubernetes in Rootless Podman[Podman Special Event] Kubernetes in Rootless Podman
[Podman Special Event] Kubernetes in Rootless Podman
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
 
GTS, Global Trigger and Synchronization system
GTS, Global Trigger and Synchronization systemGTS, Global Trigger and Synchronization system
GTS, Global Trigger and Synchronization system
 
Automation@Brainly - Polish Linux Autumn 2014
Automation@Brainly - Polish Linux Autumn 2014Automation@Brainly - Polish Linux Autumn 2014
Automation@Brainly - Polish Linux Autumn 2014
 
RTDroid_Presentation
RTDroid_PresentationRTDroid_Presentation
RTDroid_Presentation
 
Sweetening Systems Management with Salt
Sweetening Systems Management with SaltSweetening Systems Management with Salt
Sweetening Systems Management with Salt
 
Linux beginner's Workshop
Linux beginner's WorkshopLinux beginner's Workshop
Linux beginner's Workshop
 

Más de Frank Cai

把时间当朋友
把时间当朋友把时间当朋友
把时间当朋友Frank Cai
 
高性能Web服务器Nginx及相关新技术的应用实践
高性能Web服务器Nginx及相关新技术的应用实践高性能Web服务器Nginx及相关新技术的应用实践
高性能Web服务器Nginx及相关新技术的应用实践Frank Cai
 
Fotolog.Com.Mashraqi Scaling
Fotolog.Com.Mashraqi ScalingFotolog.Com.Mashraqi Scaling
Fotolog.Com.Mashraqi ScalingFrank Cai
 
高可用数据库平台架构及日常管理经验介绍.ppt
高可用数据库平台架构及日常管理经验介绍.ppt高可用数据库平台架构及日常管理经验介绍.ppt
高可用数据库平台架构及日常管理经验介绍.pptFrank Cai
 
系统性能分析和优化.ppt
系统性能分析和优化.ppt系统性能分析和优化.ppt
系统性能分析和优化.pptFrank Cai
 

Más de Frank Cai (6)

把时间当朋友
把时间当朋友把时间当朋友
把时间当朋友
 
高性能Web服务器Nginx及相关新技术的应用实践
高性能Web服务器Nginx及相关新技术的应用实践高性能Web服务器Nginx及相关新技术的应用实践
高性能Web服务器Nginx及相关新技术的应用实践
 
Fotolog.Com.Mashraqi Scaling
Fotolog.Com.Mashraqi ScalingFotolog.Com.Mashraqi Scaling
Fotolog.Com.Mashraqi Scaling
 
WAP2.0
WAP2.0WAP2.0
WAP2.0
 
高可用数据库平台架构及日常管理经验介绍.ppt
高可用数据库平台架构及日常管理经验介绍.ppt高可用数据库平台架构及日常管理经验介绍.ppt
高可用数据库平台架构及日常管理经验介绍.ppt
 
系统性能分析和优化.ppt
系统性能分析和优化.ppt系统性能分析和优化.ppt
系统性能分析和优化.ppt
 

Último

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 

Último (20)

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 

6 Dean Google

  • 1. Handling Large Datasets at Google: Current Systems and Future Directions Jeff Dean Google Fellow http://labs.google.com/people/jeff
  • 2. Outline • Hardware infrastructure • Distributed systems infrastructure: – Scheduling system – GFS – BigTable – MapReduce • Challenges and Future Directions – Infrastructure that spans all datacenters – More automation
  • 3. Sample Problem Domains • Offline batch jobs – Large datasets (PBs), bulk reads/writes (MB chunks) – Short outages acceptable – Web indexing, log processing, satellite imagery, etc. • Online applications – Smaller datasets (TBs), small reads/writes small (KBs) – Outages immediately visible to users, low latency vital – Web search, Orkut, GMail, Google Docs, etc. • Many areas: IR, machine learning, image/video processing, NLP, machine translation, ...
  • 4. Typical New Engineer • Never seen a petabyte of data • Never used a thousand machines • Never really experienced machine failure Our software has to make them successful.
  • 5. Google’s Hardware Philosophy Truckloads of low-cost machines • Workloads are large and easily parallelized • Care about perf/$, not absolute machine perf • Even reliable hardware fails at our scale • Many datacenters, all around the world – Intra-DC bandwidth >> Inter-DC bandwidth – Speed of light has remained fixed in last 10 yrs :)
  • 6. Effects of Hardware Philosophy • Software must tolerate failure • Application’s particular machine should not matter • No special machines - just 2 or 3 flavors Google - 1999
  • 7. Current Design • In-house rack design • PC-class motherboards • Low-end storage and networking hardware • Linux • + in-house software
  • 8. The Joys of Real Hardware Typical first year for a new cluster: ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover) ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back) ~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours) ~1 network rewiring (rolling ~5% of machines down over 2-day span) ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back) ~5 racks go wonky (40-80 machines see 50% packet loss) ~8 network maintenances (4 might cause ~30-minute random connectivity losses) ~12 router reloads (takes out DNS and external vips for a couple minutes) ~3 router failures (have to immediately pull traffic for an hour) ~dozens of minor 30-second blips for dns ~1000 individual machine failures ~thousands of hard drive failures slow disks, bad memory, misconfigured machines, flaky machines, etc.
  • 9. Typical Cluster Machine 1 Machine 2 Machine N … Scheduler Scheduler Scheduler GFS GFS GFS slave slave slave chunkserver chunkserver chunkserver Linux Linux Linux
  • 10. Typical Cluster Cluster scheduling master Chubby Lock service GFS master Machine 1 Machine 2 Machine N … Scheduler Scheduler Scheduler GFS GFS GFS slave slave slave chunkserver chunkserver chunkserver Linux Linux Linux
  • 11. Typical Cluster Cluster scheduling master Chubby Lock service GFS master Machine 1 Machine 2 Machine N User app1 User … app1 Scheduler Scheduler Scheduler GFS GFS GFS slave slave slave chunkserver chunkserver chunkserver Linux Linux Linux
  • 12. Typical Cluster Cluster scheduling master Chubby Lock service GFS master Machine 1 Machine 2 Machine N User app1 User … User app2 app1 Scheduler Scheduler Scheduler GFS GFS GFS slave slave slave chunkserver chunkserver chunkserver Linux Linux Linux
  • 13. Typical Cluster Cluster scheduling master Chubby Lock service GFS master Machine 1 Machine 2 Machine N User app1 BigTable BigTable server server User … User app2 app1 Scheduler Scheduler Scheduler GFS GFS GFS slave slave slave chunkserver chunkserver chunkserver Linux Linux Linux
  • 14. Typical Cluster Cluster scheduling master Chubby Lock service GFS master Machine 1 Machine 2 Machine N User app1 BigTable BigTable BigTable master server server User … User app2 app1 Scheduler Scheduler Scheduler GFS GFS GFS slave slave slave chunkserver chunkserver chunkserver Linux Linux Linux
  • 15. File Storage: GFS Client GFS Master Client Client C1 C1 C0 C5 C0 … C2 C3 C5 C2 C5 Chunkserver 2 Chunkserver N Chunkserver 1 • Master: Manages file metadata • Chunkserver: Manages 64MB file chunks • Clients talk to master to open and find files • Clients talk directly to chunkservers for data
  • 16. GFS Usage • 200+ GFS clusters • Managed by an internal service team • Largest clusters – 5000+ machines – 5+ PB of disk usage – 10000+ clients
  • 17. Data Storage: BigTable What is it, really? • 10-ft view: Row & column abstraction for storing data • Reality: Distributed, persistent, multi-level sorted map
  • 18. BigTable Data Model • Multi-dimensional sparse sorted map (row, column, timestamp) => value
  • 19. BigTable Data Model • Multi-dimensional sparse sorted map (row, column, timestamp) => value Rows “www.cnn.com”
  • 20. BigTable Data Model • Multi-dimensional sparse sorted map (row, column, timestamp) => value “contents:” Columns Rows “www.cnn.com”
  • 21. BigTable Data Model • Multi-dimensional sparse sorted map (row, column, timestamp) => value “contents:” Columns Rows “www.cnn.com” “<html>…”
  • 22. BigTable Data Model • Multi-dimensional sparse sorted map (row, column, timestamp) => value “contents:” Columns Rows “www.cnn.com” t17 “<html>…” Timestamps
  • 23. BigTable Data Model • Multi-dimensional sparse sorted map (row, column, timestamp) => value “contents:” Columns Rows t11 “www.cnn.com” t17 “<html>…” Timestamps
  • 24. BigTable Data Model • Multi-dimensional sparse sorted map (row, column, timestamp) => value “contents:” Columns Rows t3 t11 “www.cnn.com” t17 “<html>…” Timestamps
  • 25. Tablets (cont.) “language:” “contents:” “aaa.com” “cnn.com” EN “<html>…” “cnn.com/sports.html” “website.com” … … “yahoo.com/kids.html” … “zuppa.com/menu.html”
  • 26. Tablets (cont.) “language:” “contents:” “aaa.com” “cnn.com” EN “<html>…” “cnn.com/sports.html” “website.com” … … “yahoo.com/kids.html” … “zuppa.com/menu.html”
  • 27. Tablets (cont.) “language:” “contents:” “aaa.com” “cnn.com” EN “<html>…” “cnn.com/sports.html” “website.com” … … “yahoo.com/kids.html” … “zuppa.com/menu.html”
  • 28. Tablets (cont.) “language:” “contents:” “aaa.com” “cnn.com” EN “<html>…” “cnn.com/sports.html” Tablets “website.com” … … “yahoo.com/kids.html” … “zuppa.com/menu.html”
  • 29. Tablets (cont.) “language:” “contents:” “aaa.com” “cnn.com” EN “<html>…” “cnn.com/sports.html” Tablets “website.com” … … “yahoo.com/kids.html” … “zuppa.com/menu.html”
  • 30. Tablets (cont.) “language:” “contents:” “aaa.com” “cnn.com” EN “<html>…” “cnn.com/sports.html” Tablets “website.com” … … “yahoo.com/kids.html” … “zuppa.com/menu.html”
  • 31. Tablets (cont.) “language:” “contents:” “aaa.com” “cnn.com” EN “<html>…” “cnn.com/sports.html” Tablets “website.com” … … “yahoo.com/kids.html” … “zuppa.com/menu.html”
  • 32. Tablets (cont.) “language:” “contents:” “aaa.com” “cnn.com” EN “<html>…” “cnn.com/sports.html” Tablets “website.com” … … “yahoo.com/kids.html” … “zuppa.com/menu.html”
  • 33. Tablets (cont.) “language:” “contents:” “aaa.com” “cnn.com” EN “<html>…” “cnn.com/sports.html” Tablets “website.com” … … “yahoo.com/kids.html” “yahoo.com/kids.html” … “zuppa.com/menu.html”
  • 34. Tablets (cont.) “language:” “contents:” “aaa.com” “cnn.com” EN “<html>…” “cnn.com/sports.html” Tablets “website.com” … … “yahoo.com/kids.html” “yahoo.com/kids.html” … “zuppa.com/menu.html”
  • 35. Bigtable System Structure Bigtable Cell Bigtable master Bigtable tablet server … Bigtable tablet server Bigtable tablet server
  • 36. Bigtable System Structure Bigtable Cell Bigtable master performs metadata ops + load balancing Bigtable tablet server … Bigtable tablet server Bigtable tablet server
  • 37. Bigtable System Structure Bigtable Cell Bigtable master performs metadata ops + load balancing Bigtable tablet server … Bigtable tablet server Bigtable tablet server serves data serves data serves data
  • 38. Bigtable System Structure Bigtable Cell Bigtable master performs metadata ops + load balancing Bigtable tablet server … Bigtable tablet server Bigtable tablet server serves data serves data serves data Cluster scheduling system GFS Lock service
  • 39. Bigtable System Structure Bigtable Cell Bigtable master performs metadata ops + load balancing Bigtable tablet server … Bigtable tablet server Bigtable tablet server serves data serves data serves data Cluster scheduling system GFS Lock service handles failover, monitoring
  • 40. Bigtable System Structure Bigtable Cell Bigtable master performs metadata ops + load balancing Bigtable tablet server … Bigtable tablet server Bigtable tablet server serves data serves data serves data Cluster scheduling system GFS Lock service handles failover, monitoring holds tablet data, logs
  • 41. Bigtable System Structure Bigtable Cell Bigtable master performs metadata ops + load balancing Bigtable tablet server … Bigtable tablet server Bigtable tablet server serves data serves data serves data Cluster scheduling system GFS Lock service holds metadata, handles failover, monitoring holds tablet data, logs handles master-election
  • 42. Bigtable System Structure Bigtable client Bigtable Cell Bigtable client library Bigtable master performs metadata ops + load balancing Bigtable tablet server … Bigtable tablet server Bigtable tablet server serves data serves data serves data Cluster scheduling system GFS Lock service holds metadata, handles failover, monitoring holds tablet data, logs handles master-election
  • 43. Bigtable System Structure Bigtable client Bigtable Cell Bigtable client library Bigtable master performs metadata ops + Open() load balancing Bigtable tablet server … Bigtable tablet server Bigtable tablet server serves data serves data serves data Cluster scheduling system GFS Lock service holds metadata, handles failover, monitoring holds tablet data, logs handles master-election
  • 44. Bigtable System Structure Bigtable client Bigtable Cell Bigtable client library Bigtable master performs metadata ops + Open() read/write load balancing Bigtable tablet server … Bigtable tablet server Bigtable tablet server serves data serves data serves data Cluster scheduling system GFS Lock service holds metadata, handles failover, monitoring holds tablet data, logs handles master-election
  • 45. Bigtable System Structure Bigtable client Bigtable Cell Bigtable client metadata ops library Bigtable master performs metadata ops + Open() read/write load balancing Bigtable tablet server … Bigtable tablet server Bigtable tablet server serves data serves data serves data Cluster scheduling system GFS Lock service holds metadata, handles failover, monitoring holds tablet data, logs handles master-election
  • 46. Some BigTable Features • Single-row transactions: easy to do read/modify/write operations • Locality groups: segregate columns into different files • In-memory columns: random access to small items • Suite of compression techniques: per-locality group • Bloom filters: avoid seeks for non-existent data • Replication: eventual-consistency replication across datacenters, between multiple BigTable serving setups (master/slave & multi-master)
  • 47. BigTable Usage • 500+ BigTable cells • Largest cells manage 6000TB+ of data, 3000+ machines • Busiest cells sustain >500000+ ops/ second 24 hours/day, and peak much higher
  • 48. Data Processing: MapReduce • Google’s batch processing tool of choice • Users write two functions: – Map: Produces (key, value) pairs from input – Reduce: Merges (key, value) pairs from Map • Library handles data transfer and failures • Used everywhere: Earth, News, Analytics, Search Quality, Indexing, …
  • 49. Example: Document Indexing • Input: Set of documents D1, …, DN • Map – Parse document D into terms T1, …, TN – Produces (key, value) pairs • (T1, D), …, (TN, D) • Reduce – Receives list of (key, value) pairs for term T • (T, D1), …, (T, DN) – Emits single (key, value) pair • (T, (D1, …, DN))
  • 50. MapReduce Execution MapReduce GFS master Map task 1 Map task 2 Map task 3 map map map k1:v k2:v k1:v k3:v k1:v k4:v Shuffle and Sort Reduce task 1 Reduce task 2 k1:v,v,v k3,v k2:v k4,v reduce reduce GFS
  • 51. MapReduce Tricks / Features • • Data locality Backup copies of tasks • # tasks >> # machines • Multiple I/O data types • Task re-execution on failure • Data compression • Local or cluster execution • Pipelined shuffle stage • Distributed counters • Fast sorter 21
  • 52. MapReduce Programs in Google’s Source Tree 12500 10000 7500 5000 2500 0 Jan-03 May-03 Sep-03 Jan-04 May-04 Sep-04 Jan-05 May-05 Sep-05 Jan-06 May-06 Sep-06 Jan-07 May-07 Sep-07 22
  • 53. New MapReduce Programs Per Month 700 525 350 175 0 Jan-03 May-03 Sep-03 Jan-04 May-04 Sep-04 Jan-05 May-05 Sep-05 Jan-06 May-06 Sep-06 Jan-07 May-07 Sep-07 23
  • 54. New MapReduce Programs Per Month Summer intern effect 700 525 350 175 0 Jan-03 May-03 Sep-03 Jan-04 May-04 Sep-04 Jan-05 May-05 Sep-05 Jan-06 May-06 Sep-06 Jan-07 May-07 Sep-07 23
  • 55. MapReduce in Google Easy to use. Library hides complexity. Mar, ‘05 Mar, ‘06 Sep, '07 Number of jobs 72K 171K 2,217K Average time (seconds) 934 874 395 Machine years used 981 2,002 11,081 Input data read (TB) 12,571 52,254 403,152 Intermediate data (TB) 2,756 6,743 34,774 Output data written (TB) 941 2,970 14,018 Average worker machines 232 268 394
  • 56. Current Work Scheduling system + GFS + BigTable + MapReduce work well within single clusters Many separate instances in different data centers – Tools on top deal with cross-cluster issues – Each tool solves relatively narrow problem – Many tools => lots of complexity Can next generation infrastructure do more?
  • 57. Next Generation Infrastructure Truly global systems to span all our datacenters • Global namespace with many replicas of data worldwide • Support both consistent and inconsistent operations • Continued operation even with datacenter partitions • Users specify high-level desires: “99%ile latency for accessing this data should be <50ms” “Store this data on at least 2 disks in EU, 2 in U.S. & 1 in Asia” – Increased utilization through automation – Automatic migration, growing and shrinking of services – Lower end-user latency – Provide high-level programming model for data-intensive interactive services
  • 58. Questions? Further info: • The Google File System, Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung, SOSP ‘03. • Web Search for a Planet: The Google Cluster Architecture, Luiz Andre Barroso, Jeffrey Dean, Urs Hölzle, IEEE Micro, 2003. • Bigtable: A Distributed Storage System for Structured Data, Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber, OSDI’06 • MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, OSDIʼ04 • Failure Trends in a Large Disk Drive Population, Eduardo Pinheiro, Wolf- Dietrich Weber and Luiz André Barroso. FAST, ‘07. http://labs.google.com/papers http://labs.google.com/people/jeff or jeff@google.com