SlideShare una empresa de Scribd logo
1 de 25
Descargar para leer sin conexión
Outline
                Introduction
                      Design
             Implementation
                     Results
                 Conclusions




Bigtable: A Distributed Storage System for
             Structured Data

                 Alvanos Michalis


                    April 6, 2009




            Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                         Introduction
                               Design
                      Implementation
                              Results
                          Conclusions

1   Introduction
       Motivation
2   Design
       Data model
3   Implementation
       Building blocks
       Tablets
       Compactions
       Refinements
4   Results
       Hardware Environment
       Performance Evaluation
5   Conclusions
       Real applications
       Lessons
       End
                     Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                           Introduction
                                 Design
                                          Motivation
                        Implementation
                                Results
                            Conclusions


Google!

     Lots of Different kinds of data!
          Crawling system URLs, contents, links, anchors, page-rank etc
          Per-user data: preferences, recent queries/ search history
          Geographic data, images etc ...
     Many incoming requests
     No commercial system is big enough
          Scale is too large for commercial databases
          May not run on their commodity hardware
          No dependence on other vendors
          Optimizations
          Better Price/Performance
          Building internally means the system can be applied across
          many projects for low incremental cost

                       Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                           Introduction
                                 Design
                                          Motivation
                        Implementation
                                Results
                            Conclusions


Google goals



     Fault-tolerant, persistent
     Scalable
          1000s of servers
          Millions of reads/writes, efficient scans
     Self-managing
     Simple!




                       Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                            Introduction
                                  Design
                                           Data model
                         Implementation
                                 Results
                             Conclusions


Bigtable




  Definition
  A Bigtable is a sparse, distributed, persistent multidimensional
  sorted map.

  The map is indexed by a row key, column key, and a timestamp;
  each value in the map is an uninterpreted array of bytes.

  (row:string, column:string, time:int64) -> string
                        Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                            Introduction
                                  Design
                                           Data model
                         Implementation
                                 Results
                             Conclusions


Rows




       The row keys in a table are arbitrary strings
       Every read or write of data under a single row key is atomic
       maintains data in lexicographic order by row key


                        Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                         Introduction
                               Design
                                        Data model
                      Implementation
                              Results
                          Conclusions


Column Families




     Grouped into sets called column families
     All data stored in a column family is usually of the same type
     A column family must be created before data can be stored
     under any column key in that family
     A column key is named using the following syntax:
     family:qualifier
                     Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                          Introduction
                                Design
                                         Data model
                       Implementation
                               Results
                           Conclusions


Timestamps


     Each cell in a Bigtable can contain multiple versions of the
     same data; these versions are indexed by timestamp (64-bit
     integers).
     Applications that need to avoid collisions must generate
     unique timestamps themselves.
     To make the management of versioned data less onerous, they
     support two per-column-family settings that tell Bigtable to
     garbage-collect cell versions automatically.




                      Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                           Introduction   Building blocks
                                 Design   Tablets
                        Implementation    Compactions
                                Results   Refinements
                            Conclusions


Infrastructure


      Google WorkQueue (scheduler)
      GFS: large-scale distributed file system
          Master: responsible for metadata
          Chunk servers: responsible for r/w large chunks of data
          Chunks replicated on 3 machines; master responsible
      Chubby: lock/file/name service
          Coarse-grained locks; can store small amount of data in a lock
          5 replicas; need a majority vote to be active




                       Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                          Introduction   Building blocks
                                Design   Tablets
                       Implementation    Compactions
                               Results   Refinements
                           Conclusions


SSTable




     Lives in GFS
     Immutable, sorted file of key-value pairs
     Chunks of data plus an index
     Index is of block ranges, not values

                      Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                         Introduction   Building blocks
                               Design   Tablets
                      Implementation    Compactions
                              Results   Refinements
                          Conclusions


Tablet Design




     Large tables broken into tablets at row boundaries
         Tablets hold contiguous rows
         Approx 100 200 MB of data per tablet
     Approx 100 tablets per machine
         Fast recovery
         Load-balancing
     Built out of multiple SSTables
                     Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                           Introduction   Building blocks
                                 Design   Tablets
                        Implementation    Compactions
                                Results   Refinements
                            Conclusions


Tablet Location




      Like a B+-tree, but fixed at 3 levels
      How can we avoid creating a bottleneck at the root?
          Aggressively cache tablet locations
          Lookup starts from leaf (bet on it being correct); reverse on
          miss
                       Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                           Introduction   Building blocks
                                 Design   Tablets
                        Implementation    Compactions
                                Results   Refinements
                            Conclusions


Tablet Assignment
     Each tablet is assigned to one tablet server at a time. The
     master keeps track of the set of live tablet servers, and the
     current assignment of tablets to tablet servers.
     Bigtable uses Chubby to keep track of tablet servers. When a
     tablet server starts, it creates, and acquires an exclusive lock
     on, a uniquely-named file in a specific Chubby directory.
     Tablet server stops serving its tablets if loses its exclusive lock
     The master is responsible for detecting when a tablet server is
     no longer serving its tablets, and for reassigning those tablets
     as soon as possible.
     When a master is started by the cluster management system,
     it needs to discover the current tablet assignments before it
     can change them.
                       Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                          Introduction   Building blocks
                                Design   Tablets
                       Implementation    Compactions
                               Results   Refinements
                           Conclusions


Serving a Tablet




      Updates are logged
      Each SSTable corresponds to a batch of updates or a
      snapshot of the tablet taken at some earlier time
      Memtable (sorted by key) caches recent updates
      Reads consult both memtable and SSTables
                      Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                           Introduction   Building blocks
                                 Design   Tablets
                        Implementation    Compactions
                                Results   Refinements
                            Conclusions


Compactions

  As write operations execute, the size of the memtable increases.
      Minor compaction convert the memtable into an SSTable
           Reduce memory usage
           Reduce log traffic on restart
      Merging compaction
           Periodically executed in the background
           Reduce number of SSTables
           Good place to apply policy keep only N versions
      Major compaction
           Merging compaction that results in only one SSTable
           No deletion records, only live data
           Reclaim resources.

                       Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                          Introduction   Building blocks
                                Design   Tablets
                       Implementation    Compactions
                               Results   Refinements
                           Conclusions


Refinements (1/2)


     Group column families together into an SSTable. Segregating
     column families that are not typically accessed together into
     separate locality groups enables more efficient reads.
     Can compress locality groups, using Bentley and McIlroy’s
     scheme and a fast compression algorithm that looks for
     repetitions.
     Bloom Filters on locality groups allows to ask whether an
     SSTable might contain any data for a specified row/column
     pair. Drastically reduces the number of disk seeks required -
     for non-existent rows or columns do not need to touch disk.


                      Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                          Introduction   Building blocks
                                Design   Tablets
                       Implementation    Compactions
                               Results   Refinements
                           Conclusions


Refinements (2/2)


     Caching for read performance ( two levels of caching)
         Scan Cache: higher-level cache that caches the key-value pairs
         returned by the SSTable interface to the tablet server code.
         Block Cache: lower-level cache that caches SSTables blocks
         that were read from GFS.
     Commit-log implementation
     Speeding up tablet recovery (log entries)
     Exploiting immutability



                      Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                          Introduction
                                Design   Hardware Environment
                       Implementation    Performance Evaluation
                               Results
                           Conclusions


Hardware Environment


     Tablet servers were configured to use 1 GB of memory and to
     write to a GFS cell consisting of 1786 machines with two 400
     GB IDE hard drives each.
     Each machine had two dual-core Opteron 2 GHz chips
     Enough physical memory to hold the working set of all
     running processes
     Single gigabit Ethernet link
     Two-level tree-shaped switched network with 100-200 Gbps
     aggregate bandwidth at the root.



                      Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                          Introduction
                                Design   Hardware Environment
                       Implementation    Performance Evaluation
                               Results
                           Conclusions


Results Per Tablet Server




  Number of 1000-byte values read/written per second.


                      Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                          Introduction
                                Design   Hardware Environment
                       Implementation    Performance Evaluation
                               Results
                           Conclusions


Results Aggregate Rate




  Number of 1000-byte values read/written per second.

                      Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                          Introduction
                                Design   Hardware Environment
                       Implementation    Performance Evaluation
                               Results
                           Conclusions


Single tablet-server performance


      The tablet server executes 1200 reads per second ( 75
      MB/s), enough to saturate the tablet server CPUs because of
      overheads in networking stack
      Random and sequential writes perform better than random
      reads (commit log and uses group commit)
      No significant difference between random writes and
      sequential writes (same commit log)
      Sequential reads perform better than random reads (block
      cache)



                      Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                          Introduction
                                Design   Hardware Environment
                       Implementation    Performance Evaluation
                               Results
                           Conclusions


Scaling


      Aggregate throughput increases dramatically performance of
      random reads from memory increases
      However, performance does not increase linearly
      Drop in per-server throughput
          Imbalance in load: Re-balancing is throttled to reduce the
          number of tablet movement and the load generated by
          benchmarks shifts around as the benchmark progresses
          The random read benchmark: transfer one 64KB block over
          the network for every 1000-byte read and saturates shared 1
          Gigabit links



                      Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                            Introduction
                                           Real applications
                                  Design
                                           Lessons
                         Implementation
                                           End
                                 Results
                             Conclusions


Timestamps




     Google Analytics
     Google Earth
     Personalized Search

                        Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                           Introduction
                                          Real applications
                                 Design
                                          Lessons
                        Implementation
                                          End
                                Results
                            Conclusions


Lessons learned
      Large distributed systems are vulnerable to many types of
      failures, not just the standard network partitions and fail-stop
      failures
          Memory and network corruption
          Large clock skew
          Extended and asymmetric network partitions
          Bugs in other systems (Chubby !)
          ...
      Delay adding new features until it is clear how the new
      features will be used
      A practical lesson: the importance of proper system-level
      monitoring
      Keep It Simple!
                       Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                  Introduction
                                 Real applications
                        Design
                                 Lessons
               Implementation
                                 End
                       Results
                   Conclusions


END!




 QUESTIONS ?




           Alvanos Michalis      Bigtable: A Distributed Storage System for Structured Data

Más contenido relacionado

La actualidad más candente

Case Study: Automating Code Reviews for Custom SAP ABAP Applications with Vir...
Case Study: Automating Code Reviews for Custom SAP ABAP Applications with Vir...Case Study: Automating Code Reviews for Custom SAP ABAP Applications with Vir...
Case Study: Automating Code Reviews for Custom SAP ABAP Applications with Vir...Virtual Forge
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundationshktripathy
 
Big Table, H base, Dynamo, Dynamo DB Lecture
Big Table, H base, Dynamo, Dynamo DB LectureBig Table, H base, Dynamo, Dynamo DB Lecture
Big Table, H base, Dynamo, Dynamo DB LectureDr Neelesh Jain
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemRutvik Bapat
 
Architecture of data mining system
Architecture of data mining systemArchitecture of data mining system
Architecture of data mining systemramya marichamy
 
Scaling Out With Hadoop And HBase
Scaling Out With Hadoop And HBaseScaling Out With Hadoop And HBase
Scaling Out With Hadoop And HBaseAge Mooij
 
5.5 graph mining
5.5 graph mining5.5 graph mining
5.5 graph miningKrish_ver2
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Introduction to distributed file systems
Introduction to distributed file systemsIntroduction to distributed file systems
Introduction to distributed file systemsViet-Trung TRAN
 
Google File System
Google File SystemGoogle File System
Google File Systemguest2cb4689
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clusteringKrish_ver2
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopApache Apex
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATAGauravBiswas9
 
Computing Outside The Box
Computing Outside The BoxComputing Outside The Box
Computing Outside The BoxIan Foster
 

La actualidad más candente (20)

Case Study: Automating Code Reviews for Custom SAP ABAP Applications with Vir...
Case Study: Automating Code Reviews for Custom SAP ABAP Applications with Vir...Case Study: Automating Code Reviews for Custom SAP ABAP Applications with Vir...
Case Study: Automating Code Reviews for Custom SAP ABAP Applications with Vir...
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
 
Big Table, H base, Dynamo, Dynamo DB Lecture
Big Table, H base, Dynamo, Dynamo DB LectureBig Table, H base, Dynamo, Dynamo DB Lecture
Big Table, H base, Dynamo, Dynamo DB Lecture
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Unit 2
Unit 2Unit 2
Unit 2
 
Architecture of data mining system
Architecture of data mining systemArchitecture of data mining system
Architecture of data mining system
 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
 
Scaling Out With Hadoop And HBase
Scaling Out With Hadoop And HBaseScaling Out With Hadoop And HBase
Scaling Out With Hadoop And HBase
 
5.5 graph mining
5.5 graph mining5.5 graph mining
5.5 graph mining
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Introduction to distributed file systems
Introduction to distributed file systemsIntroduction to distributed file systems
Introduction to distributed file systems
 
Data Mining
Data MiningData Mining
Data Mining
 
Cluster computing
Cluster computingCluster computing
Cluster computing
 
Cs6703 grid and cloud computing unit 3
Cs6703 grid and cloud computing unit 3Cs6703 grid and cloud computing unit 3
Cs6703 grid and cloud computing unit 3
 
Google File System
Google File SystemGoogle File System
Google File System
 
Data clustering
Data clustering Data clustering
Data clustering
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
 
Computing Outside The Box
Computing Outside The BoxComputing Outside The Box
Computing Outside The Box
 

Destacado

Bigtable
BigtableBigtable
Bigtablenextlib
 
Summary of "Google's Big Table" at nosql summer reading in Tokyo
Summary of "Google's Big Table" at nosql summer reading in TokyoSummary of "Google's Big Table" at nosql summer reading in Tokyo
Summary of "Google's Big Table" at nosql summer reading in TokyoCLOUDIAN KK
 
Dynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremDynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremGrisha Weintraub
 
Dynamo and BigTable - Review and Comparison
Dynamo and BigTable - Review and ComparisonDynamo and BigTable - Review and Comparison
Dynamo and BigTable - Review and ComparisonGrisha Weintraub
 
Big table
Big tableBig table
Big tablePSIT
 
Google Bigtable paper presentation
Google Bigtable paper presentationGoogle Bigtable paper presentation
Google Bigtable paper presentationvanjakom
 
Cloud-native Apps - Architektur, Implementierung, Demo
Cloud-native Apps - Architektur, Implementierung, DemoCloud-native Apps - Architektur, Implementierung, Demo
Cloud-native Apps - Architektur, Implementierung, DemoAndreas Koop
 
ROCKING YOUR SEAT AT THE BIG TABLE - ROB BAILEY
ROCKING YOUR SEAT AT THE BIG TABLE - ROB BAILEYROCKING YOUR SEAT AT THE BIG TABLE - ROB BAILEY
ROCKING YOUR SEAT AT THE BIG TABLE - ROB BAILEYSupport Driven
 
I've (probably) been using Google App Engine for a week longer than you have
I've (probably) been using Google App Engine for a week longer than you haveI've (probably) been using Google App Engine for a week longer than you have
I've (probably) been using Google App Engine for a week longer than you haveSimon Willison
 
Innovation Case Study BitTorrent
Innovation Case Study BitTorrentInnovation Case Study BitTorrent
Innovation Case Study BitTorrentPetter S. Rønning
 
Privacy preserving public auditing for regenerating-code-based cloud storage
Privacy preserving public auditing for regenerating-code-based cloud storagePrivacy preserving public auditing for regenerating-code-based cloud storage
Privacy preserving public auditing for regenerating-code-based cloud storageparry prabhu
 
Privacy preserving public auditing for regenerating-code-based cloud storage
Privacy preserving public auditing for regenerating-code-based cloud storagePrivacy preserving public auditing for regenerating-code-based cloud storage
Privacy preserving public auditing for regenerating-code-based cloud storageNagamalleswararao Tadikonda
 

Destacado (20)

google Bigtable
google Bigtablegoogle Bigtable
google Bigtable
 
Bigtable
BigtableBigtable
Bigtable
 
Summary of "Google's Big Table" at nosql summer reading in Tokyo
Summary of "Google's Big Table" at nosql summer reading in TokyoSummary of "Google's Big Table" at nosql summer reading in Tokyo
Summary of "Google's Big Table" at nosql summer reading in Tokyo
 
Bigtable and Dynamo
Bigtable and DynamoBigtable and Dynamo
Bigtable and Dynamo
 
Dynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremDynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theorem
 
Dynamo and BigTable - Review and Comparison
Dynamo and BigTable - Review and ComparisonDynamo and BigTable - Review and Comparison
Dynamo and BigTable - Review and Comparison
 
Google's BigTable
Google's BigTableGoogle's BigTable
Google's BigTable
 
Big table
Big tableBig table
Big table
 
Big table
Big tableBig table
Big table
 
Bigtable
BigtableBigtable
Bigtable
 
Google Bigtable paper presentation
Google Bigtable paper presentationGoogle Bigtable paper presentation
Google Bigtable paper presentation
 
Cloud-native Apps - Architektur, Implementierung, Demo
Cloud-native Apps - Architektur, Implementierung, DemoCloud-native Apps - Architektur, Implementierung, Demo
Cloud-native Apps - Architektur, Implementierung, Demo
 
ROCKING YOUR SEAT AT THE BIG TABLE - ROB BAILEY
ROCKING YOUR SEAT AT THE BIG TABLE - ROB BAILEYROCKING YOUR SEAT AT THE BIG TABLE - ROB BAILEY
ROCKING YOUR SEAT AT THE BIG TABLE - ROB BAILEY
 
I've (probably) been using Google App Engine for a week longer than you have
I've (probably) been using Google App Engine for a week longer than you haveI've (probably) been using Google App Engine for a week longer than you have
I've (probably) been using Google App Engine for a week longer than you have
 
Big table
Big tableBig table
Big table
 
Innovation Case Study BitTorrent
Innovation Case Study BitTorrentInnovation Case Study BitTorrent
Innovation Case Study BitTorrent
 
Privacy preserving public auditing for regenerating-code-based cloud storage
Privacy preserving public auditing for regenerating-code-based cloud storagePrivacy preserving public auditing for regenerating-code-based cloud storage
Privacy preserving public auditing for regenerating-code-based cloud storage
 
Big table
Big tableBig table
Big table
 
Privacy preserving public auditing for regenerating-code-based cloud storage
Privacy preserving public auditing for regenerating-code-based cloud storagePrivacy preserving public auditing for regenerating-code-based cloud storage
Privacy preserving public auditing for regenerating-code-based cloud storage
 
Bigtable
BigtableBigtable
Bigtable
 

Similar a Bigtable: A Distributed Storage System for Structured Data

Snowflake Cloning.pdf
Snowflake Cloning.pdfSnowflake Cloning.pdf
Snowflake Cloning.pdfVishnuGone
 
Bigtable
BigtableBigtable
Bigtableptdorf
 
VectorDB Schema Design 101 - Considerations for Building a Scalable and Perfo...
VectorDB Schema Design 101 - Considerations for Building a Scalable and Perfo...VectorDB Schema Design 101 - Considerations for Building a Scalable and Perfo...
VectorDB Schema Design 101 - Considerations for Building a Scalable and Perfo...Zilliz
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataVipin Batra
 
MySQL 8 Tips and Tricks from Symfony USA 2018, San Francisco
MySQL 8 Tips and Tricks from Symfony USA 2018, San FranciscoMySQL 8 Tips and Tricks from Symfony USA 2018, San Francisco
MySQL 8 Tips and Tricks from Symfony USA 2018, San FranciscoDave Stokes
 
Bigtable osdi06
Bigtable osdi06Bigtable osdi06
Bigtable osdi06temp2004it
 
Bigtable osdi06
Bigtable osdi06Bigtable osdi06
Bigtable osdi06mrlonganh
 
MySQL 8 Server Optimization Swanseacon 2018
MySQL 8 Server Optimization Swanseacon 2018MySQL 8 Server Optimization Swanseacon 2018
MySQL 8 Server Optimization Swanseacon 2018Dave Stokes
 
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...IJCERT JOURNAL
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookJames Serra
 
Cloud Architecture Patterns for Mere Mortals - Bill Wilder - Vermont Code Cam...
Cloud Architecture Patterns for Mere Mortals - Bill Wilder - Vermont Code Cam...Cloud Architecture Patterns for Mere Mortals - Bill Wilder - Vermont Code Cam...
Cloud Architecture Patterns for Mere Mortals - Bill Wilder - Vermont Code Cam...Bill Wilder
 

Similar a Bigtable: A Distributed Storage System for Structured Data (20)

Scaling Up vs. Scaling-out
Scaling Up vs. Scaling-outScaling Up vs. Scaling-out
Scaling Up vs. Scaling-out
 
Snowflake Cloning.pdf
Snowflake Cloning.pdfSnowflake Cloning.pdf
Snowflake Cloning.pdf
 
Bigtable
BigtableBigtable
Bigtable
 
VectorDB Schema Design 101 - Considerations for Building a Scalable and Perfo...
VectorDB Schema Design 101 - Considerations for Building a Scalable and Perfo...VectorDB Schema Design 101 - Considerations for Building a Scalable and Perfo...
VectorDB Schema Design 101 - Considerations for Building a Scalable and Perfo...
 
NOSQL
NOSQLNOSQL
NOSQL
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
MySQL 8 Tips and Tricks from Symfony USA 2018, San Francisco
MySQL 8 Tips and Tricks from Symfony USA 2018, San FranciscoMySQL 8 Tips and Tricks from Symfony USA 2018, San Francisco
MySQL 8 Tips and Tricks from Symfony USA 2018, San Francisco
 
gfs-sosp2003
gfs-sosp2003gfs-sosp2003
gfs-sosp2003
 
gfs-sosp2003
gfs-sosp2003gfs-sosp2003
gfs-sosp2003
 
Bigtable osdi06
Bigtable osdi06Bigtable osdi06
Bigtable osdi06
 
Bigtable osdi06
Bigtable osdi06Bigtable osdi06
Bigtable osdi06
 
Bigtable osdi06
Bigtable osdi06Bigtable osdi06
Bigtable osdi06
 
MySQL 8 Server Optimization Swanseacon 2018
MySQL 8 Server Optimization Swanseacon 2018MySQL 8 Server Optimization Swanseacon 2018
MySQL 8 Server Optimization Swanseacon 2018
 
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
 
Gfs论文
Gfs论文Gfs论文
Gfs论文
 
The google file system
The google file systemThe google file system
The google file system
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future Outlook
 
Cloud Architecture Patterns for Mere Mortals - Bill Wilder - Vermont Code Cam...
Cloud Architecture Patterns for Mere Mortals - Bill Wilder - Vermont Code Cam...Cloud Architecture Patterns for Mere Mortals - Bill Wilder - Vermont Code Cam...
Cloud Architecture Patterns for Mere Mortals - Bill Wilder - Vermont Code Cam...
 
Sybase IQ Big Data
Sybase IQ Big DataSybase IQ Big Data
Sybase IQ Big Data
 
Sybase IQ ve Big Data
Sybase IQ ve Big DataSybase IQ ve Big Data
Sybase IQ ve Big Data
 

Más de elliando dias

Clojurescript slides
Clojurescript slidesClojurescript slides
Clojurescript slideselliando dias
 
Why you should be excited about ClojureScript
Why you should be excited about ClojureScriptWhy you should be excited about ClojureScript
Why you should be excited about ClojureScriptelliando dias
 
Functional Programming with Immutable Data Structures
Functional Programming with Immutable Data StructuresFunctional Programming with Immutable Data Structures
Functional Programming with Immutable Data Structureselliando dias
 
Nomenclatura e peças de container
Nomenclatura  e peças de containerNomenclatura  e peças de container
Nomenclatura e peças de containerelliando dias
 
Polyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better AgilityPolyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better Agilityelliando dias
 
Javascript Libraries
Javascript LibrariesJavascript Libraries
Javascript Librarieselliando dias
 
How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!elliando dias
 
A Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the WebA Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the Webelliando dias
 
Introdução ao Arduino
Introdução ao ArduinoIntrodução ao Arduino
Introdução ao Arduinoelliando dias
 
Incanter Data Sorcery
Incanter Data SorceryIncanter Data Sorcery
Incanter Data Sorceryelliando dias
 
Fab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine DesignFab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine Designelliando dias
 
The Digital Revolution: Machines that makes
The Digital Revolution: Machines that makesThe Digital Revolution: Machines that makes
The Digital Revolution: Machines that makeselliando dias
 
Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.elliando dias
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebookelliando dias
 
Multi-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case StudyMulti-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case Studyelliando dias
 

Más de elliando dias (20)

Clojurescript slides
Clojurescript slidesClojurescript slides
Clojurescript slides
 
Why you should be excited about ClojureScript
Why you should be excited about ClojureScriptWhy you should be excited about ClojureScript
Why you should be excited about ClojureScript
 
Functional Programming with Immutable Data Structures
Functional Programming with Immutable Data StructuresFunctional Programming with Immutable Data Structures
Functional Programming with Immutable Data Structures
 
Nomenclatura e peças de container
Nomenclatura  e peças de containerNomenclatura  e peças de container
Nomenclatura e peças de container
 
Geometria Projetiva
Geometria ProjetivaGeometria Projetiva
Geometria Projetiva
 
Polyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better AgilityPolyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better Agility
 
Javascript Libraries
Javascript LibrariesJavascript Libraries
Javascript Libraries
 
How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!
 
Ragel talk
Ragel talkRagel talk
Ragel talk
 
A Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the WebA Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the Web
 
Introdução ao Arduino
Introdução ao ArduinoIntrodução ao Arduino
Introdução ao Arduino
 
Minicurso arduino
Minicurso arduinoMinicurso arduino
Minicurso arduino
 
Incanter Data Sorcery
Incanter Data SorceryIncanter Data Sorcery
Incanter Data Sorcery
 
Rango
RangoRango
Rango
 
Fab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine DesignFab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine Design
 
The Digital Revolution: Machines that makes
The Digital Revolution: Machines that makesThe Digital Revolution: Machines that makes
The Digital Revolution: Machines that makes
 
Hadoop + Clojure
Hadoop + ClojureHadoop + Clojure
Hadoop + Clojure
 
Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
 
Multi-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case StudyMulti-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case Study
 

Último

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 

Último (20)

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 

Bigtable: A Distributed Storage System for Structured Data

  • 1. Outline Introduction Design Implementation Results Conclusions Bigtable: A Distributed Storage System for Structured Data Alvanos Michalis April 6, 2009 Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 2. Outline Introduction Design Implementation Results Conclusions 1 Introduction Motivation 2 Design Data model 3 Implementation Building blocks Tablets Compactions Refinements 4 Results Hardware Environment Performance Evaluation 5 Conclusions Real applications Lessons End Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 3. Outline Introduction Design Motivation Implementation Results Conclusions Google! Lots of Different kinds of data! Crawling system URLs, contents, links, anchors, page-rank etc Per-user data: preferences, recent queries/ search history Geographic data, images etc ... Many incoming requests No commercial system is big enough Scale is too large for commercial databases May not run on their commodity hardware No dependence on other vendors Optimizations Better Price/Performance Building internally means the system can be applied across many projects for low incremental cost Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 4. Outline Introduction Design Motivation Implementation Results Conclusions Google goals Fault-tolerant, persistent Scalable 1000s of servers Millions of reads/writes, efficient scans Self-managing Simple! Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 5. Outline Introduction Design Data model Implementation Results Conclusions Bigtable Definition A Bigtable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes. (row:string, column:string, time:int64) -> string Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 6. Outline Introduction Design Data model Implementation Results Conclusions Rows The row keys in a table are arbitrary strings Every read or write of data under a single row key is atomic maintains data in lexicographic order by row key Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 7. Outline Introduction Design Data model Implementation Results Conclusions Column Families Grouped into sets called column families All data stored in a column family is usually of the same type A column family must be created before data can be stored under any column key in that family A column key is named using the following syntax: family:qualifier Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 8. Outline Introduction Design Data model Implementation Results Conclusions Timestamps Each cell in a Bigtable can contain multiple versions of the same data; these versions are indexed by timestamp (64-bit integers). Applications that need to avoid collisions must generate unique timestamps themselves. To make the management of versioned data less onerous, they support two per-column-family settings that tell Bigtable to garbage-collect cell versions automatically. Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 9. Outline Introduction Building blocks Design Tablets Implementation Compactions Results Refinements Conclusions Infrastructure Google WorkQueue (scheduler) GFS: large-scale distributed file system Master: responsible for metadata Chunk servers: responsible for r/w large chunks of data Chunks replicated on 3 machines; master responsible Chubby: lock/file/name service Coarse-grained locks; can store small amount of data in a lock 5 replicas; need a majority vote to be active Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 10. Outline Introduction Building blocks Design Tablets Implementation Compactions Results Refinements Conclusions SSTable Lives in GFS Immutable, sorted file of key-value pairs Chunks of data plus an index Index is of block ranges, not values Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 11. Outline Introduction Building blocks Design Tablets Implementation Compactions Results Refinements Conclusions Tablet Design Large tables broken into tablets at row boundaries Tablets hold contiguous rows Approx 100 200 MB of data per tablet Approx 100 tablets per machine Fast recovery Load-balancing Built out of multiple SSTables Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 12. Outline Introduction Building blocks Design Tablets Implementation Compactions Results Refinements Conclusions Tablet Location Like a B+-tree, but fixed at 3 levels How can we avoid creating a bottleneck at the root? Aggressively cache tablet locations Lookup starts from leaf (bet on it being correct); reverse on miss Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 13. Outline Introduction Building blocks Design Tablets Implementation Compactions Results Refinements Conclusions Tablet Assignment Each tablet is assigned to one tablet server at a time. The master keeps track of the set of live tablet servers, and the current assignment of tablets to tablet servers. Bigtable uses Chubby to keep track of tablet servers. When a tablet server starts, it creates, and acquires an exclusive lock on, a uniquely-named file in a specific Chubby directory. Tablet server stops serving its tablets if loses its exclusive lock The master is responsible for detecting when a tablet server is no longer serving its tablets, and for reassigning those tablets as soon as possible. When a master is started by the cluster management system, it needs to discover the current tablet assignments before it can change them. Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 14. Outline Introduction Building blocks Design Tablets Implementation Compactions Results Refinements Conclusions Serving a Tablet Updates are logged Each SSTable corresponds to a batch of updates or a snapshot of the tablet taken at some earlier time Memtable (sorted by key) caches recent updates Reads consult both memtable and SSTables Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 15. Outline Introduction Building blocks Design Tablets Implementation Compactions Results Refinements Conclusions Compactions As write operations execute, the size of the memtable increases. Minor compaction convert the memtable into an SSTable Reduce memory usage Reduce log traffic on restart Merging compaction Periodically executed in the background Reduce number of SSTables Good place to apply policy keep only N versions Major compaction Merging compaction that results in only one SSTable No deletion records, only live data Reclaim resources. Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 16. Outline Introduction Building blocks Design Tablets Implementation Compactions Results Refinements Conclusions Refinements (1/2) Group column families together into an SSTable. Segregating column families that are not typically accessed together into separate locality groups enables more efficient reads. Can compress locality groups, using Bentley and McIlroy’s scheme and a fast compression algorithm that looks for repetitions. Bloom Filters on locality groups allows to ask whether an SSTable might contain any data for a specified row/column pair. Drastically reduces the number of disk seeks required - for non-existent rows or columns do not need to touch disk. Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 17. Outline Introduction Building blocks Design Tablets Implementation Compactions Results Refinements Conclusions Refinements (2/2) Caching for read performance ( two levels of caching) Scan Cache: higher-level cache that caches the key-value pairs returned by the SSTable interface to the tablet server code. Block Cache: lower-level cache that caches SSTables blocks that were read from GFS. Commit-log implementation Speeding up tablet recovery (log entries) Exploiting immutability Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 18. Outline Introduction Design Hardware Environment Implementation Performance Evaluation Results Conclusions Hardware Environment Tablet servers were configured to use 1 GB of memory and to write to a GFS cell consisting of 1786 machines with two 400 GB IDE hard drives each. Each machine had two dual-core Opteron 2 GHz chips Enough physical memory to hold the working set of all running processes Single gigabit Ethernet link Two-level tree-shaped switched network with 100-200 Gbps aggregate bandwidth at the root. Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 19. Outline Introduction Design Hardware Environment Implementation Performance Evaluation Results Conclusions Results Per Tablet Server Number of 1000-byte values read/written per second. Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 20. Outline Introduction Design Hardware Environment Implementation Performance Evaluation Results Conclusions Results Aggregate Rate Number of 1000-byte values read/written per second. Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 21. Outline Introduction Design Hardware Environment Implementation Performance Evaluation Results Conclusions Single tablet-server performance The tablet server executes 1200 reads per second ( 75 MB/s), enough to saturate the tablet server CPUs because of overheads in networking stack Random and sequential writes perform better than random reads (commit log and uses group commit) No significant difference between random writes and sequential writes (same commit log) Sequential reads perform better than random reads (block cache) Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 22. Outline Introduction Design Hardware Environment Implementation Performance Evaluation Results Conclusions Scaling Aggregate throughput increases dramatically performance of random reads from memory increases However, performance does not increase linearly Drop in per-server throughput Imbalance in load: Re-balancing is throttled to reduce the number of tablet movement and the load generated by benchmarks shifts around as the benchmark progresses The random read benchmark: transfer one 64KB block over the network for every 1000-byte read and saturates shared 1 Gigabit links Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 23. Outline Introduction Real applications Design Lessons Implementation End Results Conclusions Timestamps Google Analytics Google Earth Personalized Search Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 24. Outline Introduction Real applications Design Lessons Implementation End Results Conclusions Lessons learned Large distributed systems are vulnerable to many types of failures, not just the standard network partitions and fail-stop failures Memory and network corruption Large clock skew Extended and asymmetric network partitions Bugs in other systems (Chubby !) ... Delay adding new features until it is clear how the new features will be used A practical lesson: the importance of proper system-level monitoring Keep It Simple! Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 25. Outline Introduction Real applications Design Lessons Implementation End Results Conclusions END! QUESTIONS ? Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data