SlideShare una empresa de Scribd logo
1 de 38
HBASE Backups



Pritam Damania
Software Engineer, Facebook
Aug 1, 2012
Agenda
 1   Introduction to HBASE and HDFS

 2   Introduction to HBASE Backups

 3   Facebook’s Backup solution

 4   Results

 5   Further Work
INTRODUCTION TO HDFS
What is HDFS ?


▪  Distributed   FileSystem
▪  Runs    on top of commodity hardware
▪  Scale   to Petabytes of data
▪  Tolerates   machine failures
HDFS Data Model

▪  Data    is logically organized into files and directories
▪  Files   are divided into uniform-sized blocks
▪  Blocks
        are distributed across the nodes of the cluster and are replicated
 to handle hardware failure
▪  HDFS     keeps checksums of data for corruption detection and recovery
▪  HDFS     exposes block placement so that computation can be migrated to
 data
HDFS Data Model (2)

            MetaInfo(Filename, replicationFactor, block-ids, …)
            /users/user1/data/part-0, repl:2, ids: {1,3}, …
            /users/user1/data/part-1, repl:3, ids: {2,4,5}, …


                                Block Storage

1                           2
                                          1       4         2     5
        2


                                           3                      4
            3         4
    5                                               5



                                                  7
HDFS Architecture
                                               Metadata (Name, #replicas, …):
                          Namenode                 /users/foo/data, 3, …
       Metadata ops
                                             Block ops
  Client               Metadata ops

Read              Datanodes                                 Datanodes
                                       Replication

                                                                     Blocks


                               Write                        Rack 2
            Rack 1
                            Client
                                                     8
INTRODUCTION TO HBASE
HBase in a nutshell

§  distributed, large-scale data store

§  can host very large tables, billions of rows x millions of columns

§  efficient at random reads/writes

§  open source project modeled after Google’s BigTable
HBase Data Model
•  An HBase table is:
 •    a sparse , three-dimensional array of cells, indexed by:
       RowKey, ColumnKey, Timestamp/Version

 •    sharded into regions along an ordered RowKey space

•  Within each region:
 •    Data is grouped into column families
 ▪    Sort order within each column family:

      •  Row Key (asc), Column Key (asc), Timestamp (desc)
HBase System Overview
                            Database Layer
   HBASE
       Master     Backup Master
   Region           Region             Region    ...
   Server           Server             Server

            Storage Layer                              Coordination Service

HDFS                                            Zookeeper Quorum
   Namenode         Secondary Namenode          ZK         ZK         ...
                                                Peer       Peer
Datanode Datanode       Datanode       ...
HBase Overview
HBASE Region Server
             ....
          Region #2
       Region #1
                    ....
                ColumnFamily #2
            ColumnFamily #1         Memstore
                             (in memory data structure)


               HFiles (in HDFS)                       flush



  Write Ahead Log ( in HDFS)
INTRODUCTION TO HBASE
       BACKUPS
Why Backups ?


▪  Data   Corruption
▪  Operational   error
▪  Hardware   failures
▪  Disaster
Hbase Backups – The Problem

▪  Need   a consistent, point in time backup
▪  Issues   :
 ▪    Live cluster, with traffic
 ▪    Data in MemStore
 ▪    Flushes and Compations in the background
 ▪    Regionserver death
 ▪    Regions moving
CURRENT OPTIONS – Export Table
▪  Pros   :
 ▪    Can export part or full table
 ▪    Map-Reduce job downloads data to output path provided
 ▪    Supports start time, end time and versions so could provide a
      consistent backup
 ▪    Can specify which Column Families to export
▪  Cons       :
 ▪    Only one table at a time
 ▪    Full scans and random reads
CURRENT OPTIONS - Copy Table
▪  Tool   to copy existing table to a intra/inter cluster
▪  Pros   :
 ▪    Another parallel replicated setup to switch
 ▪    Supports start time, end time, and versions
 ▪    Cluster being copied to could be in different setup
 ▪    Can specify which Column Families to export
▪  Cons       :
 ▪    Keep another HBASE cluster up and ready
 ▪    Full scans and random reads
Facebook’s Backup Solution
Backups V1
                Log(Put
                A)
  Application             Backup
                          Cluster
                Log(Put
                A)
        Put A             Dedup

    HBase                 Verify
Backups V1 – Pros and Cons
▪  Pros   :
 ▪    Simple solution
 ▪    Consistency in backup
 ▪    Point in time restore
 ▪    Verification of backups
▪  Cons       :
 ▪    Requires replay of large amount of transactions
 ▪    Requires double writes and deduplication
Backups V2

               Flush Region
 RegionServe   Get File List           Mapper
      r


     Flush                     Copy
                               Files


                      HDFS
                                                .regioninfo
Backups V2 – Tuning


▪  Locality   based mappers
▪  Use   in rack replication
▪  Increase   .Trash retention for HDFS
▪  Fault   tolerant
▪  Use   Backups V1 for point in time
Backups V2 – Restore



▪  Rewrite   backed up .regioninfo
▪  Move   backup copy in place
▪  Add   regions to .META using .regioninfo
Backups V2 – Pros and Cons
▪  Pros   :
 ▪    Faster restore
 ▪    Backup entire data in hours
 ▪    Consistency in backup
 ▪    Point in time restore
 ▪    Resilient to RS death, region moves
▪  Cons       :
 ▪    Affects production cluster
 ▪    Not scalable with data growth
Backups V2 – HDFS Improvements


▪  Overhead    of copying large files
▪  Use   locality of data
▪  HDFS    HFiles are immutable
▪  HDFS    blocks are immutable
▪  Hardlinks   at block level!
Fast Copy workflow
           Source                                                       Destination

B1    B2   ………………..                                         B1’   B2’    ……………………
                                                                         ….




FastCopy Client                           Add Block
                                       Create Destination
                                          Get Source              NameNode




                          Copy Block




     B1             B1’                    B1         B1’         B1                  B1’

     B2             B2’                    B2         B2’         B2                  B2’

      Date Node1                             Date Node2                 Date Node3
FastCopy – Pros and Cons

▪  Pros   :
 ▪    Extremely fast
 ▪    Lots of space saving
 ▪    Minimal impact to production cluster
▪  Cons       :
 ▪    NameNode not aware
 ▪    Hardlinks lost on datanode death
 ▪    Balancer not aware.
Operations


▪  Messages    Use Case :
 ▪    3 stage (same cluster, off cluster, off data center)
 ▪    Stage 1 : once/ day
 ▪    Stage 2 : once / 10 day
 ▪    Stage 3 : once / 10 day
 ▪    Retention based on capacity
Results
Backup Numbers

Example :
▪    40 TB table
▪    49 Mappers
▪    Normal Copy – 15 hours
▪    Fast Copy – 1.5 hours
Disk Savings - FastCopy



Disk
usage in
percent
Network Traffic - FastCopy
Further Work
Further Work

▪    Backup HLogs
▪    Point in time backups
▪    Namenode level Hard links


▪    Code and JIRAs :
     ▪    HBASE 4618
     ▪    HDFS code in github (https://github.com/facebook/hadoop-20)
Acknowledgements

▪    Madhuwanti Vaidya

▪    Ryan Thiessen

▪    Karthik Ranganathan

▪    Paul Tuckfield

▪    Kannan Muthukkaruppan

▪    Hairong Kuang

▪    Dhruba Borthakur

▪    Amitanand Aiyer

▪    Mikhail Bautin
Questions ?
(c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

Más contenido relacionado

La actualidad más candente

HBase 2.0 cluster topology
HBase 2.0 cluster topologyHBase 2.0 cluster topology
HBase 2.0 cluster topologyMikhail Antonov
 
A DBA’s guide to using TSA
A DBA’s guide to using TSAA DBA’s guide to using TSA
A DBA’s guide to using TSAFrederik Engelen
 
Ibm db2 10.5 for linux, unix, and windows getting started with db2 installa...
Ibm db2 10.5 for linux, unix, and windows   getting started with db2 installa...Ibm db2 10.5 for linux, unix, and windows   getting started with db2 installa...
Ibm db2 10.5 for linux, unix, and windows getting started with db2 installa...bupbechanhgmail
 
A First Look at the DB2 10 DSNZPARM Changes
A First Look at the DB2 10 DSNZPARM ChangesA First Look at the DB2 10 DSNZPARM Changes
A First Look at the DB2 10 DSNZPARM ChangesWillie Favero
 
Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Hadoop World 2011: HDFS Federation - Suresh Srinivas, HortonworksHadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Hadoop World 2011: HDFS Federation - Suresh Srinivas, HortonworksCloudera, Inc.
 
DB2 V 10 HADR Multiple Standby
DB2 V 10 HADR Multiple StandbyDB2 V 10 HADR Multiple Standby
DB2 V 10 HADR Multiple StandbyDale McInnis
 
11 cool features in Defrag.nsf+ 11
11 cool features in Defrag.nsf+ 1111 cool features in Defrag.nsf+ 11
11 cool features in Defrag.nsf+ 11aosborne
 
Hadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenesHadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenesNitin Khattar
 
Storage infrastructure using HBase behind LINE messages
Storage infrastructure using HBase behind LINE messagesStorage infrastructure using HBase behind LINE messages
Storage infrastructure using HBase behind LINE messagesLINE Corporation (Tech Unit)
 
Hadoop Successes and Failures to Drive Deployment Evolution
Hadoop Successes and Failures to Drive Deployment EvolutionHadoop Successes and Failures to Drive Deployment Evolution
Hadoop Successes and Failures to Drive Deployment EvolutionBenoit Perroud
 
HDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and ScalabilityHDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and ScalabilityHortonworks
 
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
Less is More: 2X Storage Efficiency with HDFS Erasure CodingLess is More: 2X Storage Efficiency with HDFS Erasure Coding
Less is More: 2X Storage Efficiency with HDFS Erasure CodingZhe Zhang
 
Hadoop HDFS NameNode HA
Hadoop HDFS NameNode HAHadoop HDFS NameNode HA
Hadoop HDFS NameNode HAHanborq Inc.
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfsshrey mehrotra
 
JONSMITH10042016
JONSMITH10042016JONSMITH10042016
JONSMITH10042016Jon Smith
 

La actualidad más candente (20)

HBase 2.0 cluster topology
HBase 2.0 cluster topologyHBase 2.0 cluster topology
HBase 2.0 cluster topology
 
A DBA’s guide to using TSA
A DBA’s guide to using TSAA DBA’s guide to using TSA
A DBA’s guide to using TSA
 
Ibm db2 10.5 for linux, unix, and windows getting started with db2 installa...
Ibm db2 10.5 for linux, unix, and windows   getting started with db2 installa...Ibm db2 10.5 for linux, unix, and windows   getting started with db2 installa...
Ibm db2 10.5 for linux, unix, and windows getting started with db2 installa...
 
A First Look at the DB2 10 DSNZPARM Changes
A First Look at the DB2 10 DSNZPARM ChangesA First Look at the DB2 10 DSNZPARM Changes
A First Look at the DB2 10 DSNZPARM Changes
 
Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)
 
Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Hadoop World 2011: HDFS Federation - Suresh Srinivas, HortonworksHadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
 
DB2 V 10 HADR Multiple Standby
DB2 V 10 HADR Multiple StandbyDB2 V 10 HADR Multiple Standby
DB2 V 10 HADR Multiple Standby
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
11 cool features in Defrag.nsf+ 11
11 cool features in Defrag.nsf+ 1111 cool features in Defrag.nsf+ 11
11 cool features in Defrag.nsf+ 11
 
Hadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenesHadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenes
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Dnssec
DnssecDnssec
Dnssec
 
Storage infrastructure using HBase behind LINE messages
Storage infrastructure using HBase behind LINE messagesStorage infrastructure using HBase behind LINE messages
Storage infrastructure using HBase behind LINE messages
 
Hadoop Successes and Failures to Drive Deployment Evolution
Hadoop Successes and Failures to Drive Deployment EvolutionHadoop Successes and Failures to Drive Deployment Evolution
Hadoop Successes and Failures to Drive Deployment Evolution
 
INFLOW-2014-NVM-Compression
INFLOW-2014-NVM-CompressionINFLOW-2014-NVM-Compression
INFLOW-2014-NVM-Compression
 
HDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and ScalabilityHDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and Scalability
 
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
Less is More: 2X Storage Efficiency with HDFS Erasure CodingLess is More: 2X Storage Efficiency with HDFS Erasure Coding
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
 
Hadoop HDFS NameNode HA
Hadoop HDFS NameNode HAHadoop HDFS NameNode HA
Hadoop HDFS NameNode HA
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
JONSMITH10042016
JONSMITH10042016JONSMITH10042016
JONSMITH10042016
 

Similar a Facebook's HBase Backups - StampedeCon 2012

支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统yongboy
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase强 王
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
[Hi c2011]building mission critical messaging system(guoqiang jerry)
[Hi c2011]building mission critical messaging system(guoqiang jerry)[Hi c2011]building mission critical messaging system(guoqiang jerry)
[Hi c2011]building mission critical messaging system(guoqiang jerry)baggioss
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
Putting Wings on the Elephant
Putting Wings on the ElephantPutting Wings on the Elephant
Putting Wings on the ElephantDataWorks Summit
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryCloudera, Inc.
 
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...Michael Stack
 
Facebook's Approach to Big Data Storage Challenge
Facebook's Approach to Big Data Storage ChallengeFacebook's Approach to Big Data Storage Challenge
Facebook's Approach to Big Data Storage ChallengeDataWorks Summit
 
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010Cloudera, Inc.
 
Migrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at FacebookMigrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at FacebookMariaDB plc
 
MyRocks introduction and production deployment
MyRocks introduction and production deploymentMyRocks introduction and production deployment
MyRocks introduction and production deploymentYoshinori Matsunobu
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)mundlapudi
 
Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...
Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...
Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...Ontico
 
CephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCeph Community
 
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay RadiaApache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay RadiaYahoo Developer Network
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Chris Nauroth
 

Similar a Facebook's HBase Backups - StampedeCon 2012 (20)

支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
[Hi c2011]building mission critical messaging system(guoqiang jerry)
[Hi c2011]building mission critical messaging system(guoqiang jerry)[Hi c2011]building mission critical messaging system(guoqiang jerry)
[Hi c2011]building mission critical messaging system(guoqiang jerry)
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Putting Wings on the Elephant
Putting Wings on the ElephantPutting Wings on the Elephant
Putting Wings on the Elephant
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
 
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
 
Facebook's Approach to Big Data Storage Challenge
Facebook's Approach to Big Data Storage ChallengeFacebook's Approach to Big Data Storage Challenge
Facebook's Approach to Big Data Storage Challenge
 
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
 
Migrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at FacebookMigrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at Facebook
 
MyRocks introduction and production deployment
MyRocks introduction and production deploymentMyRocks introduction and production deployment
MyRocks introduction and production deployment
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)
 
Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...
Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...
Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...
 
CephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at Last
 
CLFS 2010
CLFS 2010CLFS 2010
CLFS 2010
 
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay RadiaApache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4
 
Hadoop, Taming Elephants
Hadoop, Taming ElephantsHadoop, Taming Elephants
Hadoop, Taming Elephants
 

Más de StampedeCon

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...StampedeCon
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017StampedeCon
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017StampedeCon
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...StampedeCon
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017StampedeCon
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017StampedeCon
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017StampedeCon
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...StampedeCon
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...StampedeCon
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017StampedeCon
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017StampedeCon
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017StampedeCon
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017StampedeCon
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017StampedeCon
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017StampedeCon
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...StampedeCon
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...StampedeCon
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016StampedeCon
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016StampedeCon
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016StampedeCon
 

Más de StampedeCon (20)

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016
 

Último

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 

Último (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 

Facebook's HBase Backups - StampedeCon 2012

  • 1.
  • 2. HBASE Backups Pritam Damania Software Engineer, Facebook Aug 1, 2012
  • 3. Agenda 1 Introduction to HBASE and HDFS 2 Introduction to HBASE Backups 3 Facebook’s Backup solution 4 Results 5 Further Work
  • 5. What is HDFS ? ▪  Distributed FileSystem ▪  Runs on top of commodity hardware ▪  Scale to Petabytes of data ▪  Tolerates machine failures
  • 6. HDFS Data Model ▪  Data is logically organized into files and directories ▪  Files are divided into uniform-sized blocks ▪  Blocks are distributed across the nodes of the cluster and are replicated to handle hardware failure ▪  HDFS keeps checksums of data for corruption detection and recovery ▪  HDFS exposes block placement so that computation can be migrated to data
  • 7. HDFS Data Model (2) MetaInfo(Filename, replicationFactor, block-ids, …) /users/user1/data/part-0, repl:2, ids: {1,3}, … /users/user1/data/part-1, repl:3, ids: {2,4,5}, … Block Storage 1 2 1 4 2 5 2 3 4 3 4 5 5 7
  • 8. HDFS Architecture Metadata (Name, #replicas, …): Namenode /users/foo/data, 3, … Metadata ops Block ops Client Metadata ops Read Datanodes Datanodes Replication Blocks Write Rack 2 Rack 1 Client 8
  • 10. HBase in a nutshell §  distributed, large-scale data store §  can host very large tables, billions of rows x millions of columns §  efficient at random reads/writes §  open source project modeled after Google’s BigTable
  • 11. HBase Data Model •  An HBase table is: •  a sparse , three-dimensional array of cells, indexed by: RowKey, ColumnKey, Timestamp/Version •  sharded into regions along an ordered RowKey space •  Within each region: •  Data is grouped into column families ▪  Sort order within each column family: •  Row Key (asc), Column Key (asc), Timestamp (desc)
  • 12. HBase System Overview Database Layer HBASE Master Backup Master Region Region Region ... Server Server Server Storage Layer Coordination Service HDFS Zookeeper Quorum Namenode Secondary Namenode ZK ZK ... Peer Peer Datanode Datanode Datanode ...
  • 13. HBase Overview HBASE Region Server .... Region #2 Region #1 .... ColumnFamily #2 ColumnFamily #1 Memstore (in memory data structure) HFiles (in HDFS) flush Write Ahead Log ( in HDFS)
  • 15. Why Backups ? ▪  Data Corruption ▪  Operational error ▪  Hardware failures ▪  Disaster
  • 16. Hbase Backups – The Problem ▪  Need a consistent, point in time backup ▪  Issues : ▪  Live cluster, with traffic ▪  Data in MemStore ▪  Flushes and Compations in the background ▪  Regionserver death ▪  Regions moving
  • 17. CURRENT OPTIONS – Export Table ▪  Pros : ▪  Can export part or full table ▪  Map-Reduce job downloads data to output path provided ▪  Supports start time, end time and versions so could provide a consistent backup ▪  Can specify which Column Families to export ▪  Cons : ▪  Only one table at a time ▪  Full scans and random reads
  • 18. CURRENT OPTIONS - Copy Table ▪  Tool to copy existing table to a intra/inter cluster ▪  Pros : ▪  Another parallel replicated setup to switch ▪  Supports start time, end time, and versions ▪  Cluster being copied to could be in different setup ▪  Can specify which Column Families to export ▪  Cons : ▪  Keep another HBASE cluster up and ready ▪  Full scans and random reads
  • 20. Backups V1 Log(Put A) Application Backup Cluster Log(Put A) Put A Dedup HBase Verify
  • 21. Backups V1 – Pros and Cons ▪  Pros : ▪  Simple solution ▪  Consistency in backup ▪  Point in time restore ▪  Verification of backups ▪  Cons : ▪  Requires replay of large amount of transactions ▪  Requires double writes and deduplication
  • 22. Backups V2 Flush Region RegionServe Get File List Mapper r Flush Copy Files HDFS .regioninfo
  • 23. Backups V2 – Tuning ▪  Locality based mappers ▪  Use in rack replication ▪  Increase .Trash retention for HDFS ▪  Fault tolerant ▪  Use Backups V1 for point in time
  • 24. Backups V2 – Restore ▪  Rewrite backed up .regioninfo ▪  Move backup copy in place ▪  Add regions to .META using .regioninfo
  • 25. Backups V2 – Pros and Cons ▪  Pros : ▪  Faster restore ▪  Backup entire data in hours ▪  Consistency in backup ▪  Point in time restore ▪  Resilient to RS death, region moves ▪  Cons : ▪  Affects production cluster ▪  Not scalable with data growth
  • 26. Backups V2 – HDFS Improvements ▪  Overhead of copying large files ▪  Use locality of data ▪  HDFS HFiles are immutable ▪  HDFS blocks are immutable ▪  Hardlinks at block level!
  • 27. Fast Copy workflow Source Destination B1 B2 ……………….. B1’ B2’ …………………… …. FastCopy Client Add Block Create Destination Get Source NameNode Copy Block B1 B1’ B1 B1’ B1 B1’ B2 B2’ B2 B2’ B2 B2’ Date Node1 Date Node2 Date Node3
  • 28. FastCopy – Pros and Cons ▪  Pros : ▪  Extremely fast ▪  Lots of space saving ▪  Minimal impact to production cluster ▪  Cons : ▪  NameNode not aware ▪  Hardlinks lost on datanode death ▪  Balancer not aware.
  • 29. Operations ▪  Messages Use Case : ▪  3 stage (same cluster, off cluster, off data center) ▪  Stage 1 : once/ day ▪  Stage 2 : once / 10 day ▪  Stage 3 : once / 10 day ▪  Retention based on capacity
  • 31. Backup Numbers Example : ▪  40 TB table ▪  49 Mappers ▪  Normal Copy – 15 hours ▪  Fast Copy – 1.5 hours
  • 32. Disk Savings - FastCopy Disk usage in percent
  • 33. Network Traffic - FastCopy
  • 35. Further Work ▪  Backup HLogs ▪  Point in time backups ▪  Namenode level Hard links ▪  Code and JIRAs : ▪  HBASE 4618 ▪  HDFS code in github (https://github.com/facebook/hadoop-20)
  • 36. Acknowledgements ▪  Madhuwanti Vaidya ▪  Ryan Thiessen ▪  Karthik Ranganathan ▪  Paul Tuckfield ▪  Kannan Muthukkaruppan ▪  Hairong Kuang ▪  Dhruba Borthakur ▪  Amitanand Aiyer ▪  Mikhail Bautin
  • 38. (c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0