SlideShare a Scribd company logo
1 of 21
HDFS
(Hadoop Distributed
File System)
                      Thiru
Typical Work     Writing file in
           flow            to HDFS



          Reading file       Rack
Agenda    from HDFS        Awareness



         Planning for a
                              Q&A
            Cluster
Client
                                                                                    Masters


                                                     HDFS
                      Map Reduce
                                                  {Name Node}
                      {Job Tracker}
                                             {Secondary Name Node}



Hadoop Server   Data Node             Data Node      Data Node       Data Node

Roles           Task Tracker      Task Tracker       Task Tracker    Task Tracker
                                                                                          Slaves




                Data Node             Data Node      Data Node       Data Node

                Task Tracker      Task Tracker       Task Tracker    Task Tracker
Hadoop Client




Hadoop
Cluster   Name Node   Job Tracker   Secondary NN   Hadoop Client           DN + TT

           DN + TT     DN + TT        DN + TT        DN + TT               DN + TT

           DN + TT     DN + TT        DN + TT        DN + TT               DN + TT

           DN + TT     DN + TT        DN + TT                              DN + TT
                                                     DN + TT

           DN + TT     DN + TT        DN + TT         DN + TT              DN + TT
Write data in to cluster (HDFS)

                              Analyze the data (Map Reduce)

Sample HDFS                   Store the result in to cluster (HDFS)
Workflow                    Read the result from cluster (HDFS)

              Sample scenario: How many times customer called to customer care enquiring
              about a recently launched product? Compare it against the AD campaign in the
              television. Correlate both and find the best time to run the AD
                                                     HDFS
                 CRM Data
                   entry
                                      SQOOP                                Map Reduce
                             Result
                                                                              Result
I want to
                                                           File size 200 MB
                      write file


                    Hadoop Client                                               Name Node



Write data in    Data Node 1        Data Node 2
                                                                              Ok! Block size is
                                                                              64 MB. Split the
to cluster                                                                     file in to 3 and
(HDFS            Data Node 3        Data Node 4
                                                                              write in to node
                                                                                     1,4,5


                 Data Node 5        Data Node 6


                                                             Data node
                                                            replicates as
                                            Client write
                Client Consults                            per replication           Cycle repeats
                                            data to one
                 Name node                                   factor and             for every block
                                             data node
                                                              intimates
                                                            Name node
Name Node




                DN 1               DN 5           DN 9
                       A       C              C                Rack Aware:
                                                               Rack 1:
            B   DN 2               DN 6   B       DN 10        Data node 1
                                                          A    Data node 2
                                                               Rack 2:
                DN 3               DN 7   C       DN 11
Rack            DN 4
                       A
                                   DN 8       B   DN 12
                                                               Data node 5




Awareness

                              Never loose data when a rack is down

                           Keep bulky flows within Rack when possible

            Assumption in rack has higher bandwidth, and low latency
File.txt                     File size 200 MB


              Hadoop Client                                     Name Node
                 A    B       C

                                                                Replicate in 3,8

Multi Block
Replication
               DN 1               DN 5                                DN 9
                      A
               DN 2               DN 6                               DN 10

               DN 3   A           DN 7                               DN 11

               DN 4               DN 8   A                           DN 12
 Data node sends hearth beats
             Every 10th heart beat is Block report
             Name node builds meta data from block report
             If name node is down, HDFS is down
Name Node
             Missing heartbeats signify lost nodes
                Name node consults metadata and finds affected
                 data
                Name node consults rack awareness script
                Name node tells data node to replicate
File System Metadata:
                               File.txt = A0 {1,5,7}
                                    A1 {1,7,9}
                                   A2{5,10,15}




                Primary Name                            Secondary Name
                                                             Node
                    node

Name Node &                                                         It’s been
                                                                    1 hr, give
Secondary                                                           your data

Name node      Not a hot standby for the name node* (Zoo keeper)
               Connects to name node every one hour* (Configurable)
               Housekeeping, backup of Name node meta data
               Saved meta data can be used to rebuild name node
Primary Name Node          Secondary Name Node

                edits         fsimage



                                             edits          fsimage
Understanding   edits-
                 new
Secondary
name node                     Fsimage.ckpt
                                                     Fsimage.ckpt
house keeping

                edits
                              Fsimage.ckpt
I want to read file
                             file.txt



                      Hadoop Client                                                          Name Node



Reading data       Data Node 1        Data Node 2         Data Node 7
                                                                                                   Ok!
                                                                                            File.txt = blck a
from HDFS      A               B              B
                                                                                                 {1,5,6}
Cluster            Data Node 3        Data Node 4         Data Node 8
                                                                                             Blck b {8,1,2}
                                                                                             Blck c {5,8,9}
                                                                  C
                                                         B

                   Data Node 5        Data Node 6         Data Node 9
               C           A                         A                  C




                                                      Client
                     Client                                                 Client picks           Client reads
                                                  receives DN
                    Consults                                                first node of              data
                                                  list for each
                   Name node                                                      list             sequentially
                                                      block
Single Point of
                                       Failure


                 # Task per
                 node
                                                      Dual power
                 •1 core can run                       supply for
                  1.5 Mapper or                       redundancy
                  Reducer




Choosing right                     Master
hardware                           node
                  RAM thumb
                   rule – 1 GB/                      No Commodity
                  Million blocks                       hardware
                     of data




                                   Regular Data
                                     backup
Practice at Yahoo!
 HDFS clusters at Yahoo! include about 3500 nodes
               A typical cluster node has:

               · 2 quad core Xeon processors @ 2.5ghz
Practice at    · Red Hat Enterprise Linux Server Release 5.1
YAHoo!         · Sun Java JDK 1.6.0_13-b03
               · 4 directly attached SATA drives (one terabyte each)
               · 16G RAM
               · 1-gigabit Ethernet
 70 percent of the disk space is allocated to HDFS. The remainder is
                reserved for the operating system (Red Hat Linux), logs, and space
                to spill the output of map tasks. (MapReduce intermediate data are
                not stored in HDFS.)


               For each cluster, the NameNode and the BackupNode hosts are
Practice at     specially provisioned with up to 64GB RAM; application tasks are
                never assigned to those hosts.
YAHoo!
               In total, a cluster of 3500 nodes has 9.8 PB of storage available as
                blocks that are replicated three times yielding a net 3.3 PB of
                storage for user applications. As a convenient approximation, one
                thousand nodes represent one PB of application storage.
 Durability of Data
              uncorrelated node failures
              Replication of data three times is a robust guard against loss of
              data due to uncorrelated node failures.
Practice at   correlated node failures, the failure of a rack or core switch.
YAHoo!          HDFS can tolerate losing a rack switch (each block has a
              replica on some other rack).
              loss of electrical power to the cluster
                a large cluster will lose a handful of blocks during a power-on
              restart.
 Benchmarks




Practice at
YAHoo!
 Benchmarks




Practice at
YAHoo!



                       NameNode Throughput benchmark
 Automated failover

                plan: Zookeeper, Yahoo’s distributed consensus technology to
              build an automated failover solution
               Scalability of the NameNode
Future work
                Solution: Our near-term solution to scalability is to allow
              multiple namespaces (and NameNodes) to share the physical
              storage within a cluster.
                Drawbacks: The main drawback of multiple independent
              namespaces is the cost of managing them.
Thank you

More Related Content

What's hot

Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesDavid Tjahjono,MD,MBA(UK)
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & HadoopEdureka!
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBaseHortonworks
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architectureHarikrishnan K
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingCloudera, Inc.
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2Giovanna Roda
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce FrameworkEdureka!
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapakapa rohit
 

What's hot (20)

Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Hadoop
HadoopHadoop
Hadoop
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
Hadoop programming
Hadoop programmingHadoop programming
Hadoop programming
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapa
 
Hadoop technology doc
Hadoop technology docHadoop technology doc
Hadoop technology doc
 

Viewers also liked

Big-data analytics: challenges and opportunities
Big-data analytics: challenges and opportunitiesBig-data analytics: challenges and opportunities
Big-data analytics: challenges and opportunities台灣資料科學年會
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 

Viewers also liked (6)

Big-data analytics: challenges and opportunities
Big-data analytics: challenges and opportunitiesBig-data analytics: challenges and opportunities
Big-data analytics: challenges and opportunities
 
Big Data: Issues and Challenges
Big Data: Issues and ChallengesBig Data: Issues and Challenges
Big Data: Issues and Challenges
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 

Similar to Understanding hdfs

Hadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenesHadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenesNitin Khattar
 
Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache AccumuloJared Winick
 
HA Hadoop -ApacheCon talk
HA Hadoop -ApacheCon talkHA Hadoop -ApacheCon talk
HA Hadoop -ApacheCon talkSteve Loughran
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetupvmoorthy
 
Federated HDFS
Federated HDFSFederated HDFS
Federated HDFShuguk
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 
Understanding Hadoop Clusters and the Network
Understanding Hadoop Clusters and the NetworkUnderstanding Hadoop Clusters and the Network
Understanding Hadoop Clusters and the Networkbradhedlund
 
HDFS introduction
HDFS introductionHDFS introduction
HDFS introductioninjae yeo
 
Availability and Integrity in hadoop (Strata EU Edition)
Availability and Integrity in hadoop (Strata EU Edition)Availability and Integrity in hadoop (Strata EU Edition)
Availability and Integrity in hadoop (Strata EU Edition)Steve Loughran
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduceFARUK BERKSÖZ
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-servicesSreenu Musham
 
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay RadiaApache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay RadiaYahoo Developer Network
 
Webinar: General Technical Overview of MongoDB
Webinar: General Technical Overview of MongoDBWebinar: General Technical Overview of MongoDB
Webinar: General Technical Overview of MongoDBMongoDB
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作James Chen
 
Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012StampedeCon
 

Similar to Understanding hdfs (20)

Hadoop Inside
Hadoop InsideHadoop Inside
Hadoop Inside
 
Hadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenesHadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenes
 
Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache Accumulo
 
HA Hadoop -ApacheCon talk
HA Hadoop -ApacheCon talkHA Hadoop -ApacheCon talk
HA Hadoop -ApacheCon talk
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetup
 
Federated HDFS
Federated HDFSFederated HDFS
Federated HDFS
 
Hadoop 1.x vs 2
Hadoop 1.x vs 2Hadoop 1.x vs 2
Hadoop 1.x vs 2
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Understanding Hadoop Clusters and the Network
Understanding Hadoop Clusters and the NetworkUnderstanding Hadoop Clusters and the Network
Understanding Hadoop Clusters and the Network
 
HDFS introduction
HDFS introductionHDFS introduction
HDFS introduction
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
RuG Guest Lecture
RuG Guest LectureRuG Guest Lecture
RuG Guest Lecture
 
Availability and Integrity in hadoop (Strata EU Edition)
Availability and Integrity in hadoop (Strata EU Edition)Availability and Integrity in hadoop (Strata EU Edition)
Availability and Integrity in hadoop (Strata EU Edition)
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduce
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay RadiaApache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Webinar: General Technical Overview of MongoDB
Webinar: General Technical Overview of MongoDBWebinar: General Technical Overview of MongoDB
Webinar: General Technical Overview of MongoDB
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012
 

Understanding hdfs

  • 2. Typical Work Writing file in flow to HDFS Reading file Rack Agenda from HDFS Awareness Planning for a Q&A Cluster
  • 3. Client Masters HDFS Map Reduce {Name Node} {Job Tracker} {Secondary Name Node} Hadoop Server Data Node Data Node Data Node Data Node Roles Task Tracker Task Tracker Task Tracker Task Tracker Slaves Data Node Data Node Data Node Data Node Task Tracker Task Tracker Task Tracker Task Tracker
  • 4. Hadoop Client Hadoop Cluster Name Node Job Tracker Secondary NN Hadoop Client DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT
  • 5. Write data in to cluster (HDFS) Analyze the data (Map Reduce) Sample HDFS Store the result in to cluster (HDFS) Workflow Read the result from cluster (HDFS) Sample scenario: How many times customer called to customer care enquiring about a recently launched product? Compare it against the AD campaign in the television. Correlate both and find the best time to run the AD HDFS CRM Data entry SQOOP Map Reduce Result Result
  • 6. I want to File size 200 MB write file Hadoop Client Name Node Write data in Data Node 1 Data Node 2 Ok! Block size is 64 MB. Split the to cluster file in to 3 and (HDFS Data Node 3 Data Node 4 write in to node 1,4,5 Data Node 5 Data Node 6 Data node replicates as Client write Client Consults per replication Cycle repeats data to one Name node factor and for every block data node intimates Name node
  • 7. Name Node DN 1 DN 5 DN 9 A C C Rack Aware: Rack 1: B DN 2 DN 6 B DN 10 Data node 1 A Data node 2 Rack 2: DN 3 DN 7 C DN 11 Rack DN 4 A DN 8 B DN 12 Data node 5 Awareness Never loose data when a rack is down Keep bulky flows within Rack when possible Assumption in rack has higher bandwidth, and low latency
  • 8. File.txt File size 200 MB Hadoop Client Name Node A B C Replicate in 3,8 Multi Block Replication DN 1 DN 5 DN 9 A DN 2 DN 6 DN 10 DN 3 A DN 7 DN 11 DN 4 DN 8 A DN 12
  • 9.  Data node sends hearth beats  Every 10th heart beat is Block report  Name node builds meta data from block report  If name node is down, HDFS is down Name Node  Missing heartbeats signify lost nodes  Name node consults metadata and finds affected data  Name node consults rack awareness script  Name node tells data node to replicate
  • 10. File System Metadata: File.txt = A0 {1,5,7} A1 {1,7,9} A2{5,10,15} Primary Name Secondary Name Node node Name Node & It’s been 1 hr, give Secondary your data Name node  Not a hot standby for the name node* (Zoo keeper)  Connects to name node every one hour* (Configurable)  Housekeeping, backup of Name node meta data  Saved meta data can be used to rebuild name node
  • 11. Primary Name Node Secondary Name Node edits fsimage edits fsimage Understanding edits- new Secondary name node Fsimage.ckpt Fsimage.ckpt house keeping edits Fsimage.ckpt
  • 12. I want to read file file.txt Hadoop Client Name Node Reading data Data Node 1 Data Node 2 Data Node 7 Ok! File.txt = blck a from HDFS A B B {1,5,6} Cluster Data Node 3 Data Node 4 Data Node 8 Blck b {8,1,2} Blck c {5,8,9} C B Data Node 5 Data Node 6 Data Node 9 C A A C Client Client Client picks Client reads receives DN Consults first node of data list for each Name node list sequentially block
  • 13. Single Point of Failure # Task per node Dual power •1 core can run supply for 1.5 Mapper or redundancy Reducer Choosing right Master hardware node RAM thumb rule – 1 GB/ No Commodity Million blocks hardware of data Regular Data backup
  • 15.  HDFS clusters at Yahoo! include about 3500 nodes  A typical cluster node has:  · 2 quad core Xeon processors @ 2.5ghz Practice at  · Red Hat Enterprise Linux Server Release 5.1 YAHoo!  · Sun Java JDK 1.6.0_13-b03  · 4 directly attached SATA drives (one terabyte each)  · 16G RAM  · 1-gigabit Ethernet
  • 16.  70 percent of the disk space is allocated to HDFS. The remainder is reserved for the operating system (Red Hat Linux), logs, and space to spill the output of map tasks. (MapReduce intermediate data are not stored in HDFS.)  For each cluster, the NameNode and the BackupNode hosts are Practice at specially provisioned with up to 64GB RAM; application tasks are never assigned to those hosts. YAHoo!  In total, a cluster of 3500 nodes has 9.8 PB of storage available as blocks that are replicated three times yielding a net 3.3 PB of storage for user applications. As a convenient approximation, one thousand nodes represent one PB of application storage.
  • 17.  Durability of Data uncorrelated node failures Replication of data three times is a robust guard against loss of data due to uncorrelated node failures. Practice at correlated node failures, the failure of a rack or core switch. YAHoo! HDFS can tolerate losing a rack switch (each block has a replica on some other rack). loss of electrical power to the cluster a large cluster will lose a handful of blocks during a power-on restart.
  • 19.  Benchmarks Practice at YAHoo! NameNode Throughput benchmark
  • 20.  Automated failover plan: Zookeeper, Yahoo’s distributed consensus technology to build an automated failover solution  Scalability of the NameNode Future work Solution: Our near-term solution to scalability is to allow multiple namespaces (and NameNodes) to share the physical storage within a cluster. Drawbacks: The main drawback of multiple independent namespaces is the cost of managing them.

Editor's Notes

  1. Yahoo! Adopted Hadoop for internal use at the end of 2006. So the data is a little bit out of date.
  2. Why MapReduce intermediate data are not stored in HDFS? HDFS read and write is expensive.2. Ifintermediate data is lost, computation can be done on the corresponding dataNode again to get intermediate data. It is determinstic.
  3. For row two, failure is no long uncommon, because, for example, if one machine dies in a thousand machines, when ten thousand machines, ten machines may die. This is a lot. So replica factor being two is necessary.
  4. Ops/s is operations per second
  5. Again, out of date.