SlideShare una empresa de Scribd logo
1 de 52
The Google File System
Published By:
Sanjay Ghemawat,
Howard Gobioff,
Shun-Tak Leung
Google                              Presented By:
                   Manoj Samaraweera (138231B)
                      Azeem Mumtaz (138218R)
                            University of Moratuwa
Contents
•   Distributed File Systems
•   Introducing Google File System
•   Design Overview
•   System Interaction
•   Master Operation
•   Fault Tolerance and Diagnosis
•   Measurements and Benchmarks
•   Experience
•   Related Works
•   Conclusion
•   Reference
Distributed File Systems
• Enables programs to store and access remote
  files exactly as they do local ones
• New modes of data organization on disk or
  across multiple servers
• Goals
 ▫   Performance
 ▫   Scalability
 ▫   Reliability
 ▫   Availability
Introducing Google File System
• Growing demand for Google data processing
• Properties
 ▫   A scalable distributed file system
 ▫   For large distributed data intensive applications
 ▫   Fault tolerance
 ▫   Inexpensive commodity hardware
 ▫   High aggregated performance
• Design is driven by observation of workload and
  technological environment
Design Assumptions
• Component failures are the norm
  ▫ Commodity Hardware
• Files are huge by traditional standard
  ▫ Multi-GB files
  ▫ Small files also must be supported,
      Not optimized
• Read Workloads
  ▫ Large streaming reads
  ▫ Small random reads
• Write Workloads
  ▫ Large, sequential writes that append data to file
• Multiple clients concurrently append to one file
  ▫ Consistency Semantics
  ▫ Files are used as producer-consumer queues or many way merging
• High sustained bandwidth is more important than low latency
Design Interface
•   Typical File System Interface
•   Hierarchical Directory Organization
•   Files are identified as pathnames
•   Operations
    ▫ Create, delete, open, close, read, write
Architecture (1/2)
• Files are divided into chunks
• Fixed-size chunks (64MB)
• Unique 64-bit chunk handles
  ▫ Immutable and globally unique
• Chunks as Linux files
• Replicated over chunkservers, called replicas
  ▫ 3 replicas by default
  ▫ Different replication for different region of file namespace
• Single master
• Multiple chunkservers
  ▫ Grouped into Racks
  ▫ Connected through switches
• Multiple clients
• Master/chunkserver coordination
  ▫ HeartBeat Messages
Architecture (2/2)
Single Master
• Maintains Metadata
• Controls System Wide Activities
 ▫   Chunk lease management
 ▫   Garbage collection
 ▫   Chunk migration
 ▫   Replication
Chunk Size (1/2)
• 64 MB
• Stored as plain Linux file on a chunkserver
• Advantages
 ▫ Reduces client’s interaction with single master
 ▫ Clients most likely to perform many operations on
   a large chunk
    Reduce network overhead by keeping a persistent
     TCP connection with the chunkserver
 ▫ Reduces the size of the metadata
    Keep metadata in memory
 ▫ Lazy Space Allocation
Chunk Size (1/2)
• Disadvantages
 ▫ Small files consisting of small chunks may
   become hot spots
 ▫ Solutions
    Higher replication factor
    Stagger application start time
    Allow clients to read from other clients
Metadata (1/5)
• 3 Major Types
 ▫ The file and chunk namespace
 ▫ File-to-chunk mappings
 ▫ The location of each chunk replicas
• Namespaces and mappings
 ▫ Persisted by logging mutation to an operation log
   stored on master
 ▫ Operation log is replicated
Metadata (2/5)
• Metadata are stored in the memory
 ▫ Improves the performance master
 ▫ Easier to scan the entire state of metadata
   periodically
    Chunk garbage collection
    Re-replication in the presence of chunkserver failure
    Chunk migration to balance load and disk space
• 64 bytes of metadata for 64 MB chunk
• File namespace data requires < 64 bytes per file
 ▫ Prefix compression
Metadata (3/5)
• Chunk location information
 ▫ Polled at master startup
    Chunkservers join and leave the cluster
 ▫ Keeps up-to-date with chunkserver with
   HeartBeat messages
Metadata (4/5)
• Operation Logs
 ▫ Historical record of critical metadata changes
 ▫ Logical timeline that defines the order of
   concurrent operations
 ▫ Not visible to client
    Until it is replicated and flushed the logs to the disk
 ▫ Flushing and replication in batch
    Reduces impact on system throughput
Metadata (5/5)
• Operation Logs
 ▫ By replaying operation logs master recover its file
   system state
 ▫ Checkpoints
    To avoid the growth of the operation logs beyond the
     threshold
    avoids interfering other mutations by working in a
     separate thread
 ▫ Compact B-tree like structure
    Directly mapped into the memory and used for
     namespace lookup
    No extra parsing
Consistency Model (1/3)


• Guarantees by GFS
  ▫ File namespace mutations (i.e. File Creation) are atomic
     Namespace management and locking guarantees atomicity and
      correctness
     The master’s operation log
  ▫ After a sequence of successful mutations, the mutated file is
    guaranteed to be defined and contain the data written by
    the last mutation. This is obtained by
     Applying the same mutation in order to all replicas
     Using chunk version numbers to detect stale replica
Consistency Model (2/3)
• Relaxed consistency model
• Two types of mutations
  ▫ Writes
      Cause data to be written at an application-specified file offset
  ▫ Record Appends
      Cause data to be appended atomically at least once
      Offset chosen by GFS, not by the client
• States of a file region after a mutation
  ▫ Consistent
      All clients see the same data, regardless which replicas they read from
  ▫ Inconsistent
      Clients see different data at different times
  ▫ Defined
      consistent and all clients see what the mutation writes in its entirety
  ▫ Undefined
      consistent but it may not reflect what any mutation has written
Consistency Model (3/3)
• Implication for Applications
 ▫ Relying on appends rather on overwrites
 ▫ Checkpointing
    to verify how much data has been successfully
     written
 ▫ Writing self-validating records
    Checksums to detect and remove padding
 ▫ Writing Self-identifying records
    Unique Identifiers to identify and discard duplicates
Lease & Mutation Order
• Master uses leases to maintain a consistent
  mutation order among replicas
• Primary is the chunkserver who is granted a
  chunk lease
 ▫ Master delegates the authority of mutation
 ▫ All others are secondary replicas
• Primary defines a mutation order between
  mutations
 ▫ Secondary replicas follows this order
Writes (1/7)
               •       Step 1
                   ▫    Which chunkserver holds
                        the current lease for the
                        chunk?
                   ▫    The location of secondary
                        replicas
Writes (2/7)
               •       Step 2
                   ▫       Identities of primary and
                           secondary replicas
                   ▫       Client cache this data for
                           future mutation, until
                           Primary is unreachable
                           Primary no longer holds
                            the lease
Writes (3/7)
               •       Step 3
                   ▫    Client pushes the data to
                        all replicas
                   ▫    Chunkserver stores the
                        data in an internal LRU
                        buffer cache
Writes (4/7)
               •       Step 4
                   ▫       Client sends a write
                           request to the primary
                   ▫       Primary assigns a
                           consecutive serial
                           numbers to mutations
                           Serialization
                   ▫       Primary applies
                           mutations to its own state
Writes (5/7)
               •       Step 5
                   ▫    Forward the writes to all
                        secondary replicas
                   ▫    Follows the mutation
                        order
Writes (6/7)
               •       Step 6
                   ▫    Secondary replicas
                        inform primary after
                        completing the mutation
Writes (7/7)
               •       Step 7
                   ▫    Primary replies to the
                        client
                   ▫    Retries from step 3 to 7 in
                        case of errors
Data Flow (1/2)
• Decoupled control flow and data flow
• Data is pushed linearly along a chain of
  chunkservers in a pipelined fashion
 ▫ Utilize inbound bandwidth
• Distance is accurately estimated from IP
  addresses
• Minimize latency by pipelining the data
  transmission over TCP
Data Flow (2/2)
• Ideal elapsed time for transmitting B bytes to R
  replicas:
    T – Network Throughput
    L – Latency between 2 machines
• At Google:
   
   
   
       T = 100 Mbps
       L <= 1 ms
       1000 replicas
                              Β/Τ RL
      1 MB distributed in 80 ms
Record Append
• In traditional writes
  ▫ Clients specifies offset where the data to be written
  ▫ Concurrent write to the same region is not serialized
• In record append
  ▫ Client specifies only the data
  ▫ Similar to writes
  ▫ GFS appends data to the file at least once atomically
     The chunk is padded if appending the record exceeds the
      maximum size
     If a record append fails at any replica, the client retries
      the operation - record duplicates
     File region may be defined but inconsistent
Snapshot (1/2)
• Goals
 ▫ To quickly create branch copies of huge data sets
 ▫ To easily checkpoint the current state
• Copy-on-write technique
 ▫ Master receive snapshot request,
 ▫ Revokes outstanding leases on chunks in the file
 ▫ Master logs the operation to the disk
 ▫ Applies this log to its in-memory state by duplicating
   the metadata for the source file or directory tree
 ▫ New snapshot file
Snapshot (2/2)
• After the snapshot operation
 ▫ Clients sends a request to master to find the
   current lease holder of a “chunk C”
 ▫ Reference count for chunk C is > 1
 ▫ Master pick a new chunk handle C
 ▫ Master asks chunkserver to create a new chunk C
 ▫ Master grants one of the replicas a lease on the
   new chunk C and replies to the client
Content
 •   Distributed File Systems
 •   Introducing Google File System
 •   Design Overview
 •   System Interaction
 •   Master Operation
 •   Fault Tolerance and Diagnosis
 •   Measurements and Benchmarks
 •   Experience
 •   Related Works
 •   Conclusion
 •   Reference
Master Operation

•   Namespace Management and Locking
•   Replica Placement
•   Creation, Re-replication, Rebalancing
•   Garbage Collection
•   Stale Replica Detection
Namespace Management and Locking

• Each master operation acquires a set of locks
  before it runs

• Creating /home/user/foo while /home/user is
  snapshotted to /save/user
Replica Placement

• Chunk replica placement policy serves two
  purposes:
 ▫ Maximize data reliability and availability.
 ▫ Maximize network bandwidth utilization
Creation, Re-replication, Rebalancing

• Creation
  ▫ Want to place new replicas on chunkservers with
    below-average disk space utilization
  ▫ Limit the number of “recent” creations on each
    chunkserver
  ▫ Spread replicas of a chunk across racks.
• Re-replication
  ▫ As soon as # of replicas go below user specified goal
• Rebalancing
  ▫ Moves replicas for better disk space and load
    balancing
Garbage Collection

• Mechanism
 ▫ Master logs the deletion immediately.
 ▫ File is just renamed to a hidden name.
 ▫ Removes any such hidden files if they have existed
   for more than three days.
 ▫ In a regular scan of the chunk namespace, master
   identifies orphaned chunks and erases the
   metadata for those chunks.
Stale Replica Detection

• Chunk version number to distinguish between
  up-to-date and stale replicas.
• Master removes stale replicas in its regular
  garbage collection.
Fault Tolerance and Diagnosis

• High Availability
  ▫ Fast Recovery
     Master and the chunkserver are designed to restore their
      state and start in seconds.
  ▫ Chunk Replication
     master clones existing replicas as needed to keep each
      chunk fully replicated
  ▫ Master Replication
     The master state is replicated for reliability
     Operation log and checkpoints are replicated on multiple
      machines
     “Shadow master” read-only access to the FS even when
      the primary master is down
Fault Tolerance and Diagnosis (2)

• Data Integrity
 ▫ Each chunkserver uses checksumming to detect
   corruption of stored data.
 ▫ Chunk is broken up into 64 KB blocks. Each has a
   corresponding 32 bit checksum
 ▫ Checksum computation is heavily optimized for
   writes that append to the end of a chunk
Fault Tolerance and Diagnosis (3)

• Diagnostic Tools
 ▫ Extensive and detailed diagnostic logging for in
   problem isolation, debugging, and performance
   analysis
 ▫ GFS servers generate diagnostic logs that record
   many significant events and all RPC requests and
   replies
Measurements and Benchmarks
Micro-benchmarks
       GFS cluster consisting of one master, two master replicas, 16
chunkservers, and 16 clients
Measurements and Benchmarks (2)

• Real World Clusters
  • Cluster A is used regularly for research and development
  • Cluster B is primarily used for production data processing
Measurements and Benchmarks (3)
Experience
• Biggest problems were disk and Linux related.
 ▫ Many of disks claimed to the Linux driver that they
   supported a range of IDE protocol versions but in fact
   responded reliably only to the more recent ones.

 ▫ Despite occasional problems, the availability of Linux
   code has helped to explore and understand system
   behavior.
Related Works (1/3)
• Both GFS & AFS provides a location independent
  namespace
  ▫ data to be moved transparently for load balance
  ▫ fault tolerance
• Unlike AFS, GFS spreads a file’s data across
  storage servers in a way more akin to xFS and Swift
  in order to deliver aggregate performance and
  increased fault tolerance
• GFS currently uses replication for redundancy and
  consumes more raw storage than xFS or Swift.
Related Works (2/3)
• In contrast to systems like AFS, xFS, Frangipani,
  and Intermezzo, GFS does not provide any caching
  below the file system interface.
• GFS uses a centralized approach in order to
  simplify the design, increase its reliability, and gain
  flexibility
  ▫ unlike Frangipani, xFS, Minnesota’s GFS and GPFS
  ▫ Makes it easier to implement sophisticated chunk
    placement and replication policies since the master
    already has most of the relevant information and
    controls how it changes.
Related Works (3/3)
• GFS delivers aggregated performance by focusing on
  the needs of our applications rather than building a
  POSIX-compliant file system, unlike in Lustre
• NASD architecture is based on network-attached
  disk drives, similarly GFS uses commodity machines
  as chunkservers
• GFS chunkservers use lazily allocated fixed-size
  chunks, whereas NASD uses variable-length objects
• The producer-consumer queues enabled by atomic
  record appends address a similar problem as the
  distributed queues in River
  ▫ River uses memory-based queues distributed across
    machines
Conclusion
• GFS demonstrates the qualities essential for
  supporting large-scale data processing
  workloads on commodity hardware.
• Provides fault tolerance by constant
  monitoring, replicating crucial data, and fast
  and automatic recovery
• Delivers high aggregate throughput to many
  concurrent readers and writers performing a
  variety of tasks
Reference
• Ghemawat. S., Gobioff. H., Leung. S., 2003. The
  Google file system. In Proceedings of the
  nineteenth ACM symposium on Operating
  systems principles (SOSP '03). ACM, New York,
  NY, USA, 29-43.
• Coulouris. G., Dollimore. J., Kindberg. T. 2005.
  Distributed Systems: Concepts and Design (4th
  Edition). Addison-Wesley Longman Publishing
  Co., Inc., Boston, MA, USA.
Thank You

Más contenido relacionado

La actualidad más candente

Query Decomposition and data localization
Query Decomposition and data localization Query Decomposition and data localization
Query Decomposition and data localization Hafiz faiz
 
Ddb 1.6-design issues
Ddb 1.6-design issuesDdb 1.6-design issues
Ddb 1.6-design issuesEsar Qasmi
 
Market oriented Cloud Computing
Market oriented Cloud ComputingMarket oriented Cloud Computing
Market oriented Cloud ComputingJithin Parakka
 
Cloud File System with GFS and HDFS
Cloud File System with GFS and HDFS  Cloud File System with GFS and HDFS
Cloud File System with GFS and HDFS Dr Neelesh Jain
 
logical addressing
logical addressinglogical addressing
logical addressingSagar Gor
 
Distributed computing
Distributed computingDistributed computing
Distributed computingshivli0769
 
Mac addresses(media access control)
Mac addresses(media access control)Mac addresses(media access control)
Mac addresses(media access control)Ismail Mukiibi
 
Clock synchronization in distributed system
Clock synchronization in distributed systemClock synchronization in distributed system
Clock synchronization in distributed systemSunita Sahu
 
819 Static Channel Allocation
819 Static Channel Allocation819 Static Channel Allocation
819 Static Channel Allocationtechbed
 
Arp and rarp
Arp and rarpArp and rarp
Arp and rarp1991shalu
 
Introduction to Virtualization
Introduction to VirtualizationIntroduction to Virtualization
Introduction to VirtualizationRahul Hada
 
Lecture 1 introduction to parallel and distributed computing
Lecture 1   introduction to parallel and distributed computingLecture 1   introduction to parallel and distributed computing
Lecture 1 introduction to parallel and distributed computingVajira Thambawita
 
Mobile Computing UNIT-6
Mobile Computing UNIT-6Mobile Computing UNIT-6
Mobile Computing UNIT-6Ramesh Babu
 

La actualidad más candente (20)

C I D R
C I D RC I D R
C I D R
 
Distributed DBMS - Unit 1 - Introduction
Distributed DBMS - Unit 1 - IntroductionDistributed DBMS - Unit 1 - Introduction
Distributed DBMS - Unit 1 - Introduction
 
Query Decomposition and data localization
Query Decomposition and data localization Query Decomposition and data localization
Query Decomposition and data localization
 
Ddb 1.6-design issues
Ddb 1.6-design issuesDdb 1.6-design issues
Ddb 1.6-design issues
 
Market oriented Cloud Computing
Market oriented Cloud ComputingMarket oriented Cloud Computing
Market oriented Cloud Computing
 
Cloud File System with GFS and HDFS
Cloud File System with GFS and HDFS  Cloud File System with GFS and HDFS
Cloud File System with GFS and HDFS
 
logical addressing
logical addressinglogical addressing
logical addressing
 
Distributed computing
Distributed computingDistributed computing
Distributed computing
 
Mac addresses(media access control)
Mac addresses(media access control)Mac addresses(media access control)
Mac addresses(media access control)
 
Clock synchronization in distributed system
Clock synchronization in distributed systemClock synchronization in distributed system
Clock synchronization in distributed system
 
Google file system
Google file systemGoogle file system
Google file system
 
819 Static Channel Allocation
819 Static Channel Allocation819 Static Channel Allocation
819 Static Channel Allocation
 
Google File System
Google File SystemGoogle File System
Google File System
 
Arp and rarp
Arp and rarpArp and rarp
Arp and rarp
 
Introduction to Virtualization
Introduction to VirtualizationIntroduction to Virtualization
Introduction to Virtualization
 
Lecture 1 introduction to parallel and distributed computing
Lecture 1   introduction to parallel and distributed computingLecture 1   introduction to parallel and distributed computing
Lecture 1 introduction to parallel and distributed computing
 
GSM: Handovers
GSM: HandoversGSM: Handovers
GSM: Handovers
 
Mobile Computing UNIT-6
Mobile Computing UNIT-6Mobile Computing UNIT-6
Mobile Computing UNIT-6
 
GFS
GFSGFS
GFS
 
Routing
RoutingRouting
Routing
 

Similar a Google File Systems

Gfs google-file-system-13331
Gfs google-file-system-13331Gfs google-file-system-13331
Gfs google-file-system-13331Fengchang Xie
 
Replication and replica sets
Replication and replica setsReplication and replica sets
Replication and replica setsChris Westin
 
google file system
google file systemgoogle file system
google file systemdiptipan
 
Google File System
Google File SystemGoogle File System
Google File SystemDreamJobs1
 
Lecture-7 Main Memroy.pptx
Lecture-7 Main Memroy.pptxLecture-7 Main Memroy.pptx
Lecture-7 Main Memroy.pptxAmanuelmergia
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...areej qasrawi
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.pptvijayapraba1
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukAndrii Vozniuk
 
GFS - Google File System
GFS - Google File SystemGFS - Google File System
GFS - Google File Systemtutchiio
 
PostgreSQL High Availability in a Containerized World
PostgreSQL High Availability in a Containerized WorldPostgreSQL High Availability in a Containerized World
PostgreSQL High Availability in a Containerized WorldJignesh Shah
 
Operating system memory management
Operating system memory managementOperating system memory management
Operating system memory managementrprajat007
 
Buytaert kris my_sql-pacemaker
Buytaert kris my_sql-pacemakerBuytaert kris my_sql-pacemaker
Buytaert kris my_sql-pacemakerkuchinskaya
 

Similar a Google File Systems (20)

Google file system
Google file systemGoogle file system
Google file system
 
Gfs google-file-system-13331
Gfs google-file-system-13331Gfs google-file-system-13331
Gfs google-file-system-13331
 
Replication and replica sets
Replication and replica setsReplication and replica sets
Replication and replica sets
 
google file system
google file systemgoogle file system
google file system
 
Lalit
LalitLalit
Lalit
 
Google File System
Google File SystemGoogle File System
Google File System
 
Hadoop
HadoopHadoop
Hadoop
 
Lecture-7 Main Memroy.pptx
Lecture-7 Main Memroy.pptxLecture-7 Main Memroy.pptx
Lecture-7 Main Memroy.pptx
 
Memory Management.pdf
Memory Management.pdfMemory Management.pdf
Memory Management.pdf
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
 
GFS - Google File System
GFS - Google File SystemGFS - Google File System
GFS - Google File System
 
Ch8 main memory
Ch8   main memoryCh8   main memory
Ch8 main memory
 
PostgreSQL High Availability in a Containerized World
PostgreSQL High Availability in a Containerized WorldPostgreSQL High Availability in a Containerized World
PostgreSQL High Availability in a Containerized World
 
Operating system memory management
Operating system memory managementOperating system memory management
Operating system memory management
 
Megastore by Google
Megastore by GoogleMegastore by Google
Megastore by Google
 
Buytaert kris my_sql-pacemaker
Buytaert kris my_sql-pacemakerBuytaert kris my_sql-pacemaker
Buytaert kris my_sql-pacemaker
 
Big Data for QAs
Big Data for QAsBig Data for QAs
Big Data for QAs
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 

Último

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 

Último (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 

Google File Systems

  • 1. The Google File System Published By: Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Google Presented By: Manoj Samaraweera (138231B) Azeem Mumtaz (138218R) University of Moratuwa
  • 2. Contents • Distributed File Systems • Introducing Google File System • Design Overview • System Interaction • Master Operation • Fault Tolerance and Diagnosis • Measurements and Benchmarks • Experience • Related Works • Conclusion • Reference
  • 3. Distributed File Systems • Enables programs to store and access remote files exactly as they do local ones • New modes of data organization on disk or across multiple servers • Goals ▫ Performance ▫ Scalability ▫ Reliability ▫ Availability
  • 4. Introducing Google File System • Growing demand for Google data processing • Properties ▫ A scalable distributed file system ▫ For large distributed data intensive applications ▫ Fault tolerance ▫ Inexpensive commodity hardware ▫ High aggregated performance • Design is driven by observation of workload and technological environment
  • 5. Design Assumptions • Component failures are the norm ▫ Commodity Hardware • Files are huge by traditional standard ▫ Multi-GB files ▫ Small files also must be supported,  Not optimized • Read Workloads ▫ Large streaming reads ▫ Small random reads • Write Workloads ▫ Large, sequential writes that append data to file • Multiple clients concurrently append to one file ▫ Consistency Semantics ▫ Files are used as producer-consumer queues or many way merging • High sustained bandwidth is more important than low latency
  • 6. Design Interface • Typical File System Interface • Hierarchical Directory Organization • Files are identified as pathnames • Operations ▫ Create, delete, open, close, read, write
  • 7. Architecture (1/2) • Files are divided into chunks • Fixed-size chunks (64MB) • Unique 64-bit chunk handles ▫ Immutable and globally unique • Chunks as Linux files • Replicated over chunkservers, called replicas ▫ 3 replicas by default ▫ Different replication for different region of file namespace • Single master • Multiple chunkservers ▫ Grouped into Racks ▫ Connected through switches • Multiple clients • Master/chunkserver coordination ▫ HeartBeat Messages
  • 9. Single Master • Maintains Metadata • Controls System Wide Activities ▫ Chunk lease management ▫ Garbage collection ▫ Chunk migration ▫ Replication
  • 10. Chunk Size (1/2) • 64 MB • Stored as plain Linux file on a chunkserver • Advantages ▫ Reduces client’s interaction with single master ▫ Clients most likely to perform many operations on a large chunk  Reduce network overhead by keeping a persistent TCP connection with the chunkserver ▫ Reduces the size of the metadata  Keep metadata in memory ▫ Lazy Space Allocation
  • 11. Chunk Size (1/2) • Disadvantages ▫ Small files consisting of small chunks may become hot spots ▫ Solutions  Higher replication factor  Stagger application start time  Allow clients to read from other clients
  • 12. Metadata (1/5) • 3 Major Types ▫ The file and chunk namespace ▫ File-to-chunk mappings ▫ The location of each chunk replicas • Namespaces and mappings ▫ Persisted by logging mutation to an operation log stored on master ▫ Operation log is replicated
  • 13. Metadata (2/5) • Metadata are stored in the memory ▫ Improves the performance master ▫ Easier to scan the entire state of metadata periodically  Chunk garbage collection  Re-replication in the presence of chunkserver failure  Chunk migration to balance load and disk space • 64 bytes of metadata for 64 MB chunk • File namespace data requires < 64 bytes per file ▫ Prefix compression
  • 14. Metadata (3/5) • Chunk location information ▫ Polled at master startup  Chunkservers join and leave the cluster ▫ Keeps up-to-date with chunkserver with HeartBeat messages
  • 15. Metadata (4/5) • Operation Logs ▫ Historical record of critical metadata changes ▫ Logical timeline that defines the order of concurrent operations ▫ Not visible to client  Until it is replicated and flushed the logs to the disk ▫ Flushing and replication in batch  Reduces impact on system throughput
  • 16. Metadata (5/5) • Operation Logs ▫ By replaying operation logs master recover its file system state ▫ Checkpoints  To avoid the growth of the operation logs beyond the threshold  avoids interfering other mutations by working in a separate thread ▫ Compact B-tree like structure  Directly mapped into the memory and used for namespace lookup  No extra parsing
  • 17. Consistency Model (1/3) • Guarantees by GFS ▫ File namespace mutations (i.e. File Creation) are atomic  Namespace management and locking guarantees atomicity and correctness  The master’s operation log ▫ After a sequence of successful mutations, the mutated file is guaranteed to be defined and contain the data written by the last mutation. This is obtained by  Applying the same mutation in order to all replicas  Using chunk version numbers to detect stale replica
  • 18. Consistency Model (2/3) • Relaxed consistency model • Two types of mutations ▫ Writes  Cause data to be written at an application-specified file offset ▫ Record Appends  Cause data to be appended atomically at least once  Offset chosen by GFS, not by the client • States of a file region after a mutation ▫ Consistent  All clients see the same data, regardless which replicas they read from ▫ Inconsistent  Clients see different data at different times ▫ Defined  consistent and all clients see what the mutation writes in its entirety ▫ Undefined  consistent but it may not reflect what any mutation has written
  • 19. Consistency Model (3/3) • Implication for Applications ▫ Relying on appends rather on overwrites ▫ Checkpointing  to verify how much data has been successfully written ▫ Writing self-validating records  Checksums to detect and remove padding ▫ Writing Self-identifying records  Unique Identifiers to identify and discard duplicates
  • 20. Lease & Mutation Order • Master uses leases to maintain a consistent mutation order among replicas • Primary is the chunkserver who is granted a chunk lease ▫ Master delegates the authority of mutation ▫ All others are secondary replicas • Primary defines a mutation order between mutations ▫ Secondary replicas follows this order
  • 21. Writes (1/7) • Step 1 ▫ Which chunkserver holds the current lease for the chunk? ▫ The location of secondary replicas
  • 22. Writes (2/7) • Step 2 ▫ Identities of primary and secondary replicas ▫ Client cache this data for future mutation, until  Primary is unreachable  Primary no longer holds the lease
  • 23. Writes (3/7) • Step 3 ▫ Client pushes the data to all replicas ▫ Chunkserver stores the data in an internal LRU buffer cache
  • 24. Writes (4/7) • Step 4 ▫ Client sends a write request to the primary ▫ Primary assigns a consecutive serial numbers to mutations  Serialization ▫ Primary applies mutations to its own state
  • 25. Writes (5/7) • Step 5 ▫ Forward the writes to all secondary replicas ▫ Follows the mutation order
  • 26. Writes (6/7) • Step 6 ▫ Secondary replicas inform primary after completing the mutation
  • 27. Writes (7/7) • Step 7 ▫ Primary replies to the client ▫ Retries from step 3 to 7 in case of errors
  • 28. Data Flow (1/2) • Decoupled control flow and data flow • Data is pushed linearly along a chain of chunkservers in a pipelined fashion ▫ Utilize inbound bandwidth • Distance is accurately estimated from IP addresses • Minimize latency by pipelining the data transmission over TCP
  • 29. Data Flow (2/2) • Ideal elapsed time for transmitting B bytes to R replicas:  T – Network Throughput  L – Latency between 2 machines • At Google:    T = 100 Mbps L <= 1 ms 1000 replicas Β/Τ RL  1 MB distributed in 80 ms
  • 30. Record Append • In traditional writes ▫ Clients specifies offset where the data to be written ▫ Concurrent write to the same region is not serialized • In record append ▫ Client specifies only the data ▫ Similar to writes ▫ GFS appends data to the file at least once atomically  The chunk is padded if appending the record exceeds the maximum size  If a record append fails at any replica, the client retries the operation - record duplicates  File region may be defined but inconsistent
  • 31. Snapshot (1/2) • Goals ▫ To quickly create branch copies of huge data sets ▫ To easily checkpoint the current state • Copy-on-write technique ▫ Master receive snapshot request, ▫ Revokes outstanding leases on chunks in the file ▫ Master logs the operation to the disk ▫ Applies this log to its in-memory state by duplicating the metadata for the source file or directory tree ▫ New snapshot file
  • 32. Snapshot (2/2) • After the snapshot operation ▫ Clients sends a request to master to find the current lease holder of a “chunk C” ▫ Reference count for chunk C is > 1 ▫ Master pick a new chunk handle C ▫ Master asks chunkserver to create a new chunk C ▫ Master grants one of the replicas a lease on the new chunk C and replies to the client
  • 33. Content • Distributed File Systems • Introducing Google File System • Design Overview • System Interaction • Master Operation • Fault Tolerance and Diagnosis • Measurements and Benchmarks • Experience • Related Works • Conclusion • Reference
  • 34. Master Operation • Namespace Management and Locking • Replica Placement • Creation, Re-replication, Rebalancing • Garbage Collection • Stale Replica Detection
  • 35. Namespace Management and Locking • Each master operation acquires a set of locks before it runs • Creating /home/user/foo while /home/user is snapshotted to /save/user
  • 36. Replica Placement • Chunk replica placement policy serves two purposes: ▫ Maximize data reliability and availability. ▫ Maximize network bandwidth utilization
  • 37. Creation, Re-replication, Rebalancing • Creation ▫ Want to place new replicas on chunkservers with below-average disk space utilization ▫ Limit the number of “recent” creations on each chunkserver ▫ Spread replicas of a chunk across racks. • Re-replication ▫ As soon as # of replicas go below user specified goal • Rebalancing ▫ Moves replicas for better disk space and load balancing
  • 38. Garbage Collection • Mechanism ▫ Master logs the deletion immediately. ▫ File is just renamed to a hidden name. ▫ Removes any such hidden files if they have existed for more than three days. ▫ In a regular scan of the chunk namespace, master identifies orphaned chunks and erases the metadata for those chunks.
  • 39. Stale Replica Detection • Chunk version number to distinguish between up-to-date and stale replicas. • Master removes stale replicas in its regular garbage collection.
  • 40. Fault Tolerance and Diagnosis • High Availability ▫ Fast Recovery  Master and the chunkserver are designed to restore their state and start in seconds. ▫ Chunk Replication  master clones existing replicas as needed to keep each chunk fully replicated ▫ Master Replication  The master state is replicated for reliability  Operation log and checkpoints are replicated on multiple machines  “Shadow master” read-only access to the FS even when the primary master is down
  • 41. Fault Tolerance and Diagnosis (2) • Data Integrity ▫ Each chunkserver uses checksumming to detect corruption of stored data. ▫ Chunk is broken up into 64 KB blocks. Each has a corresponding 32 bit checksum ▫ Checksum computation is heavily optimized for writes that append to the end of a chunk
  • 42. Fault Tolerance and Diagnosis (3) • Diagnostic Tools ▫ Extensive and detailed diagnostic logging for in problem isolation, debugging, and performance analysis ▫ GFS servers generate diagnostic logs that record many significant events and all RPC requests and replies
  • 43. Measurements and Benchmarks Micro-benchmarks GFS cluster consisting of one master, two master replicas, 16 chunkservers, and 16 clients
  • 44. Measurements and Benchmarks (2) • Real World Clusters • Cluster A is used regularly for research and development • Cluster B is primarily used for production data processing
  • 46. Experience • Biggest problems were disk and Linux related. ▫ Many of disks claimed to the Linux driver that they supported a range of IDE protocol versions but in fact responded reliably only to the more recent ones. ▫ Despite occasional problems, the availability of Linux code has helped to explore and understand system behavior.
  • 47. Related Works (1/3) • Both GFS & AFS provides a location independent namespace ▫ data to be moved transparently for load balance ▫ fault tolerance • Unlike AFS, GFS spreads a file’s data across storage servers in a way more akin to xFS and Swift in order to deliver aggregate performance and increased fault tolerance • GFS currently uses replication for redundancy and consumes more raw storage than xFS or Swift.
  • 48. Related Works (2/3) • In contrast to systems like AFS, xFS, Frangipani, and Intermezzo, GFS does not provide any caching below the file system interface. • GFS uses a centralized approach in order to simplify the design, increase its reliability, and gain flexibility ▫ unlike Frangipani, xFS, Minnesota’s GFS and GPFS ▫ Makes it easier to implement sophisticated chunk placement and replication policies since the master already has most of the relevant information and controls how it changes.
  • 49. Related Works (3/3) • GFS delivers aggregated performance by focusing on the needs of our applications rather than building a POSIX-compliant file system, unlike in Lustre • NASD architecture is based on network-attached disk drives, similarly GFS uses commodity machines as chunkservers • GFS chunkservers use lazily allocated fixed-size chunks, whereas NASD uses variable-length objects • The producer-consumer queues enabled by atomic record appends address a similar problem as the distributed queues in River ▫ River uses memory-based queues distributed across machines
  • 50. Conclusion • GFS demonstrates the qualities essential for supporting large-scale data processing workloads on commodity hardware. • Provides fault tolerance by constant monitoring, replicating crucial data, and fast and automatic recovery • Delivers high aggregate throughput to many concurrent readers and writers performing a variety of tasks
  • 51. Reference • Ghemawat. S., Gobioff. H., Leung. S., 2003. The Google file system. In Proceedings of the nineteenth ACM symposium on Operating systems principles (SOSP '03). ACM, New York, NY, USA, 29-43. • Coulouris. G., Dollimore. J., Kindberg. T. 2005. Distributed Systems: Concepts and Design (4th Edition). Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.

Notas del editor

  1. fast recovery and replicationno distinguish between normal and abnormal terminationshadows, not mirrors, in that they may lag the primary slightly
  2. impractical to detect corruption by comparing replicas across chunkserverswrite overwrites an existing range ?? Compare chsum of 1st and last blocks
  3. helped immeasurably in problem isolation, debugging, and performance analysis with minimal costchunkservers going up and downRPC logs include the exact requests and responses sent on the wire
  4. N clients append to a single file