SlideShare una empresa de Scribd logo
1 de 25
The Google File System
(GFS)

   Sanjay Ghemawat, Howard Gobioff,
   and Shun-Tak Leung
Introduction

   Design constraints
       Component failures are the norm
           1000s of components
           Bugs, human errors, failures of memory, disk,
            connectors, networking, and power supplies
           Monitoring, error detection, fault tolerance, automatic
            recovery
       Files are huge by traditional standards
           Multi-GB files are common
           Billions of objects
Introduction

   Design constraints
       Most modifications are appends
           Random writes are practically nonexistent
           Many files are written once, and read sequentially
       Two types of reads
           Large streaming reads
           Small random reads (in the forward direction)
       Sustained bandwidth more important than latency
       File system APIs are open to changes
Interface Design

   Not POSIX compliant
   Additional operations
       Snapshot
       Record append
Architectural Design

   A GFS cluster
       A single master
       Multiple chunkservers per master
           Accessed by multiple clients
       Running on commodity Linux machines
   A file
       Represented as fixed-sized chunks
           Labeled with 64-bit unique global IDs
           Stored at chunkservers
           3-way Mirrored across chunkservers
Architectural Design (2)

     Application   chunk location?    GFS Master
     GFS client


                                       GFS chunkserver
           chunk data?            GFS chunkserver
                                       Linux file system
                                GFS chunkserver
                                   Linux file system
                                Linux file system
Architectural Design (3)

   Master server
       Maintains all metadata
           Name space, access control, file-to-chunk mappings,
            garbage collection, chunk migration
   GPS clients
       Consult master for metadata
       Access data from chunkservers
       Does not go through VFS
       No caching at clients and chunkservers due to the
        frequent case of streaming
Single-Master Design

   Simple
   Master answers only chunk locations
   A client typically asks for multiple chunk
    locations in a single request
   The master also predicatively provide chunk
    locations immediately following those
    requested
Chunk Size

   64 MB
   Fewer chunk location requests to the master
   Reduced overhead to access a chunk
   Fewer metadata entries
       Kept in memory
- Some potential problems with fragmentation
Metadata

   Three major types
       File and chunk namespaces
       File-to-chunk mappings
       Locations of a chunk’s replicas
Metadata

   All kept in memory
       Fast!
       Quick global scans
           Garbage collections
           Reorganizations
       64 bytes per 64 MB of data
       Prefix compression
Chunk Locations

   No persistent states
       Polls chunkservers at startup
       Use heartbeat messages to monitor servers
       Simplicity
       On-demand approach vs. coordination
           On-demand wins when changes (failures) are often
Operation Logs

   Metadata updates are logged
       e.g., <old value, new value> pairs
       Log replicated on remote machines
   Take global snapshots (checkpoints) to
    truncate logs
       Memory mapped (no serialization/deserialization)
       Checkpoints can be created while updates arrive
   Recovery
       Latest checkpoint + subsequent log files
Consistency Model

   Relaxed consistency
       Concurrent changes are consistent but undefined
       An append is atomically committed at least once
        - Occasional duplications
   All changes to a chunk are applied in the
    same order to all replicas
   Use version number to detect missed
    updates
System Interactions

   The master grants a chunk lease to a replica
   The replica holding the lease determines the
    order of updates to all replicas
   Lease
       60 second timeouts
       Can be extended indefinitely
       Extension request are piggybacked on heartbeat
        messages
       After a timeout expires, the master can grant new
        leases
Data Flow

   Separation of control and data flows
       Avoid network bottleneck
   Updates are pushed linearly among replicas
   Pipelined transfers
   13 MB/second with 100 Mbps network
Snapshot

   Copy-on-write approach
       Revoke outstanding leases
       New updates are logged while taking the
        snapshot
       Commit the log to disk
       Apply to the log to a copy of metadata
       A chunk is not copied until the next update
Master Operation

   No directories
   No hard links and symbolic links
   Full path name to metadata mapping
       With prefix compression
Locking Operations

   A lock per path
       To access /d1/d2/leaf
       Need to lock /d1, /d1/d2, and /d1/d2/leaf
       Can modify a directory concurrently
           Each thread acquires
               A read lock on a directory
               A write lock on a file
       Totally ordered locking to prevent deadlocks
Replica Placement

   Goals:
       Maximize data reliability and availability
       Maximize network bandwidth
   Need to spread chunk replicas across
    machines and racks
   Higher priority to replica chunks with lower
    replication factors
   Limited resources spent on replication
Garbage Collection

   Simpler than eager deletion due to
       Unfinished replicated creation
       Lost deletion messages
   Deleted files are hidden for three days
   Then they are garbage collected
   Combined with other background operations
    (taking snapshots)
   Safety net against accidents
Fault Tolerance and Diagnosis

   Fast recovery
       Master and chunkserver are designed to restore
        their states and start in seconds regardless of
        termination conditions
   Chunk replication
   Master replication
       Shadow masters provide read-only access when
        the primary master is down
Fault Tolerance and Diagnosis

   Data integrity
       A chunk is divided into 64-KB blocks
       Each with its checksum
       Verified at read and write times
       Also background scans for rarely used data
Measurements

   Chunkserver workload
       Bimodal distribution of small and large files
       Ratio of write to append operations: 3:1 to 8:1
       Virtually no overwrites
   Master workload
       Most request for chunk locations and open files
   Reads achieve 75% of the network limit
   Writes achieve 50% of the network limit
Major Innovations

   File system API tailored to stylized workload
   Single-master design to simplify coordination
   Metadata fit in memory
   Flat namespace

Más contenido relacionado

La actualidad más candente

Distributed Shared Memory Systems
Distributed Shared Memory SystemsDistributed Shared Memory Systems
Distributed Shared Memory SystemsArush Nagpal
 
HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions i...
HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions i...HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions i...
HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions i...Daehyeok Kim
 
Linux cgroups and namespaces
Linux cgroups and namespacesLinux cgroups and namespaces
Linux cgroups and namespacesLocaweb
 
IBM Spectrum Scale Networking Flow
IBM Spectrum Scale Networking FlowIBM Spectrum Scale Networking Flow
IBM Spectrum Scale Networking FlowSandeep Patil
 
Red Hat Global File System (GFS)
Red Hat Global File System (GFS)Red Hat Global File System (GFS)
Red Hat Global File System (GFS)Schubert Zhang
 
The Chubby lock service for loosely- coupled distributed systems
The Chubby lock service for loosely- coupled distributed systems The Chubby lock service for loosely- coupled distributed systems
The Chubby lock service for loosely- coupled distributed systems Ioanna Tsalouchidou
 
Linux containers-namespaces(Dec 2014)
Linux containers-namespaces(Dec 2014)Linux containers-namespaces(Dec 2014)
Linux containers-namespaces(Dec 2014)Ralf Dannert
 
Course 102: Lecture 27: FileSystems in Linux (Part 2)
Course 102: Lecture 27: FileSystems in Linux (Part 2)Course 102: Lecture 27: FileSystems in Linux (Part 2)
Course 102: Lecture 27: FileSystems in Linux (Part 2)Ahmed El-Arabawy
 
Cs 704 D Aos Distr File System
Cs 704 D Aos Distr File SystemCs 704 D Aos Distr File System
Cs 704 D Aos Distr File SystemDebasis Das
 
Linux container, namespaces & CGroup.
Linux container, namespaces & CGroup. Linux container, namespaces & CGroup.
Linux container, namespaces & CGroup. Neeraj Shrimali
 
High Performance Storage Devices in the Linux Kernel
High Performance Storage Devices in the Linux KernelHigh Performance Storage Devices in the Linux Kernel
High Performance Storage Devices in the Linux KernelKernel TLV
 
Tokyo OpenStack Summit 2015: Unraveling Docker Security
Tokyo OpenStack Summit 2015: Unraveling Docker SecurityTokyo OpenStack Summit 2015: Unraveling Docker Security
Tokyo OpenStack Summit 2015: Unraveling Docker SecurityPhil Estes
 
Running Applications on the NetBSD Rump Kernel by Justin Cormack
Running Applications on the NetBSD Rump Kernel by Justin Cormack Running Applications on the NetBSD Rump Kernel by Justin Cormack
Running Applications on the NetBSD Rump Kernel by Justin Cormack eurobsdcon
 
[IC Manage] Workspace Acceleration & Network Storage Reduction
[IC Manage] Workspace Acceleration & Network Storage Reduction[IC Manage] Workspace Acceleration & Network Storage Reduction
[IC Manage] Workspace Acceleration & Network Storage ReductionPerforce
 
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. TanenbaumA Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaumeurobsdcon
 
Luxun a Persistent Messaging System Tailored for Big Data Collecting & Analytics
Luxun a Persistent Messaging System Tailored for Big Data Collecting & AnalyticsLuxun a Persistent Messaging System Tailored for Big Data Collecting & Analytics
Luxun a Persistent Messaging System Tailored for Big Data Collecting & AnalyticsWilliam Yang
 
Application layer
Application layerApplication layer
Application layerNeha Kurale
 

La actualidad más candente (20)

Distributed Shared Memory Systems
Distributed Shared Memory SystemsDistributed Shared Memory Systems
Distributed Shared Memory Systems
 
HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions i...
HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions i...HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions i...
HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions i...
 
Linux cgroups and namespaces
Linux cgroups and namespacesLinux cgroups and namespaces
Linux cgroups and namespaces
 
IBM Spectrum Scale Networking Flow
IBM Spectrum Scale Networking FlowIBM Spectrum Scale Networking Flow
IBM Spectrum Scale Networking Flow
 
Red Hat Global File System (GFS)
Red Hat Global File System (GFS)Red Hat Global File System (GFS)
Red Hat Global File System (GFS)
 
The Chubby lock service for loosely- coupled distributed systems
The Chubby lock service for loosely- coupled distributed systems The Chubby lock service for loosely- coupled distributed systems
The Chubby lock service for loosely- coupled distributed systems
 
Linux containers-namespaces(Dec 2014)
Linux containers-namespaces(Dec 2014)Linux containers-namespaces(Dec 2014)
Linux containers-namespaces(Dec 2014)
 
Course 102: Lecture 27: FileSystems in Linux (Part 2)
Course 102: Lecture 27: FileSystems in Linux (Part 2)Course 102: Lecture 27: FileSystems in Linux (Part 2)
Course 102: Lecture 27: FileSystems in Linux (Part 2)
 
Cs 704 D Aos Distr File System
Cs 704 D Aos Distr File SystemCs 704 D Aos Distr File System
Cs 704 D Aos Distr File System
 
Hdfs high availability
Hdfs high availabilityHdfs high availability
Hdfs high availability
 
Linux container, namespaces & CGroup.
Linux container, namespaces & CGroup. Linux container, namespaces & CGroup.
Linux container, namespaces & CGroup.
 
High Performance Storage Devices in the Linux Kernel
High Performance Storage Devices in the Linux KernelHigh Performance Storage Devices in the Linux Kernel
High Performance Storage Devices in the Linux Kernel
 
Tokyo OpenStack Summit 2015: Unraveling Docker Security
Tokyo OpenStack Summit 2015: Unraveling Docker SecurityTokyo OpenStack Summit 2015: Unraveling Docker Security
Tokyo OpenStack Summit 2015: Unraveling Docker Security
 
Running Applications on the NetBSD Rump Kernel by Justin Cormack
Running Applications on the NetBSD Rump Kernel by Justin Cormack Running Applications on the NetBSD Rump Kernel by Justin Cormack
Running Applications on the NetBSD Rump Kernel by Justin Cormack
 
[IC Manage] Workspace Acceleration & Network Storage Reduction
[IC Manage] Workspace Acceleration & Network Storage Reduction[IC Manage] Workspace Acceleration & Network Storage Reduction
[IC Manage] Workspace Acceleration & Network Storage Reduction
 
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. TanenbaumA Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
 
SoNAS
SoNASSoNAS
SoNAS
 
Docker allocating resources
Docker allocating resourcesDocker allocating resources
Docker allocating resources
 
Luxun a Persistent Messaging System Tailored for Big Data Collecting & Analytics
Luxun a Persistent Messaging System Tailored for Big Data Collecting & AnalyticsLuxun a Persistent Messaging System Tailored for Big Data Collecting & Analytics
Luxun a Persistent Messaging System Tailored for Big Data Collecting & Analytics
 
Application layer
Application layerApplication layer
Application layer
 

Similar a tittle

Distributed computing seminar lecture 3 - distributed file systems
Distributed computing seminar   lecture 3 - distributed file systemsDistributed computing seminar   lecture 3 - distributed file systems
Distributed computing seminar lecture 3 - distributed file systemstugrulh
 
GFS - Google File System
GFS - Google File SystemGFS - Google File System
GFS - Google File Systemtutchiio
 
Distributed file systems
Distributed file systemsDistributed file systems
Distributed file systemsSri Prasanna
 
Advance google file system
Advance google file systemAdvance google file system
Advance google file systemLalit Rastogi
 
Cluster based storage - Nasd and Google file system - advanced operating syst...
Cluster based storage - Nasd and Google file system - advanced operating syst...Cluster based storage - Nasd and Google file system - advanced operating syst...
Cluster based storage - Nasd and Google file system - advanced operating syst...Antonio Cesarano
 
advanced Google file System
advanced Google file Systemadvanced Google file System
advanced Google file Systemdiptipan
 
GOOGLE FILE SYSTEM
GOOGLE FILE SYSTEMGOOGLE FILE SYSTEM
GOOGLE FILE SYSTEMJYoTHiSH o.s
 
google file system
google file systemgoogle file system
google file systemdiptipan
 
Hw09 Low Latency, Random Reads From Hdfs
Hw09   Low Latency, Random Reads From HdfsHw09   Low Latency, Random Reads From Hdfs
Hw09 Low Latency, Random Reads From HdfsCloudera, Inc.
 
Google File System
Google File SystemGoogle File System
Google File SystemDreamJobs1
 
Gfs google-file-system-13331
Gfs google-file-system-13331Gfs google-file-system-13331
Gfs google-file-system-13331Fengchang Xie
 
CS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMSCS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMSKathirvel Ayyaswamy
 

Similar a tittle (20)

Google
GoogleGoogle
Google
 
Lec3 Dfs
Lec3 DfsLec3 Dfs
Lec3 Dfs
 
Distributed computing seminar lecture 3 - distributed file systems
Distributed computing seminar   lecture 3 - distributed file systemsDistributed computing seminar   lecture 3 - distributed file systems
Distributed computing seminar lecture 3 - distributed file systems
 
GFS - Google File System
GFS - Google File SystemGFS - Google File System
GFS - Google File System
 
Gfs介绍
Gfs介绍Gfs介绍
Gfs介绍
 
Distributed file systems
Distributed file systemsDistributed file systems
Distributed file systems
 
Advance google file system
Advance google file systemAdvance google file system
Advance google file system
 
Kosmos Filesystem
Kosmos FilesystemKosmos Filesystem
Kosmos Filesystem
 
Cluster based storage - Nasd and Google file system - advanced operating syst...
Cluster based storage - Nasd and Google file system - advanced operating syst...Cluster based storage - Nasd and Google file system - advanced operating syst...
Cluster based storage - Nasd and Google file system - advanced operating syst...
 
Gfs
GfsGfs
Gfs
 
advanced Google file System
advanced Google file Systemadvanced Google file System
advanced Google file System
 
GOOGLE FILE SYSTEM
GOOGLE FILE SYSTEMGOOGLE FILE SYSTEM
GOOGLE FILE SYSTEM
 
google file system
google file systemgoogle file system
google file system
 
Lalit
LalitLalit
Lalit
 
Google file system
Google file systemGoogle file system
Google file system
 
Hw09 Low Latency, Random Reads From Hdfs
Hw09   Low Latency, Random Reads From HdfsHw09   Low Latency, Random Reads From Hdfs
Hw09 Low Latency, Random Reads From Hdfs
 
Google File System
Google File SystemGoogle File System
Google File System
 
Hadoop
HadoopHadoop
Hadoop
 
Gfs google-file-system-13331
Gfs google-file-system-13331Gfs google-file-system-13331
Gfs google-file-system-13331
 
CS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMSCS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMS
 

tittle

  • 1. The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
  • 2. Introduction  Design constraints  Component failures are the norm  1000s of components  Bugs, human errors, failures of memory, disk, connectors, networking, and power supplies  Monitoring, error detection, fault tolerance, automatic recovery  Files are huge by traditional standards  Multi-GB files are common  Billions of objects
  • 3. Introduction  Design constraints  Most modifications are appends  Random writes are practically nonexistent  Many files are written once, and read sequentially  Two types of reads  Large streaming reads  Small random reads (in the forward direction)  Sustained bandwidth more important than latency  File system APIs are open to changes
  • 4. Interface Design  Not POSIX compliant  Additional operations  Snapshot  Record append
  • 5. Architectural Design  A GFS cluster  A single master  Multiple chunkservers per master  Accessed by multiple clients  Running on commodity Linux machines  A file  Represented as fixed-sized chunks  Labeled with 64-bit unique global IDs  Stored at chunkservers  3-way Mirrored across chunkservers
  • 6. Architectural Design (2) Application chunk location? GFS Master GFS client GFS chunkserver chunk data? GFS chunkserver Linux file system GFS chunkserver Linux file system Linux file system
  • 7. Architectural Design (3)  Master server  Maintains all metadata  Name space, access control, file-to-chunk mappings, garbage collection, chunk migration  GPS clients  Consult master for metadata  Access data from chunkservers  Does not go through VFS  No caching at clients and chunkservers due to the frequent case of streaming
  • 8. Single-Master Design  Simple  Master answers only chunk locations  A client typically asks for multiple chunk locations in a single request  The master also predicatively provide chunk locations immediately following those requested
  • 9. Chunk Size  64 MB  Fewer chunk location requests to the master  Reduced overhead to access a chunk  Fewer metadata entries  Kept in memory - Some potential problems with fragmentation
  • 10. Metadata  Three major types  File and chunk namespaces  File-to-chunk mappings  Locations of a chunk’s replicas
  • 11. Metadata  All kept in memory  Fast!  Quick global scans  Garbage collections  Reorganizations  64 bytes per 64 MB of data  Prefix compression
  • 12. Chunk Locations  No persistent states  Polls chunkservers at startup  Use heartbeat messages to monitor servers  Simplicity  On-demand approach vs. coordination  On-demand wins when changes (failures) are often
  • 13. Operation Logs  Metadata updates are logged  e.g., <old value, new value> pairs  Log replicated on remote machines  Take global snapshots (checkpoints) to truncate logs  Memory mapped (no serialization/deserialization)  Checkpoints can be created while updates arrive  Recovery  Latest checkpoint + subsequent log files
  • 14. Consistency Model  Relaxed consistency  Concurrent changes are consistent but undefined  An append is atomically committed at least once - Occasional duplications  All changes to a chunk are applied in the same order to all replicas  Use version number to detect missed updates
  • 15. System Interactions  The master grants a chunk lease to a replica  The replica holding the lease determines the order of updates to all replicas  Lease  60 second timeouts  Can be extended indefinitely  Extension request are piggybacked on heartbeat messages  After a timeout expires, the master can grant new leases
  • 16. Data Flow  Separation of control and data flows  Avoid network bottleneck  Updates are pushed linearly among replicas  Pipelined transfers  13 MB/second with 100 Mbps network
  • 17. Snapshot  Copy-on-write approach  Revoke outstanding leases  New updates are logged while taking the snapshot  Commit the log to disk  Apply to the log to a copy of metadata  A chunk is not copied until the next update
  • 18. Master Operation  No directories  No hard links and symbolic links  Full path name to metadata mapping  With prefix compression
  • 19. Locking Operations  A lock per path  To access /d1/d2/leaf  Need to lock /d1, /d1/d2, and /d1/d2/leaf  Can modify a directory concurrently  Each thread acquires  A read lock on a directory  A write lock on a file  Totally ordered locking to prevent deadlocks
  • 20. Replica Placement  Goals:  Maximize data reliability and availability  Maximize network bandwidth  Need to spread chunk replicas across machines and racks  Higher priority to replica chunks with lower replication factors  Limited resources spent on replication
  • 21. Garbage Collection  Simpler than eager deletion due to  Unfinished replicated creation  Lost deletion messages  Deleted files are hidden for three days  Then they are garbage collected  Combined with other background operations (taking snapshots)  Safety net against accidents
  • 22. Fault Tolerance and Diagnosis  Fast recovery  Master and chunkserver are designed to restore their states and start in seconds regardless of termination conditions  Chunk replication  Master replication  Shadow masters provide read-only access when the primary master is down
  • 23. Fault Tolerance and Diagnosis  Data integrity  A chunk is divided into 64-KB blocks  Each with its checksum  Verified at read and write times  Also background scans for rarely used data
  • 24. Measurements  Chunkserver workload  Bimodal distribution of small and large files  Ratio of write to append operations: 3:1 to 8:1  Virtually no overwrites  Master workload  Most request for chunk locations and open files  Reads achieve 75% of the network limit  Writes achieve 50% of the network limit
  • 25. Major Innovations  File system API tailored to stylized workload  Single-master design to simplify coordination  Metadata fit in memory  Flat namespace