Google File Systems

The Google File System
Published By:
Sanjay Ghemawat,
Howard Gobioff,
Shun-Tak Leung
Google Presented By:
Manoj Samaraweera (138231B)
Azeem Mumtaz (138218R)
University of Moratuwa

Contents
• Distributed File Systems
• Introducing Google File System
• Design Overview
• System Interaction
• Master Operation
• Fault Tolerance and Diagnosis
• Measurements and Benchmarks
• Experience
• Related Works
• Conclusion
• Reference

Distributed File Systems
• Enables programs to store and access remote
files exactly as they do local ones
• New modes of data organization on disk or
across multiple servers
• Goals
▫ Performance
▫ Scalability
▫ Reliability
▫ Availability

Introducing Google File System
• Growing demand for Google data processing
• Properties
▫ A scalable distributed file system
▫ For large distributed data intensive applications
▫ Fault tolerance
▫ Inexpensive commodity hardware
▫ High aggregated performance
• Design is driven by observation of workload and
technological environment

Design Assumptions
• Component failures are the norm
▫ Commodity Hardware
• Files are huge by traditional standard
▫ Multi-GB files
▫ Small files also must be supported,
 Not optimized
• Read Workloads
▫ Large streaming reads
▫ Small random reads
• Write Workloads
▫ Large, sequential writes that append data to file
• Multiple clients concurrently append to one file
▫ Consistency Semantics
▫ Files are used as producer-consumer queues or many way merging
• High sustained bandwidth is more important than low latency

Design Interface
• Typical File System Interface
• Hierarchical Directory Organization
• Files are identified as pathnames
• Operations
▫ Create, delete, open, close, read, write

Architecture (1/2)
• Files are divided into chunks
• Fixed-size chunks (64MB)
• Unique 64-bit chunk handles
▫ Immutable and globally unique
• Chunks as Linux files
• Replicated over chunkservers, called replicas
▫ 3 replicas by default
▫ Different replication for different region of file namespace
• Single master
• Multiple chunkservers
▫ Grouped into Racks
▫ Connected through switches
• Multiple clients
• Master/chunkserver coordination
▫ HeartBeat Messages

Single Master
• Maintains Metadata
• Controls System Wide Activities
▫ Chunk lease management
▫ Garbage collection
▫ Chunk migration
▫ Replication

Chunk Size (1/2)
• 64 MB
• Stored as plain Linux file on a chunkserver
• Advantages
▫ Reduces client’s interaction with single master
▫ Clients most likely to perform many operations on
a large chunk
 Reduce network overhead by keeping a persistent
TCP connection with the chunkserver
▫ Reduces the size of the metadata
 Keep metadata in memory
▫ Lazy Space Allocation

Chunk Size (1/2)
• Disadvantages
▫ Small files consisting of small chunks may
become hot spots
▫ Solutions
 Higher replication factor
 Stagger application start time
 Allow clients to read from other clients

Metadata (1/5)
• 3 Major Types
▫ The file and chunk namespace
▫ File-to-chunk mappings
▫ The location of each chunk replicas
• Namespaces and mappings
▫ Persisted by logging mutation to an operation log
stored on master
▫ Operation log is replicated

Metadata (2/5)
• Metadata are stored in the memory
▫ Improves the performance master
▫ Easier to scan the entire state of metadata
periodically
 Chunk garbage collection
 Re-replication in the presence of chunkserver failure
 Chunk migration to balance load and disk space
• 64 bytes of metadata for 64 MB chunk
• File namespace data requires < 64 bytes per file
▫ Prefix compression

Metadata (3/5)
• Chunk location information
▫ Polled at master startup
 Chunkservers join and leave the cluster
▫ Keeps up-to-date with chunkserver with
HeartBeat messages

Metadata (4/5)
• Operation Logs
▫ Historical record of critical metadata changes
▫ Logical timeline that defines the order of
concurrent operations
▫ Not visible to client
 Until it is replicated and flushed the logs to the disk
▫ Flushing and replication in batch
 Reduces impact on system throughput

Metadata (5/5)
• Operation Logs
▫ By replaying operation logs master recover its file
system state
▫ Checkpoints
 To avoid the growth of the operation logs beyond the
threshold
 avoids interfering other mutations by working in a
separate thread
▫ Compact B-tree like structure
 Directly mapped into the memory and used for
namespace lookup
 No extra parsing

Consistency Model (1/3)

• Guarantees by GFS
▫ File namespace mutations (i.e. File Creation) are atomic
 Namespace management and locking guarantees atomicity and
correctness
 The master’s operation log
▫ After a sequence of successful mutations, the mutated file is
guaranteed to be defined and contain the data written by
the last mutation. This is obtained by
 Applying the same mutation in order to all replicas
 Using chunk version numbers to detect stale replica

• Relaxed consistency model
• Two types of mutations
▫ Writes
 Cause data to be written at an application-specified file offset
▫ Record Appends
 Cause data to be appended atomically at least once
 Offset chosen by GFS, not by the client
• States of a file region after a mutation
▫ Consistent
 All clients see the same data, regardless which replicas they read from
▫ Inconsistent
 Clients see different data at different times
▫ Defined
 consistent and all clients see what the mutation writes in its entirety
▫ Undefined
 consistent but it may not reflect what any mutation has written

• Implication for Applications
▫ Relying on appends rather on overwrites
▫ Checkpointing
 to verify how much data has been successfully
written
▫ Writing self-validating records
 Checksums to detect and remove padding
▫ Writing Self-identifying records
 Unique Identifiers to identify and discard duplicates

Lease & Mutation Order
• Master uses leases to maintain a consistent
mutation order among replicas
• Primary is the chunkserver who is granted a
chunk lease
▫ Master delegates the authority of mutation
▫ All others are secondary replicas
• Primary defines a mutation order between
mutations
▫ Secondary replicas follows this order

Writes (1/7)
• Step 1
▫ Which chunkserver holds
the current lease for the
chunk?
▫ The location of secondary
replicas

Writes (2/7)
• Step 2
▫ Identities of primary and
secondary replicas
▫ Client cache this data for
future mutation, until
 Primary is unreachable
 Primary no longer holds
the lease

Writes (3/7)
• Step 3
▫ Client pushes the data to
all replicas
▫ Chunkserver stores the
data in an internal LRU
buffer cache

Writes (4/7)
• Step 4
▫ Client sends a write
request to the primary
▫ Primary assigns a
consecutive serial
numbers to mutations
 Serialization
▫ Primary applies
mutations to its own state

Writes (5/7)
• Step 5
▫ Forward the writes to all
secondary replicas
▫ Follows the mutation
order

Writes (6/7)
• Step 6
▫ Secondary replicas
inform primary after
completing the mutation

Writes (7/7)
• Step 7
▫ Primary replies to the
client
▫ Retries from step 3 to 7 in
case of errors

Data Flow (1/2)
• Decoupled control flow and data flow
• Data is pushed linearly along a chain of
chunkservers in a pipelined fashion
▫ Utilize inbound bandwidth
• Distance is accurately estimated from IP
addresses
• Minimize latency by pipelining the data
transmission over TCP

Data Flow (2/2)
• Ideal elapsed time for transmitting B bytes to R
replicas:
 T – Network Throughput
 L – Latency between 2 machines
• At Google:



T = 100 Mbps
L <= 1 ms
1000 replicas
Β/Τ RL
 1 MB distributed in 80 ms

Record Append
• In traditional writes
▫ Clients specifies offset where the data to be written
▫ Concurrent write to the same region is not serialized
• In record append
▫ Client specifies only the data
▫ Similar to writes
▫ GFS appends data to the file at least once atomically
 The chunk is padded if appending the record exceeds the
maximum size
 If a record append fails at any replica, the client retries
the operation - record duplicates
 File region may be defined but inconsistent

Snapshot (1/2)
• Goals
▫ To quickly create branch copies of huge data sets
▫ To easily checkpoint the current state
• Copy-on-write technique
▫ Master receive snapshot request,
▫ Revokes outstanding leases on chunks in the file
▫ Master logs the operation to the disk
▫ Applies this log to its in-memory state by duplicating
the metadata for the source file or directory tree
▫ New snapshot file

Snapshot (2/2)
• After the snapshot operation
▫ Clients sends a request to master to find the
current lease holder of a “chunk C”
▫ Reference count for chunk C is > 1
▫ Master pick a new chunk handle C
▫ Master asks chunkserver to create a new chunk C
▫ Master grants one of the replicas a lease on the
new chunk C and replies to the client

Content
• Distributed File Systems
• Introducing Google File System
• Design Overview
• System Interaction
• Master Operation
• Fault Tolerance and Diagnosis
• Measurements and Benchmarks
• Experience
• Related Works
• Conclusion
• Reference

Master Operation

• Namespace Management and Locking
• Replica Placement
• Creation, Re-replication, Rebalancing
• Garbage Collection
• Stale Replica Detection

Namespace Management and Locking

• Each master operation acquires a set of locks
before it runs

• Creating /home/user/foo while /home/user is
snapshotted to /save/user

Replica Placement

• Chunk replica placement policy serves two
purposes:
▫ Maximize data reliability and availability.
▫ Maximize network bandwidth utilization

Creation, Re-replication, Rebalancing

• Creation
▫ Want to place new replicas on chunkservers with
below-average disk space utilization
▫ Limit the number of “recent” creations on each
chunkserver
▫ Spread replicas of a chunk across racks.
• Re-replication
▫ As soon as # of replicas go below user specified goal
• Rebalancing
▫ Moves replicas for better disk space and load
balancing

Garbage Collection

• Mechanism
▫ Master logs the deletion immediately.
▫ File is just renamed to a hidden name.
▫ Removes any such hidden files if they have existed
for more than three days.
▫ In a regular scan of the chunk namespace, master
identifies orphaned chunks and erases the
metadata for those chunks.

Stale Replica Detection

• Chunk version number to distinguish between
up-to-date and stale replicas.
• Master removes stale replicas in its regular
garbage collection.

Fault Tolerance and Diagnosis

• High Availability
▫ Fast Recovery
 Master and the chunkserver are designed to restore their
state and start in seconds.
▫ Chunk Replication
 master clones existing replicas as needed to keep each
chunk fully replicated
▫ Master Replication
 The master state is replicated for reliability
 Operation log and checkpoints are replicated on multiple
machines
 “Shadow master” read-only access to the FS even when
the primary master is down

Fault Tolerance and Diagnosis (2)

• Data Integrity
▫ Each chunkserver uses checksumming to detect
corruption of stored data.
▫ Chunk is broken up into 64 KB blocks. Each has a
corresponding 32 bit checksum
▫ Checksum computation is heavily optimized for
writes that append to the end of a chunk

Fault Tolerance and Diagnosis (3)

• Diagnostic Tools
▫ Extensive and detailed diagnostic logging for in
problem isolation, debugging, and performance
analysis
▫ GFS servers generate diagnostic logs that record
many significant events and all RPC requests and
replies

Measurements and Benchmarks
Micro-benchmarks
GFS cluster consisting of one master, two master replicas, 16
chunkservers, and 16 clients

Measurements and Benchmarks (2)

• Real World Clusters
• Cluster A is used regularly for research and development
• Cluster B is primarily used for production data processing

Measurements and Benchmarks (3)

Experience
• Biggest problems were disk and Linux related.
▫ Many of disks claimed to the Linux driver that they
supported a range of IDE protocol versions but in fact
responded reliably only to the more recent ones.

▫ Despite occasional problems, the availability of Linux
code has helped to explore and understand system
behavior.

Related Works (1/3)
• Both GFS & AFS provides a location independent
namespace
▫ data to be moved transparently for load balance
▫ fault tolerance
• Unlike AFS, GFS spreads a file’s data across
storage servers in a way more akin to xFS and Swift
in order to deliver aggregate performance and
increased fault tolerance
• GFS currently uses replication for redundancy and
consumes more raw storage than xFS or Swift.

Related Works (2/3)
• In contrast to systems like AFS, xFS, Frangipani,
and Intermezzo, GFS does not provide any caching
below the file system interface.
• GFS uses a centralized approach in order to
simplify the design, increase its reliability, and gain
flexibility
▫ unlike Frangipani, xFS, Minnesota’s GFS and GPFS
▫ Makes it easier to implement sophisticated chunk
placement and replication policies since the master
already has most of the relevant information and
controls how it changes.

Related Works (3/3)
• GFS delivers aggregated performance by focusing on
the needs of our applications rather than building a
POSIX-compliant file system, unlike in Lustre
• NASD architecture is based on network-attached
disk drives, similarly GFS uses commodity machines
as chunkservers
• GFS chunkservers use lazily allocated fixed-size
chunks, whereas NASD uses variable-length objects
• The producer-consumer queues enabled by atomic
record appends address a similar problem as the
distributed queues in River
▫ River uses memory-based queues distributed across
machines

Conclusion
• GFS demonstrates the qualities essential for
supporting large-scale data processing
workloads on commodity hardware.
• Provides fault tolerance by constant
monitoring, replicating crucial data, and fast
and automatic recovery
• Delivers high aggregate throughput to many
concurrent readers and writers performing a
variety of tasks

Reference
• Ghemawat. S., Gobioff. H., Leung. S., 2003. The
Google file system. In Proceedings of the
nineteenth ACM symposium on Operating
systems principles (SOSP '03). ACM, New York,
NY, USA, 29-43.
• Coulouris. G., Dollimore. J., Kindberg. T. 2005.
Distributed Systems: Concepts and Design (4th
Edition). Addison-Wesley Longman Publishing
Co., Inc., Boston, MA, USA.

Google File Systems

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Google File Systems

Similar a Google File Systems (20)

Último

Último (20)

Google File Systems

Notas del editor