Topic 11: Google Filesystem

11: Google Filesystem
Zubair Nabi
zubair.nabi@itu.edu.pk
April 20, 2013
Zubair Nabi 11: Google Filesystem April 20, 2013 1 / 29

Outline
1 Introduction
2 Google Filesystem
3 Hadoop Distributed Filesystem

Outline
1 Introduction
2 Google Filesystem

Filesystem
The purpose of a ﬁlesystem is to:
1 Organize and store data

Filesystem
2 Support sharing of data among users and applications

Filesystem
3 Ensure persistence of data after a reboot

Filesystem
3 Ensure persistence of data after a reboot
4 Examples include FAT, NTFS, ext3, ext4, etc.

Distributed ﬁlesystem
Self-explanatory: the ﬁlesystem is distributed across many machines

The DFS provides a common abstraction to the dispersed ﬁles

Each DFS has an associated API that provides a service to clients,
which are normal ﬁle operations, such as create, read, write, etc.

Maintains a namespace which maps logical names to physical names

Simpliﬁes replication and migration

Simpliﬁes replication and migration
Examples include the Network Filesystem (NFS), Andrew Filesystem
(AFS), etc.

Outline
1 Introduction
2 Google Filesystem

Introduction
Designed by Google to meet its massive storage needs

Introduction
Shares many goals with previous distributed ﬁlesystems such as
performance, scalability, reliability, and availability

Introduction
Shares many goals with previous distributed ﬁlesystems such as
performance, scalability, reliability, and availability
At the same time, design driven by key observations of their workload
and infrastructure, both current and future

Design Goals
1 Failure is the norm rather than the exception: The GFS must
constantly introspect and automatically recover from failure

Design Goals
2 The system stores a fair number of large files: Optimize for large
files, on the order of GBs, but still support small files

Design Goals
2 The system stores a fair number of large files: Optimize for large
files, on the order of GBs, but still support small files
3 Applications prefer to do large streaming reads of contiguous
regions: Optimize for this case

Design Goals (2)
4 Most applications perform large, sequential writes that are mostly
append operations: Support small writes but do not optimize for them

Design Goals (2)
5 Most operations are producer-consume queues or many-way
merging: Support concurrent reads or writes by hundreds of clients
simultaneously

Design Goals (2)
5 Most operations are producer-consume queues or many-way
merging: Support concurrent reads or writes by hundreds of clients
simultaneously
6 Applications process data in bulk at a high rate: Favour throughput
over latency

Interface
The interface is similar to traditional ﬁlesystems but no support for a
standard POSIX-like API

Interface
Files are organized hierarchically into directories with pathnames

Interface
Files are organized hierarchically into directories with pathnames
Support for create, delete, open, close, read, and write operations

Architecture
Consists of a single master and multiple chunkservers

Architecture
The system can be accessed by multiple clients

Architecture
The system can be accessed by multiple clients
Both the master and chunkservers run as user-space server processes
on commodity Linux machines

Files
Files are sliced into ﬁxed-size chunks

Files
Each chunk is identiﬁable by an immutable and globally unique 64-bit
handle

Files
handle
Chunks are stored by chunkservers as local Linux ﬁles

Files
handle
Reads and writes to a chunk are speciﬁed by a handle and a byte
range

Files
handle
range
Each chunk is replicated on multiple chunkservers

Files
handle
range
Each chunk is replicated on multiple chunkservers
3 by default

Master
In charge of all ﬁlesystem metadata

Master
Namespace, access control information, mapping between ﬁles and
chunks, and current locations of chunks

Master
Holds this information in memory and regularly syncs it with a log ﬁle

Master
Also in charge of chunk leasing, garbage collection, and chunk
migration

Master
migration
Periodically sends each chunkserver a heartbeat signal to check its
state and send it instructions

Master
migration
Clients interact with it to access metadata but all data-bearing
communication goes directly to the relevant chunkservers

Master
migration
Clients interact with it to access metadata but all data-bearing
communication goes directly to the relevant chunkservers
As a result, the master does not become a performance bottleneck

Consistency Model: Master
All namespace mutations (such as ﬁle creation) are atomic as they are
exclusively handled by the master

Namespace locking guarantees atomicity and correctness

Namespace locking guarantees atomicity and correctness
The operation log maintained by the master deﬁnes a global total order
of these operations

Consistency Model: Data
The state after mutation depends on:
Mutation type: write or append

Whether it succeeds or fails

Whether there are other concurrent mutations

A ﬁle region is consistent if all clients see the same data, regardless
of the replica

A ﬁle region is consistent if all clients see the same data, regardless
of the replica
A region is deﬁned after a mutation if it is still consistent and clients
see the mutation in its entirety

Consistency Model: Data (2)
If there are no other concurrent writers, the region is deﬁned and
consistent

consistent
Concurrent and successful mutations leave the region undeﬁned but
consistent

consistent
consistent
Mingled fragments from multiple mutations

consistent
consistent
Mingled fragments from multiple mutations
A failed mutation makes the region both inconsistent and undeﬁned

Mutation Operations
Each chunk has many replicas

Mutation Operations
The primary replica holds a lease from the master

Mutation Operations
The primary replica holds a lease from the master
It decides the order of all mutations for all replicas

Write Operation
Client obtains the location of replicas and the identity of the primary
replica from the master

Write Operation
It then pushes the data to all replica nodes

Write Operation
The client issues an update request to primary

Write Operation
Primary forwards the write request to all replicas

Write Operation
It waits for a reply from all replicas before returning to the client

Record Append Operation
Performed atomically

Append location chosen by the GFS and communicated to the client

1 If the records ﬁts in the current chunk, it is written and communicated to
the client

1 If the records ﬁts in the current chunk, it is written and communicated to
the client
2 If it does not, the chunk is padded and the client is told to try the next
chunk

Application Safeguards
Use record append rather than write

Insert checksums in record headers to detect fragments

Insert checksums in record headers to detect fragments
Insert sequence numbers to detect duplicates

Chunk Placement
Put on chunkservers with below average disk space usage

Chunk Placement
Limit number of “recent” creations on a chunkserver, to ensure that it
does not experience any trafﬁc spike due to its fresh data

Chunk Placement
Limit number of “recent” creations on a chunkserver, to ensure that it
does not experience any trafﬁc spike due to its fresh data
For reliability, replicas spread across racks

Garbage Collection
Chunks become garbage when they are orphaned

Garbage Collection
A lazy reclamation strategy is used by not reclaiming chunks at delete
time

Garbage Collection
time
Each chunkserver communicates the subset of its current chunks to
the master in the heartbeat signal

Garbage Collection
time
Master pinpoints chunks which have been orphaned

Garbage Collection
time
Master pinpoints chunks which have been orphaned
The chunkserver ﬁnally reclaims that space

Stale Replica Detection
Each chunk is assigned a version number

Each time a new lease is granted, the version number is incremented

Stale replicas will have outdated version numbers

Stale replicas will have outdated version numbers
They are simply garbage collected

Outline
1 Introduction
2 Google Filesystem

Introduction
Open-source clone of GFS

Introduction
Comes packaged with Hadoop

Introduction
Master is called the NameNode and chunkservers are called
DataNodes

Introduction
DataNodes
Chunks are known as blocks

Introduction
DataNodes
Chunks are known as blocks
Exposes a Java API and a command-line interface

Command-line API
Accessible through: bin/hdfs dfs -command args
1
http:
//hadoop.apache.org/docs/r1.0.4/file_system_shell.html

Command-line API
Accessible through: bin/hdfs dfs -command args
Useful commands: cat, copyFromLocal, copyToLocal, cp,
ls, mkdir, moveFromLocal, moveToLocal, mv, rm, etc.1
1
http:
//hadoop.apache.org/docs/r1.0.4/file_system_shell.html

References
1 Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The
Google ﬁle system. In Proceedings of the nineteenth ACM symposium
on Operating systems principles (SOSP ’03). ACM, New York, NY,
USA, 29-43.

Topic 11: Google Filesystem

Recommended

Recommended

More Related Content

Similar to Topic 11: Google Filesystem

Similar to Topic 11: Google Filesystem (20)

More from Zubair Nabi

More from Zubair Nabi (20)

Recently uploaded

Recently uploaded (20)

Topic 11: Google Filesystem