2. GFS is scalable distributed file system for
large distributed data-intensive
applications.
3. Google had key observations upon
which they decided to build their own
DFS.
Cost Effective:
› The system is built using inexpensive
commodity components where components
failure is the norm and not the exception.
› So the system must detect, tolerate, and
recover from failures on a routine basis.
4. File Size:
› Multi GB files are the common case, so the
system must be optimized in managing large
files
› Small files also are supported but no need to
optimize for them.
5. Read Operation:
› Large Data Streams
An operation reads hundreds of KBs or maybe 1MB
or more.
Successive operations from the same client reads
usually from the same file region.
› Random Reads
An operation reads a few KBs staring from an
arbitrary offset.
Performance - conscious applications usually
patch and sort their small reads to advance
steadily in the file instead going back and forth.
6. Write Operations:
› Are the same in size as the read operations.
› Once written the files are seldom modified.
› Write operations are in the form of sequential
append.
› Random writes are supported but not
efficient.
7. Transaction Management:
› Usually applications use GFS in the form of
Producer- Consumer model.
› Many Producer can be writing to the same
file concurrently.
› Atomic writes and Synchronization between
different producers must be optimized.
8. Latency Vs High Sustained Bandwidth.
› Client don’t have a tight SLA for read and
write operations response time, instead they
care more about processing and moving
data bulks in high rate.
9. GFS provides an interface to:
› Create
› Delete
› Open
› Close
› Read
› Write
› Snapshot (Copy)
› Record Append
10. The system is organized into clusters.
Each Cluster has the following
components:
› Single Cluster Master
› Multiple Chunk Servers
› Multiple Clients (System Environment)
11. File are divided into fixed size .
Chunk size is 64 MB.
assigns a 64 bit identifier called
chunk handle for each Chunk upon
creation.
stores chunks on local disk.
For reliability, each chunk is replicated
across multiple chunk servers.
12. maintains file system meta
data.
› Operations on Files and chunks
namespaces.
› Mapping between Files and Chunks.
› Current location of Chunks.
› Chunk leas management
› Garbage Collection
› Chunk migration between Chunk Servers.
13.
14.
15.
16. Namespace Management and locking.
Replica Placement.
Replica Creation, Re-replication, and
Rebalancing.
Garbage Collection.
Stale Replica Detection.