The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
Topic 11: Google Filesystem
1. 11: Google Filesystem
Zubair Nabi
zubair.nabi@itu.edu.pk
April 20, 2013
Zubair Nabi 11: Google Filesystem April 20, 2013 1 / 29
2. Outline
1 Introduction
2 Google Filesystem
3 Hadoop Distributed Filesystem
Zubair Nabi 11: Google Filesystem April 20, 2013 2 / 29
3. Outline
1 Introduction
2 Google Filesystem
3 Hadoop Distributed Filesystem
Zubair Nabi 11: Google Filesystem April 20, 2013 3 / 29
4. Filesystem
The purpose of a filesystem is to:
1 Organize and store data
Zubair Nabi 11: Google Filesystem April 20, 2013 4 / 29
5. Filesystem
The purpose of a filesystem is to:
1 Organize and store data
2 Support sharing of data among users and applications
Zubair Nabi 11: Google Filesystem April 20, 2013 4 / 29
6. Filesystem
The purpose of a filesystem is to:
1 Organize and store data
2 Support sharing of data among users and applications
3 Ensure persistence of data after a reboot
Zubair Nabi 11: Google Filesystem April 20, 2013 4 / 29
7. Filesystem
The purpose of a filesystem is to:
1 Organize and store data
2 Support sharing of data among users and applications
3 Ensure persistence of data after a reboot
4 Examples include FAT, NTFS, ext3, ext4, etc.
Zubair Nabi 11: Google Filesystem April 20, 2013 4 / 29
9. Distributed filesystem
Self-explanatory: the filesystem is distributed across many machines
The DFS provides a common abstraction to the dispersed files
Zubair Nabi 11: Google Filesystem April 20, 2013 5 / 29
10. Distributed filesystem
Self-explanatory: the filesystem is distributed across many machines
The DFS provides a common abstraction to the dispersed files
Each DFS has an associated API that provides a service to clients,
which are normal file operations, such as create, read, write, etc.
Zubair Nabi 11: Google Filesystem April 20, 2013 5 / 29
11. Distributed filesystem
Self-explanatory: the filesystem is distributed across many machines
The DFS provides a common abstraction to the dispersed files
Each DFS has an associated API that provides a service to clients,
which are normal file operations, such as create, read, write, etc.
Maintains a namespace which maps logical names to physical names
Zubair Nabi 11: Google Filesystem April 20, 2013 5 / 29
12. Distributed filesystem
Self-explanatory: the filesystem is distributed across many machines
The DFS provides a common abstraction to the dispersed files
Each DFS has an associated API that provides a service to clients,
which are normal file operations, such as create, read, write, etc.
Maintains a namespace which maps logical names to physical names
Simplifies replication and migration
Zubair Nabi 11: Google Filesystem April 20, 2013 5 / 29
13. Distributed filesystem
Self-explanatory: the filesystem is distributed across many machines
The DFS provides a common abstraction to the dispersed files
Each DFS has an associated API that provides a service to clients,
which are normal file operations, such as create, read, write, etc.
Maintains a namespace which maps logical names to physical names
Simplifies replication and migration
Examples include the Network Filesystem (NFS), Andrew Filesystem
(AFS), etc.
Zubair Nabi 11: Google Filesystem April 20, 2013 5 / 29
14. Outline
1 Introduction
2 Google Filesystem
3 Hadoop Distributed Filesystem
Zubair Nabi 11: Google Filesystem April 20, 2013 6 / 29
16. Introduction
Designed by Google to meet its massive storage needs
Shares many goals with previous distributed filesystems such as
performance, scalability, reliability, and availability
Zubair Nabi 11: Google Filesystem April 20, 2013 7 / 29
17. Introduction
Designed by Google to meet its massive storage needs
Shares many goals with previous distributed filesystems such as
performance, scalability, reliability, and availability
At the same time, design driven by key observations of their workload
and infrastructure, both current and future
Zubair Nabi 11: Google Filesystem April 20, 2013 7 / 29
18. Design Goals
1 Failure is the norm rather than the exception: The GFS must
constantly introspect and automatically recover from failure
Zubair Nabi 11: Google Filesystem April 20, 2013 8 / 29
19. Design Goals
1 Failure is the norm rather than the exception: The GFS must
constantly introspect and automatically recover from failure
2 The system stores a fair number of large files: Optimize for large
files, on the order of GBs, but still support small files
Zubair Nabi 11: Google Filesystem April 20, 2013 8 / 29
20. Design Goals
1 Failure is the norm rather than the exception: The GFS must
constantly introspect and automatically recover from failure
2 The system stores a fair number of large files: Optimize for large
files, on the order of GBs, but still support small files
3 Applications prefer to do large streaming reads of contiguous
regions: Optimize for this case
Zubair Nabi 11: Google Filesystem April 20, 2013 8 / 29
21. Design Goals (2)
4 Most applications perform large, sequential writes that are mostly
append operations: Support small writes but do not optimize for them
Zubair Nabi 11: Google Filesystem April 20, 2013 9 / 29
22. Design Goals (2)
4 Most applications perform large, sequential writes that are mostly
append operations: Support small writes but do not optimize for them
5 Most operations are producer-consume queues or many-way
merging: Support concurrent reads or writes by hundreds of clients
simultaneously
Zubair Nabi 11: Google Filesystem April 20, 2013 9 / 29
23. Design Goals (2)
4 Most applications perform large, sequential writes that are mostly
append operations: Support small writes but do not optimize for them
5 Most operations are producer-consume queues or many-way
merging: Support concurrent reads or writes by hundreds of clients
simultaneously
6 Applications process data in bulk at a high rate: Favour throughput
over latency
Zubair Nabi 11: Google Filesystem April 20, 2013 9 / 29
24. Interface
The interface is similar to traditional filesystems but no support for a
standard POSIX-like API
Zubair Nabi 11: Google Filesystem April 20, 2013 10 / 29
25. Interface
The interface is similar to traditional filesystems but no support for a
standard POSIX-like API
Files are organized hierarchically into directories with pathnames
Zubair Nabi 11: Google Filesystem April 20, 2013 10 / 29
26. Interface
The interface is similar to traditional filesystems but no support for a
standard POSIX-like API
Files are organized hierarchically into directories with pathnames
Support for create, delete, open, close, read, and write operations
Zubair Nabi 11: Google Filesystem April 20, 2013 10 / 29
27. Architecture
Consists of a single master and multiple chunkservers
Zubair Nabi 11: Google Filesystem April 20, 2013 11 / 29
28. Architecture
Consists of a single master and multiple chunkservers
The system can be accessed by multiple clients
Zubair Nabi 11: Google Filesystem April 20, 2013 11 / 29
29. Architecture
Consists of a single master and multiple chunkservers
The system can be accessed by multiple clients
Both the master and chunkservers run as user-space server processes
on commodity Linux machines
Zubair Nabi 11: Google Filesystem April 20, 2013 11 / 29
30. Files
Files are sliced into fixed-size chunks
Zubair Nabi 11: Google Filesystem April 20, 2013 12 / 29
31. Files
Files are sliced into fixed-size chunks
Each chunk is identifiable by an immutable and globally unique 64-bit
handle
Zubair Nabi 11: Google Filesystem April 20, 2013 12 / 29
32. Files
Files are sliced into fixed-size chunks
Each chunk is identifiable by an immutable and globally unique 64-bit
handle
Chunks are stored by chunkservers as local Linux files
Zubair Nabi 11: Google Filesystem April 20, 2013 12 / 29
33. Files
Files are sliced into fixed-size chunks
Each chunk is identifiable by an immutable and globally unique 64-bit
handle
Chunks are stored by chunkservers as local Linux files
Reads and writes to a chunk are specified by a handle and a byte
range
Zubair Nabi 11: Google Filesystem April 20, 2013 12 / 29
34. Files
Files are sliced into fixed-size chunks
Each chunk is identifiable by an immutable and globally unique 64-bit
handle
Chunks are stored by chunkservers as local Linux files
Reads and writes to a chunk are specified by a handle and a byte
range
Each chunk is replicated on multiple chunkservers
Zubair Nabi 11: Google Filesystem April 20, 2013 12 / 29
35. Files
Files are sliced into fixed-size chunks
Each chunk is identifiable by an immutable and globally unique 64-bit
handle
Chunks are stored by chunkservers as local Linux files
Reads and writes to a chunk are specified by a handle and a byte
range
Each chunk is replicated on multiple chunkservers
3 by default
Zubair Nabi 11: Google Filesystem April 20, 2013 12 / 29
36. Master
In charge of all filesystem metadata
Zubair Nabi 11: Google Filesystem April 20, 2013 13 / 29
37. Master
In charge of all filesystem metadata
Namespace, access control information, mapping between files and
chunks, and current locations of chunks
Zubair Nabi 11: Google Filesystem April 20, 2013 13 / 29
38. Master
In charge of all filesystem metadata
Namespace, access control information, mapping between files and
chunks, and current locations of chunks
Holds this information in memory and regularly syncs it with a log file
Zubair Nabi 11: Google Filesystem April 20, 2013 13 / 29
39. Master
In charge of all filesystem metadata
Namespace, access control information, mapping between files and
chunks, and current locations of chunks
Holds this information in memory and regularly syncs it with a log file
Also in charge of chunk leasing, garbage collection, and chunk
migration
Zubair Nabi 11: Google Filesystem April 20, 2013 13 / 29
40. Master
In charge of all filesystem metadata
Namespace, access control information, mapping between files and
chunks, and current locations of chunks
Holds this information in memory and regularly syncs it with a log file
Also in charge of chunk leasing, garbage collection, and chunk
migration
Periodically sends each chunkserver a heartbeat signal to check its
state and send it instructions
Zubair Nabi 11: Google Filesystem April 20, 2013 13 / 29
41. Master
In charge of all filesystem metadata
Namespace, access control information, mapping between files and
chunks, and current locations of chunks
Holds this information in memory and regularly syncs it with a log file
Also in charge of chunk leasing, garbage collection, and chunk
migration
Periodically sends each chunkserver a heartbeat signal to check its
state and send it instructions
Clients interact with it to access metadata but all data-bearing
communication goes directly to the relevant chunkservers
Zubair Nabi 11: Google Filesystem April 20, 2013 13 / 29
42. Master
In charge of all filesystem metadata
Namespace, access control information, mapping between files and
chunks, and current locations of chunks
Holds this information in memory and regularly syncs it with a log file
Also in charge of chunk leasing, garbage collection, and chunk
migration
Periodically sends each chunkserver a heartbeat signal to check its
state and send it instructions
Clients interact with it to access metadata but all data-bearing
communication goes directly to the relevant chunkservers
As a result, the master does not become a performance bottleneck
Zubair Nabi 11: Google Filesystem April 20, 2013 13 / 29
44. Consistency Model: Master
All namespace mutations (such as file creation) are atomic as they are
exclusively handled by the master
Zubair Nabi 11: Google Filesystem April 20, 2013 15 / 29
45. Consistency Model: Master
All namespace mutations (such as file creation) are atomic as they are
exclusively handled by the master
Namespace locking guarantees atomicity and correctness
Zubair Nabi 11: Google Filesystem April 20, 2013 15 / 29
46. Consistency Model: Master
All namespace mutations (such as file creation) are atomic as they are
exclusively handled by the master
Namespace locking guarantees atomicity and correctness
The operation log maintained by the master defines a global total order
of these operations
Zubair Nabi 11: Google Filesystem April 20, 2013 15 / 29
47. Consistency Model: Data
The state after mutation depends on:
Mutation type: write or append
Zubair Nabi 11: Google Filesystem April 20, 2013 16 / 29
48. Consistency Model: Data
The state after mutation depends on:
Mutation type: write or append
Whether it succeeds or fails
Zubair Nabi 11: Google Filesystem April 20, 2013 16 / 29
49. Consistency Model: Data
The state after mutation depends on:
Mutation type: write or append
Whether it succeeds or fails
Whether there are other concurrent mutations
Zubair Nabi 11: Google Filesystem April 20, 2013 16 / 29
50. Consistency Model: Data
The state after mutation depends on:
Mutation type: write or append
Whether it succeeds or fails
Whether there are other concurrent mutations
A file region is consistent if all clients see the same data, regardless
of the replica
Zubair Nabi 11: Google Filesystem April 20, 2013 16 / 29
51. Consistency Model: Data
The state after mutation depends on:
Mutation type: write or append
Whether it succeeds or fails
Whether there are other concurrent mutations
A file region is consistent if all clients see the same data, regardless
of the replica
A region is defined after a mutation if it is still consistent and clients
see the mutation in its entirety
Zubair Nabi 11: Google Filesystem April 20, 2013 16 / 29
52. Consistency Model: Data (2)
If there are no other concurrent writers, the region is defined and
consistent
Zubair Nabi 11: Google Filesystem April 20, 2013 17 / 29
53. Consistency Model: Data (2)
If there are no other concurrent writers, the region is defined and
consistent
Concurrent and successful mutations leave the region undefined but
consistent
Zubair Nabi 11: Google Filesystem April 20, 2013 17 / 29
54. Consistency Model: Data (2)
If there are no other concurrent writers, the region is defined and
consistent
Concurrent and successful mutations leave the region undefined but
consistent
Mingled fragments from multiple mutations
Zubair Nabi 11: Google Filesystem April 20, 2013 17 / 29
55. Consistency Model: Data (2)
If there are no other concurrent writers, the region is defined and
consistent
Concurrent and successful mutations leave the region undefined but
consistent
Mingled fragments from multiple mutations
A failed mutation makes the region both inconsistent and undefined
Zubair Nabi 11: Google Filesystem April 20, 2013 17 / 29
57. Mutation Operations
Each chunk has many replicas
The primary replica holds a lease from the master
Zubair Nabi 11: Google Filesystem April 20, 2013 18 / 29
58. Mutation Operations
Each chunk has many replicas
The primary replica holds a lease from the master
It decides the order of all mutations for all replicas
Zubair Nabi 11: Google Filesystem April 20, 2013 18 / 29
59. Write Operation
Client obtains the location of replicas and the identity of the primary
replica from the master
Zubair Nabi 11: Google Filesystem April 20, 2013 19 / 29
60. Write Operation
Client obtains the location of replicas and the identity of the primary
replica from the master
It then pushes the data to all replica nodes
Zubair Nabi 11: Google Filesystem April 20, 2013 19 / 29
61. Write Operation
Client obtains the location of replicas and the identity of the primary
replica from the master
It then pushes the data to all replica nodes
The client issues an update request to primary
Zubair Nabi 11: Google Filesystem April 20, 2013 19 / 29
62. Write Operation
Client obtains the location of replicas and the identity of the primary
replica from the master
It then pushes the data to all replica nodes
The client issues an update request to primary
Primary forwards the write request to all replicas
Zubair Nabi 11: Google Filesystem April 20, 2013 19 / 29
63. Write Operation
Client obtains the location of replicas and the identity of the primary
replica from the master
It then pushes the data to all replica nodes
The client issues an update request to primary
Primary forwards the write request to all replicas
It waits for a reply from all replicas before returning to the client
Zubair Nabi 11: Google Filesystem April 20, 2013 19 / 29
65. Record Append Operation
Performed atomically
Append location chosen by the GFS and communicated to the client
Zubair Nabi 11: Google Filesystem April 20, 2013 20 / 29
66. Record Append Operation
Performed atomically
Append location chosen by the GFS and communicated to the client
Primary forwards the write request to all replicas
Zubair Nabi 11: Google Filesystem April 20, 2013 20 / 29
67. Record Append Operation
Performed atomically
Append location chosen by the GFS and communicated to the client
Primary forwards the write request to all replicas
It waits for a reply from all replicas before returning to the client
Zubair Nabi 11: Google Filesystem April 20, 2013 20 / 29
68. Record Append Operation
Performed atomically
Append location chosen by the GFS and communicated to the client
Primary forwards the write request to all replicas
It waits for a reply from all replicas before returning to the client
1 If the records fits in the current chunk, it is written and communicated to
the client
Zubair Nabi 11: Google Filesystem April 20, 2013 20 / 29
69. Record Append Operation
Performed atomically
Append location chosen by the GFS and communicated to the client
Primary forwards the write request to all replicas
It waits for a reply from all replicas before returning to the client
1 If the records fits in the current chunk, it is written and communicated to
the client
2 If it does not, the chunk is padded and the client is told to try the next
chunk
Zubair Nabi 11: Google Filesystem April 20, 2013 20 / 29
72. Application Safeguards
Use record append rather than write
Insert checksums in record headers to detect fragments
Zubair Nabi 11: Google Filesystem April 20, 2013 22 / 29
73. Application Safeguards
Use record append rather than write
Insert checksums in record headers to detect fragments
Insert sequence numbers to detect duplicates
Zubair Nabi 11: Google Filesystem April 20, 2013 22 / 29
74. Chunk Placement
Put on chunkservers with below average disk space usage
Zubair Nabi 11: Google Filesystem April 20, 2013 23 / 29
75. Chunk Placement
Put on chunkservers with below average disk space usage
Limit number of “recent” creations on a chunkserver, to ensure that it
does not experience any traffic spike due to its fresh data
Zubair Nabi 11: Google Filesystem April 20, 2013 23 / 29
76. Chunk Placement
Put on chunkservers with below average disk space usage
Limit number of “recent” creations on a chunkserver, to ensure that it
does not experience any traffic spike due to its fresh data
For reliability, replicas spread across racks
Zubair Nabi 11: Google Filesystem April 20, 2013 23 / 29
78. Garbage Collection
Chunks become garbage when they are orphaned
A lazy reclamation strategy is used by not reclaiming chunks at delete
time
Zubair Nabi 11: Google Filesystem April 20, 2013 24 / 29
79. Garbage Collection
Chunks become garbage when they are orphaned
A lazy reclamation strategy is used by not reclaiming chunks at delete
time
Each chunkserver communicates the subset of its current chunks to
the master in the heartbeat signal
Zubair Nabi 11: Google Filesystem April 20, 2013 24 / 29
80. Garbage Collection
Chunks become garbage when they are orphaned
A lazy reclamation strategy is used by not reclaiming chunks at delete
time
Each chunkserver communicates the subset of its current chunks to
the master in the heartbeat signal
Master pinpoints chunks which have been orphaned
Zubair Nabi 11: Google Filesystem April 20, 2013 24 / 29
81. Garbage Collection
Chunks become garbage when they are orphaned
A lazy reclamation strategy is used by not reclaiming chunks at delete
time
Each chunkserver communicates the subset of its current chunks to
the master in the heartbeat signal
Master pinpoints chunks which have been orphaned
The chunkserver finally reclaims that space
Zubair Nabi 11: Google Filesystem April 20, 2013 24 / 29
82. Stale Replica Detection
Each chunk is assigned a version number
Zubair Nabi 11: Google Filesystem April 20, 2013 25 / 29
83. Stale Replica Detection
Each chunk is assigned a version number
Each time a new lease is granted, the version number is incremented
Zubair Nabi 11: Google Filesystem April 20, 2013 25 / 29
84. Stale Replica Detection
Each chunk is assigned a version number
Each time a new lease is granted, the version number is incremented
Stale replicas will have outdated version numbers
Zubair Nabi 11: Google Filesystem April 20, 2013 25 / 29
85. Stale Replica Detection
Each chunk is assigned a version number
Each time a new lease is granted, the version number is incremented
Stale replicas will have outdated version numbers
They are simply garbage collected
Zubair Nabi 11: Google Filesystem April 20, 2013 25 / 29
86. Outline
1 Introduction
2 Google Filesystem
3 Hadoop Distributed Filesystem
Zubair Nabi 11: Google Filesystem April 20, 2013 26 / 29
89. Introduction
Open-source clone of GFS
Comes packaged with Hadoop
Master is called the NameNode and chunkservers are called
DataNodes
Zubair Nabi 11: Google Filesystem April 20, 2013 27 / 29
90. Introduction
Open-source clone of GFS
Comes packaged with Hadoop
Master is called the NameNode and chunkservers are called
DataNodes
Chunks are known as blocks
Zubair Nabi 11: Google Filesystem April 20, 2013 27 / 29
91. Introduction
Open-source clone of GFS
Comes packaged with Hadoop
Master is called the NameNode and chunkservers are called
DataNodes
Chunks are known as blocks
Exposes a Java API and a command-line interface
Zubair Nabi 11: Google Filesystem April 20, 2013 27 / 29
92. Command-line API
Accessible through: bin/hdfs dfs -command args
1
http:
//hadoop.apache.org/docs/r1.0.4/file_system_shell.html
Zubair Nabi 11: Google Filesystem April 20, 2013 28 / 29
93. Command-line API
Accessible through: bin/hdfs dfs -command args
Useful commands: cat, copyFromLocal, copyToLocal, cp,
ls, mkdir, moveFromLocal, moveToLocal, mv, rm, etc.1
1
http:
//hadoop.apache.org/docs/r1.0.4/file_system_shell.html
Zubair Nabi 11: Google Filesystem April 20, 2013 28 / 29
94. References
1 Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The
Google file system. In Proceedings of the nineteenth ACM symposium
on Operating systems principles (SOSP ’03). ACM, New York, NY,
USA, 29-43.
Zubair Nabi 11: Google Filesystem April 20, 2013 29 / 29