This document provides an overview of Google's Bigtable distributed storage system. It describes Bigtable's data model as a sparse, multidimensional sorted map indexed by row, column, and timestamp. Bigtable stores data across many tablet servers, with a single master server coordinating metadata operations like tablet assignment and load balancing. The master uses Chubby, a distributed lock service, to track which tablet servers are available and reassign tablets if servers become unreachable.
1. INSE 6620 (Cloud Computing Security and Privacy)
Cloud Computing 101
Prof. Lingyu Wang
1
2. Enabling TechnologiesEnabling Technologies
Cloud computing relies on:
1. Hardware advancements
2. Web x.0 technologies
3 Vi t li ti3. Virtualization
4. Distributed file system
2
Ghemawat et al., The Google File System; Dean et al., MapReduce: Simplified Data Processing on Large Clusters;
Chang et al., Bigtable: A Distributed Storage System for Structured Data
4. How Does it Work?How Does it Work?
How are data stored?
The Google File System (GFS)
How are data organized?
The Bigtable
How are computations supported?
M dMapreduce
4
5. Google File System (GFS) MotivationGoogle File System (GFS) Motivation
Need a scalable DFS for
Large distributed data-intensive applications
Performance, Reliability, Scalability and Availability
M th t diti l DFSMore than traditional DFS
Component failure is norm, not exception
built from inexpensive commodity componentsbuilt from inexpensive commodity components
Files are large (multi-GB)
Workloads: Large streaming reads sequential writesg g q
Co-design applications and file system API
Sustained bandwidth more critical than low latency
5
6. File StructureFile Structure
Files are divided into chunks
Fixed-size chunks (64MB)
Replicated over chunkservers, called replicas
3 replicas by default
Unique 64-bit chunk handles
h k f lChunks as Linux files chunk
file
6
…
blocks
8. Architecture - MasterArchitecture Master
Master stores three types of meta data
File & chunk namespaces
Mapping from files to chunks
Location of chunk replicasLocation of chunk replicas
Stored in memory
HeartbeatsHeartbeats
Having one master
Global knowledge allows better placement /Global knowledge allows better placement /
replication
Simplifies design
8
9. Mutation OperationsMutation Operations
Primary replica
Holds lease assigned by masterHolds lease assigned by master
Assigns serial order for all mutation
operations performed on replicas
Write operationWrite operation
1-2: client obtains replica locations
and identity of primary replica
3: client pushes data to replicas3 c e t pus es data to ep cas
4: client issues update request to
primary
5: primary forwards/performs write
requestrequest
6: primary receives replies from
replica
7: primary replies to clientp y p
9
10. Fault Tolerance and DiagnosisFault Tolerance and Diagnosis
Fast Recovery
Both master and chunkserver are designed toBoth master and chunkserver are designed to
restart in seconds
Chunk replication
E h h k i li t d lti l h kEach chunk is replicated on multiple chunkservers
on different racks
Master replicationp
Master’s state is replicated
Monitoring outside GFS may restart master process
Data integrityData integrity
Checksumming to detect corruption of stored data
Each chunkserver independently verifies integrity
same data may look different on different chunk servers
10
12. MapReduce MotivationMapReduce Motivation
Recall “Cost associativity”: 1k servers*1hr=1server*1k hrs
Nice, but how?
How to run my task on 1k servers?
Distributed computing, many things to worry about
Customized task, can’t use standard applications
MapRed ce a p og amming model/abst actionMapReduce: a programming model/abstraction
that supports this while hiding messy details:
ParallelizationParallelization
Data distribution
Fault-tolerance
Load balancing
12
14. Programming ModelProgramming Model
Input & Output: each a set of key/value pairs
Programmer specifies two functions:Programmer specifies two functions:
map (in_key, in_value) -> list(out_key,
intermediate value)intermediate_value)
Processes input key/value pair to generate intermediate
pairs
(transparently, the underlying system groups/sorts(transparently, the underlying system groups/sorts
intermediate values based on out_keys)
reduce (out_key, list(intermediate_value)) ->
list(out_value)( _ )
Given all intermediate values for a particular key,
produces a set of merged output values (usually just one)
Many real world problems can be representedMany real world problems can be represented
using these two functions
14
15. Example: Count Word OccurrencesExample: Count Word Occurrences
Input consists of (url, contents) pairs
map(key=url, val=contents):
For each word w in contents, emit (w, “1”)
ed ce(ke o d al es niq co nts)reduce(key=word, values=uniq_counts):
Sum all “1”s in values list
Emit result “(word sum)”Emit result (word, sum)
15
16. Example: Count Word OccurrencesExample: Count Word Occurrences
map(key=url, val=contents):
Fo each o d in contents emit ( “1”)For each word w in contents, emit (w, “1”)
reduce(key=word, values=uniq_counts):
Sum all “1”s in values list
Emit result “(word, sum)”
see bob throw
see 1 bob 1
see bob throw
see spot run
bob 1
run 1
1
run 1
see 2
t 1see 1
spot 1
throw 1
spot 1
throw 1
throw 1
grouping/
sorting 16
17. Example: Distributed GrepExample: Distributed Grep
Input consists of (url+offset, single line)
map(key=url+offset, val=line):
If contents matches regexp, emit (line, “1”)
d (k l l )reduce(key=line, values=uniq_counts):
Don’t do anything; just emit line
17
18. Reverse Web-Link GraphReverse Web Link Graph
Map
For each target URL found in page source
Emit a <target, source> pair
R dReduce
Concatenate a list of all source URLs
Outputs: <target list (source)> pairsOutputs: <target, list (source)> pairs
18
20. More ExamplesMore Examples
Distributed sort
Map: extracts key from each record, emits a <key,
record>
Reduce: emits all pairs unchangedReduce: emits all pairs unchanged
Relies on underlying partitioning and orderingy g p g g
functionalities
20
21. Widely Used at GoogleWidely Used at Google
Example uses:Example uses:
distributed grep distributed sort web link-graph reversal
term-vector / host web access log stats inverted index construction
i i l hi
document clustering machine learning
statistical machine
translation
... ... ... 21
22. Usage in Aug 2004Usage in Aug 2004
Number of jobs 29,423
Average job completion time 634 secsAverage job completion time 634 secs
Machine days used 79,186 days
Input data read 3,288 TB
d d d d 8Intermediate data produced 758 TB
Output data written 193 TB
Average worker machines per job 157Average worker machines per job 157
Average worker deaths per job 1.2
Average map tasks per job 3,351
Average reduce tasks per job 55Average reduce tasks per job 55
Unique map implementations 395
Unique reduce implementations 269
U i / d bi ti 426Unique map/reduce combinations 426
22
23. Implementation OverviewImplementation Overview
Typical cluster:
100s-1000s of 2-CPU x86 machines, 2-4 GB of
memory
100MBPS or 1GBPS but limited bisection bandwidth100MBPS or 1GBPS, but limited bisection bandwidth
Storage is on local IDE disks
GFS: distributed file system manages datay g
Job scheduling system: jobs made up of tasks,
scheduler assigns tasks to machines
Implementation is a C++ library linked into
user programsuser programs
23
24. ParallelizationParallelization
How is task distributed?
Partition input key/value pairs into equal-sized
chunks of 16-64MB, run map() tasks in parallel
After all map()s are complete consolidate allAfter all map()s are complete, consolidate all
emitted values for each unique emitted key
Now partition space of output map keys, and run
reduce() in parallel
Typical setting:
2,000 machines
M = 200,000
R 5 000R = 5,000
24
25. Execution Overview
(0) mapreduce(spec, &result)
M inputp
splits of 16-
64MB each
R regions
• Read all intermediate data
• Sort it by intermediate keys
g
Partitioning function
hash(intermediate_key) mod R
25
27. Task Granularity & PipeliningTask Granularity & Pipelining
Fine granularity tasks: map tasks >>
himachines
Minimizes time for fault recovery
Better dynamic load balancingBetter dynamic load balancing
Often use 200,000 map & 5000 reduce tasks
Running on 2000 machinesRunning on 2000 machines
27
28. Fault ToleranceFault Tolerance
Worker failure handled via re-execution
Detect failure via periodic heartbeats
Re-execute completed + in-progress map tasks
Due to inaccessible resultsDue to inaccessible results
Only re-execute in progress reduce tasks
Results of completed tasks stored in global file system
Robust: lost 80 machines once finished ok
Master failure not handled
Rare in practice
Abort and re-run at client
28
29. Refinement: Redundant ExecutionRefinement: Redundant Execution
Problem: Slow workers may significantly delay
l ti ti h l t d f t kcompletion time when close to end of tasks
Other jobs consuming resources on machine
Bad disks w/ soft errors transfer data slowlyBad disks w/ soft errors transfer data slowly
Weird things: processor caches disabled
Solution: Near end of phase, spawn backup
taskstas s
Whichever one finishes first "wins“
Dramatically shortens job completion time
29
30. Refinement: Locality OptimizationRefinement: Locality Optimization
Network bandwidth is a relatively scarce
t itresource, so to save it:
Input data stored on local disks in GFS
Schedule a map task on machine hosting a replicaSchedule a map task on machine hosting a replica
If can’t, schedule it close to a replica (e.g., a host
using the same switch)g )
Effect
Thousands of machines read input at local diskp
speed
Without this, rack switches limit read rate
30
31. Refinement: Combiner FunctionRefinement: Combiner Function
Purpose: reduce data sent over network
Combiner function: performs partial merging of
intermediate data at the map worker
Typically, combiner function == reducer function
Only difference is how to handle outputOnly difference is how to handle output
E.g. word count
31
32. PerformancePerformance
Tests run on cluster of 1800 machines:
4 GB of memory, dual-processor 2 GHz Xeons
Dual 160 GB IDE disks
Gigabit Ethernet NIC bisection bandwidth 100 GbpsGigabit Ethernet NIC, bisection bandwidth 100 Gbps
Two benchmarks:
Grep Scan 1010 100-byte records to extract recordsGrep Scan 1010 100-byte records to extract records
matching a rare pattern (92K matching records)
M=15,000 (input split size about 64MB)
R=1R=1
Sort Sort 1010 100-byte records
M=15,000 (input split size about 64MB)
R 4 000R=4,000
32
33. GrepGrep
Locality optimization helps:
1800 machines read 1 TB at peak ~31 GB/s
W/out this, rack switches would limit to 10 GB/s
St t h d i i ifi t f h t j bStartup overhead is significant for short jobs
Total time about 150 seconds; 1 minute startup
timetime
33
35. ExperienceExperience
Rewrote Google's production indexing System
i M R dusing MapReduce
Set of 10, 14, 17, 21, 24 MapReduce operations
New code is simpler easier to understandNew code is simpler, easier to understand
3800 lines C++ 700
Easier to understand and change indexing processg g p
(from months to days)
Easier to operate
M R d h dl f il l hiMapReduce handles failures, slow machines
Easy to improve performance
Add more machinesAdd more machines
35
36. ConclusionConclusion
MapReduce proven to be useful abstraction
Greatly simplifies large-scale computations
Fun to use:
focus on problem,
let library deal w/ messy details
36
37. Bigtable MotivationBigtable Motivation
Storage for (semi-)structured data
e.g., Google Earth, Google Finance, Personalized
Search
ScaleScale
Lots of data
Millions of machinesMillions of machines
Different project/applications
Hundreds of millions of users
37
38. Why Not a DBMS?Why Not a DBMS?
Few DBMS’s support the requisite scale
Required DB with wide scalability, wide applicability,
high performance and high availability
Couldn’t afford it if there was oneCouldn t afford it if there was one
Most DBMSs require very expensive infrastructure
DBMSs provide more than Google needsDBMSs provide more than Google needs
E.g., full transactions, SQL
Google has highly optimized lower-levelGoogle has highly optimized lower level
systems that could be exploited
GFS, Chubby, MapReduce, Job scheduling, y, p , g
38
39. BigtableBigtable
“A BigTable is a sparse, distributed, persistent
ltidi i l t d Th imultidimensional sorted map. The map is
indexed by a row key, a column key, and a
timestamp; each value in the map is antimestamp; each value in the map is an
uninterpreted array of bytes.”
39
40. Data ModelData Model
(row, column, timestamp) -> cell contents
Rows
Arbitrary string
Access to data in a row is atomic
Ordered lexicographically
40
41. Data ModelData Model
Column
Tow-level name structure: Column families and
columns
Column Family is the unit of access controlColumn Family is the unit of access control
41
42. Data ModelData Model
Timestamps
Store different versions of data in a cell
Lookup options
Return most recent K valuesReturn most recent K values
Return all values
42
43. Data ModelData Model
The row range for a table is dynamically
titi d i t “t bl t ”partitioned into “tablets”
Tablet is the unit for distribution and load
b l n ingbalancing
43
44. Building BlocksBuilding Blocks
Google File System (GFS)
stores persistent data
Scheduler
schedules jobs onto machines
Chubby
L k i di t ib t d l kLock service: distributed lock manager
e.g., master election, location bootstrapping
MapReduce (optional)MapReduce (optional)
Data processing
Read/write Bigtable dataRead/write Bigtable data
44
45. ImplementationImplementation
Single-master distributed system
Three major components
Library that linked into every client
One master server
Assigning tablets to tablet servers
Addition and expiration of tablet servers, balancing tablet-dd o a d e p a o o ab e se e s, ba a c g ab e
server load
Metadata Operations
Many tablet serversMany tablet servers
Tablet servers handle read and write requests to its table
Splits tablets that have grown too large
45
47. How to locate a Tablet?How to locate a Tablet?
Given a row, how do clients find the location of
th t bl t h th t tthe tablet whose row range covers the target
row?
47
48. Tablet AssignmentTablet Assignment
Chubby
Tablet server registers itself by getting a lock in a
specific directory chubby
Chubby gives “lease” on lock, must be renewed periodicallyChubby gives lease on lock, must be renewed periodically
Server loses lock if it gets disconnected
Master monitors this directory to find which servers
i t/ liexist/are alive
If server not contactable/has lost lock, master grabs lock
and reassigns tablets
48