SlideShare una empresa de Scribd logo
1 de 73
Descargar para leer sin conexión
Designs, Lessons and Advice from Building Large
Distributed Systems
Jeff Dean
Google Fellow
jeff@google.com
Computing shifting to really small and really big devices
UI-centric devices
Large consolidated computing farms
Google’s data center at The Dalles, OR
The Machinery
Servers
• CPUs
• DRAM
• Disks
Racks
• 40-80 servers
• Ethernet switch
Clusters
Architectural view of the storage hierarchy
…
P
L1$
P
L1$
L2$
…
P
L1$
P
L1$
L2$
…
Local DRAM
Disk
One server
DRAM: 16GB, 100ns, 20GB/s
Disk: 2TB, 10ms, 200MB/s
Architectural view of the storage hierarchy
…
P
L1$
P
L1$
L2$
…
P
L1$
P
L1$
L2$
…
Local DRAM
Disk
One server
DRAM: 16GB, 100ns, 20GB/s
Disk: 2TB, 10ms, 200MB/s
Rack Switch
DRAM Disk
DRAM Disk
DRAM Disk
DRAM
Disk
DRAM Disk
DRAM Disk
DRAM Disk
DRAM
Disk
…
Local rack (80 servers)
DRAM: 1TB, 300us, 100MB/s
Disk: 160TB, 11ms, 100MB/s
Architectural view of the storage hierarchy
…
P
L1$
P
L1$
L2$
…
P
L1$
P
L1$
L2$
…
Local DRAM
Disk
One server
DRAM: 16GB, 100ns, 20GB/s
Disk: 2TB, 10ms, 200MB/s
Rack Switch
DRAM Disk
DRAM Disk
DRAM Disk
DRAM
Disk
DRAM Disk
DRAM Disk
DRAM Disk
DRAM
Disk
…
Local rack (80 servers)
DRAM: 1TB, 300us, 100MB/s
Disk: 160TB, 11ms, 100MB/s
Cluster Switch
…
Cluster (30+ racks)
DRAM: 30TB, 500us, 10MB/s
Disk: 4.80PB, 12ms, 10MB/s
Storage hierarchy: a different view
A bumpy ride that has been getting bumpier over time
Reliability & Availability
• Things will crash. Deal with it!
– Assume you could start with super reliable servers (MTBF of 30 years)
– Build computing system with 10 thousand of those
– Watch one fail per day
• Fault-tolerant software is inevitable
• Typical yearly flakiness metrics
– 1-5% of your disk drives will die
– Servers will crash at least twice (2-4% failure rate)
The Joys of Real Hardware
Typical first year for a new cluster:
~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)
~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back)
~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours)
~1 network rewiring (rolling ~5% of machines down over 2-day span)
~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)
~5 racks go wonky (40-80 machines see 50% packetloss)
~8 network maintenances (4 might cause ~30-minute random connectivity losses)
~12 router reloads (takes out DNS and external vips for a couple minutes)
~3 router failures (have to immediately pull traffic for an hour)
~dozens of minor 30-second blips for dns
~1000 individual machine failures
~thousands of hard drive failures
slow disks, bad memory, misconfigured machines, flaky machines, etc.
Long distance links: wild dogs, sharks, dead horses, drunken hunters, etc.
Understanding downtime behavior matters
Understanding downtime behavior matters
Understanding downtime behavior matters
Understanding fault statistics matters
Understanding fault statistics matters
Can we expect faults to be independent or correlated?
Are there common failure patterns we should program around?
• Cluster is 1000s of machines, typically one or handful of configurations
• File system (GFS) + Cluster scheduling system are core services
• Typically 100s to 1000s of active jobs (some w/1 task, some w/1000s)
• mix of batch and low-latency, user-facing production jobs
Google Cluster Environment
...
Linux
chunk
server
scheduling
slave
Commodity HW
job 1
task
job 3
task
job 12
task
Linux
chunk
server
scheduling
slave
Commodity HW
job 7
task
job 3
task
job 5
task
Machine 1
...
Machine N
scheduling
master
GFS
master
Chubby
lock service
• Master manages metadata
• Data transfers happen directly between clients/
chunkservers
• Files broken into chunks (typically 64 MB)
Client
Client
Misc. servers
Client
Replicas
Masters
GFS Master
GFS Master
C0 C1
C2C5
Chunkserver 1
C0
C2
C5
Chunkserver N
C1
C3C5
Chunkserver 2
…
GFS Design
• 200+ clusters
• Many clusters of 1000s of machines
• Pools of 1000s of clients
• 4+ PB Filesystems
• 40 GB/s read/write load
– (in the presence of frequent HW failures)
GFS Usage @ Google
Google: Most Systems are Distributed Systems
• Distributed systems are a must:
– data, request volume or both are too large for single machine
• careful design about how to partition problems
• need high capacity systems even within a single datacenter
–multiple datacenters, all around the world
• almost all products deployed in multiple locations
–services used heavily even internally
• a web search touches 50+ separate services, 1000s
machines
Many Internal Services
• Simpler from a software engineering standpoint
– few dependencies, clearly specified (Protocol Buffers)
– easy to test new versions of individual services
– ability to run lots of experiments
• Development cycles largely decoupled
– lots of benefits: small teams can work independently
– easier to have many engineering offices around the world
Protocol Buffers
• Good protocol description language is vital
• Desired attributes:
– self-describing, multiple language support
– efficient to encode/decode (200+ MB/s), compact serialized form
Our solution: Protocol Buffers (in active use since 2000)
message SearchResult {
required int32 estimated_results = 1; // (1 is the tag number)
optional string error_message = 2;
repeated group Result = 3 {
required float score = 4;
required fixed64 docid = 5;
optional message<WebResultDetails> = 6;
…
}
};
Protocol Buffers (cont)
• Automatically generated language wrappers
• Graceful client and server upgrades
– systems ignore tags they don't understand, but pass the information through
(no need to upgrade intermediate servers)
• Serialization/deserialization
– high performance (200+ MB/s encode/decode)
– fairly compact (uses variable length encodings)
– format used to store data persistently (not just for RPCs)
• Also allow service specifications:
service Search {
rpc DoSearch(SearchRequest) returns (SearchResponse);
rpc DoSnippets(SnippetRequest) returns
(SnippetResponse);
rpc Ping(EmptyMessage) returns (EmptyMessage) {
{ protocol=udp; };
};
• Open source version: http://code.google.com/p/protobuf/
Designing Efficient Systems
Given a basic problem definition, how do you choose the "best" solution?
• Best could be simplest, highest performance, easiest to extend, etc.
Important skill: ability to estimate performance of a system design
– without actually having to build it!
Numbers Everyone Should Know
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lock/unlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3,000 ns
Send 2K bytes over 1 Gbps network 20,000 ns
Read 1 MB sequentially from memory 250,000 ns
Round trip within same datacenter 500,000 ns
Disk seek 10,000,000 ns
Read 1 MB sequentially from disk 20,000,000 ns
Send packet CA->Netherlands->CA 150,000,000 ns
Back of the Envelope Calculations
How long to generate image results page (30 thumbnails)?
Design 1: Read serially, thumbnail 256K images on the fly
30 seeks * 10 ms/seek + 30 * 256K / 30 MB/s = 560 ms
Back of the Envelope Calculations
How long to generate image results page (30 thumbnails)?
Design 1: Read serially, thumbnail 256K images on the fly
30 seeks * 10 ms/seek + 30 * 256K / 30 MB/s = 560 ms
Design 2: Issue reads in parallel:
10 ms/seek + 256K read / 30 MB/s = 18 ms
(Ignores variance, so really more like 30-60 ms, probably)
Back of the Envelope Calculations
How long to generate image results page (30 thumbnails)?
Design 1: Read serially, thumbnail 256K images on the fly
30 seeks * 10 ms/seek + 30 * 256K / 30 MB/s = 560 ms
Design 2: Issue reads in parallel:
10 ms/seek + 256K read / 30 MB/s = 18 ms
(Ignores variance, so really more like 30-60 ms, probably)
Lots of variations:
– caching (single images? whole sets of thumbnails?)
– pre-computing thumbnails
– …
Back of the envelope helps identify most promising…
Write Microbenchmarks!
• Great to understand performance
– Builds intuition for back-of-the-envelope calculations
• Reduces cycle time to test performance improvements
Benchmark Time(ns) CPU(ns) Iterations
BM_VarintLength32/0 2 2 291666666
BM_VarintLength32Old/0 5 5 124660869
BM_VarintLength64/0 8 8 89600000
BM_VarintLength64Old/0 25 24 42164705
BM_VarintEncode32/0 7 7 80000000
BM_VarintEncode64/0 18 16 39822222
BM_VarintEncode64Old/0 24 22 31165217
Know Your Basic Building Blocks
Core language libraries, basic data structures,
protocol buffers, GFS, BigTable,
indexing systems, MySQL, MapReduce, …
Not just their interfaces, but understand their
implementations (at least at a high level)
If you don’t know what’s going on, you can’t do
decent back-of-the-envelope calculations!
Encoding Your Data
• CPUs are fast, memory/bandwidth are precious, ergo…
– Variable-length encodings
– Compression
– Compact in-memory representations
• Compression/encoding very important for many systems
– inverted index posting list formats
– storage systems for persistent data
• We have lots of core libraries in this area
– Many tradeoffs: space, encoding/decoding speed, etc. E.g.:
• Zippy: encode@300 MB/s, decode@600MB/s, 2-4X compression
• gzip: encode@25MB/s, decode@200MB/s, 4-6X compression
Designing & Building Infrastructure
Identify common problems, and build software systems to address them
in a general way
• Important not to try to be all things to all people
– Clients might be demanding 8 different things
– Doing 6 of them is easy
– …handling 7 of them requires real thought
– …dealing with all 8 usually results in a worse system
• more complex, compromises other clients in trying to satisfy everyone
Don't build infrastructure just for its own sake
• Identify common needs and address them
• Don't imagine unlikely potential needs that aren't really there
• Best approach: use your own infrastructure (especially at first!)
– (much more rapid feedback about what works, what doesn't)
Design for Growth
Try to anticipate how requirements will evolve
keep likely features in mind as you design base system
Ensure your design works if scale changes by 10X or 20X
but the right solution for X often not optimal for 100X
Interactive Apps: Design for Low Latency
• Aim for low avg. times (happy users!)
–90%ile and 99%ile also very important
–Think about how much data you’re shuffling around
• e.g. dozens of 1 MB RPCs per user request -> latency will be lousy
• Worry about variance!
–Redundancy or timeouts can help bring in latency tail
• Judicious use of caching can help
• Use higher priorities for interactive requests
• Parallelism helps!
Making Applications Robust Against Failures
Canary requests
Failover to other replicas/datacenters
Bad backend detection:
stop using for live requests until behavior gets better
More aggressive load balancing when imbalance is more severe
Make your apps do something reasonable even if not all is right
– Better to give users limited functionality than an error page
Add Sufficient Monitoring/Status/Debugging Hooks
All our servers:
• Export HTML-based status pages for easy diagnosis
• Export a collection of key-value pairs via a standard interface
– monitoring systems periodically collect this from running servers
• RPC subsystem collects sample of all requests, all error requests, all
requests >0.0s, >0.05s, >0.1s, >0.5s, >1s, etc.
• Support low-overhead online profiling
– cpu profiling
– memory profiling
– lock contention profiling
If your system is slow or misbehaving, can you figure out why?
MapReduce
• A simple programming model that applies to many large-scale
computing problems
• Hide messy details in MapReduce runtime library:
– automatic parallelization
– load balancing
– network and disk transfer optimizations
– handling of machine failures
– robustness
– improvements to core library benefit all users of library!
Typical problem solved by MapReduce
•Read a lot of data
•Map: extract something you care about from each record
•Shuffle and Sort
•Reduce: aggregate, summarize, filter, or transform
•Write the results
Outline stays the same,
map and reduce change to fit the problem
Example: Rendering Map Tiles
Input Map Shuffle Reduce Output
Emit each to all
overlapping latitude-
longitude rectangles
Sort by key
(key= Rect. Id)
Render tile using
data for all enclosed
features
Rendered tiles
Geographic
feature list
I-5
Lake Washington
WA-520
I-90
(0, I-5)
(0, Lake Wash.)
(0, WA-520)
(1, I-90)
(1, I-5)
(1, Lake Wash.)
(0, I-5)
(0, Lake Wash.)
(0, WA-520)
(1, I-90)
0
1 (1, I-5)
(1, Lake Wash.)
…
…
…
…
Parallel MapReduce
Map Map Map Map
Input
data
Reduce
Shuffle
Reduce
Shuffle
Reduce
Shuffle
Partitioned
output
Master
Parallel MapReduce
Map Map Map Map
Input
data
Reduce
Shuffle
Reduce
Shuffle
Reduce
Shuffle
Partitioned
output
Master
For large enough problems, it’s more about disk and
network performance than CPU & DRAM
MapReduce Usage Statistics Over Time
Number of jobs
Aug, ‘04
29K
Mar, ‘06
171K
Sep, '07
2,217K
Sep, ’09
3,467K
Average completion time (secs) 634 874 395 475
Machine years used 217 2,002 11,081 25,562
Input data read (TB) 3,288 52,254 403,152 544,130
Intermediate data (TB) 758 6,743 34,774 90,120
Output data written (TB) 193 2,970 14,018 57,520
Average worker machines 157 268 394 488
MapReduce in Practice
•Abstract input and output interfaces
–lots of MR operations don’t just read/write simple files
• B-tree files
• memory-mapped key-value stores
• complex inverted index file formats
• BigTable tables
• SQL databases, etc.
• ...
•Low-level MR interfaces are in terms of byte arrays
–Hardly ever use textual formats, though: slow, hard to parse
–Most input & output is in encoded Protocol Buffer format
• See “MapReduce: A Flexible Data Processing Tool” (to appear in
upcoming CACM)
• Lots of (semi-)structured data at Google
– URLs:
• Contents, crawl metadata, links, anchors, pagerank, …
– Per-user data:
• User preference settings, recent queries/search results, …
– Geographic locations:
• Physical entities (shops, restaurants, etc.), roads, satellite image data, user
annotations, …
• Scale is large
– billions of URLs, many versions/page (~20K/version)
– Hundreds of millions of users, thousands of q/sec
– 100TB+ of satellite image data
BigTable: Motivation
• Distributed multi-dimensional sparse map
(row, column, timestamp) → cell contents
Rows
Columns
• Rows are ordered lexicographically
• Good match for most of our applications
Basic Data Model
• Distributed multi-dimensional sparse map
(row, column, timestamp) → cell contents
“www.cnn.com”
“contents:”
Rows
Columns
“<html>…”
• Rows are ordered lexicographically
• Good match for most of our applications
Basic Data Model
• Distributed multi-dimensional sparse map
(row, column, timestamp) → cell contents
“www.cnn.com”
“contents:”
Rows
Columns
Timestamps
t17“<html>…”
• Rows are ordered lexicographically
• Good match for most of our applications
Basic Data Model
• Distributed multi-dimensional sparse map
(row, column, timestamp) → cell contents
“www.cnn.com”
“contents:”
Rows
Columns
Timestamps
t11
t17“<html>…”
• Rows are ordered lexicographically
• Good match for most of our applications
Basic Data Model
• Distributed multi-dimensional sparse map
(row, column, timestamp) → cell contents
“www.cnn.com”
“contents:”
Rows
Columns
Timestamps
t3
t11
t17“<html>…”
• Rows are ordered lexicographically
• Good match for most of our applications
Basic Data Model
Tablets & Splitting
…
“cnn.com”
“contents:”
“<html>…”
“language:”
EN
“cnn.com/sports.html”
“zuppa.com/menu.html”
…
“website.com”
“aaa.com”
Tablets & Splitting
…
Tablets
“cnn.com”
“contents:”
“<html>…”
“language:”
EN
“cnn.com/sports.html”
“zuppa.com/menu.html”
…
“website.com”
“aaa.com”
Tablets & Splitting
…
Tablets
“cnn.com”
“contents:”
“<html>…”
“language:”
EN
“cnn.com/sports.html”
“zuppa.com/menu.html”
…
“yahoo.com/kids.html”
“yahoo.com/kids.html0”
…
…
“website.com”
“aaa.com”
Bigtable master
Bigtable tablet server Bigtable tablet serverBigtable tablet server …
Bigtable Cell
BigTable System Structure
Bigtable master
Bigtable tablet server Bigtable tablet serverBigtable tablet server …
performs metadata ops +
load balancing
Bigtable Cell
BigTable System Structure
Bigtable master
Bigtable tablet server Bigtable tablet serverBigtable tablet server …
performs metadata ops +
load balancing
serves data serves dataserves data
Bigtable Cell
BigTable System Structure
Lock service
Bigtable master
Bigtable tablet server Bigtable tablet serverBigtable tablet server
GFSCluster scheduling system
…
performs metadata ops +
load balancing
serves data serves dataserves data
Bigtable Cell
BigTable System Structure
Lock service
Bigtable master
Bigtable tablet server Bigtable tablet serverBigtable tablet server
GFSCluster scheduling system
…
handles failover, monitoring
performs metadata ops +
load balancing
serves data serves dataserves data
Bigtable Cell
BigTable System Structure
Lock service
Bigtable master
Bigtable tablet server Bigtable tablet serverBigtable tablet server
GFSCluster scheduling system
…
holds tablet data, logshandles failover, monitoring
performs metadata ops +
load balancing
serves data serves dataserves data
Bigtable Cell
BigTable System Structure
Lock service
Bigtable master
Bigtable tablet server Bigtable tablet serverBigtable tablet server
GFSCluster scheduling system
…
holds metadata,
handles master-electionholds tablet data, logshandles failover, monitoring
performs metadata ops +
load balancing
serves data serves dataserves data
Bigtable Cell
BigTable System Structure
Lock service
Bigtable master
Bigtable tablet server Bigtable tablet serverBigtable tablet server
GFSCluster scheduling system
…
holds metadata,
handles master-electionholds tablet data, logshandles failover, monitoring
performs metadata ops +
load balancing
serves data serves dataserves data
Bigtable Cell
Bigtable client
Bigtable client
library
BigTable System Structure
Lock service
Bigtable master
Bigtable tablet server Bigtable tablet serverBigtable tablet server
GFSCluster scheduling system
…
holds metadata,
handles master-electionholds tablet data, logshandles failover, monitoring
performs metadata ops +
load balancing
serves data serves dataserves data
Bigtable Cell
Bigtable client
Bigtable client
library
Open()
BigTable System Structure
Lock service
Bigtable master
Bigtable tablet server Bigtable tablet serverBigtable tablet server
GFSCluster scheduling system
…
holds metadata,
handles master-electionholds tablet data, logshandles failover, monitoring
performs metadata ops +
load balancing
serves data serves dataserves data
Bigtable Cell
Bigtable client
Bigtable client
library
Open()
read/write
BigTable System Structure
Lock service
Bigtable master
Bigtable tablet server Bigtable tablet serverBigtable tablet server
GFSCluster scheduling system
…
holds metadata,
handles master-electionholds tablet data, logshandles failover, monitoring
performs metadata ops +
load balancing
serves data serves dataserves data
Bigtable Cell
Bigtable client
Bigtable client
library
Open()
read/write
metadata ops
BigTable System Structure
BigTable Status
• Design/initial implementation started beginning of 2004
• Production use or active development for 100+ projects:
– Google Print
– My Search History
– Orkut
– Crawling/indexing pipeline
– Google Maps/Google Earth
– Blogger
– …
• Currently ~500 BigTable clusters
• Largest cluster:
–70+ PB data; sustained: 10M ops/sec; 30+ GB/s I/O
BigTable: What’s New Since OSDI’06?
• Lots of work on scaling
• Service clusters, managed by dedicated team
• Improved performance isolation
–fair-share scheduler within each server, better
accounting of memory used per user (caches, etc.)
–can partition servers within a cluster for different users
or tables
• Improved protection against corruption
–many small changes
–e.g. immediately read results of every compaction,
compare with CRC.
• Catches ~1 corruption/5.4 PB of data compacted
BigTable Replication (New Since OSDI’06)
• Configured on a per-table basis
• Typically used to replicate data to multiple bigtable
clusters in different data centers
• Eventual consistency model: writes to table in one cluster
eventually appear in all configured replicas
• Nearly all user-facing production uses of BigTable use
replication
BigTable Coprocessors (New Since OSDI’06)
• Arbitrary code that runs run next to each tablet in table
–as tablets split and move, coprocessor code
automatically splits/moves too
• High-level call interface for clients
–Unlike RPC, calls addressed to rows or ranges of rows
• coprocessor client library resolves to actual locations
–Calls across multiple rows automatically split into multiple
parallelized RPCs
• Very flexible model for building distributed services
– automatic scaling, load balancing, request routing for apps
Example Coprocessor Uses
• Scalable metadata management for Colossus (next gen
GFS-like file system)
• Distributed language model serving for machine
translation system
• Distributed query processing for full-text indexing support
• Regular expression search support for code repository
Current Work: Spanner
• Storage & computation system that spans all our datacenters
– single global namespace
• Names are independent of location(s) of data
• Similarities to Bigtable: tables, families, locality groups, coprocessors, ...
• Differences: hierarchical directories instead of rows, fine-grained replication
• Fine-grained ACLs, replication configuration at the per-directory level
– support mix of strong and weak consistency across datacenters
• Strong consistency implemented with Paxos across tablet replicas
• Full support for distributed transactions across directories/machines
– much more automated operation
• system automatically moves and adds replicas of data and computation based
on constraints and usage patterns
• automated allocation of resources across entire fleet of machines
• Future scale: ~106 to 107 machines, ~1013 directories,
~1018 bytes of storage, spread at 100s to 1000s of
locations around the world, ~109 client machines
– zones of semi-autonomous control
– consistency after disconnected operation
– users specify high-level desires:
“99%ile latency for accessing this data should be <50ms”
“Store this data on at least 2 disks in EU, 2 in U.S. & 1 in Asia”
Design Goals for Spanner
Adaptivity in World-Wide Systems
• Challenge: automatic, dynamic world-wide placement of
data & computation to minimize latency and/or cost, given
constraints on:
– bandwidth
– packet loss
– power
– resource usage
– failure modes
– ...
• Users specify high-level desires:
“99%ile latency for accessing this data should be <50ms”
“Store this data on at least 2 disks in EU, 2 in U.S. & 1 in Asia”
Building Applications on top of Weakly
Consistent Storage Systems
• Many applications need state replicated across a wide area
– For reliability and availability
• Two main choices:
– consistent operations (e.g. use Paxos)
• often imposes additional latency for common case
– inconsistent operations
• better performance/availability, but apps harder to write and reason about in
this model
• Many apps need to use a mix of both of these:
– e.g. Gmail: marking a message as read is asynchronous, sending a
message is a heavier-weight consistent operation
Building Applications on top of Weakly
Consistent Storage Systems
• Challenge: General model of consistency choices,
explained and codified
– ideally would have one or more “knobs” controlling performance vs.
consistency
– “knob” would provide easy-to-understand tradeoffs
• Challenge: Easy-to-use abstractions for resolving
conflicting updates to multiple versions of a piece of state
– Useful for reconciling client state with servers after disconnected
operation
– Also useful for reconciling replicated state in different data centers
after repairing a network partition
Thanks! Questions...?
Further reading:
• Ghemawat, Gobioff, & Leung. Google File System, SOSP 2003.
• Barroso, Dean, & Hölzle . Web Search for a Planet: The Google Cluster Architecture, IEEE Micro, 2003.
• Dean & Ghemawat. MapReduce: Simplified Data Processing on Large Clusters, OSDI 2004.
• Chang, Dean, Ghemawat, Hsieh, Wallach, Burrows, Chandra, Fikes, & Gruber. Bigtable: A Distributed
Storage System for Structured Data, OSDI 2006.
• Burrows. The Chubby Lock Service for Loosely-Coupled Distributed Systems. OSDI 2006.
• Pinheiro, Weber, & Barroso. Failure Trends in a Large Disk Drive Population. FAST 2007.
• Brants, Popat, Xu, Och, & Dean. Large Language Models in Machine Translation, EMNLP 2007.
• Barroso & Hölzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale
Machines, Morgan & Claypool Synthesis Series on Computer Architecture, 2009.
• Malewicz et al. Pregel: A System for Large-Scale Graph Processing. PODC, 2009.
• Schroeder, Pinheiro, & Weber. DRAM Errors in the Wild: A Large-Scale Field Study. SEGMETRICS’09.
• Protocol Buffers. http://code.google.com/p/protobuf/
These and many more available at: http://labs.google.com/papers.html

Más contenido relacionado

La actualidad más candente

LizardFS-WhitePaper-Eng-v3.9.2-web
LizardFS-WhitePaper-Eng-v3.9.2-webLizardFS-WhitePaper-Eng-v3.9.2-web
LizardFS-WhitePaper-Eng-v3.9.2-webSzymon Haly
 
Congratsyourthedbatoo
CongratsyourthedbatooCongratsyourthedbatoo
CongratsyourthedbatooDave Stokes
 
Linux IO internals for database administrators (SCaLE 2017 and PGDay Nordic 2...
Linux IO internals for database administrators (SCaLE 2017 and PGDay Nordic 2...Linux IO internals for database administrators (SCaLE 2017 and PGDay Nordic 2...
Linux IO internals for database administrators (SCaLE 2017 and PGDay Nordic 2...PostgreSQL-Consulting
 
MySQL with DRBD/Pacemaker/Corosync on Linux
 MySQL with DRBD/Pacemaker/Corosync on Linux MySQL with DRBD/Pacemaker/Corosync on Linux
MySQL with DRBD/Pacemaker/Corosync on LinuxPawan Kumar
 
Comparison of foss distributed storage
Comparison of foss distributed storageComparison of foss distributed storage
Comparison of foss distributed storageMarian Marinov
 
Linux admin interview questions
Linux admin interview questionsLinux admin interview questions
Linux admin interview questionsKavya Sri
 
Systems Performance: Enterprise and the Cloud
Systems Performance: Enterprise and the CloudSystems Performance: Enterprise and the Cloud
Systems Performance: Enterprise and the CloudBrendan Gregg
 
Oracle Performance On Linux X86 systems
Oracle  Performance On Linux  X86 systems Oracle  Performance On Linux  X86 systems
Oracle Performance On Linux X86 systems Baruch Osoveskiy
 
Deployment Strategies
Deployment StrategiesDeployment Strategies
Deployment StrategiesMongoDB
 
UKOUG, Lies, Damn Lies and I/O Statistics
UKOUG, Lies, Damn Lies and I/O StatisticsUKOUG, Lies, Damn Lies and I/O Statistics
UKOUG, Lies, Damn Lies and I/O StatisticsKyle Hailey
 
Performance Whack-a-Mole Tutorial (pgCon 2009)
Performance Whack-a-Mole Tutorial (pgCon 2009) Performance Whack-a-Mole Tutorial (pgCon 2009)
Performance Whack-a-Mole Tutorial (pgCon 2009) PostgreSQL Experts, Inc.
 
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...Anne Nicolas
 
Bacula Overview
Bacula OverviewBacula Overview
Bacula Overviewsambismo
 
PostgreSQL High Availability in a Containerized World
PostgreSQL High Availability in a Containerized WorldPostgreSQL High Availability in a Containerized World
PostgreSQL High Availability in a Containerized WorldJignesh Shah
 
Protecting exchange servers with dpm 2010 son vu
Protecting exchange servers with dpm 2010 son vuProtecting exchange servers with dpm 2010 son vu
Protecting exchange servers with dpm 2010 son vuvncson
 
Replication Solutions for PostgreSQL
Replication Solutions for PostgreSQLReplication Solutions for PostgreSQL
Replication Solutions for PostgreSQLPeter Eisentraut
 

La actualidad más candente (20)

LizardFS-WhitePaper-Eng-v3.9.2-web
LizardFS-WhitePaper-Eng-v3.9.2-webLizardFS-WhitePaper-Eng-v3.9.2-web
LizardFS-WhitePaper-Eng-v3.9.2-web
 
Congratsyourthedbatoo
CongratsyourthedbatooCongratsyourthedbatoo
Congratsyourthedbatoo
 
Linux IO internals for database administrators (SCaLE 2017 and PGDay Nordic 2...
Linux IO internals for database administrators (SCaLE 2017 and PGDay Nordic 2...Linux IO internals for database administrators (SCaLE 2017 and PGDay Nordic 2...
Linux IO internals for database administrators (SCaLE 2017 and PGDay Nordic 2...
 
MySQL with DRBD/Pacemaker/Corosync on Linux
 MySQL with DRBD/Pacemaker/Corosync on Linux MySQL with DRBD/Pacemaker/Corosync on Linux
MySQL with DRBD/Pacemaker/Corosync on Linux
 
Comparison of foss distributed storage
Comparison of foss distributed storageComparison of foss distributed storage
Comparison of foss distributed storage
 
Linux admin interview questions
Linux admin interview questionsLinux admin interview questions
Linux admin interview questions
 
Systems Performance: Enterprise and the Cloud
Systems Performance: Enterprise and the CloudSystems Performance: Enterprise and the Cloud
Systems Performance: Enterprise and the Cloud
 
Oracle Performance On Linux X86 systems
Oracle  Performance On Linux  X86 systems Oracle  Performance On Linux  X86 systems
Oracle Performance On Linux X86 systems
 
Deployment Strategies
Deployment StrategiesDeployment Strategies
Deployment Strategies
 
Sitaram_Chalasani_CV
Sitaram_Chalasani_CVSitaram_Chalasani_CV
Sitaram_Chalasani_CV
 
UKOUG, Lies, Damn Lies and I/O Statistics
UKOUG, Lies, Damn Lies and I/O StatisticsUKOUG, Lies, Damn Lies and I/O Statistics
UKOUG, Lies, Damn Lies and I/O Statistics
 
Performance Whack-a-Mole Tutorial (pgCon 2009)
Performance Whack-a-Mole Tutorial (pgCon 2009) Performance Whack-a-Mole Tutorial (pgCon 2009)
Performance Whack-a-Mole Tutorial (pgCon 2009)
 
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
 
5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance
 
Bacula Overview
Bacula OverviewBacula Overview
Bacula Overview
 
PostgreSQL High Availability in a Containerized World
PostgreSQL High Availability in a Containerized WorldPostgreSQL High Availability in a Containerized World
PostgreSQL High Availability in a Containerized World
 
PostgreSQL on Solaris
PostgreSQL on SolarisPostgreSQL on Solaris
PostgreSQL on Solaris
 
UNIX/Linux training
UNIX/Linux trainingUNIX/Linux training
UNIX/Linux training
 
Protecting exchange servers with dpm 2010 son vu
Protecting exchange servers with dpm 2010 son vuProtecting exchange servers with dpm 2010 son vu
Protecting exchange servers with dpm 2010 son vu
 
Replication Solutions for PostgreSQL
Replication Solutions for PostgreSQLReplication Solutions for PostgreSQL
Replication Solutions for PostgreSQL
 

Destacado

Celi @Codemotion 2014 - Roberto Franchini GlusterFS
Celi @Codemotion 2014 - Roberto Franchini GlusterFSCeli @Codemotion 2014 - Roberto Franchini GlusterFS
Celi @Codemotion 2014 - Roberto Franchini GlusterFSCELI
 
Real-time discovery e sentiment analysis su Twitter: Blogmeter Now
Real-time discovery e sentiment analysis su Twitter: Blogmeter NowReal-time discovery e sentiment analysis su Twitter: Blogmeter Now
Real-time discovery e sentiment analysis su Twitter: Blogmeter NowMe-Source S.r.l./Blogmeter
 
Object oriented design patterns for distributed systems
Object oriented design patterns for distributed systemsObject oriented design patterns for distributed systems
Object oriented design patterns for distributed systemsJordan McBain
 
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)Stuart Herbert
 
Architectural Patterns of Resilient Distributed Systems
 Architectural Patterns of Resilient Distributed Systems Architectural Patterns of Resilient Distributed Systems
Architectural Patterns of Resilient Distributed SystemsInes Sombra
 
Lessons from Highly Scalable Architectures at Social Networking Sites
Lessons from Highly Scalable Architectures at Social Networking SitesLessons from Highly Scalable Architectures at Social Networking Sites
Lessons from Highly Scalable Architectures at Social Networking SitesPatrick Senti
 
Design principles of scalable, distributed systems
Design principles of scalable, distributed systemsDesign principles of scalable, distributed systems
Design principles of scalable, distributed systemsTinniam V Ganesh (TV)
 
OVERVIEW OF FACEBOOK SCALABLE ARCHITECTURE.
OVERVIEW  OF FACEBOOK SCALABLE ARCHITECTURE.OVERVIEW  OF FACEBOOK SCALABLE ARCHITECTURE.
OVERVIEW OF FACEBOOK SCALABLE ARCHITECTURE.Rishikese MR
 
High Scalability by Example – How can Web-Architecture scale like Facebook, T...
High Scalability by Example – How can Web-Architecture scale like Facebook, T...High Scalability by Example – How can Web-Architecture scale like Facebook, T...
High Scalability by Example – How can Web-Architecture scale like Facebook, T...Robert Mederer
 
Building a Scalable Architecture for web apps
Building a Scalable Architecture for web appsBuilding a Scalable Architecture for web apps
Building a Scalable Architecture for web appsDirecti Group
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityRenato Lucindo
 
Facebook architecture presentation: scalability challenge
Facebook architecture presentation: scalability challengeFacebook architecture presentation: scalability challenge
Facebook architecture presentation: scalability challengeCristina Munoz
 
Facebook Architecture - Breaking it Open
Facebook Architecture - Breaking it OpenFacebook Architecture - Breaking it Open
Facebook Architecture - Breaking it OpenHARMAN Services
 
7 Stages of Scaling Web Applications
7 Stages of Scaling Web Applications7 Stages of Scaling Web Applications
7 Stages of Scaling Web ApplicationsDavid Mitzenmacher
 
Distributed Systems Real Life Applications
Distributed Systems Real Life ApplicationsDistributed Systems Real Life Applications
Distributed Systems Real Life ApplicationsAman Srivastava
 
facebook architecture for 600M users
facebook architecture for 600M usersfacebook architecture for 600M users
facebook architecture for 600M usersJongyoon Choi
 
Big Data in Real-Time at Twitter
Big Data in Real-Time at TwitterBig Data in Real-Time at Twitter
Big Data in Real-Time at Twitternkallen
 
Unit 1 architecture of distributed systems
Unit 1 architecture of distributed systemsUnit 1 architecture of distributed systems
Unit 1 architecture of distributed systemskaran2190
 
Introduction to Web Architecture
Introduction to Web ArchitectureIntroduction to Web Architecture
Introduction to Web ArchitectureChamnap Chhorn
 

Destacado (20)

Celi @Codemotion 2014 - Roberto Franchini GlusterFS
Celi @Codemotion 2014 - Roberto Franchini GlusterFSCeli @Codemotion 2014 - Roberto Franchini GlusterFS
Celi @Codemotion 2014 - Roberto Franchini GlusterFS
 
Real-time discovery e sentiment analysis su Twitter: Blogmeter Now
Real-time discovery e sentiment analysis su Twitter: Blogmeter NowReal-time discovery e sentiment analysis su Twitter: Blogmeter Now
Real-time discovery e sentiment analysis su Twitter: Blogmeter Now
 
Object oriented design patterns for distributed systems
Object oriented design patterns for distributed systemsObject oriented design patterns for distributed systems
Object oriented design patterns for distributed systems
 
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
 
Architectural Patterns of Resilient Distributed Systems
 Architectural Patterns of Resilient Distributed Systems Architectural Patterns of Resilient Distributed Systems
Architectural Patterns of Resilient Distributed Systems
 
Lessons from Highly Scalable Architectures at Social Networking Sites
Lessons from Highly Scalable Architectures at Social Networking SitesLessons from Highly Scalable Architectures at Social Networking Sites
Lessons from Highly Scalable Architectures at Social Networking Sites
 
Design principles of scalable, distributed systems
Design principles of scalable, distributed systemsDesign principles of scalable, distributed systems
Design principles of scalable, distributed systems
 
OVERVIEW OF FACEBOOK SCALABLE ARCHITECTURE.
OVERVIEW  OF FACEBOOK SCALABLE ARCHITECTURE.OVERVIEW  OF FACEBOOK SCALABLE ARCHITECTURE.
OVERVIEW OF FACEBOOK SCALABLE ARCHITECTURE.
 
High Scalability by Example – How can Web-Architecture scale like Facebook, T...
High Scalability by Example – How can Web-Architecture scale like Facebook, T...High Scalability by Example – How can Web-Architecture scale like Facebook, T...
High Scalability by Example – How can Web-Architecture scale like Facebook, T...
 
Building a Scalable Architecture for web apps
Building a Scalable Architecture for web appsBuilding a Scalable Architecture for web apps
Building a Scalable Architecture for web apps
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
 
Facebook architecture presentation: scalability challenge
Facebook architecture presentation: scalability challengeFacebook architecture presentation: scalability challenge
Facebook architecture presentation: scalability challenge
 
Facebook Architecture - Breaking it Open
Facebook Architecture - Breaking it OpenFacebook Architecture - Breaking it Open
Facebook Architecture - Breaking it Open
 
7 Stages of Scaling Web Applications
7 Stages of Scaling Web Applications7 Stages of Scaling Web Applications
7 Stages of Scaling Web Applications
 
Distributed Systems Real Life Applications
Distributed Systems Real Life ApplicationsDistributed Systems Real Life Applications
Distributed Systems Real Life Applications
 
facebook architecture for 600M users
facebook architecture for 600M usersfacebook architecture for 600M users
facebook architecture for 600M users
 
Big Data in Real-Time at Twitter
Big Data in Real-Time at TwitterBig Data in Real-Time at Twitter
Big Data in Real-Time at Twitter
 
Scalability Design Principles - Internal Session
Scalability Design Principles - Internal SessionScalability Design Principles - Internal Session
Scalability Design Principles - Internal Session
 
Unit 1 architecture of distributed systems
Unit 1 architecture of distributed systemsUnit 1 architecture of distributed systems
Unit 1 architecture of distributed systems
 
Introduction to Web Architecture
Introduction to Web ArchitectureIntroduction to Web Architecture
Introduction to Web Architecture
 

Similar a Designs, Lessons and Advice from Building Large Distributed Systems

Intro to big data choco devday - 23-01-2014
Intro to big data   choco devday - 23-01-2014Intro to big data   choco devday - 23-01-2014
Intro to big data choco devday - 23-01-2014Hassan Islamov
 
Deploying ssd in the data center 2014
Deploying ssd in the data center 2014Deploying ssd in the data center 2014
Deploying ssd in the data center 2014Howard Marks
 
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward
 
Deployment Strategies (Mongo Austin)
Deployment Strategies (Mongo Austin)Deployment Strategies (Mongo Austin)
Deployment Strategies (Mongo Austin)MongoDB
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisMike Pittaro
 
Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis PyData
 
High Frequency Trading and NoSQL database
High Frequency Trading and NoSQL databaseHigh Frequency Trading and NoSQL database
High Frequency Trading and NoSQL databasePeter Lawrey
 
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)Alex Rasmussen
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
 
Deployment Strategy
Deployment StrategyDeployment Strategy
Deployment StrategyMongoDB
 
14 scaleabilty wics
14 scaleabilty wics14 scaleabilty wics
14 scaleabilty wicsashish61_scs
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisMike Pittaro
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data Analysisodsc
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHungWei Chiu
 
LMAX Disruptor - High Performance Inter-Thread Messaging Library
LMAX Disruptor - High Performance Inter-Thread Messaging LibraryLMAX Disruptor - High Performance Inter-Thread Messaging Library
LMAX Disruptor - High Performance Inter-Thread Messaging LibrarySebastian Andrasoni
 
HPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big DataHPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big DataHPC DAY
 
MySQL NDB Cluster 8.0 SQL faster than NoSQL
MySQL NDB Cluster 8.0 SQL faster than NoSQL MySQL NDB Cluster 8.0 SQL faster than NoSQL
MySQL NDB Cluster 8.0 SQL faster than NoSQL Bernd Ocklin
 
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. TanenbaumA Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaumeurobsdcon
 
Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...
Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...
Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...DefconRussia
 
Speed up your Symfony2 application and build awesome features with Redis
Speed up your Symfony2 application and build awesome features with RedisSpeed up your Symfony2 application and build awesome features with Redis
Speed up your Symfony2 application and build awesome features with RedisRicard Clau
 

Similar a Designs, Lessons and Advice from Building Large Distributed Systems (20)

Intro to big data choco devday - 23-01-2014
Intro to big data   choco devday - 23-01-2014Intro to big data   choco devday - 23-01-2014
Intro to big data choco devday - 23-01-2014
 
Deploying ssd in the data center 2014
Deploying ssd in the data center 2014Deploying ssd in the data center 2014
Deploying ssd in the data center 2014
 
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
 
Deployment Strategies (Mongo Austin)
Deployment Strategies (Mongo Austin)Deployment Strategies (Mongo Austin)
Deployment Strategies (Mongo Austin)
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data Analysis
 
Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis
 
High Frequency Trading and NoSQL database
High Frequency Trading and NoSQL databaseHigh Frequency Trading and NoSQL database
High Frequency Trading and NoSQL database
 
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
Deployment Strategy
Deployment StrategyDeployment Strategy
Deployment Strategy
 
14 scaleabilty wics
14 scaleabilty wics14 scaleabilty wics
14 scaleabilty wics
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data Analysis
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data Analysis
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User Group
 
LMAX Disruptor - High Performance Inter-Thread Messaging Library
LMAX Disruptor - High Performance Inter-Thread Messaging LibraryLMAX Disruptor - High Performance Inter-Thread Messaging Library
LMAX Disruptor - High Performance Inter-Thread Messaging Library
 
HPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big DataHPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big Data
 
MySQL NDB Cluster 8.0 SQL faster than NoSQL
MySQL NDB Cluster 8.0 SQL faster than NoSQL MySQL NDB Cluster 8.0 SQL faster than NoSQL
MySQL NDB Cluster 8.0 SQL faster than NoSQL
 
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. TanenbaumA Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
 
Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...
Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...
Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...
 
Speed up your Symfony2 application and build awesome features with Redis
Speed up your Symfony2 application and build awesome features with RedisSpeed up your Symfony2 application and build awesome features with Redis
Speed up your Symfony2 application and build awesome features with Redis
 

Último

(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesPrabhanshu Chaturvedi
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGSIVASHANKAR N
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 

Último (20)

(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 

Designs, Lessons and Advice from Building Large Distributed Systems

  • 1. Designs, Lessons and Advice from Building Large Distributed Systems Jeff Dean Google Fellow jeff@google.com
  • 2. Computing shifting to really small and really big devices UI-centric devices Large consolidated computing farms
  • 3. Google’s data center at The Dalles, OR
  • 4. The Machinery Servers • CPUs • DRAM • Disks Racks • 40-80 servers • Ethernet switch Clusters
  • 5. Architectural view of the storage hierarchy … P L1$ P L1$ L2$ … P L1$ P L1$ L2$ … Local DRAM Disk One server DRAM: 16GB, 100ns, 20GB/s Disk: 2TB, 10ms, 200MB/s
  • 6. Architectural view of the storage hierarchy … P L1$ P L1$ L2$ … P L1$ P L1$ L2$ … Local DRAM Disk One server DRAM: 16GB, 100ns, 20GB/s Disk: 2TB, 10ms, 200MB/s Rack Switch DRAM Disk DRAM Disk DRAM Disk DRAM Disk DRAM Disk DRAM Disk DRAM Disk DRAM Disk … Local rack (80 servers) DRAM: 1TB, 300us, 100MB/s Disk: 160TB, 11ms, 100MB/s
  • 7. Architectural view of the storage hierarchy … P L1$ P L1$ L2$ … P L1$ P L1$ L2$ … Local DRAM Disk One server DRAM: 16GB, 100ns, 20GB/s Disk: 2TB, 10ms, 200MB/s Rack Switch DRAM Disk DRAM Disk DRAM Disk DRAM Disk DRAM Disk DRAM Disk DRAM Disk DRAM Disk … Local rack (80 servers) DRAM: 1TB, 300us, 100MB/s Disk: 160TB, 11ms, 100MB/s Cluster Switch … Cluster (30+ racks) DRAM: 30TB, 500us, 10MB/s Disk: 4.80PB, 12ms, 10MB/s
  • 8. Storage hierarchy: a different view A bumpy ride that has been getting bumpier over time
  • 9. Reliability & Availability • Things will crash. Deal with it! – Assume you could start with super reliable servers (MTBF of 30 years) – Build computing system with 10 thousand of those – Watch one fail per day • Fault-tolerant software is inevitable • Typical yearly flakiness metrics – 1-5% of your disk drives will die – Servers will crash at least twice (2-4% failure rate)
  • 10. The Joys of Real Hardware Typical first year for a new cluster: ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover) ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back) ~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours) ~1 network rewiring (rolling ~5% of machines down over 2-day span) ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back) ~5 racks go wonky (40-80 machines see 50% packetloss) ~8 network maintenances (4 might cause ~30-minute random connectivity losses) ~12 router reloads (takes out DNS and external vips for a couple minutes) ~3 router failures (have to immediately pull traffic for an hour) ~dozens of minor 30-second blips for dns ~1000 individual machine failures ~thousands of hard drive failures slow disks, bad memory, misconfigured machines, flaky machines, etc. Long distance links: wild dogs, sharks, dead horses, drunken hunters, etc.
  • 15. Understanding fault statistics matters Can we expect faults to be independent or correlated? Are there common failure patterns we should program around?
  • 16. • Cluster is 1000s of machines, typically one or handful of configurations • File system (GFS) + Cluster scheduling system are core services • Typically 100s to 1000s of active jobs (some w/1 task, some w/1000s) • mix of batch and low-latency, user-facing production jobs Google Cluster Environment ... Linux chunk server scheduling slave Commodity HW job 1 task job 3 task job 12 task Linux chunk server scheduling slave Commodity HW job 7 task job 3 task job 5 task Machine 1 ... Machine N scheduling master GFS master Chubby lock service
  • 17. • Master manages metadata • Data transfers happen directly between clients/ chunkservers • Files broken into chunks (typically 64 MB) Client Client Misc. servers Client Replicas Masters GFS Master GFS Master C0 C1 C2C5 Chunkserver 1 C0 C2 C5 Chunkserver N C1 C3C5 Chunkserver 2 … GFS Design
  • 18. • 200+ clusters • Many clusters of 1000s of machines • Pools of 1000s of clients • 4+ PB Filesystems • 40 GB/s read/write load – (in the presence of frequent HW failures) GFS Usage @ Google
  • 19. Google: Most Systems are Distributed Systems • Distributed systems are a must: – data, request volume or both are too large for single machine • careful design about how to partition problems • need high capacity systems even within a single datacenter –multiple datacenters, all around the world • almost all products deployed in multiple locations –services used heavily even internally • a web search touches 50+ separate services, 1000s machines
  • 20. Many Internal Services • Simpler from a software engineering standpoint – few dependencies, clearly specified (Protocol Buffers) – easy to test new versions of individual services – ability to run lots of experiments • Development cycles largely decoupled – lots of benefits: small teams can work independently – easier to have many engineering offices around the world
  • 21. Protocol Buffers • Good protocol description language is vital • Desired attributes: – self-describing, multiple language support – efficient to encode/decode (200+ MB/s), compact serialized form Our solution: Protocol Buffers (in active use since 2000) message SearchResult { required int32 estimated_results = 1; // (1 is the tag number) optional string error_message = 2; repeated group Result = 3 { required float score = 4; required fixed64 docid = 5; optional message<WebResultDetails> = 6; … } };
  • 22. Protocol Buffers (cont) • Automatically generated language wrappers • Graceful client and server upgrades – systems ignore tags they don't understand, but pass the information through (no need to upgrade intermediate servers) • Serialization/deserialization – high performance (200+ MB/s encode/decode) – fairly compact (uses variable length encodings) – format used to store data persistently (not just for RPCs) • Also allow service specifications: service Search { rpc DoSearch(SearchRequest) returns (SearchResponse); rpc DoSnippets(SnippetRequest) returns (SnippetResponse); rpc Ping(EmptyMessage) returns (EmptyMessage) { { protocol=udp; }; }; • Open source version: http://code.google.com/p/protobuf/
  • 23. Designing Efficient Systems Given a basic problem definition, how do you choose the "best" solution? • Best could be simplest, highest performance, easiest to extend, etc. Important skill: ability to estimate performance of a system design – without actually having to build it!
  • 24. Numbers Everyone Should Know L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns Mutex lock/unlock 25 ns Main memory reference 100 ns Compress 1K bytes with Zippy 3,000 ns Send 2K bytes over 1 Gbps network 20,000 ns Read 1 MB sequentially from memory 250,000 ns Round trip within same datacenter 500,000 ns Disk seek 10,000,000 ns Read 1 MB sequentially from disk 20,000,000 ns Send packet CA->Netherlands->CA 150,000,000 ns
  • 25. Back of the Envelope Calculations How long to generate image results page (30 thumbnails)? Design 1: Read serially, thumbnail 256K images on the fly 30 seeks * 10 ms/seek + 30 * 256K / 30 MB/s = 560 ms
  • 26. Back of the Envelope Calculations How long to generate image results page (30 thumbnails)? Design 1: Read serially, thumbnail 256K images on the fly 30 seeks * 10 ms/seek + 30 * 256K / 30 MB/s = 560 ms Design 2: Issue reads in parallel: 10 ms/seek + 256K read / 30 MB/s = 18 ms (Ignores variance, so really more like 30-60 ms, probably)
  • 27. Back of the Envelope Calculations How long to generate image results page (30 thumbnails)? Design 1: Read serially, thumbnail 256K images on the fly 30 seeks * 10 ms/seek + 30 * 256K / 30 MB/s = 560 ms Design 2: Issue reads in parallel: 10 ms/seek + 256K read / 30 MB/s = 18 ms (Ignores variance, so really more like 30-60 ms, probably) Lots of variations: – caching (single images? whole sets of thumbnails?) – pre-computing thumbnails – … Back of the envelope helps identify most promising…
  • 28. Write Microbenchmarks! • Great to understand performance – Builds intuition for back-of-the-envelope calculations • Reduces cycle time to test performance improvements Benchmark Time(ns) CPU(ns) Iterations BM_VarintLength32/0 2 2 291666666 BM_VarintLength32Old/0 5 5 124660869 BM_VarintLength64/0 8 8 89600000 BM_VarintLength64Old/0 25 24 42164705 BM_VarintEncode32/0 7 7 80000000 BM_VarintEncode64/0 18 16 39822222 BM_VarintEncode64Old/0 24 22 31165217
  • 29. Know Your Basic Building Blocks Core language libraries, basic data structures, protocol buffers, GFS, BigTable, indexing systems, MySQL, MapReduce, … Not just their interfaces, but understand their implementations (at least at a high level) If you don’t know what’s going on, you can’t do decent back-of-the-envelope calculations!
  • 30. Encoding Your Data • CPUs are fast, memory/bandwidth are precious, ergo… – Variable-length encodings – Compression – Compact in-memory representations • Compression/encoding very important for many systems – inverted index posting list formats – storage systems for persistent data • We have lots of core libraries in this area – Many tradeoffs: space, encoding/decoding speed, etc. E.g.: • Zippy: encode@300 MB/s, decode@600MB/s, 2-4X compression • gzip: encode@25MB/s, decode@200MB/s, 4-6X compression
  • 31. Designing & Building Infrastructure Identify common problems, and build software systems to address them in a general way • Important not to try to be all things to all people – Clients might be demanding 8 different things – Doing 6 of them is easy – …handling 7 of them requires real thought – …dealing with all 8 usually results in a worse system • more complex, compromises other clients in trying to satisfy everyone Don't build infrastructure just for its own sake • Identify common needs and address them • Don't imagine unlikely potential needs that aren't really there • Best approach: use your own infrastructure (especially at first!) – (much more rapid feedback about what works, what doesn't)
  • 32. Design for Growth Try to anticipate how requirements will evolve keep likely features in mind as you design base system Ensure your design works if scale changes by 10X or 20X but the right solution for X often not optimal for 100X
  • 33. Interactive Apps: Design for Low Latency • Aim for low avg. times (happy users!) –90%ile and 99%ile also very important –Think about how much data you’re shuffling around • e.g. dozens of 1 MB RPCs per user request -> latency will be lousy • Worry about variance! –Redundancy or timeouts can help bring in latency tail • Judicious use of caching can help • Use higher priorities for interactive requests • Parallelism helps!
  • 34. Making Applications Robust Against Failures Canary requests Failover to other replicas/datacenters Bad backend detection: stop using for live requests until behavior gets better More aggressive load balancing when imbalance is more severe Make your apps do something reasonable even if not all is right – Better to give users limited functionality than an error page
  • 35. Add Sufficient Monitoring/Status/Debugging Hooks All our servers: • Export HTML-based status pages for easy diagnosis • Export a collection of key-value pairs via a standard interface – monitoring systems periodically collect this from running servers • RPC subsystem collects sample of all requests, all error requests, all requests >0.0s, >0.05s, >0.1s, >0.5s, >1s, etc. • Support low-overhead online profiling – cpu profiling – memory profiling – lock contention profiling If your system is slow or misbehaving, can you figure out why?
  • 36. MapReduce • A simple programming model that applies to many large-scale computing problems • Hide messy details in MapReduce runtime library: – automatic parallelization – load balancing – network and disk transfer optimizations – handling of machine failures – robustness – improvements to core library benefit all users of library!
  • 37. Typical problem solved by MapReduce •Read a lot of data •Map: extract something you care about from each record •Shuffle and Sort •Reduce: aggregate, summarize, filter, or transform •Write the results Outline stays the same, map and reduce change to fit the problem
  • 38. Example: Rendering Map Tiles Input Map Shuffle Reduce Output Emit each to all overlapping latitude- longitude rectangles Sort by key (key= Rect. Id) Render tile using data for all enclosed features Rendered tiles Geographic feature list I-5 Lake Washington WA-520 I-90 (0, I-5) (0, Lake Wash.) (0, WA-520) (1, I-90) (1, I-5) (1, Lake Wash.) (0, I-5) (0, Lake Wash.) (0, WA-520) (1, I-90) 0 1 (1, I-5) (1, Lake Wash.) … … … …
  • 39. Parallel MapReduce Map Map Map Map Input data Reduce Shuffle Reduce Shuffle Reduce Shuffle Partitioned output Master
  • 40. Parallel MapReduce Map Map Map Map Input data Reduce Shuffle Reduce Shuffle Reduce Shuffle Partitioned output Master For large enough problems, it’s more about disk and network performance than CPU & DRAM
  • 41. MapReduce Usage Statistics Over Time Number of jobs Aug, ‘04 29K Mar, ‘06 171K Sep, '07 2,217K Sep, ’09 3,467K Average completion time (secs) 634 874 395 475 Machine years used 217 2,002 11,081 25,562 Input data read (TB) 3,288 52,254 403,152 544,130 Intermediate data (TB) 758 6,743 34,774 90,120 Output data written (TB) 193 2,970 14,018 57,520 Average worker machines 157 268 394 488
  • 42. MapReduce in Practice •Abstract input and output interfaces –lots of MR operations don’t just read/write simple files • B-tree files • memory-mapped key-value stores • complex inverted index file formats • BigTable tables • SQL databases, etc. • ... •Low-level MR interfaces are in terms of byte arrays –Hardly ever use textual formats, though: slow, hard to parse –Most input & output is in encoded Protocol Buffer format • See “MapReduce: A Flexible Data Processing Tool” (to appear in upcoming CACM)
  • 43. • Lots of (semi-)structured data at Google – URLs: • Contents, crawl metadata, links, anchors, pagerank, … – Per-user data: • User preference settings, recent queries/search results, … – Geographic locations: • Physical entities (shops, restaurants, etc.), roads, satellite image data, user annotations, … • Scale is large – billions of URLs, many versions/page (~20K/version) – Hundreds of millions of users, thousands of q/sec – 100TB+ of satellite image data BigTable: Motivation
  • 44. • Distributed multi-dimensional sparse map (row, column, timestamp) → cell contents Rows Columns • Rows are ordered lexicographically • Good match for most of our applications Basic Data Model
  • 45. • Distributed multi-dimensional sparse map (row, column, timestamp) → cell contents “www.cnn.com” “contents:” Rows Columns “<html>…” • Rows are ordered lexicographically • Good match for most of our applications Basic Data Model
  • 46. • Distributed multi-dimensional sparse map (row, column, timestamp) → cell contents “www.cnn.com” “contents:” Rows Columns Timestamps t17“<html>…” • Rows are ordered lexicographically • Good match for most of our applications Basic Data Model
  • 47. • Distributed multi-dimensional sparse map (row, column, timestamp) → cell contents “www.cnn.com” “contents:” Rows Columns Timestamps t11 t17“<html>…” • Rows are ordered lexicographically • Good match for most of our applications Basic Data Model
  • 48. • Distributed multi-dimensional sparse map (row, column, timestamp) → cell contents “www.cnn.com” “contents:” Rows Columns Timestamps t3 t11 t17“<html>…” • Rows are ordered lexicographically • Good match for most of our applications Basic Data Model
  • 52. Bigtable master Bigtable tablet server Bigtable tablet serverBigtable tablet server … Bigtable Cell BigTable System Structure
  • 53. Bigtable master Bigtable tablet server Bigtable tablet serverBigtable tablet server … performs metadata ops + load balancing Bigtable Cell BigTable System Structure
  • 54. Bigtable master Bigtable tablet server Bigtable tablet serverBigtable tablet server … performs metadata ops + load balancing serves data serves dataserves data Bigtable Cell BigTable System Structure
  • 55. Lock service Bigtable master Bigtable tablet server Bigtable tablet serverBigtable tablet server GFSCluster scheduling system … performs metadata ops + load balancing serves data serves dataserves data Bigtable Cell BigTable System Structure
  • 56. Lock service Bigtable master Bigtable tablet server Bigtable tablet serverBigtable tablet server GFSCluster scheduling system … handles failover, monitoring performs metadata ops + load balancing serves data serves dataserves data Bigtable Cell BigTable System Structure
  • 57. Lock service Bigtable master Bigtable tablet server Bigtable tablet serverBigtable tablet server GFSCluster scheduling system … holds tablet data, logshandles failover, monitoring performs metadata ops + load balancing serves data serves dataserves data Bigtable Cell BigTable System Structure
  • 58. Lock service Bigtable master Bigtable tablet server Bigtable tablet serverBigtable tablet server GFSCluster scheduling system … holds metadata, handles master-electionholds tablet data, logshandles failover, monitoring performs metadata ops + load balancing serves data serves dataserves data Bigtable Cell BigTable System Structure
  • 59. Lock service Bigtable master Bigtable tablet server Bigtable tablet serverBigtable tablet server GFSCluster scheduling system … holds metadata, handles master-electionholds tablet data, logshandles failover, monitoring performs metadata ops + load balancing serves data serves dataserves data Bigtable Cell Bigtable client Bigtable client library BigTable System Structure
  • 60. Lock service Bigtable master Bigtable tablet server Bigtable tablet serverBigtable tablet server GFSCluster scheduling system … holds metadata, handles master-electionholds tablet data, logshandles failover, monitoring performs metadata ops + load balancing serves data serves dataserves data Bigtable Cell Bigtable client Bigtable client library Open() BigTable System Structure
  • 61. Lock service Bigtable master Bigtable tablet server Bigtable tablet serverBigtable tablet server GFSCluster scheduling system … holds metadata, handles master-electionholds tablet data, logshandles failover, monitoring performs metadata ops + load balancing serves data serves dataserves data Bigtable Cell Bigtable client Bigtable client library Open() read/write BigTable System Structure
  • 62. Lock service Bigtable master Bigtable tablet server Bigtable tablet serverBigtable tablet server GFSCluster scheduling system … holds metadata, handles master-electionholds tablet data, logshandles failover, monitoring performs metadata ops + load balancing serves data serves dataserves data Bigtable Cell Bigtable client Bigtable client library Open() read/write metadata ops BigTable System Structure
  • 63. BigTable Status • Design/initial implementation started beginning of 2004 • Production use or active development for 100+ projects: – Google Print – My Search History – Orkut – Crawling/indexing pipeline – Google Maps/Google Earth – Blogger – … • Currently ~500 BigTable clusters • Largest cluster: –70+ PB data; sustained: 10M ops/sec; 30+ GB/s I/O
  • 64. BigTable: What’s New Since OSDI’06? • Lots of work on scaling • Service clusters, managed by dedicated team • Improved performance isolation –fair-share scheduler within each server, better accounting of memory used per user (caches, etc.) –can partition servers within a cluster for different users or tables • Improved protection against corruption –many small changes –e.g. immediately read results of every compaction, compare with CRC. • Catches ~1 corruption/5.4 PB of data compacted
  • 65. BigTable Replication (New Since OSDI’06) • Configured on a per-table basis • Typically used to replicate data to multiple bigtable clusters in different data centers • Eventual consistency model: writes to table in one cluster eventually appear in all configured replicas • Nearly all user-facing production uses of BigTable use replication
  • 66. BigTable Coprocessors (New Since OSDI’06) • Arbitrary code that runs run next to each tablet in table –as tablets split and move, coprocessor code automatically splits/moves too • High-level call interface for clients –Unlike RPC, calls addressed to rows or ranges of rows • coprocessor client library resolves to actual locations –Calls across multiple rows automatically split into multiple parallelized RPCs • Very flexible model for building distributed services – automatic scaling, load balancing, request routing for apps
  • 67. Example Coprocessor Uses • Scalable metadata management for Colossus (next gen GFS-like file system) • Distributed language model serving for machine translation system • Distributed query processing for full-text indexing support • Regular expression search support for code repository
  • 68. Current Work: Spanner • Storage & computation system that spans all our datacenters – single global namespace • Names are independent of location(s) of data • Similarities to Bigtable: tables, families, locality groups, coprocessors, ... • Differences: hierarchical directories instead of rows, fine-grained replication • Fine-grained ACLs, replication configuration at the per-directory level – support mix of strong and weak consistency across datacenters • Strong consistency implemented with Paxos across tablet replicas • Full support for distributed transactions across directories/machines – much more automated operation • system automatically moves and adds replicas of data and computation based on constraints and usage patterns • automated allocation of resources across entire fleet of machines
  • 69. • Future scale: ~106 to 107 machines, ~1013 directories, ~1018 bytes of storage, spread at 100s to 1000s of locations around the world, ~109 client machines – zones of semi-autonomous control – consistency after disconnected operation – users specify high-level desires: “99%ile latency for accessing this data should be <50ms” “Store this data on at least 2 disks in EU, 2 in U.S. & 1 in Asia” Design Goals for Spanner
  • 70. Adaptivity in World-Wide Systems • Challenge: automatic, dynamic world-wide placement of data & computation to minimize latency and/or cost, given constraints on: – bandwidth – packet loss – power – resource usage – failure modes – ... • Users specify high-level desires: “99%ile latency for accessing this data should be <50ms” “Store this data on at least 2 disks in EU, 2 in U.S. & 1 in Asia”
  • 71. Building Applications on top of Weakly Consistent Storage Systems • Many applications need state replicated across a wide area – For reliability and availability • Two main choices: – consistent operations (e.g. use Paxos) • often imposes additional latency for common case – inconsistent operations • better performance/availability, but apps harder to write and reason about in this model • Many apps need to use a mix of both of these: – e.g. Gmail: marking a message as read is asynchronous, sending a message is a heavier-weight consistent operation
  • 72. Building Applications on top of Weakly Consistent Storage Systems • Challenge: General model of consistency choices, explained and codified – ideally would have one or more “knobs” controlling performance vs. consistency – “knob” would provide easy-to-understand tradeoffs • Challenge: Easy-to-use abstractions for resolving conflicting updates to multiple versions of a piece of state – Useful for reconciling client state with servers after disconnected operation – Also useful for reconciling replicated state in different data centers after repairing a network partition
  • 73. Thanks! Questions...? Further reading: • Ghemawat, Gobioff, & Leung. Google File System, SOSP 2003. • Barroso, Dean, & Hölzle . Web Search for a Planet: The Google Cluster Architecture, IEEE Micro, 2003. • Dean & Ghemawat. MapReduce: Simplified Data Processing on Large Clusters, OSDI 2004. • Chang, Dean, Ghemawat, Hsieh, Wallach, Burrows, Chandra, Fikes, & Gruber. Bigtable: A Distributed Storage System for Structured Data, OSDI 2006. • Burrows. The Chubby Lock Service for Loosely-Coupled Distributed Systems. OSDI 2006. • Pinheiro, Weber, & Barroso. Failure Trends in a Large Disk Drive Population. FAST 2007. • Brants, Popat, Xu, Och, & Dean. Large Language Models in Machine Translation, EMNLP 2007. • Barroso & Hölzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Morgan & Claypool Synthesis Series on Computer Architecture, 2009. • Malewicz et al. Pregel: A System for Large-Scale Graph Processing. PODC, 2009. • Schroeder, Pinheiro, & Weber. DRAM Errors in the Wild: A Large-Scale Field Study. SEGMETRICS’09. • Protocol Buffers. http://code.google.com/p/protobuf/ These and many more available at: http://labs.google.com/papers.html