6. I assume you know
Storage services
Highly scalable
Highly available
Cloud storage
7. I assume you know
That brings following data abstractions
Tables
Blobs
Drives
Queues
8. I assume you know
It is NOT a relational database
Storage services exposed trough a REST full API
With some sugar on top (aka an SDK)
LINQ to tables support might make you think it is a database
9. This talk is based on an academic whitepaper
http://sigops.org/sosp/sosp11/current/2011-Cascais/printable/11-calder.pdf
10. Which discusses how
Storage services defy the rules of the CAP theorem
Distributed & scalable system
Highly available
Strongly Consistent
Partition tolerant
All at the same time!
11. Now how did they do that?
Deep Dive: Ready for a plunge?
14. Global namespace
DNS based service location
http(s)://AccountName.Service.core.windows.net/PartitionName/ObjectName
DNS resolution
Datacenter (f.e. Western Europe)
Virtual IP of a Stamp
15. Stamp
Scale unit
A windows azure deployment
3 layers
Typically 20 racks each
Roughly 360 XL instances
With disks attached
2 Petabyte (before june 2012)
30 Petabyte
Intra-stamp replication
16. Incoming read request
Arrives at Front End
Routed to partition layer
http(s)://AccountName.Service.core.windows.net/PartitionName/ObjectName
Reads usually from memory
Front End Layer
FE FE FE FE FE FE FE FE FE
Partition Layer
PS PS PS PS PS PS PS PS PS
17. Incoming read request
2 nodes per request
Blazingly fast
Flat “Quantum 10” network topology
50 Tbps of bandwidth
10 Gbps / node
Faster than typical ssd drive
18. Incoming Write request
Written to stream layer
Synchronous replication
1 Primary Extend Node
2 Secondary Extend Nodes
Front End Layer
Partition Layer
PS PS PS PS PS PS PS PS PS
Stream Layer
EN EN EN EN EN EN EN
Primary Secondary Secondary
22. Front End Nodes
Stateless role instances
Authentication & Authorization
Shared Key (storage key)
Shared Key Lite
Shared Access Signatures
Routing to Partition Servers
Based on PartitionMap (in memory cache)
Versioning
x-ms-version: 2011-08-18
Throttling
Front End Layer
FE FE FE FE FE FE FE FE FE
23. Partition Map
Determine partition server
Used By Front End Nodes
Managed by Partition Manager (Partition Layer)
Ordered Dictionary
Key: http(s)://AccountName.Service.core.windows.net/PartitionName/ObjectName
Value: Partition Server
Split across partition servers
Range Partition
24. Practical examples
Blob
Each blob is a partition
Blob uri determines range
Range only matters for listing
Same container: More likely same range
https://myaccount.blobs.core.windows.net/mycontainer/subdir/file1.jpg
Https://myaccount.blobs.core.windows.net/mycontainer/subdir/file2.jpg
Other container: Less likely
https://myaccount.blobs.core.windows.net/othercontainer/subdir/file1.jpg
25. Practical examples
Queue
Each queue is a partition
Queue name determines range
All messages same partition
Range only matters for listing
26. Practical examples
Table
Each group of rows with same PartitionKey is a partition
Ordering within set by RowKey
Range matters a lot!
Expect continuation tokens!
27. Continuation tokens
Requests across partitions
All on same partition server
1 result
Across multiple ranges (or to many results)
Sequential queries required
1 result + continuation token
Repeat query with token until done
Will almost never occur in development!
As dataset is typically small
29. PartitionLayer
Dynamic Scalable Object Index
Object Tables
System defined data structures
Account, Entity, Blob, Message, Schema, PartitionMap
Dynamic Load Balancing
Monitor load to each part of the index to determine hot spots
Index is dynamically split into thousands of Index Range Partitions based on load
Index Range Partitions are automatically load balanced across servers to quickly adapt to changes
in load
Updates every 15 seconds
30. Partition layer
Consists of
Partition Master
Distributes Object Tables
Partitions across Partition Servers Partition Layer
Dynamic Load Balancing
PM PM
Partition Server LS LS
Serves data for exactly 1 Range Partition
Manages Streams
Lock Server PS PS PS
Master Election
Partition Lock
31. Dynamic Scalable Object Index
How it works
Account Container Blob
Aaaa Aaaa Aaaa Front End Layer
A-H
FE FE FE FE
Harry Pictures Sunrise
Harry Pictures Sunset Partition Layer
PM PM
Map
H‟-R
LS LS
Richard Images Soccer
Richard Images Tennis
R‟-Z PS PS PS
Zzzz Zzzz zzzz
32. Dynamic Load Balancing
Not in the critical path!
Partition Layer
PM PM
LS LS
PS PS PS PS PS PS PS PS PS
Stream Layer
EN EN EN EN EN EN EN
Primary Secondary Secondary
33. Partition Server
Architecture
Log Structured Merge-Tree
Easy replication
Append only persisted streams Memory Row
Table Cache Memory Data
Commit log Index Bloom Load Metrics
Cache Filters
Metadata log
Row Data (Snapshots)
Blob Data (Actual blob data)
Caches in memory Persistent Data
Commit Log Stream Row Data Stream
Memory table (= Commit log)
Row Data
Metadata Log Stream Blob Data Stream
Index (snapshot location)
Bloom filter
Load Metrics
34. Incoming Read request
HOT Data
Completely in memory
Memory Row
Table Cache Memory Data
Index Bloom Load Metrics
Cache Filters
Persistent Data
Commit Log Stream Row Data Stream
Metadata Log Stream Blob Data Stream
35. Incoming Read request
COLD Data
Partially Memory
Memory Row
Table Cache Memory Data
Index Bloom Load Metrics
Cache Filters
Persistent Data
Commit Log Stream Row Data Stream
Metadata Log Stream Blob Data Stream
36. Incoming Write Request
Persistent First
Followed by memory
Memory Row
Table Cache Memory Data
Index Bloom Load Metrics
Cache Filters
Persistent Data
Commit Log Stream Row Data Stream
Metadata Log Stream Blob Data Stream
37. Checkpoint process
Not in critical path!
Memory Row
Table Cache Memory Data
Index Bloom Load Metrics
Cache Filters
Persistent Data
Commit Log Stream Row Data Stream
Commit Log Stream Row Data Stream
Metadata Log Stream Blob Data Stream
38. Geo replication process
Not in critical path!
Memory Row Memory Row
Table Cache Memory Data Table Cache Memory Data
Index Bloom Load Metrics Index Bloom Load Metrics
Cache Filters Cache Filters
Persistent Data Persistent Data
Commit Log Stream Row Data Stream Commit Log Stream Row Data Stream
Metadata Log Stream Blob Data Stream Metadata Log Stream Blob Data Stream
40. Stream layer
Append-Only Distributed File System
Streams are very large files
File system like directory namespace
Stream Operations
Open, Close, Delete Streams
Rename Streams
Concatenate Streams together
Append for writing
Random reads
41. Stream layer
Concepts
Block
Min unit of write/read Stream //foo/myfile.data
Checksum
Up to N bytes (e.g. 4MB) Pointer 1 Pointer 2 Pointer 3
Extent
Unit of replication Extent Extent Extent
Sequence of blocks
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Size limit (e.g. 1GB)
Sealed/unsealed Sealed Unsealed
Sealed
Stream
Hierarchical namespace
Ordered list of pointers to extents
Append/Concatenate
42. Stream layer
Consists of
Paxos System
Distributed consensus protocol
Stream Master
Stream Layer
Extend Nodes
Attached Disks SM
Manages Extents Paxos SM EN EN EN EN
Optimization Routines
SM
43. Extent creation
Allocation process
Partition Server
Request extent / stream creation
Partition Layer
PS
Stream Master
Chooses 3 random extend nodes based on
available capacity Stream Layer
Chooses Primary & Secondary's SM
Allocates replica set
Random allocation
To reduce Mean Time To Recovery
EN EN EN
Primary Secondary Secondary
44. Replication Process
Bitwise equality on all nodes
Synchronous
Partition Layer
Append request to primary PS
Written to disk, checksum
Offset and data forwarded to Stream Layer
secondaries SM
Ack
EN EN EN
Primary Secondary Secondary
Journaling
Dedicated disk (spindle/ssd)
For writes
Simultaneously copied to data disk
(from memory or journal)
45. Dealing with failure
Ack from primary lost going back to partition layer
Retry from partition layer can cause multiple blocks to be appended (duplicate
records)
Unresponsive/Unreachable Extent Node (EN)
Seal the failed extent & allocate a new extent and append immediately
Stream //foo/myfile.data
Pointer 1 Pointer 2 Pointer 3
Extent Extent Extent Extent
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
46. Replication Failures (Scenario 1)
Partition Layer
PS
Seal
Stream Layer
SM
120
120
EN EN EN
Primary Secondary Secondary
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
47. Replication Failures (Scenario 2)
Partition Layer
PS
Seal
Stream Layer
SM
120
100
EN EN EN
Primary Secondary Secondary
?
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
48. Network partitioning during read
Data Stream Partition Layer
Partition Layer only reads from
PS
offsets returned from successful
appends
Committed on all replicas Stream Layer
SM
Row and Blob Data Streams
Offset valid on any replica
Safe to read
EN EN EN
Primary Secondary Secondary
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
49. Network partitioning during read
Log Stream Partition Layer
Logs are used on partition load
PS
Check commit length first
Only read from
Unsealed replica if all replicas have 1, 2 Stream Layer
the same commit length SM
A sealed replica Seal
EN EN EN
Primary
Secondary Secondary
Secondary Secondary
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
50. Garbage Collection
Downside to this architecture
Lot‟s of wasted disk space
„Forgotten‟ streams
„Forgotten‟ extends
Unused blocks in extends after write errors
Garbage collection required
Partition servers mark used streams
Stream manager cleans up unreferenced streams
Stream manager cleans up unreferenced extends
Stream manager performs erasure coding
Read Solomon algorithm
Allows recreation of cold streams after deleting extends
51. Beating the CAP theorem
Layering & co-design
Stream Layer
Availability with Partition/failure tolerance
For Consistency, replicas are bit-wise identical up to the commit length
Partition Layer
Consistency with Partition/failure tolerance
For Availability, RangePartitions can be served by any partition server
Are moved to available servers if a partition server fails
Designed for specific classes of partitioning/failures seen in practice
Process to Disk to Node to Rack failures/unresponsiveness
Node to Rack level network partitioning
53. Partitioning Strategy
Tables: Great for writing data, horrible for complex querying
Data is automatically spread across extend nodes
Indexes are automatically spread across partition servers
Queries to many servers may be required to list data
Front End Layer
Partition Layer
PS PS PS PS PS PS PS PS PS
Stream Layer
EN EN EN EN EN EN EN
Secondary Secondary
54. Partitioning Strategy
Insert all aggregates in a dedicated partition
Aggregate Account Table PK RowKey Properties
Records that belong together as a unit Europe Orders Order1 Order 1 data
F.e. Order with orderliness Order1 01 Orderline 1 data
Dedicated Order1 02 Orderline 2 data
Unique combination of partition parts Europe Orders Order2 Order 2 data
Account, table, partitionkey Order2 01 Orderline 1 data
Best insert performance Order2 02 Orderline 2 data
Use Entity Group Transactions US Orders Order3 Order 3 data
All aggregate records insert in 1 go Order3 01 Orderline 1 data
Transactional Order3 02 Orderline 2 data
Use Rowkey only for ordering
Can be blank
55. Partitioning Strategy
Query By = Store By
Do not query across partitions, it hurts! Acct Table PK RK Properties
Instead store a copy of the data to query Yves Orders 0634911534000000000 1 Order 1 data
In 1 partition 0634911534000000000 1_01 Orderline 1 data
PartitionKey 0634911534000000000 1_02 Orderline 2 data
Combination of query by‟s Yves Orders 0634911534000000000 3 Order 3 data
or well known identifiers 0634911534000000000 3_01 Orderline 1 data
Rowkey 0634911534000000000 3_02 Orderline 2 data
Use for order Yves Orders 0634911666000000000 2 Order 2 data
Must be unique in combo PK 0634911666000000000 2_01 Orderline 1 data
0634911666000000000 2_02 Orderline 2 data
Guarantees fast retrieval
Every query = 1 partition server
56. Partitioning Strategy
DO Denormalize!
It‟s not a sin! Acct Table PK RK Properties
Limit dataset to values required by query Yves Orders 0634911534000000000 1 €200
Do precompute Yves Orders 0634911534000000000 3 $500
Do format up front Yves Orders 0634911666000000000 2 €1000
Etc…
57. Bonus Tip
Great partition keys
Any combination of properties of the object
Who
F.e. current userid
When
DateTime.Format("{0:D19}", DateTime.MaxValue.Ticks – DateTime.UtcNow.Ticks)
Where
Geohash algorithm (geohash.org)
Why
Flow context through operations/messages
Add all partition key values as properties to the original aggregate root!
Or update will be your next nightmare
58. Change Propagation Strategy
Use queued messaging
NO TRANSACTION support
Split multi-partition update Web Roles/Websites
In separate atomic operations
Built-in retry mechanism on storage queues
Use compensation logic if needed Worker roles
Example P
Update primary records in entity group tx
Publish „Updated‟ event through queues
Maintain list interested nodes in storage
Secondary nodes update indexes in response S S S
59. Change Propagation Strategy
Sounds hard, but don’t worry
I‟ve gone through all the hard work already
And contributed this back to the community Web Roles/Websites
NServiceBus
Bonus
Works very well to update web role caches Worker roles
Very nice in combination with SignalR! P
As A Service
www.messagehandler.net
S S S
60. Counting
Counting records across partitions is a PITA, because querying is…
And there is no built in support
Do it upfront Web Roles/Websites
I typically count „created‟ events in flight
It‟s just another handler
BUT… Worker roles
+1 is not an idempotent operation P
Keep the count atomic
No other operation except +1 & save
Use sharded counters
Every thread on every node „Owns‟ S S S C
a shard of the counter (prevent concurrency)
In single partition
Roll up when querying and/or regular interval
61. Counting
Counting actions in Goeleven.com
public class CountActions : IHandleMessages<ActionCreated>
{
public void Handle(ActionCreated message)
{
string identifier = RoleEnvironment.CurrentRoleInstance.Id + "." + Thread.CurrentThread.ManagedThreadId;
var shard = counterShardRepository.TryGetShard("Actions", identifier);
if (shard == null)
{
shard = new CounterShard { PartitionKey = "Actions", RowKey = identifier, Partial = 1 };
counterShardRepository.Add(shard);
}
else Acct Table PK RK Properties
{ My Counters Actions RollUp 2164
shard.Partial += 1;
counterShardRepository.Update(shard); My Counters Actions Goeleven.Worker_IN_0.11 12
} My Counters Actions Goeleven.Worker_IN_1.56 15
counterShardRepository.SaveAllChanges();
}
}
62. Counting
Rollup
public class Counter
{
public long Value { get; private set; }
public Counter(IEnumerable<CounterShard> counterShards)
{
foreach (var counterShard in counterShards)
{
Value += counterShard.Partial;
}
Acct Table PK RK Properties
}
} My Counters Actions RollUp 2164
My Counters Actions Goeleven.Worker_IN_0.11 12
My Counters Actions Goeleven.Worker_IN_1.56 15
My Counters Actions 2181
63. Master Election
Scheduling in multi-node environment
Like rollup
All nodes kick in together doing the same thing
Master election required to prevent conflicts
Remember the locking service?
It surfaces through blob as a „Lease‟
So you can use it too! Partition Layer
PM PM
LS LS
PS PS PS
64. Master Election
Try to acquire a lease
If success start work, and hold on to the lease until done
leaseId = blob.TryAcquireLease();
if (leaseId != null)
{
renewalThread = new Thread(() =>
{
Thread.Sleep(TimeSpan.FromSeconds(40));
blob.RenewLease(leaseId);
});
renewalThread.Start();
}
// do your thing
renewalThread.Abort();
blob.ReleaseLease(leaseId);
Find blob extension on http://blog.smarx.com/posts/leasing-windows-azure-blobs-using-the-
storage-client-library