Windows azure storage services

Me?

Yves Goeleven
 Solution Architect Capgemini
 Windows Azure MVP

 Co-founder Azug
 Founder Goeleven.com
 Founder Messagehandler.net
 Contributor NServiceBus

Agenda

Deep Dive Architecture Best Practices
 Global Namespace  Table Partitioning Strategy
 Front End Layer  Change Propagation Strategy
 Partition Layer  Counting Strategy
 Stream Layer  Master Election Strategy

Warning! This is a level 400 session!

I assume you know

Storage services
 Highly scalable
 Highly available
 Cloud storage

I assume you know

That brings following data abstractions
 Tables
 Blobs
 Drives
 Queues

I assume you know

It is NOT a relational database
 Storage services exposed trough a REST full API
 With some sugar on top (aka an SDK)
 LINQ to tables support might make you think it is a database

This talk is based on an academic whitepaper

http://sigops.org/sosp/sosp11/current/2011-Cascais/printable/11-calder.pdf

Which discusses how

Storage services defy the rules of the CAP theorem
 Distributed & scalable system
 Highly available
 Strongly Consistent
 Partition tolerant

 All at the same time!

Now how did they do that?
Deep Dive: Ready for a plunge?

Global namespace

Every request gets sent to
 http(s)://AccountName.Service.core.windows.net/PartitionName/ObjectName

Global namespace

DNS based service location
 DNS resolution
 Datacenter (f.e. Western Europe)
 Virtual IP of a Stamp

Stamp

Scale unit
 A windows azure deployment
 3 layers
 Typically 20 racks each
 Roughly 360 XL instances
 With disks attached
 2 Petabyte (before june 2012)
 30 Petabyte

Intra-stamp replication

Incoming read request

Arrives at Front End
 Routed to partition layer
 Reads usually from memory

Front End Layer

FE FE FE FE FE FE FE FE FE

Partition Layer

PS PS PS PS PS PS PS PS PS

Incoming read request

2 nodes per request
 Blazingly fast
 Flat “Quantum 10” network topology
 50 Tbps of bandwidth
 10 Gbps / node
 Faster than typical ssd drive

Incoming Write request

Written to stream layer
 Synchronous replication
 1 Primary Extend Node
 2 Secondary Extend Nodes

Front End Layer
Partition Layer

Stream Layer
EN EN EN EN EN EN EN
Primary Secondary Secondary

Geo replication

Inter-stamp
replication

Intra-stamp replication Intra-stamp replication

Location Service

Stamp Allocation
 Manages DNS records
 Determines active/passive stamp configuration
 Assigns accounts to stamps
 Geo Replication
 Disaster Recovery
 Load Balancing
 Account Migration

Front End Nodes

Stateless role instances
 Authentication & Authorization
 Shared Key (storage key)
 Shared Key Lite
 Shared Access Signatures
 Routing to Partition Servers
 Based on PartitionMap (in memory cache)
 Versioning
 x-ms-version: 2011-08-18
 Throttling

Front End Layer

FE FE FE FE FE FE FE FE FE

Partition Map

Determine partition server
 Used By Front End Nodes
 Managed by Partition Manager (Partition Layer)
 Ordered Dictionary
 Key: http(s)://AccountName.Service.core.windows.net/PartitionName/ObjectName
 Value: Partition Server
 Split across partition servers
 Range Partition

Practical examples

Blob
 Each blob is a partition
 Blob uri determines range
 Range only matters for listing
 Same container: More likely same range
 https://myaccount.blobs.core.windows.net/mycontainer/subdir/file1.jpg
 Https://myaccount.blobs.core.windows.net/mycontainer/subdir/file2.jpg
 Other container: Less likely
 https://myaccount.blobs.core.windows.net/othercontainer/subdir/file1.jpg

Practical examples

Queue
 Each queue is a partition
 Queue name determines range
 All messages same partition
 Range only matters for listing

Practical examples

Table
 Each group of rows with same PartitionKey is a partition
 Ordering within set by RowKey
 Range matters a lot!
 Expect continuation tokens!

Continuation tokens

Requests across partitions
 All on same partition server
 1 result
 Across multiple ranges (or to many results)
 Sequential queries required
 1 result + continuation token
 Repeat query with token until done
 Will almost never occur in development!
 As dataset is typically small

PartitionLayer

Dynamic Scalable Object Index
 Object Tables
 System defined data structures
 Account, Entity, Blob, Message, Schema, PartitionMap
 Dynamic Load Balancing
 Monitor load to each part of the index to determine hot spots
 Index is dynamically split into thousands of Index Range Partitions based on load
 Index Range Partitions are automatically load balanced across servers to quickly adapt to changes
in load
 Updates every 15 seconds

Partition layer

Consists of
 Partition Master
 Distributes Object Tables
 Partitions across Partition Servers Partition Layer
 Dynamic Load Balancing
PM PM
 Partition Server LS LS
 Serves data for exactly 1 Range Partition
 Manages Streams
 Lock Server PS PS PS
 Master Election
 Partition Lock

Dynamic Scalable Object Index

How it works

Account Container Blob
Aaaa Aaaa Aaaa Front End Layer
A-H
FE FE FE FE
Harry Pictures Sunrise
Harry Pictures Sunset Partition Layer
PM PM
Map
H‟-R
LS LS
Richard Images Soccer
Richard Images Tennis
R‟-Z PS PS PS
Zzzz Zzzz zzzz

Dynamic Load Balancing

Not in the critical path!

Partition Layer
PM PM
LS LS


Stream Layer

Partition Server

Architecture
 Log Structured Merge-Tree
 Easy replication
 Append only persisted streams Memory Row
Table Cache Memory Data
 Commit log Index Bloom Load Metrics
Cache Filters
 Metadata log
 Row Data (Snapshots)
 Blob Data (Actual blob data)
 Caches in memory Persistent Data
Commit Log Stream Row Data Stream
 Memory table (= Commit log)
 Row Data
Metadata Log Stream Blob Data Stream
 Index (snapshot location)
 Bloom filter
 Load Metrics

Incoming Read request

HOT Data
 Completely in memory

Memory Row
Index Bloom Load Metrics
Cache Filters

Persistent Data


Incoming Read request

COLD Data
 Partially Memory

Memory Row
Cache Filters

Persistent Data


Incoming Write Request

Persistent First
 Followed by memory

Memory Row
Cache Filters

Persistent Data


Checkpoint process

Not in critical path!

Memory Row
Cache Filters

Persistent Data


Geo replication process

Not in critical path!

Memory Row Memory Row
Table Cache Memory Data Table Cache Memory Data
Index Bloom Load Metrics Index Bloom Load Metrics
Cache Filters Cache Filters

Persistent Data Persistent Data

Commit Log Stream Row Data Stream Commit Log Stream Row Data Stream

Metadata Log Stream Blob Data Stream Metadata Log Stream Blob Data Stream

Stream layer

Append-Only Distributed File System
 Streams are very large files
 File system like directory namespace
 Stream Operations
 Open, Close, Delete Streams
 Rename Streams
 Concatenate Streams together
 Append for writing
 Random reads

Stream layer

Concepts
 Block
 Min unit of write/read Stream //foo/myfile.data
 Checksum
 Up to N bytes (e.g. 4MB) Pointer 1 Pointer 2 Pointer 3
 Extent
 Unit of replication Extent Extent Extent
 Sequence of blocks

Block
Block

Block

Block

Block

Block

Block

Block

Block

Block

Block
 Size limit (e.g. 1GB)
 Sealed/unsealed Sealed Unsealed
Sealed
 Stream
 Hierarchical namespace
 Ordered list of pointers to extents
 Append/Concatenate

Stream layer

Consists of
 Paxos System
 Distributed consensus protocol
 Stream Master
Stream Layer
 Extend Nodes
 Attached Disks SM
 Manages Extents Paxos SM EN EN EN EN
 Optimization Routines
SM

Extent creation

Allocation process
 Partition Server
 Request extent / stream creation
Partition Layer
PS
 Stream Master
 Chooses 3 random extend nodes based on
available capacity Stream Layer
 Chooses Primary & Secondary's SM
 Allocates replica set
 Random allocation
 To reduce Mean Time To Recovery
EN EN EN

Replication Process

Bitwise equality on all nodes
 Synchronous
Partition Layer
 Append request to primary PS
 Written to disk, checksum
 Offset and data forwarded to Stream Layer
secondaries SM
 Ack
EN EN EN
Journaling
 Dedicated disk (spindle/ssd)
 For writes
 Simultaneously copied to data disk
(from memory or journal)

Dealing with failure

Ack from primary lost going back to partition layer
 Retry from partition layer can cause multiple blocks to be appended (duplicate
records)
Unresponsive/Unreachable Extent Node (EN)
 Seal the failed extent & allocate a new extent and append immediately

Stream //foo/myfile.data

Pointer 1 Pointer 2 Pointer 3

Extent Extent Extent Extent
Block
Block

Block

Block

Block

Block

Block

Block

Block

Block

Block

Block
Block

Block

Block

Block

Replication Failures (Scenario 1)

Partition Layer
PS
Seal
Stream Layer
SM
120
120

EN EN EN
Block
Block

Block

Block
Block

Block
Block

Block
Block
Block

Block
Block
Block

Block

Block
Block

Block

Block

Block

Block

Replication Failures (Scenario 2)

Partition Layer
PS
Seal
Stream Layer
SM
120
100

EN EN EN
?
Block
Block

Block

Block
Block

Block
Block
Block

Block
Block
Block

Block

Block
Block

Block

Block

Block

Block

Network partitioning during read
Data Stream Partition Layer
 Partition Layer only reads from
PS
offsets returned from successful
appends
 Committed on all replicas Stream Layer
SM
 Row and Blob Data Streams
 Offset valid on any replica
 Safe to read
EN EN EN


Block
Block

Block
Block

Block
Block

Block
Block

Block

Block

Block

Network partitioning during read
Log Stream Partition Layer
 Logs are used on partition load
PS
 Check commit length first
 Only read from
 Unsealed replica if all replicas have 1, 2 Stream Layer
the same commit length SM
 A sealed replica Seal

EN EN EN

Primary
Secondary Secondary
Secondary Secondary

Block
Block

Block
Block
Block

Block
Block

Block

Block

Block
Block
Block

Block

Block

Block

Block
Block

Block

Block

Garbage Collection

Downside to this architecture
 Lot‟s of wasted disk space
 „Forgotten‟ streams
 „Forgotten‟ extends
 Unused blocks in extends after write errors
 Garbage collection required
 Partition servers mark used streams
 Stream manager cleans up unreferenced streams
 Stream manager cleans up unreferenced extends
 Stream manager performs erasure coding
 Read Solomon algorithm
 Allows recreation of cold streams after deleting extends

Beating the CAP theorem

Layering & co-design
 Stream Layer
 Availability with Partition/failure tolerance
 For Consistency, replicas are bit-wise identical up to the commit length
 Partition Layer
 Consistency with Partition/failure tolerance
 For Availability, RangePartitions can be served by any partition server
 Are moved to available servers if a partition server fails

 Designed for specific classes of partitioning/failures seen in practice
 Process to Disk to Node to Rack failures/unresponsiveness
 Node to Rack level network partitioning

Partitioning Strategy

Tables: Great for writing data, horrible for complex querying
 Data is automatically spread across extend nodes
 Indexes are automatically spread across partition servers
 Queries to many servers may be required to list data

Front End Layer
Partition Layer

Stream Layer
Secondary Secondary


Insert all aggregates in a dedicated partition
 Aggregate Account Table PK RowKey Properties
 Records that belong together as a unit Europe Orders Order1 Order 1 data
 F.e. Order with orderliness Order1 01 Orderline 1 data
 Dedicated Order1 02 Orderline 2 data
 Unique combination of partition parts Europe Orders Order2 Order 2 data
 Account, table, partitionkey Order2 01 Orderline 1 data
 Best insert performance Order2 02 Orderline 2 data
 Use Entity Group Transactions US Orders Order3 Order 3 data
 All aggregate records insert in 1 go Order3 01 Orderline 1 data
 Transactional Order3 02 Orderline 2 data
 Use Rowkey only for ordering
 Can be blank


Query By = Store By
 Do not query across partitions, it hurts! Acct Table PK RK Properties
 Instead store a copy of the data to query Yves Orders 0634911534000000000 1 Order 1 data
 In 1 partition 0634911534000000000 1_01 Orderline 1 data
 PartitionKey 0634911534000000000 1_02 Orderline 2 data
 Combination of query by‟s Yves Orders 0634911534000000000 3 Order 3 data
 or well known identifiers 0634911534000000000 3_01 Orderline 1 data
 Rowkey 0634911534000000000 3_02 Orderline 2 data
 Use for order Yves Orders 0634911666000000000 2 Order 2 data
 Must be unique in combo PK 0634911666000000000 2_01 Orderline 1 data
0634911666000000000 2_02 Orderline 2 data
 Guarantees fast retrieval
 Every query = 1 partition server


DO Denormalize!
 It‟s not a sin! Acct Table PK RK Properties
 Limit dataset to values required by query Yves Orders 0634911534000000000 1 €200
 Do precompute Yves Orders 0634911534000000000 3 $500
 Do format up front Yves Orders 0634911666000000000 2 €1000
 Etc…

Bonus Tip

Great partition keys
 Any combination of properties of the object
 Who
 F.e. current userid
 When
 DateTime.Format("{0:D19}", DateTime.MaxValue.Ticks – DateTime.UtcNow.Ticks)
 Where
 Geohash algorithm (geohash.org)
 Why
 Flow context through operations/messages

 Add all partition key values as properties to the original aggregate root!
 Or update will be your next nightmare

Change Propagation Strategy

Use queued messaging
 NO TRANSACTION support
 Split multi-partition update Web Roles/Websites
 In separate atomic operations
 Built-in retry mechanism on storage queues
 Use compensation logic if needed Worker roles
 Example P
 Update primary records in entity group tx
 Publish „Updated‟ event through queues
 Maintain list interested nodes in storage
 Secondary nodes update indexes in response S S S

Change Propagation Strategy

Sounds hard, but don’t worry
 I‟ve gone through all the hard work already
 And contributed this back to the community Web Roles/Websites
 NServiceBus
 Bonus
 Works very well to update web role caches Worker roles
 Very nice in combination with SignalR! P

 As A Service
 www.messagehandler.net
S S S

Counting

Counting records across partitions is a PITA, because querying is…
 And there is no built in support
 Do it upfront Web Roles/Websites
 I typically count „created‟ events in flight
 It‟s just another handler
 BUT… Worker roles
 +1 is not an idempotent operation P
 Keep the count atomic
 No other operation except +1 & save
 Use sharded counters
 Every thread on every node „Owns‟ S S S C
a shard of the counter (prevent concurrency)
 In single partition
 Roll up when querying and/or regular interval

Counting

Counting actions in Goeleven.com
public class CountActions : IHandleMessages<ActionCreated>
{
public void Handle(ActionCreated message)
{
string identifier = RoleEnvironment.CurrentRoleInstance.Id + "." + Thread.CurrentThread.ManagedThreadId;
var shard = counterShardRepository.TryGetShard("Actions", identifier);

if (shard == null)
{
shard = new CounterShard { PartitionKey = "Actions", RowKey = identifier, Partial = 1 };
counterShardRepository.Add(shard);
}
else Acct Table PK RK Properties
{ My Counters Actions RollUp 2164
shard.Partial += 1;
counterShardRepository.Update(shard); My Counters Actions Goeleven.Worker_IN_0.11 12
} My Counters Actions Goeleven.Worker_IN_1.56 15
counterShardRepository.SaveAllChanges();
}
}

Counting

Rollup
public class Counter
{
public long Value { get; private set; }

public Counter(IEnumerable<CounterShard> counterShards)
{
foreach (var counterShard in counterShards)
{
Value += counterShard.Partial;
}
Acct Table PK RK Properties
}
} My Counters Actions RollUp 2164
My Counters Actions Goeleven.Worker_IN_0.11 12
My Counters Actions Goeleven.Worker_IN_1.56 15

My Counters Actions 2181

Master Election

Scheduling in multi-node environment
 Like rollup
 All nodes kick in together doing the same thing
 Master election required to prevent conflicts
 Remember the locking service?
 It surfaces through blob as a „Lease‟
So you can use it too! Partition Layer
 PM PM
LS LS

PS PS PS

Master Election

Try to acquire a lease
 If success start work, and hold on to the lease until done
leaseId = blob.TryAcquireLease();
if (leaseId != null)
{
renewalThread = new Thread(() =>
{
Thread.Sleep(TimeSpan.FromSeconds(40));
blob.RenewLease(leaseId);
});
renewalThread.Start();
}
// do your thing
renewalThread.Abort();
blob.ReleaseLease(leaseId);

 Find blob extension on http://blog.smarx.com/posts/leasing-windows-azure-blobs-using-the-
storage-client-library

What we discussed

Deep Dive Best Practices
 Global Namespace  Table Partitioning Strategy
 Front End Layer  Change Propagation Strategy
 Partition Layer  Counting Strategy
 Stream Layer  Master Election Strategy

Resources

Want to know more?
 White paper: http://sigops.org/sosp/sosp11/current/2011-Cascais/printable/11-calder.pdf
 Video: http://www.youtube.com/watch?v=QnYdbQO0yj4
 More slides: http://sigops.org/sosp/sosp11/current/2011-Cascais/11-calder.pptx
 Blob extensions: http://blog.smarx.com/posts/leasing-windows-azure-blobs-using-the-storage-
client-library
 NServiceBus: https://github.com/nservicebus/nservicebus

See it in action
 www.goeleven.com
 www.messagehandler.net

Windows azure storage services

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Windows azure storage services

Similar a Windows azure storage services (20)

Más de Yves Goeleven

Más de Yves Goeleven (9)

Windows azure storage services