SlideShare una empresa de Scribd logo
1 de 68
Me?

Yves Goeleven
 Solution Architect Capgemini
 Windows Azure MVP

   Co-founder Azug
   Founder Goeleven.com
   Founder Messagehandler.net
   Contributor NServiceBus
Agenda

Deep Dive Architecture   Best Practices
   Global Namespace        Table Partitioning Strategy
   Front End Layer         Change Propagation Strategy
   Partition Layer         Counting Strategy
   Stream Layer            Master Election Strategy
Warning! This is a level 400 session!
I assume you know

Storage services
 Highly scalable
 Highly available
 Cloud storage
I assume you know

That brings following data abstractions
   Tables
   Blobs
   Drives
   Queues
I assume you know

It is NOT a relational database
 Storage services exposed trough a REST full API
 With some sugar on top (aka an SDK)
 LINQ to tables support might make you think it is a database
This talk is based on an academic whitepaper




   http://sigops.org/sosp/sosp11/current/2011-Cascais/printable/11-calder.pdf
Which discusses how

Storage services defy the rules of the CAP theorem
   Distributed & scalable system
   Highly available
   Strongly Consistent
   Partition tolerant

 All at the same time!
Now how did they do that?
Deep Dive: Ready for a plunge?
Global Namespace
Global namespace

Every request gets sent to
 http(s)://AccountName.Service.core.windows.net/PartitionName/ObjectName
Global namespace

DNS based service location
   http(s)://AccountName.Service.core.windows.net/PartitionName/ObjectName
   DNS resolution
   Datacenter (f.e. Western Europe)
   Virtual IP of a Stamp
Stamp

Scale unit
 A windows azure deployment
 3 layers
   Typically 20 racks each
   Roughly 360 XL instances
 With disks attached
   2 Petabyte (before june 2012)
   30 Petabyte




                                    Intra-stamp replication
Incoming read request

Arrives at Front End
 Routed to partition layer
 http(s)://AccountName.Service.core.windows.net/PartitionName/ObjectName
 Reads usually from memory



                                                      Front End Layer

         FE   FE   FE   FE   FE   FE   FE   FE   FE

                                                       Partition Layer

         PS   PS   PS   PS   PS   PS   PS   PS   PS
Incoming read request

2 nodes per request
 Blazingly fast
 Flat “Quantum 10” network topology
 50 Tbps of bandwidth
   10 Gbps / node
   Faster than typical ssd drive
Incoming Write request

Written to stream layer
 Synchronous replication
 1 Primary Extend Node
 2 Secondary Extend Nodes

                                                                 Front End Layer
                                                                  Partition Layer
         PS    PS   PS    PS    PS    PS    PS    PS    PS


                                                                   Stream Layer
              EN EN   EN   EN        EN      EN        EN
                                 Primary   Secondary Secondary
Geo replication


                               Inter-stamp
                                replication




     Intra-stamp replication                  Intra-stamp replication
Location Service

Stamp Allocation
 Manages DNS records
 Determines active/passive stamp configuration
     Assigns accounts to stamps
     Geo Replication
     Disaster Recovery
     Load Balancing
     Account Migration
Front End Layer
Front End Nodes

Stateless role instances
 Authentication & Authorization
   Shared Key (storage key)
   Shared Key Lite
   Shared Access Signatures
 Routing to Partition Servers
   Based on PartitionMap (in memory cache)
 Versioning
   x-ms-version: 2011-08-18
 Throttling

                                                             Front End Layer

           FE    FE    FE      FE   FE   FE   FE   FE   FE
Partition Map

Determine partition server
 Used By Front End Nodes
 Managed by Partition Manager (Partition Layer)
 Ordered Dictionary
   Key: http(s)://AccountName.Service.core.windows.net/PartitionName/ObjectName
   Value: Partition Server
 Split across partition servers
   Range Partition
Practical examples

Blob
   Each blob is a partition
   Blob uri determines range
   Range only matters for listing
   Same container: More likely same range
     https://myaccount.blobs.core.windows.net/mycontainer/subdir/file1.jpg
     Https://myaccount.blobs.core.windows.net/mycontainer/subdir/file2.jpg
 Other container: Less likely
     https://myaccount.blobs.core.windows.net/othercontainer/subdir/file1.jpg
Practical examples

Queue
   Each queue is a partition
   Queue name determines range
   All messages same partition
   Range only matters for listing
Practical examples

Table
   Each group of rows with same PartitionKey is a partition
   Ordering within set by RowKey
   Range matters a lot!
   Expect continuation tokens!
Continuation tokens

Requests across partitions
 All on same partition server
   1 result
 Across multiple ranges (or to many results)
   Sequential queries required
   1 result + continuation token
   Repeat query with token until done
 Will almost never occur in development!
   As dataset is typically small
Partition Layer
PartitionLayer

Dynamic Scalable Object Index
 Object Tables
   System defined data structures
   Account, Entity, Blob, Message, Schema, PartitionMap
 Dynamic Load Balancing
   Monitor load to each part of the index to determine hot spots
   Index is dynamically split into thousands of Index Range Partitions based on load
   Index Range Partitions are automatically load balanced across servers to quickly adapt to changes
    in load
   Updates every 15 seconds
Partition layer

Consists of
 Partition Master
   Distributes Object Tables
   Partitions across Partition Servers                         Partition Layer
   Dynamic Load Balancing
                                                     PM PM
 Partition Server                                                LS LS
   Serves data for exactly 1 Range Partition
   Manages Streams
 Lock Server                                   PS    PS   PS
   Master Election
   Partition Lock
Dynamic Scalable Object Index

How it works

  Account   Container   Blob
  Aaaa      Aaaa        Aaaa                             Front End Layer
                A-H
                                  FE    FE    FE    FE
  Harry     Pictures    Sunrise
  Harry     Pictures    Sunset                            Partition Layer
                                        PM PM
                                         Map
                H‟-R
                                                            LS LS
  Richard   Images      Soccer
  Richard   Images      Tennis
               R‟-Z                PS    PS    PS
  Zzzz      Zzzz        zzzz
Dynamic Load Balancing

Not in the critical path!


                                                                   Partition Layer
                                             PM PM
                                                                     LS LS


          PS    PS   PS     PS    PS    PS    PS    PS    PS


                                                                    Stream Layer
               EN EN   EN    EN        EN      EN        EN
                                   Primary   Secondary Secondary
Partition Server

Architecture
 Log Structured Merge-Tree
   Easy replication
 Append only persisted streams      Memory      Row
                                      Table     Cache                                             Memory Data
     Commit log                                            Index       Bloom     Load Metrics
                                                            Cache       Filters
     Metadata log
     Row Data (Snapshots)
     Blob Data (Actual blob data)
 Caches in memory                                                                               Persistent Data
                                       Commit Log Stream            Row Data Stream
     Memory table (= Commit log)
     Row Data
                                      Metadata Log Stream           Blob Data Stream
     Index (snapshot location)
     Bloom filter
     Load Metrics
Incoming Read request

HOT Data
 Completely in memory

   Memory      Row
    Table     Cache                                             Memory Data
                          Index       Bloom     Load Metrics
                          Cache       Filters




                                                               Persistent Data
     Commit Log Stream            Row Data Stream

    Metadata Log Stream           Blob Data Stream
Incoming Read request

COLD Data
 Partially Memory

    Memory      Row
     Table     Cache                                            Memory Data
                          Index       Bloom     Load Metrics
                          Cache       Filters




                                                               Persistent Data
     Commit Log Stream            Row Data Stream

    Metadata Log Stream           Blob Data Stream
Incoming Write Request

Persistent First
 Followed by memory

    Memory      Row
     Table     Cache                                            Memory Data
                          Index       Bloom     Load Metrics
                          Cache       Filters




                                                               Persistent Data
     Commit Log Stream            Row Data Stream

    Metadata Log Stream           Blob Data Stream
Checkpoint process

Not in critical path!

    Memory       Row
     Table      Cache                                                   Memory Data
                                  Index       Bloom     Load Metrics
                                  Cache       Filters




                                                                       Persistent Data
     Commit Log Stream                    Row Data Stream
              Commit Log Stream             Row Data Stream

     Metadata Log Stream                  Blob Data Stream
Geo replication process

Not in critical path!


  Memory     Row                                                                Memory     Row
   Table    Cache                                              Memory Data       Table    Cache                                              Memory Data
                         Index       Bloom     Load Metrics                                            Index       Bloom     Load Metrics
                         Cache       Filters                                                           Cache       Filters




                                                              Persistent Data                                                               Persistent Data

   Commit Log Stream             Row Data Stream                                 Commit Log Stream             Row Data Stream



   Metadata Log Stream           Blob Data Stream                                Metadata Log Stream           Blob Data Stream
Stream Layer
Stream layer

Append-Only Distributed File System
 Streams are very large files
   File system like directory namespace
 Stream Operations
     Open, Close, Delete Streams
     Rename Streams
     Concatenate Streams together
     Append for writing
     Random reads
Stream layer

Concepts
 Block
   Min unit of write/read                                         Stream //foo/myfile.data
   Checksum
   Up to N bytes (e.g. 4MB)                      Pointer 1                Pointer 2                       Pointer 3
 Extent
     Unit of replication                            Extent                        Extent                             Extent
     Sequence of blocks




                                                                                                                   Block
                                          Block

                                                   Block

                                                           Block

                                                                   Block




                                                                           Block

                                                                                   Block

                                                                                           Block

                                                                                                   Block



                                                                                                           Block




                                                                                                                           Block
     Size limit (e.g. 1GB)
     Sealed/unsealed                               Sealed                         Unsealed
                                                                                   Sealed
 Stream
   Hierarchical namespace
   Ordered list of pointers to extents
   Append/Concatenate
Stream layer

Consists of
 Paxos System
   Distributed consensus protocol
   Stream Master
                                                        Stream Layer
 Extend Nodes
   Attached Disks                   SM
   Manages Extents                  Paxos   SM   EN   EN   EN   EN
   Optimization Routines
                                     SM
Extent creation

Allocation process
 Partition Server
   Request extent / stream creation
                                                                    Partition Layer
                                             PS
 Stream Master
   Chooses 3 random extend nodes based on
    available capacity                                                Stream Layer
   Chooses Primary & Secondary's                          SM
   Allocates replica set
 Random allocation
   To reduce Mean Time To Recovery
                                                   EN      EN EN
                                                  Primary Secondary Secondary
Replication Process

Bitwise equality on all nodes
 Synchronous
                                                                  Partition Layer
 Append request to primary            PS
 Written to disk, checksum
 Offset and data forwarded to                                     Stream Layer
  secondaries                                          SM
 Ack
                                            EN         EN           EN
                                            Primary   Secondary    Secondary
Journaling
 Dedicated disk (spindle/ssd)
 For writes
 Simultaneously copied to data disk
    (from memory or journal)
Dealing with failure

Ack from primary lost going back to partition layer
 Retry from partition layer can cause multiple blocks to be appended (duplicate
  records)
Unresponsive/Unreachable Extent Node (EN)
 Seal the failed extent & allocate a new extent and append immediately


                                 Stream //foo/myfile.data

                Pointer 1                Pointer 2                       Pointer 3


                   Extent                        Extent                             Extent               Extent
                                                                                 Block
        Block

                 Block

                         Block

                                 Block




                                         Block

                                                 Block

                                                         Block

                                                                 Block



                                                                         Block




                                                                                         Block




                                                                                                 Block
                                                                         Block

                                                                                 Block

                                                                                         Block




                                                                                                 Block
Replication Failures (Scenario 1)

                                                                                                         Partition Layer
           PS
                                                Seal
                                                                                                              Stream Layer
                                                                        SM
                                                120
                                                              120


                                EN                            EN                              EN
                         Primary                              Secondary                       Secondary
                        Block
                Block




                                Block




                                                                                              Block
                                                                                      Block




                                                                                                      Block
                                        Block




                                                                              Block
                                                              Block
                                                      Block




                                                                      Block
                                        Block
                        Block




                                                              Block




                                                                                              Block
                Block




                                Block




                                                                                      Block




                                                                                                      Block

                                                                                                              Block
Replication Failures (Scenario 2)

                                                                                                       Partition Layer
           PS
                                                Seal
                                                                                                            Stream Layer
                                                                        SM
                                                120
                                                                              100


                                EN                            EN                            EN
                         Primary                              Secondary                     Secondary
                                          ?
                        Block
                Block




                                Block




                                                                                            Block
                                                                                    Block




                                                                                                    Block
                                                              Block
                                                      Block




                                                                      Block
                                        Block
                        Block




                                                              Block




                                                                                            Block
                Block




                                Block




                                                      Block




                                                                                    Block




                                                                                                    Block
Network partitioning during read
Data Stream                                                                                                                          Partition Layer
 Partition Layer only reads from
                                       PS
    offsets returned from successful
    appends
   Committed on all replicas                                                                                                         Stream Layer
                                                                                               SM
   Row and Blob Data Streams
   Offset valid on any replica
   Safe to read
                                                            EN                       EN                              EN

                                                      Primary                        Secondary                       Secondary




                                                                                                                     Block
                                                                                                             Block




                                                                                                                             Block
                                                                     Block




                                                                                                     Block
                                                    Block




                                                                                     Block
                                            Block




                                                             Block




                                                                             Block




                                                                                             Block
Network partitioning during read
Log Stream                                                                                                                                Partition Layer
 Logs are used on partition load
                                            PS
 Check commit length first
 Only read from
   Unsealed replica if all replicas have                                         1, 2                                                     Stream Layer
    the same commit length                                                                          SM
   A sealed replica                                                                 Seal




                                                                 EN                       EN                              EN

                                                          Primary
                                                         Secondary                        Secondary
                                                                                           Secondary                      Secondary




                                                                                                                          Block
                                                                                                                  Block




                                                                                                                                  Block
                                                                                                          Block
                                                         Block




                                                                                          Block
                                                 Block




                                                                  Block




                                                                                  Block




                                                                                                  Block
                                                                          Block
                                                 Block

                                                         Block

                                                                  Block


                                                                          Block




                                                                                                          Block
                                                                                  Block

                                                                                          Block

                                                                                                  Block
Garbage Collection

Downside to this architecture
 Lot‟s of wasted disk space
   „Forgotten‟ streams
   „Forgotten‟ extends
   Unused blocks in extends after write errors
 Garbage collection required
     Partition servers mark used streams
     Stream manager cleans up unreferenced streams
     Stream manager cleans up unreferenced extends
     Stream manager performs erasure coding
       Read Solomon algorithm
       Allows recreation of cold streams after deleting extends
Beating the CAP theorem

Layering & co-design
 Stream Layer
   Availability with Partition/failure tolerance
   For Consistency, replicas are bit-wise identical up to the commit length
 Partition Layer
   Consistency with Partition/failure tolerance
   For Availability, RangePartitions can be served by any partition server
   Are moved to available servers if a partition server fails


 Designed for specific classes of partitioning/failures seen in practice
   Process to Disk to Node to Rack failures/unresponsiveness
   Node to Rack level network partitioning
Best practices
Partitioning Strategy

Tables: Great for writing data, horrible for complex querying
 Data is automatically spread across extend nodes
 Indexes are automatically spread across partition servers
 Queries to many servers may be required to list data

                                                                   Front End Layer
                                                                    Partition Layer
           PS    PS   PS   PS     PS    PS    PS    PS    PS


                                                                     Stream Layer
                EN EN   EN   EN        EN      EN        EN
                                             Secondary Secondary
Partitioning Strategy

Insert all aggregates in a dedicated partition
   Aggregate                                  Account   Table    PK       RowKey   Properties
     Records that belong together as a unit   Europe    Orders   Order1            Order 1 data
     F.e. Order with orderliness                                 Order1   01       Orderline 1 data
   Dedicated                                                     Order1   02       Orderline 2 data
     Unique combination of partition parts    Europe    Orders   Order2            Order 2 data
      Account, table, partitionkey                               Order2   01       Orderline 1 data
     Best insert performance                                     Order2   02       Orderline 2 data
   Use Entity Group Transactions              US        Orders   Order3            Order 3 data
     All aggregate records insert in 1 go                        Order3   01       Orderline 1 data
     Transactional                                               Order3   02       Orderline 2 data
   Use Rowkey only for ordering
     Can be blank
Partitioning Strategy

Query By = Store By
   Do not query across partitions, it hurts!   Acct   Table    PK                    RK     Properties
   Instead store a copy of the data to query   Yves   Orders   0634911534000000000   1      Order 1 data
   In 1 partition                                              0634911534000000000   1_01   Orderline 1 data
 PartitionKey                                                  0634911534000000000   1_02   Orderline 2 data
   Combination of query by‟s                   Yves   Orders   0634911534000000000   3      Order 3 data
   or well known identifiers                                   0634911534000000000   3_01   Orderline 1 data
 Rowkey                                                        0634911534000000000   3_02   Orderline 2 data
   Use for order                               Yves   Orders   0634911666000000000   2      Order 2 data
   Must be unique in combo PK                                  0634911666000000000   2_01   Orderline 1 data
                                                                0634911666000000000   2_02   Orderline 2 data
 Guarantees fast retrieval
   Every query = 1 partition server
Partitioning Strategy

DO Denormalize!
     It‟s not a sin!                             Acct   Table    PK                    RK   Properties
     Limit dataset to values required by query   Yves   Orders   0634911534000000000   1    €200
     Do precompute                               Yves   Orders   0634911534000000000   3    $500
     Do format up front                          Yves   Orders   0634911666000000000   2    €1000
     Etc…
Bonus Tip

Great partition keys
   Any combination of properties of the object
   Who
     F.e. current userid
   When
     DateTime.Format("{0:D19}", DateTime.MaxValue.Ticks – DateTime.UtcNow.Ticks)
   Where
     Geohash algorithm (geohash.org)
   Why
     Flow context through operations/messages

   Add all partition key values as properties to the original aggregate root!
     Or update will be your next nightmare
Change Propagation Strategy

Use queued messaging
     NO TRANSACTION support
     Split multi-partition update                     Web Roles/Websites
     In separate atomic operations
     Built-in retry mechanism on storage queues
     Use compensation logic if needed                          Worker roles
 Example                                              P
   Update primary records in entity group tx
   Publish „Updated‟ event through queues
     Maintain list interested nodes in storage
   Secondary nodes update indexes in response     S   S    S
Change Propagation Strategy

Sounds hard, but don’t worry
   I‟ve gone through all the hard work already
   And contributed this back to the community        Web Roles/Websites
   NServiceBus
 Bonus
   Works very well to update web role caches                  Worker roles
   Very nice in combination with SignalR!            P

 As A Service
   www.messagehandler.net
                                                  S   S    S
Counting

Counting records across partitions is a PITA, because querying is…
   And there is no built in support
   Do it upfront                                         Web Roles/Websites
   I typically count „created‟ events in flight
     It‟s just another handler
   BUT…                                                           Worker roles
     +1 is not an idempotent operation                   P
   Keep the count atomic
     No other operation except +1 & save
   Use sharded counters
     Every thread on every node „Owns‟               S   S    S     C
      a shard of the counter (prevent concurrency)
     In single partition
     Roll up when querying and/or regular interval
Counting

Counting actions in Goeleven.com
 public class CountActions : IHandleMessages<ActionCreated>
  {
      public void Handle(ActionCreated message)
      {
          string identifier = RoleEnvironment.CurrentRoleInstance.Id + "." + Thread.CurrentThread.ManagedThreadId;
          var shard = counterShardRepository.TryGetShard("Actions", identifier);

          if (shard == null)
          {
               shard = new CounterShard { PartitionKey   = "Actions", RowKey = identifier, Partial = 1 };
               counterShardRepository.Add(shard);
          }
          else                                           Acct   Table      PK        RK                     Properties
          {                                              My     Counters   Actions   RollUp                 2164
               shard.Partial += 1;
               counterShardRepository.Update(shard);     My     Counters   Actions   Goeleven.Worker_IN_0.11 12
          }                                              My     Counters   Actions   Goeleven.Worker_IN_1.56 15
          counterShardRepository.SaveAllChanges();
      }
  }
Counting

Rollup
  public class Counter
  {
      public long Value { get; private set; }

      public Counter(IEnumerable<CounterShard> counterShards)
      {
          foreach (var counterShard in counterShards)
          {
              Value += counterShard.Partial;
          }
                                                 Acct Table       PK        RK                     Properties
      }
  }                                              My    Counters   Actions   RollUp                 2164
                                                 My    Counters   Actions   Goeleven.Worker_IN_0.11 12
                                                 My    Counters   Actions   Goeleven.Worker_IN_1.56 15


                                                 My    Counters   Actions                          2181
Master Election

Scheduling in multi-node environment
     Like rollup
     All nodes kick in together doing the same thing
     Master election required to prevent conflicts
     Remember the locking service?
     It surfaces through blob as a „Lease‟
      So you can use it too!                                            Partition Layer
                                                            PM PM
                                                                          LS LS


                                                        PS    PS   PS
Master Election

Try to acquire a lease
   If success start work, and hold on to the lease until done
       leaseId = blob.TryAcquireLease();
       if (leaseId != null)
       {
           renewalThread = new Thread(() =>
           {
               Thread.Sleep(TimeSpan.FromSeconds(40));
               blob.RenewLease(leaseId);
           });
           renewalThread.Start();
       }
       // do your thing
       renewalThread.Abort();
       blob.ReleaseLease(leaseId);



   Find blob extension on http://blog.smarx.com/posts/leasing-windows-azure-blobs-using-the-
    storage-client-library
Wrapup
What we discussed

Deep Dive              Best Practices
   Global Namespace      Table Partitioning Strategy
   Front End Layer       Change Propagation Strategy
   Partition Layer       Counting Strategy
   Stream Layer          Master Election Strategy
Resources

Want to know more?
   White paper: http://sigops.org/sosp/sosp11/current/2011-Cascais/printable/11-calder.pdf
   Video: http://www.youtube.com/watch?v=QnYdbQO0yj4
   More slides: http://sigops.org/sosp/sosp11/current/2011-Cascais/11-calder.pptx
   Blob extensions: http://blog.smarx.com/posts/leasing-windows-azure-blobs-using-the-storage-
    client-library
   NServiceBus: https://github.com/nservicebus/nservicebus


See it in action
   www.goeleven.com
   www.messagehandler.net
Q&A

Más contenido relacionado

La actualidad más candente

An a z index of the bash commands
An a z index of the bash commandsAn a z index of the bash commands
An a z index of the bash commandsBen Pope
 
DNS Security
DNS SecurityDNS Security
DNS Securityinbroker
 
DNS(Domain Name System)
DNS(Domain Name System)DNS(Domain Name System)
DNS(Domain Name System)Hasham khan
 
Dns Hardening Linux Os
Dns Hardening   Linux OsDns Hardening   Linux Os
Dns Hardening Linux Osecarrow
 
Chapter 4 Linux Basic Commands
Chapter 4 Linux Basic CommandsChapter 4 Linux Basic Commands
Chapter 4 Linux Basic CommandsShankar Mahure
 
Chapter 4 configuring and managing the dns server role
Chapter 4   configuring and managing the dns server roleChapter 4   configuring and managing the dns server role
Chapter 4 configuring and managing the dns server roleLuis Garay
 
Accessing R from Python using RPy2
Accessing R from Python using RPy2Accessing R from Python using RPy2
Accessing R from Python using RPy2Ryan Rosario
 
101 3.2 process text streams using filters
101 3.2 process text streams using filters101 3.2 process text streams using filters
101 3.2 process text streams using filtersAcácio Oliveira
 
101 3.2 process text streams using filters
101 3.2 process text streams using filters101 3.2 process text streams using filters
101 3.2 process text streams using filtersAcácio Oliveira
 
Linux Introduction
Linux IntroductionLinux Introduction
Linux IntroductionLearn 2 Be
 
3.2 process text streams using filters
3.2 process text streams using filters3.2 process text streams using filters
3.2 process text streams using filtersAcácio Oliveira
 
Codendi Installation Guide
Codendi Installation GuideCodendi Installation Guide
Codendi Installation GuideCodendi
 
Chapter4 configuringandmanagingthednsserverrole-140520003253-phpapp01
Chapter4 configuringandmanagingthednsserverrole-140520003253-phpapp01Chapter4 configuringandmanagingthednsserverrole-140520003253-phpapp01
Chapter4 configuringandmanagingthednsserverrole-140520003253-phpapp01velimamedov
 

La actualidad más candente (20)

Domain name system
Domain name systemDomain name system
Domain name system
 
Session3
Session3Session3
Session3
 
An a z index of the bash commands
An a z index of the bash commandsAn a z index of the bash commands
An a z index of the bash commands
 
DNS Security
DNS SecurityDNS Security
DNS Security
 
DNS(Domain Name System)
DNS(Domain Name System)DNS(Domain Name System)
DNS(Domain Name System)
 
Dns Hardening Linux Os
Dns Hardening   Linux OsDns Hardening   Linux Os
Dns Hardening Linux Os
 
Dns security
Dns securityDns security
Dns security
 
Linux
LinuxLinux
Linux
 
Session4
Session4Session4
Session4
 
10135 b 07
10135 b 0710135 b 07
10135 b 07
 
Linux
LinuxLinux
Linux
 
Chapter 4 Linux Basic Commands
Chapter 4 Linux Basic CommandsChapter 4 Linux Basic Commands
Chapter 4 Linux Basic Commands
 
Chapter 4 configuring and managing the dns server role
Chapter 4   configuring and managing the dns server roleChapter 4   configuring and managing the dns server role
Chapter 4 configuring and managing the dns server role
 
Accessing R from Python using RPy2
Accessing R from Python using RPy2Accessing R from Python using RPy2
Accessing R from Python using RPy2
 
101 3.2 process text streams using filters
101 3.2 process text streams using filters101 3.2 process text streams using filters
101 3.2 process text streams using filters
 
101 3.2 process text streams using filters
101 3.2 process text streams using filters101 3.2 process text streams using filters
101 3.2 process text streams using filters
 
Linux Introduction
Linux IntroductionLinux Introduction
Linux Introduction
 
3.2 process text streams using filters
3.2 process text streams using filters3.2 process text streams using filters
3.2 process text streams using filters
 
Codendi Installation Guide
Codendi Installation GuideCodendi Installation Guide
Codendi Installation Guide
 
Chapter4 configuringandmanagingthednsserverrole-140520003253-phpapp01
Chapter4 configuringandmanagingthednsserverrole-140520003253-phpapp01Chapter4 configuringandmanagingthednsserverrole-140520003253-phpapp01
Chapter4 configuringandmanagingthednsserverrole-140520003253-phpapp01
 

Similar a Windows azure storage services

Azure storage deep dive
Azure storage deep diveAzure storage deep dive
Azure storage deep diveYves Goeleven
 
azure track -04- azure storage deep dive
azure track -04- azure storage deep diveazure track -04- azure storage deep dive
azure track -04- azure storage deep diveITProceed
 
Intuitions for scaling data centric architectures - Benjamin Stopford
Intuitions for scaling data centric architectures - Benjamin StopfordIntuitions for scaling data centric architectures - Benjamin Stopford
Intuitions for scaling data centric architectures - Benjamin StopfordJAXLondon_Conference
 
Databases for Storage Engineers
Databases for Storage EngineersDatabases for Storage Engineers
Databases for Storage EngineersThomas Kejser
 
VDI storage and storage virtualization
VDI storage and storage virtualizationVDI storage and storage virtualization
VDI storage and storage virtualizationSisimon Soman
 
Storage virtualization citrix blr wide tech talk
Storage virtualization citrix blr wide tech talkStorage virtualization citrix blr wide tech talk
Storage virtualization citrix blr wide tech talkSisimon Soman
 
Balancing Replication and Partitioning in a Distributed Java Database
Balancing Replication and Partitioning in a Distributed Java DatabaseBalancing Replication and Partitioning in a Distributed Java Database
Balancing Replication and Partitioning in a Distributed Java DatabaseBen Stopford
 
Oracle12c achitecture
Oracle12c achitectureOracle12c achitecture
Oracle12c achitecturenaderattia
 
Why Graphics Is Fast, and What It Can Teach Us About Parallel Programming
Why Graphics Is Fast, and What It Can Teach Us About Parallel ProgrammingWhy Graphics Is Fast, and What It Can Teach Us About Parallel Programming
Why Graphics Is Fast, and What It Can Teach Us About Parallel ProgrammingJonathan Ragan-Kelley
 
Ca บทที่สี่
Ca บทที่สี่Ca บทที่สี่
Ca บทที่สี่atit604
 
Power of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data StructuresPower of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data Structuresconfluent
 
The Power of the Log
The Power of the LogThe Power of the Log
The Power of the LogBen Stopford
 
Sparse Content Map Storage System
Sparse Content Map Storage SystemSparse Content Map Storage System
Sparse Content Map Storage Systemianeboston
 
Handling Data in Mega Scale Web Systems
Handling Data in Mega Scale Web SystemsHandling Data in Mega Scale Web Systems
Handling Data in Mega Scale Web SystemsVineet Gupta
 
O connor bosc2010
O connor bosc2010O connor bosc2010
O connor bosc2010BOSC 2010
 

Similar a Windows azure storage services (20)

Azure storage deep dive
Azure storage deep diveAzure storage deep dive
Azure storage deep dive
 
azure track -04- azure storage deep dive
azure track -04- azure storage deep diveazure track -04- azure storage deep dive
azure track -04- azure storage deep dive
 
Intuitions for scaling data centric architectures - Benjamin Stopford
Intuitions for scaling data centric architectures - Benjamin StopfordIntuitions for scaling data centric architectures - Benjamin Stopford
Intuitions for scaling data centric architectures - Benjamin Stopford
 
Databases for Storage Engineers
Databases for Storage EngineersDatabases for Storage Engineers
Databases for Storage Engineers
 
VDI storage and storage virtualization
VDI storage and storage virtualizationVDI storage and storage virtualization
VDI storage and storage virtualization
 
ElephantDB
ElephantDBElephantDB
ElephantDB
 
Storage virtualization citrix blr wide tech talk
Storage virtualization citrix blr wide tech talkStorage virtualization citrix blr wide tech talk
Storage virtualization citrix blr wide tech talk
 
Balancing Replication and Partitioning in a Distributed Java Database
Balancing Replication and Partitioning in a Distributed Java DatabaseBalancing Replication and Partitioning in a Distributed Java Database
Balancing Replication and Partitioning in a Distributed Java Database
 
Oracle12c achitecture
Oracle12c achitectureOracle12c achitecture
Oracle12c achitecture
 
Why Graphics Is Fast, and What It Can Teach Us About Parallel Programming
Why Graphics Is Fast, and What It Can Teach Us About Parallel ProgrammingWhy Graphics Is Fast, and What It Can Teach Us About Parallel Programming
Why Graphics Is Fast, and What It Can Teach Us About Parallel Programming
 
Ca บทที่สี่
Ca บทที่สี่Ca บทที่สี่
Ca บทที่สี่
 
RuG Guest Lecture
RuG Guest LectureRuG Guest Lecture
RuG Guest Lecture
 
Hdfs
HdfsHdfs
Hdfs
 
Power of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data StructuresPower of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data Structures
 
The Power of the Log
The Power of the LogThe Power of the Log
The Power of the Log
 
User Group Bi
User Group BiUser Group Bi
User Group Bi
 
Sparse Content Map Storage System
Sparse Content Map Storage SystemSparse Content Map Storage System
Sparse Content Map Storage System
 
Handling Data in Mega Scale Web Systems
Handling Data in Mega Scale Web SystemsHandling Data in Mega Scale Web Systems
Handling Data in Mega Scale Web Systems
 
Deep Dive on Amazon Aurora
Deep Dive on Amazon AuroraDeep Dive on Amazon Aurora
Deep Dive on Amazon Aurora
 
O connor bosc2010
O connor bosc2010O connor bosc2010
O connor bosc2010
 

Más de Yves Goeleven

Back to the 90s' - Revenge of the static website
Back to the 90s' - Revenge of the static websiteBack to the 90s' - Revenge of the static website
Back to the 90s' - Revenge of the static websiteYves Goeleven
 
Io t privacy and security considerations
Io t   privacy and security considerationsIo t   privacy and security considerations
Io t privacy and security considerationsYves Goeleven
 
Connecting your app to the real world
Connecting your app to the real worldConnecting your app to the real world
Connecting your app to the real worldYves Goeleven
 
Madn - connecting things with people
Madn - connecting things with peopleMadn - connecting things with people
Madn - connecting things with peopleYves Goeleven
 
Message handler customer deck
Message handler customer deckMessage handler customer deck
Message handler customer deckYves Goeleven
 
Cloudbrew - Internet Of Things
Cloudbrew - Internet Of ThingsCloudbrew - Internet Of Things
Cloudbrew - Internet Of ThingsYves Goeleven
 
Azug - successfully breeding rabits
Azug - successfully breeding rabitsAzug - successfully breeding rabits
Azug - successfully breeding rabitsYves Goeleven
 
Eda on the azure services platform
Eda on the azure services platformEda on the azure services platform
Eda on the azure services platformYves Goeleven
 

Más de Yves Goeleven (9)

Back to the 90s' - Revenge of the static website
Back to the 90s' - Revenge of the static websiteBack to the 90s' - Revenge of the static website
Back to the 90s' - Revenge of the static website
 
Io t privacy and security considerations
Io t   privacy and security considerationsIo t   privacy and security considerations
Io t privacy and security considerations
 
Connecting your app to the real world
Connecting your app to the real worldConnecting your app to the real world
Connecting your app to the real world
 
Madn - connecting things with people
Madn - connecting things with peopleMadn - connecting things with people
Madn - connecting things with people
 
Message handler customer deck
Message handler customer deckMessage handler customer deck
Message handler customer deck
 
Cloudbrew - Internet Of Things
Cloudbrew - Internet Of ThingsCloudbrew - Internet Of Things
Cloudbrew - Internet Of Things
 
Azug - successfully breeding rabits
Azug - successfully breeding rabitsAzug - successfully breeding rabits
Azug - successfully breeding rabits
 
Eda on the azure services platform
Eda on the azure services platformEda on the azure services platform
Eda on the azure services platform
 
Sql Azure
Sql AzureSql Azure
Sql Azure
 

Windows azure storage services

  • 1.
  • 2.
  • 3. Me? Yves Goeleven  Solution Architect Capgemini  Windows Azure MVP  Co-founder Azug  Founder Goeleven.com  Founder Messagehandler.net  Contributor NServiceBus
  • 4. Agenda Deep Dive Architecture Best Practices  Global Namespace  Table Partitioning Strategy  Front End Layer  Change Propagation Strategy  Partition Layer  Counting Strategy  Stream Layer  Master Election Strategy
  • 5. Warning! This is a level 400 session!
  • 6. I assume you know Storage services  Highly scalable  Highly available  Cloud storage
  • 7. I assume you know That brings following data abstractions  Tables  Blobs  Drives  Queues
  • 8. I assume you know It is NOT a relational database  Storage services exposed trough a REST full API  With some sugar on top (aka an SDK)  LINQ to tables support might make you think it is a database
  • 9. This talk is based on an academic whitepaper http://sigops.org/sosp/sosp11/current/2011-Cascais/printable/11-calder.pdf
  • 10. Which discusses how Storage services defy the rules of the CAP theorem  Distributed & scalable system  Highly available  Strongly Consistent  Partition tolerant  All at the same time!
  • 11. Now how did they do that? Deep Dive: Ready for a plunge?
  • 13. Global namespace Every request gets sent to  http(s)://AccountName.Service.core.windows.net/PartitionName/ObjectName
  • 14. Global namespace DNS based service location  http(s)://AccountName.Service.core.windows.net/PartitionName/ObjectName  DNS resolution  Datacenter (f.e. Western Europe)  Virtual IP of a Stamp
  • 15. Stamp Scale unit  A windows azure deployment  3 layers  Typically 20 racks each  Roughly 360 XL instances  With disks attached  2 Petabyte (before june 2012)  30 Petabyte Intra-stamp replication
  • 16. Incoming read request Arrives at Front End  Routed to partition layer  http(s)://AccountName.Service.core.windows.net/PartitionName/ObjectName  Reads usually from memory Front End Layer FE FE FE FE FE FE FE FE FE Partition Layer PS PS PS PS PS PS PS PS PS
  • 17. Incoming read request 2 nodes per request  Blazingly fast  Flat “Quantum 10” network topology  50 Tbps of bandwidth  10 Gbps / node  Faster than typical ssd drive
  • 18. Incoming Write request Written to stream layer  Synchronous replication  1 Primary Extend Node  2 Secondary Extend Nodes Front End Layer Partition Layer PS PS PS PS PS PS PS PS PS Stream Layer EN EN EN EN EN EN EN Primary Secondary Secondary
  • 19. Geo replication Inter-stamp replication Intra-stamp replication Intra-stamp replication
  • 20. Location Service Stamp Allocation  Manages DNS records  Determines active/passive stamp configuration  Assigns accounts to stamps  Geo Replication  Disaster Recovery  Load Balancing  Account Migration
  • 22. Front End Nodes Stateless role instances  Authentication & Authorization  Shared Key (storage key)  Shared Key Lite  Shared Access Signatures  Routing to Partition Servers  Based on PartitionMap (in memory cache)  Versioning  x-ms-version: 2011-08-18  Throttling Front End Layer FE FE FE FE FE FE FE FE FE
  • 23. Partition Map Determine partition server  Used By Front End Nodes  Managed by Partition Manager (Partition Layer)  Ordered Dictionary  Key: http(s)://AccountName.Service.core.windows.net/PartitionName/ObjectName  Value: Partition Server  Split across partition servers  Range Partition
  • 24. Practical examples Blob  Each blob is a partition  Blob uri determines range  Range only matters for listing  Same container: More likely same range  https://myaccount.blobs.core.windows.net/mycontainer/subdir/file1.jpg  Https://myaccount.blobs.core.windows.net/mycontainer/subdir/file2.jpg  Other container: Less likely  https://myaccount.blobs.core.windows.net/othercontainer/subdir/file1.jpg
  • 25. Practical examples Queue  Each queue is a partition  Queue name determines range  All messages same partition  Range only matters for listing
  • 26. Practical examples Table  Each group of rows with same PartitionKey is a partition  Ordering within set by RowKey  Range matters a lot!  Expect continuation tokens!
  • 27. Continuation tokens Requests across partitions  All on same partition server  1 result  Across multiple ranges (or to many results)  Sequential queries required  1 result + continuation token  Repeat query with token until done  Will almost never occur in development!  As dataset is typically small
  • 29. PartitionLayer Dynamic Scalable Object Index  Object Tables  System defined data structures  Account, Entity, Blob, Message, Schema, PartitionMap  Dynamic Load Balancing  Monitor load to each part of the index to determine hot spots  Index is dynamically split into thousands of Index Range Partitions based on load  Index Range Partitions are automatically load balanced across servers to quickly adapt to changes in load  Updates every 15 seconds
  • 30. Partition layer Consists of  Partition Master  Distributes Object Tables  Partitions across Partition Servers Partition Layer  Dynamic Load Balancing PM PM  Partition Server LS LS  Serves data for exactly 1 Range Partition  Manages Streams  Lock Server PS PS PS  Master Election  Partition Lock
  • 31. Dynamic Scalable Object Index How it works Account Container Blob Aaaa Aaaa Aaaa Front End Layer A-H FE FE FE FE Harry Pictures Sunrise Harry Pictures Sunset Partition Layer PM PM Map H‟-R LS LS Richard Images Soccer Richard Images Tennis R‟-Z PS PS PS Zzzz Zzzz zzzz
  • 32. Dynamic Load Balancing Not in the critical path! Partition Layer PM PM LS LS PS PS PS PS PS PS PS PS PS Stream Layer EN EN EN EN EN EN EN Primary Secondary Secondary
  • 33. Partition Server Architecture  Log Structured Merge-Tree  Easy replication  Append only persisted streams Memory Row Table Cache Memory Data  Commit log Index Bloom Load Metrics Cache Filters  Metadata log  Row Data (Snapshots)  Blob Data (Actual blob data)  Caches in memory Persistent Data Commit Log Stream Row Data Stream  Memory table (= Commit log)  Row Data Metadata Log Stream Blob Data Stream  Index (snapshot location)  Bloom filter  Load Metrics
  • 34. Incoming Read request HOT Data  Completely in memory Memory Row Table Cache Memory Data Index Bloom Load Metrics Cache Filters Persistent Data Commit Log Stream Row Data Stream Metadata Log Stream Blob Data Stream
  • 35. Incoming Read request COLD Data  Partially Memory Memory Row Table Cache Memory Data Index Bloom Load Metrics Cache Filters Persistent Data Commit Log Stream Row Data Stream Metadata Log Stream Blob Data Stream
  • 36. Incoming Write Request Persistent First  Followed by memory Memory Row Table Cache Memory Data Index Bloom Load Metrics Cache Filters Persistent Data Commit Log Stream Row Data Stream Metadata Log Stream Blob Data Stream
  • 37. Checkpoint process Not in critical path! Memory Row Table Cache Memory Data Index Bloom Load Metrics Cache Filters Persistent Data Commit Log Stream Row Data Stream Commit Log Stream Row Data Stream Metadata Log Stream Blob Data Stream
  • 38. Geo replication process Not in critical path! Memory Row Memory Row Table Cache Memory Data Table Cache Memory Data Index Bloom Load Metrics Index Bloom Load Metrics Cache Filters Cache Filters Persistent Data Persistent Data Commit Log Stream Row Data Stream Commit Log Stream Row Data Stream Metadata Log Stream Blob Data Stream Metadata Log Stream Blob Data Stream
  • 40. Stream layer Append-Only Distributed File System  Streams are very large files  File system like directory namespace  Stream Operations  Open, Close, Delete Streams  Rename Streams  Concatenate Streams together  Append for writing  Random reads
  • 41. Stream layer Concepts  Block  Min unit of write/read Stream //foo/myfile.data  Checksum  Up to N bytes (e.g. 4MB) Pointer 1 Pointer 2 Pointer 3  Extent  Unit of replication Extent Extent Extent  Sequence of blocks Block Block Block Block Block Block Block Block Block Block Block  Size limit (e.g. 1GB)  Sealed/unsealed Sealed Unsealed Sealed  Stream  Hierarchical namespace  Ordered list of pointers to extents  Append/Concatenate
  • 42. Stream layer Consists of  Paxos System  Distributed consensus protocol  Stream Master Stream Layer  Extend Nodes  Attached Disks SM  Manages Extents Paxos SM EN EN EN EN  Optimization Routines SM
  • 43. Extent creation Allocation process  Partition Server  Request extent / stream creation Partition Layer PS  Stream Master  Chooses 3 random extend nodes based on available capacity Stream Layer  Chooses Primary & Secondary's SM  Allocates replica set  Random allocation  To reduce Mean Time To Recovery EN EN EN Primary Secondary Secondary
  • 44. Replication Process Bitwise equality on all nodes  Synchronous Partition Layer  Append request to primary PS  Written to disk, checksum  Offset and data forwarded to Stream Layer secondaries SM  Ack EN EN EN Primary Secondary Secondary Journaling  Dedicated disk (spindle/ssd)  For writes  Simultaneously copied to data disk (from memory or journal)
  • 45. Dealing with failure Ack from primary lost going back to partition layer  Retry from partition layer can cause multiple blocks to be appended (duplicate records) Unresponsive/Unreachable Extent Node (EN)  Seal the failed extent & allocate a new extent and append immediately Stream //foo/myfile.data Pointer 1 Pointer 2 Pointer 3 Extent Extent Extent Extent Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block
  • 46. Replication Failures (Scenario 1) Partition Layer PS Seal Stream Layer SM 120 120 EN EN EN Primary Secondary Secondary Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block
  • 47. Replication Failures (Scenario 2) Partition Layer PS Seal Stream Layer SM 120 100 EN EN EN Primary Secondary Secondary ? Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block
  • 48. Network partitioning during read Data Stream Partition Layer  Partition Layer only reads from PS offsets returned from successful appends  Committed on all replicas Stream Layer SM  Row and Blob Data Streams  Offset valid on any replica  Safe to read EN EN EN Primary Secondary Secondary Block Block Block Block Block Block Block Block Block Block Block
  • 49. Network partitioning during read Log Stream Partition Layer  Logs are used on partition load PS  Check commit length first  Only read from  Unsealed replica if all replicas have 1, 2 Stream Layer the same commit length SM  A sealed replica Seal EN EN EN Primary Secondary Secondary Secondary Secondary Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block
  • 50. Garbage Collection Downside to this architecture  Lot‟s of wasted disk space  „Forgotten‟ streams  „Forgotten‟ extends  Unused blocks in extends after write errors  Garbage collection required  Partition servers mark used streams  Stream manager cleans up unreferenced streams  Stream manager cleans up unreferenced extends  Stream manager performs erasure coding  Read Solomon algorithm  Allows recreation of cold streams after deleting extends
  • 51. Beating the CAP theorem Layering & co-design  Stream Layer  Availability with Partition/failure tolerance  For Consistency, replicas are bit-wise identical up to the commit length  Partition Layer  Consistency with Partition/failure tolerance  For Availability, RangePartitions can be served by any partition server  Are moved to available servers if a partition server fails  Designed for specific classes of partitioning/failures seen in practice  Process to Disk to Node to Rack failures/unresponsiveness  Node to Rack level network partitioning
  • 53. Partitioning Strategy Tables: Great for writing data, horrible for complex querying  Data is automatically spread across extend nodes  Indexes are automatically spread across partition servers  Queries to many servers may be required to list data Front End Layer Partition Layer PS PS PS PS PS PS PS PS PS Stream Layer EN EN EN EN EN EN EN Secondary Secondary
  • 54. Partitioning Strategy Insert all aggregates in a dedicated partition  Aggregate Account Table PK RowKey Properties  Records that belong together as a unit Europe Orders Order1 Order 1 data  F.e. Order with orderliness Order1 01 Orderline 1 data  Dedicated Order1 02 Orderline 2 data  Unique combination of partition parts Europe Orders Order2 Order 2 data  Account, table, partitionkey Order2 01 Orderline 1 data  Best insert performance Order2 02 Orderline 2 data  Use Entity Group Transactions US Orders Order3 Order 3 data  All aggregate records insert in 1 go Order3 01 Orderline 1 data  Transactional Order3 02 Orderline 2 data  Use Rowkey only for ordering  Can be blank
  • 55. Partitioning Strategy Query By = Store By  Do not query across partitions, it hurts! Acct Table PK RK Properties  Instead store a copy of the data to query Yves Orders 0634911534000000000 1 Order 1 data  In 1 partition 0634911534000000000 1_01 Orderline 1 data  PartitionKey 0634911534000000000 1_02 Orderline 2 data  Combination of query by‟s Yves Orders 0634911534000000000 3 Order 3 data  or well known identifiers 0634911534000000000 3_01 Orderline 1 data  Rowkey 0634911534000000000 3_02 Orderline 2 data  Use for order Yves Orders 0634911666000000000 2 Order 2 data  Must be unique in combo PK 0634911666000000000 2_01 Orderline 1 data 0634911666000000000 2_02 Orderline 2 data  Guarantees fast retrieval  Every query = 1 partition server
  • 56. Partitioning Strategy DO Denormalize!  It‟s not a sin! Acct Table PK RK Properties  Limit dataset to values required by query Yves Orders 0634911534000000000 1 €200  Do precompute Yves Orders 0634911534000000000 3 $500  Do format up front Yves Orders 0634911666000000000 2 €1000  Etc…
  • 57. Bonus Tip Great partition keys  Any combination of properties of the object  Who  F.e. current userid  When  DateTime.Format("{0:D19}", DateTime.MaxValue.Ticks – DateTime.UtcNow.Ticks)  Where  Geohash algorithm (geohash.org)  Why  Flow context through operations/messages  Add all partition key values as properties to the original aggregate root!  Or update will be your next nightmare
  • 58. Change Propagation Strategy Use queued messaging  NO TRANSACTION support  Split multi-partition update Web Roles/Websites  In separate atomic operations  Built-in retry mechanism on storage queues  Use compensation logic if needed Worker roles  Example P  Update primary records in entity group tx  Publish „Updated‟ event through queues  Maintain list interested nodes in storage  Secondary nodes update indexes in response S S S
  • 59. Change Propagation Strategy Sounds hard, but don’t worry  I‟ve gone through all the hard work already  And contributed this back to the community Web Roles/Websites  NServiceBus  Bonus  Works very well to update web role caches Worker roles  Very nice in combination with SignalR! P  As A Service  www.messagehandler.net S S S
  • 60. Counting Counting records across partitions is a PITA, because querying is…  And there is no built in support  Do it upfront Web Roles/Websites  I typically count „created‟ events in flight  It‟s just another handler  BUT… Worker roles  +1 is not an idempotent operation P  Keep the count atomic  No other operation except +1 & save  Use sharded counters  Every thread on every node „Owns‟ S S S C a shard of the counter (prevent concurrency)  In single partition  Roll up when querying and/or regular interval
  • 61. Counting Counting actions in Goeleven.com public class CountActions : IHandleMessages<ActionCreated> { public void Handle(ActionCreated message) { string identifier = RoleEnvironment.CurrentRoleInstance.Id + "." + Thread.CurrentThread.ManagedThreadId; var shard = counterShardRepository.TryGetShard("Actions", identifier); if (shard == null) { shard = new CounterShard { PartitionKey = "Actions", RowKey = identifier, Partial = 1 }; counterShardRepository.Add(shard); } else Acct Table PK RK Properties { My Counters Actions RollUp 2164 shard.Partial += 1; counterShardRepository.Update(shard); My Counters Actions Goeleven.Worker_IN_0.11 12 } My Counters Actions Goeleven.Worker_IN_1.56 15 counterShardRepository.SaveAllChanges(); } }
  • 62. Counting Rollup public class Counter { public long Value { get; private set; } public Counter(IEnumerable<CounterShard> counterShards) { foreach (var counterShard in counterShards) { Value += counterShard.Partial; } Acct Table PK RK Properties } } My Counters Actions RollUp 2164 My Counters Actions Goeleven.Worker_IN_0.11 12 My Counters Actions Goeleven.Worker_IN_1.56 15 My Counters Actions 2181
  • 63. Master Election Scheduling in multi-node environment  Like rollup  All nodes kick in together doing the same thing  Master election required to prevent conflicts  Remember the locking service?  It surfaces through blob as a „Lease‟ So you can use it too! Partition Layer  PM PM LS LS PS PS PS
  • 64. Master Election Try to acquire a lease  If success start work, and hold on to the lease until done leaseId = blob.TryAcquireLease(); if (leaseId != null) { renewalThread = new Thread(() => { Thread.Sleep(TimeSpan.FromSeconds(40)); blob.RenewLease(leaseId); }); renewalThread.Start(); } // do your thing renewalThread.Abort(); blob.ReleaseLease(leaseId);  Find blob extension on http://blog.smarx.com/posts/leasing-windows-azure-blobs-using-the- storage-client-library
  • 66. What we discussed Deep Dive Best Practices  Global Namespace  Table Partitioning Strategy  Front End Layer  Change Propagation Strategy  Partition Layer  Counting Strategy  Stream Layer  Master Election Strategy
  • 67. Resources Want to know more?  White paper: http://sigops.org/sosp/sosp11/current/2011-Cascais/printable/11-calder.pdf  Video: http://www.youtube.com/watch?v=QnYdbQO0yj4  More slides: http://sigops.org/sosp/sosp11/current/2011-Cascais/11-calder.pptx  Blob extensions: http://blog.smarx.com/posts/leasing-windows-azure-blobs-using-the-storage- client-library  NServiceBus: https://github.com/nservicebus/nservicebus See it in action  www.goeleven.com  www.messagehandler.net
  • 68. Q&A