SlideShare una empresa de Scribd logo
1 de 96
PhD Defense

Scalable and Elastic
Transactional Data Stores for
Cloud Computing Platforms
                    Sudipto Das
            Computer Science, UC Santa Barbara
                     sudipto@cs.ucsb.edu
Committee:
Divy Agrawal (co-chair), Amr El Abbadi (co-chair),
Phil Bernstein, Tim Sherwood

Sponsors:
Web replacing Desktop




          Sudipto Das {sudipto@cs.ucsb.edu}   2
Paradigm shift in Infrastructure




            Sudipto Das {sudipto@cs.ucsb.edu}   3
Paradigm shift in Infrastructure




            Sudipto Das {sudipto@cs.ucsb.edu}   4
Cloud computing
   Computing infrastructure
    and solutions delivered as a
    service
    ◦ Industry worth USD150 billion by
      2014*
   Contributors to success
    ◦ Economies of scale
    ◦ Elasticity and pay-per-use pricing
   Popular paradigms
    ◦ Infrastructure as a Service (IaaS)
    ◦ Platform as a Service (PaaS)
    ◦ Software as a Service (SaaS)
*http://www.crn.com/news/channel-programs/225700984/cloud-computing-services-market-to-near-150-billion-in-2014.htm

                                      Sudipto Das {sudipto@cs.ucsb.edu}                         5
Databases for cloud platforms
 Data is central to applications
 DBMSs are mission critical component in
  cloud software stack
    ◦ Manage petabytes of data, drive revenue
    ◦ Serve a variety of applications (multitenancy)
   Data needs for cloud applications
    ◦ OLTP systems: store and serve data
    ◦ Data analysis systems: decision support,
      intelligence


                   Sudipto Das {sudipto@cs.ucsb.edu}   6
Databases for cloud platforms
 Data is central to applications
 DBMSs are mission critical component in
  cloud software stack
    ◦ Manage petabytes of data, drive revenue
    ◦ Serve a variety of applications (multitenancy)
   Data needs for cloud applications
    ◦ OLTP systems: store and serve data
    ◦ Data analysis systems: decision support,
      intelligence


                   Sudipto Das {sudipto@cs.ucsb.edu}   7
Application landscape
 Social   gaming

 Rich
     content and
 mash-ups

 Managed
 applications

 Cloud application
 platforms

                Sudipto Das {sudipto@cs.ucsb.edu}   8
Challenges for OLTP systems

 Scalability
 ◦ While ensuring efficient transaction
   execution!


 Lightweight         Elasticity
 ◦ Scale on-demand!

                Sudipto Das {sudipto@cs.ucsb.edu}   9
Two approaches to scalability
   Scale-up
    ◦ Preferred in classical
      enterprise setting (RDBMS)
    ◦ Flexible ACID transactions
    ◦ Transactions access a single node




                     Sudipto Das {sudipto@cs.ucsb.edu}   10
Two approaches to scalability
   Scale-up
    ◦ Preferred in classical
      enterprise setting (RDBMS)
    ◦ Flexible ACID transactions
    ◦ Transactions access a single node
   Scale-out
    ◦ Cloud friendly (Key value
      stores)
    ◦ Execution at a single server
      Limited functionality & guarantees
    ◦ No multi-row or multi-step
      transactions
                       Sudipto Das {sudipto@cs.ucsb.edu}   11
Why care about transactions?

confirm_friend_request(user1, user2)
{
begin_transaction();

       update_friend_list(user1, user2, status.confirmed);

       update_friend_list(user2, user1, status.confirmed);
end_transaction();
}




                   Sudipto Das {sudipto@cs.ucsb.edu}   12
Why care about transactions?

confirm_friend_request(user1, user2)
{
begin_transaction();

       update_friend_list(user1, user2, status.confirmed);

       update_friend_list(user2, user1, status.confirmed);
end_transaction();
}



   Simplicity in application design
   with ACID transactions
                   Sudipto Das {sudipto@cs.ucsb.edu}   13
confirm_friend_request_A(user1, user2) {
  try {
       update_friend_list(user1, user2, status.confirmed);
  } catch(exception e) {
       report_error(e);
       return;
  }
  try {
       update_friend_list(user2, user1, status.confirmed); 

   } catch(exception e) {
       revert_friend_list(user1, user2);
       report_error(e);
       return;
   }
}
confirm_friend_request_B(user1, user2) {
  try{
 
 update_friend_list(user1, user2, status.confirmed);
  } catch(exception e) {

 report_error(e);
 
 add_to_retry_queue(operation.updatefriendlist, user1, user2, current_time());

}
  try {
 
 update_friend_list(user2, user1, status.confirmed);
  } catch(exception e) {

 report_error(e);

 add_to_retry_queue(operation.updatefriendlist, user2, user1, current_time());
  }
}

                                Sudipto Das {sudipto@cs.ucsb.edu}            14
confirm_friend_request_A(user1, user2) {
  try {
       update_friend_list(user1, user2, status.confirmed);
  } catch(exception e) {
       report_error(e);
       return;
  }
  try {
       update_friend_list(user2, user1, status.confirmed); 

   } catch(exception e) {
       revert_friend_list(user1, user2);
       report_error(e);
       return;
   }
}
confirm_friend_request_B(user1, user2) {
  try{
 
 update_friend_list(user1, user2, status.confirmed);
  } catch(exception e) {

 report_error(e);
 
 add_to_retry_queue(operation.updatefriendlist, user1, user2, current_time());

}
  try {
 
 update_friend_list(user2, user1, status.confirmed);
  } catch(exception e) {

 report_error(e);

 add_to_retry_queue(operation.updatefriendlist, user2, user1, current_time());
  }
}

                                Sudipto Das {sudipto@cs.ucsb.edu}            15
Challenge: Transactions at Scale


            Key Value Stores
Scale-out




                                                          RDBMSs


                      ACID transactions
                      Sudipto Das {sudipto@cs.ucsb.edu}            16
Challenge: Lightweight Elasticity
    Provisioning on-demand and not for peak
            Optimize operating cost!

                       Capacity




                                            Resources
Resources




                        Demand                                                         Capacity

                                                                                       Demand
               Time                                                   Time

   Traditional Infrastructures                          Deployment in the Cloud

                             Unused resources
                                                         Slide Credits: Berkeley RAD Lab

                       Sudipto Das {sudipto@cs.ucsb.edu}                          17
Contributions for OLTP systems

 Transactions        at Scale
 ◦ ElasTraS [HotCloud 2009, UCSB TR 2010]
 ◦ G-Store [SoCC 2010]
 Lightweight       Elasticity
 ◦ Albatross [VLDB 2011]
 ◦ Zephyr [SIGMOD 2011]
 Self-Manageability
 ◦ Pythia [in progress]
              Sudipto Das {sudipto@cs.ucsb.edu}   18
Contributions for OLTP systems
It is possible to architect scalable DBMSs that
efficiently support transactional semantics to ease
application design and elastically adapt to fluctuating
operational demands to optimize the operating cost.




                  Sudipto Das {sudipto@cs.ucsb.edu}   19
Contributions for OLTP systems
It is possible to architect scalable DBMSs that
efficiently support transactional semantics to ease
application design and elastically adapt to fluctuating
operational demands to optimize the operating cost.


   Transactions at
    Scale
    ◦ ElasTraS [HotCloud
      2009, UCSB TR 2010]
    ◦ G-Store
      [SoCC 2010]
                  Sudipto Das {sudipto@cs.ucsb.edu}   20
Contributions for OLTP systems
It is possible to architect scalable DBMSs that
efficiently support transactional semantics to ease
application design and elastically adapt to fluctuating
operational demands to optimize the operating cost.


   Transactions at                          Lightweight
    Scale                                     Elasticity
    ◦ ElasTraS [HotCloud                      ◦ Albatross
      2009, UCSB TR 2010]                       [VLDB 2011]
    ◦ G-Store                                 ◦ Zephyr
      [SoCC 2010]                               [SIGMOD 2011]
                  Sudipto Das {sudipto@cs.ucsb.edu}      21
Contributions for OLTP systems
It is possible to architect scalable DBMSs that
efficiently support transactional semantics to ease
application design and elastically adapt to fluctuating
operational demands to optimize the operating cost.


   Transactions at                          Lightweight
    Scale                                     Elasticity
    ◦ ElasTraS [HotCloud                      ◦ Albatross
      2009, UCSB TR 2010]                       [VLDB 2011]
    ◦ G-Store                                 ◦ Zephyr
      [SoCC 2010]                               [SIGMOD 2011]
                  Sudipto Das {sudipto@cs.ucsb.edu}      22
Contributions
        Data Management


            Transaction Processing


         Dynamic                  Static
        partitioning           partitioning
                                    ElasTraS
           G-Store
                                  [HotCloud ‘09]
         [SoCC ‘10]
                                     [TR ‘10]


                        Albatross [VLDB ‘11]
                       Zephyr [SIGMOD ‘11]

            Dissertation
            Sudipto Das {sudipto@cs.ucsb.edu}      23
Contributions
                Data Management


Analytics           Transaction Processing


   Ricardo       Dynamic                  Static
[SIGMOD ‘10]    partitioning           partitioning
  MD-HBase                                  ElasTraS
  [MDM ‘11]        G-Store
                                          [HotCloud ‘09]
  Best Paper     [SoCC ‘10]
                                             [TR ‘10]
  Runner up

   Anonimos                     Albatross [VLDB ‘11]
  [ICDE ‘10],                  Zephyr [SIGMOD ‘11]
    [TKDE]
                    Dissertation
                    Sudipto Das {sudipto@cs.ucsb.edu}      24
Contributions
                Data Management


Analytics           Transaction Processing                    Novel
                                                           Architectures

   Ricardo       Dynamic                  Static             Hyder
[SIGMOD ‘10]    partitioning           partitioning        [CIDR ‘11]
                                                           Best Paper
  MD-HBase                                  ElasTraS
  [MDM ‘11]        G-Store
                                          [HotCloud ‘09]      CoTS
  Best Paper     [SoCC ‘10]
                                             [TR ‘10]      [ICDE ‘09],
  Runner up
                                                           [VLDB ‘09]
   Anonimos                     Albatross [VLDB ‘11]
  [ICDE ‘10],                  Zephyr [SIGMOD ‘11]            TCAM
    [TKDE]                                                 [DaMoN ‘08]
                    Dissertation
                    Sudipto Das {sudipto@cs.ucsb.edu}          25
Transactions at Scale


            Key Value Stores
Scale-out




                                                          RDBMSs


                      ACID transactions
                      Sudipto Das {sudipto@cs.ucsb.edu}            26
Scale-out with static partitioning
   Table level partitioning (range, hash)
    ◦ Distributed transactions
   Partitioning the Database schema
    ◦ Co-locate data items accessed together
    ◦ Goal: Minimize distributed transactions




                   Sudipto Das {sudipto@cs.ucsb.edu}   27
Scale-out with static partitioning
   Table level partitioning (range, hash)
    ◦ Distributed transactions
   Partitioning the Database schema
    ◦ Co-locate data items accessed together
    ◦ Goal: Minimize distributed transactions




                   Sudipto Das {sudipto@cs.ucsb.edu}   28
Scale-out with static partitioning
   Table level partitioning (range, hash)
    ◦ Distributed transactions
   Partitioning the Database schema
    ◦ Co-locate data items accessed together
    ◦ Goal: Minimize distributed transactions
   Scaling-out with static partitioning
    ◦ ElasTraS [HotCloud 2009, TR 2010]




                   Sudipto Das {sudipto@cs.ucsb.edu}   29
Scale-out with static partitioning
   Table level partitioning (range, hash)
    ◦ Distributed transactions
   Partitioning the Database schema
    ◦ Co-locate data items accessed together
    ◦ Goal: Minimize distributed transactions
   Scaling-out with static partitioning
    ◦   ElasTraS [HotCloud 2009, TR 2010]
    ◦   Cloud SQL Server [ICDE 2011]
    ◦   MegaStore [CIDR 2011]
    ◦   RelationalCloud [CIDR 2011]
                   Sudipto Das {sudipto@cs.ucsb.edu}   30
Dynamically formed partitions
   Access patterns change, often rapidly
    ◦ Online multi-player gaming applications
    ◦ Collaboration based applications
    ◦ Scientific computing applications
   Not amenable to static partitioning




                  Sudipto Das {sudipto@cs.ucsb.edu}   31
Dynamically formed partitions
   Access patterns change, often rapidly
    ◦ Online multi-player gaming applications
    ◦ Collaboration based applications
    ◦ Scientific computing applications
 Not amenable to static partitioning
 How to get the benefit of partitioning
  when accesses do not statically partition?
    ◦ Ours is the first solution to allow that


                    Sudipto Das {sudipto@cs.ucsb.edu}   32
Online Multi-player Games

                                 ID            Name   $$$    Score




                                          Player Profile




           Sudipto Das {sudipto@cs.ucsb.edu}            33
Online Multi-player Games




           Sudipto Das {sudipto@cs.ucsb.edu}   34
Online Multi-player Games




                            Execute transactions
                            on player profiles while
                            the game is in progress
           Sudipto Das {sudipto@cs.ucsb.edu}   35
Online Multi-player Games




           Sudipto Das {sudipto@cs.ucsb.edu}   36
Online Multi-player Games




                                   Partitions/groups
                                   are dynamic


           Sudipto Das {sudipto@cs.ucsb.edu}   37
Online Multi-player Games




                           Hundreds of thousands
                           of concurrent groups
           Sudipto Das {sudipto@cs.ucsb.edu}   38
Data Fusion for dynamic partitions
[G-Store, SoCC 2010]

 Transactional access to a group of data
  items formed on-demand
 Challenge: Avoid distributed transactions!




               Sudipto Das {sudipto@cs.ucsb.edu}   39
Data Fusion for dynamic partitions
[G-Store, SoCC 2010]

 Transactional access to a group of data
  items formed on-demand
 Challenge: Avoid distributed transactions!
 Key Group Abstraction
    ◦ Groups are small
    ◦ Groups have non-trivial lifetime
    ◦ Groups are dynamic and on-demand
   Groups are dynamically formed tenant
    databases
                 Sudipto Das {sudipto@cs.ucsb.edu}   40
Transactions on Groups
Without distributed transactions




                                       One key selected as the
                                        leader




                  Sudipto Das {sudipto@cs.ucsb.edu}      41
Transactions on Groups
Without distributed transactions




                                     One key selected as the
                                      leader
                                     Followers transfer
                                      ownership of keys to leader


                  Sudipto Das {sudipto@cs.ucsb.edu}   42
Transactions on Groups
Without distributed transactions



                                                           Key
                                                          Group
                                                       Ownership
                                                       of keys at a
                                                       single node

                                     One key selected as the
                                      leader
                                     Followers transfer
                                      ownership of keys to leader


                  Sudipto Das {sudipto@cs.ucsb.edu}   43
Transactions on Groups
Without distributed transactions
                  Grouping Protocol

                                                          Key
                                                         Group
                                                      Ownership
                                                      of keys at a
                                                      single node

                                    One key selected as the
                                     leader
                                    Followers transfer
                                     ownership of keys to leader


                 Sudipto Das {sudipto@cs.ucsb.edu}   44
Why is group formation hard?
   Guarantee the contract between
    leaders and followers in the presence of:
    ◦ Leader and follower failures
    ◦ Lost, duplicated, or re-ordered messages
    ◦ Dynamics of the underlying system
   How to ensure efficient and ACID
    execution of transactions?




                   Sudipto Das {sudipto@cs.ucsb.edu}   45
Grouping protocol

              L(Joining)          L(Joined)
Follower(s)
Create
Request         J    JA       JAA

Leader
          L(Creating) L(Joined)

                           Time




                      Sudipto Das {sudipto@cs.ucsb.edu}   46
Grouping protocol

              L(Joining)          L(Joined)
Follower(s)
Create
Request         J    JA       JAA
                                           Group Opns
Leader
          L(Creating) L(Joined)

                           Time




                      Sudipto Das {sudipto@cs.ucsb.edu}   47
Grouping protocol

              L(Joining)          L(Joined)                          L(Free)
Follower(s)
Create
Request         J    JA       JAA                                D      DA
                                           Group Opns
Leader
          L(Creating) L(Joined)                           L(Deleting)   L(Deleted)
                                                                 Delete
                           Time
                                                                 Request




                      Sudipto Das {sudipto@cs.ucsb.edu}                    48
Grouping protocol
                           Log entries


              L(Joining)          L(Joined)                          L(Free)
Follower(s)
Create
Request         J    JA       JAA                                D      DA
                                           Group Opns
Leader
          L(Creating) L(Joined)                           L(Deleting)   L(Deleted)
                                                                 Delete
                           Time
                                                                 Request




                      Sudipto Das {sudipto@cs.ucsb.edu}                    49
Grouping protocol
                            Log entries


               L(Joining)          L(Joined)                          L(Free)
Follower(s)
Create
Request          J    JA       JAA                                D      DA
                                            Group Opns
Leader
           L(Creating) L(Joined)                           L(Deleting)   L(Deleted)
                                                                  Delete
                            Time
                                                                  Request
    Conceptually akin to “locking”
     ◦ Locks held by groups
                       Sudipto Das {sudipto@cs.ucsb.edu}                    50
Efficient transaction processing
   How does the leader execute transactions?
    ◦ Caches data for group members  underlying data
      store equivalent to a disk
    ◦ Transaction logging for durability
    ◦ Cache asynchronously flushed to propagate updates
    ◦ Guaranteed update propagation
                        Transaction Manager
Leader                                                  Log
                            Cache Manager

                                              Asynchronous update
                                              Propagation

Followers
                    Sudipto Das {sudipto@cs.ucsb.edu}           51
Prototype: G-Store [SoCC 2010]
     An implementation over Key-value stores

                        Application Clients


                  Transactional Multi-Key Access



Grouping Transaction      Grouping Transaction                    Grouping Transaction
 Layer    Manager          Layer    Manager                        Layer    Manager
Key-Value Store Logic     Key-Value Store Logic                   Key-Value Store Logic



                        Distributed Storage
                                  G-Store
                              Sudipto Das {sudipto@cs.ucsb.edu}            52
Prototype: G-Store [SoCC 2010]
     An implementation over Key-value stores

                        Application Clients


                  Transactional Multi-Key Access

          Grouping middleware layer resident on top of a key-value store

Grouping Transaction      Grouping Transaction                     Grouping Transaction
 Layer    Manager          Layer    Manager                         Layer    Manager
Key-Value Store Logic      Key-Value Store Logic                   Key-Value Store Logic



                        Distributed Storage
                                   G-Store
                               Sudipto Das {sudipto@cs.ucsb.edu}            53
G-Store Evaluation
   Implemented using HBase
    ◦ Added the middleware layer
    ◦ ~10000 LOC
 Experiments in Amazon EC2
 Benchmark: An online multi-player game
 Cluster size: 10 nodes
 Data size: ~1 billion rows (>1 TB)




                     Sudipto Das {sudipto@cs.ucsb.edu}   54
G-Store Evaluation
   Implemented using HBase
    ◦ Added the middleware layer
    ◦ ~10000 LOC
   Experiments in Amazon EC2
   Benchmark: An online multi-player game
   Cluster size: 10 nodes
   Data size: ~1 billion rows (>1 TB)
   For groups with 100 keys
    ◦ Group creation latency: ~10 – 100ms
    ◦ More than 10,000 groups concurrently created

                     Sudipto Das {sudipto@cs.ucsb.edu}   55
G-Store Evaluation




 Group creation latency                  Group creation throughput



                  Sudipto Das {sudipto@cs.ucsb.edu}      56
Lightweight Elasticity
    Provisioning on-demand and not for peak
            Optimize operating cost!

                       Capacity




                                            Resources
Resources




                        Demand                                                         Capacity

                                                                                       Demand
               Time                                                   Time

   Traditional Infrastructures                          Deployment in the Cloud

                             Unused resources
                                                         Slide Credits: Berkeley RAD Lab

                       Sudipto Das {sudipto@cs.ucsb.edu}                          57
Elasticity in the Database tier


           Load Balancer

                                                 Application/
                                                 Web/Caching
                                                 tier



                                                Database tier


            Sudipto Das {sudipto@cs.ucsb.edu}      58
Elasticity in the Database tier


           Load Balancer

                                                 Application/
                                                 Web/Caching
                                                 tier



                                                Database tier


            Sudipto Das {sudipto@cs.ucsb.edu}      59
Elasticity in the Database tier


           Load Balancer

                                                 Application/
                                                 Web/Caching
                                                 tier



                                                Database tier


            Sudipto Das {sudipto@cs.ucsb.edu}      60
Elasticity in the Database tier


           Load Balancer

                                                 Application/
                                                 Web/Caching
                                                 tier



                                                Database tier


            Sudipto Das {sudipto@cs.ucsb.edu}      61
Elasticity in the Database tier


           Load Balancer

                                                 Application/
                                                 Web/Caching
                                                 tier



                                                Database tier


            Sudipto Das {sudipto@cs.ucsb.edu}      62
Elasticity in the Database tier


           Load Balancer

                                                 Application/
                                                 Web/Caching
                                                 tier



                                                Database tier


            Sudipto Das {sudipto@cs.ucsb.edu}      63
Elasticity in the Database tier


           Load Balancer

                                                 Application/
                                                 Web/Caching
                                                 tier



                                                Database tier


            Sudipto Das {sudipto@cs.ucsb.edu}      64
Live database migration
   Migrate a database partition (or tenant)
    in a live system
    ◦ Optimize operating cost
    ◦ Resource orchestration in multitenant
      systems




                 Sudipto Das {sudipto@cs.ucsb.edu}   65
Live database migration
   Migrate a database partition (or tenant)
    in a live system
    ◦ Optimize operating cost
    ◦ Resource orchestration in multitenant
      systems
   Different from
    ◦ Migration between software versions
    ◦ Migration in case of schema evolution


                   Sudipto Das {sudipto@cs.ucsb.edu}   66
VM migration for DB elasticity
   One DB partition-per-VM
    ◦ Pros: allows fine-grained load
      balancing
                                                          VM   VM    VM
    ◦ Cons
      Performance overhead                                Hypervisor
      Poor consolidation ratio [Curino et
       al., CIDR 2011]




                      Sudipto Das {sudipto@cs.ucsb.edu}         67
VM migration for DB elasticity
   One DB partition-per-VM
    ◦ Pros: allows fine-grained load
      balancing
    ◦ Cons                                                VM      VM    VM
      Performance overhead
      Poor consolidation ratio [Curino et                 Hypervisor
       al., CIDR 2011]
   Multiple DB partitions in
    a VM
    ◦ Pros: good performance
    ◦ Cons: Migrate all partitions                                VM
      Coarse-grained load balancing
                                                               Hypervisor
                      Sudipto Das {sudipto@cs.ucsb.edu}            68
Live database migration
   Multiple partitions share the same
    database process
    ◦ Shared process multitenancy
   Migrate individual partitions on-
    demand in a live system
    ◦ Virtualization in the database tier
   Straightforward solution
    ◦   Stop serving partition at the source
    ◦   Copy to destination
    ◦   Start serving at the destination
    ◦   Expensive!

                       Sudipto Das {sudipto@cs.ucsb.edu}   69
Migration cost measures
   Service un-availability
    ◦ Time the partition is unavailable
   Number of failed requests
    ◦ Number of operations failing/transactions
      aborting
   Performance overhead
    ◦ Impact on response times
   Additional data transferred


                    Sudipto Das {sudipto@cs.ucsb.edu}   70
Two common DBMS architectures
   Decoupled storage
    architectures
    ◦ ElasTraS, G-Store, Deuteronomy,
      MegaStore
    ◦ Persistent data is not migrated
    ◦ Albatross [VLDB 2011]

   Shared nothing architectures
    ◦ SQL Azure, Relational Cloud,
      MySQL Cluster
    ◦ Migrate persistent data
    ◦ Zephyr [SIGMOD 2011]
                    Sudipto Das {sudipto@cs.ucsb.edu}   71
Two common DBMS architectures
   Decoupled storage
    architectures
    ◦ ElasTraS, G-Store, Deuteronomy,
      MegaStore
    ◦ Persistent data is not migrated
    ◦ Albatross [VLDB 2011]

   Shared nothing architectures
    ◦ SQL Azure, Relational Cloud,
      MySQL Cluster
    ◦ Migrate persistent data
    ◦ Zephyr [SIGMOD 2011]
                    Sudipto Das {sudipto@cs.ucsb.edu}   72
Why is live DB migration hard?
   Persistent DB image must be migrated (GBs)
    ◦ How to ensure no downtime?
   Nodes can fail during migration
    ◦ How to guarantee correctness during
      failures?
      Transaction atomicity and durability.
      Recover migration state after failure.
   Transactions execute during migration
    ◦ How to guarantee serializability?
      Transaction correctness equivalent to normal operation


                      Sudipto Das {sudipto@cs.ucsb.edu}   73
Our approach: Zephyr
[SIGMOD 2011]
   Migration executed in phases
    ◦ Starts with transfer of minimal information to
      destination (“wireframe”)
   Database pages used as granule of
    migration
    ◦ Unique page ownership
   Source and destination concurrently
    execute transactions in one migration phase
   Minimal transaction synchronization
      Guaranteed serializability
   Logging and handshaking protocols
                       Sudipto Das {sudipto@cs.ucsb.edu}   74
Simplifying assumptions
 For this talk
  ◦ Transactions access a single partition
  ◦ No replication
  ◦ No structural changes to indices
   Extensions in the paper [SIGMOD 2011]
    ◦ Relaxes these assumptions




                  Sudipto Das {sudipto@cs.ucsb.edu}   75
Design overview

                        P1
                        P2
    Owned Pages         P3



                        Pn

Active transactions
                      TS1,…,
                        TSk
                      Source                        Destination
                                                                   Page owned by Node

                                                                   Page not owned by Node

                               Sudipto Das {sudipto@cs.ucsb.edu}            76
Init mode
           Freeze indices and migrate wireframe

                        P1                                  P1
                        P2                                  P2
    Owned Pages         P3                                  P3       Un-owned Pages


                        Pn                                  Pn
                      TS1,…,
Active transactions
                        TSk
                      Source                        Destination
                                                                   Page owned by Node

                                                                   Page not owned by Node

                               Sudipto Das {sudipto@cs.ucsb.edu}            77
What is an index wireframe?




  Source


           Sudipto Das {sudipto@cs.ucsb.edu}   78
What is an index wireframe?




  Source                                       Destination


           Sudipto Das {sudipto@cs.ucsb.edu}                 79
Dual mode

                      P1                                  P1
                       P2                                 P2
                       P3                                 P3



                       Pn                                 Pn
Old, still active   TSk+1,…,                            TD1,…,       New transactions
transactions           TSl                               TDm
                    Source                        Destination
                                                                   Page owned by Node
          Index wireframes remain frozen
                                                                   Page not owned by Node

                               Sudipto Das {sudipto@cs.ucsb.edu}           80
Dual mode

                      P1       P3 accessed by             P1
                       P2            TDi                  P2
                       P3                                 P3



                       Pn                                 Pn
Old, still active   TSk+1,…,                            TD1,…,       New transactions
transactions           TSl                               TDm
                    Source                        Destination
                                                                   Page owned by Node
          Index wireframes remain frozen
                                                                   Page not owned by Node

                               Sudipto Das {sudipto@cs.ucsb.edu}           81
Dual mode
             Requests for un-owned pages can block

                       P1       P3 accessed by             P1
                        P2            TDi                  P2
                        P3                                 P3



                        Pn                                 Pn
Old, still active    TSk+1,…,                            TD1,…,       New transactions
transactions            TSl                               TDm
                     Source                        Destination
                                                                    Page owned by Node
          Index wireframes remain frozen
                                                                    Page not owned by Node

                                Sudipto Das {sudipto@cs.ucsb.edu}           82
Dual mode
             Requests for un-owned pages can block

                       P1       P3 accessed by             P1
                        P2            TDi                  P2
                        P3                                 P3

                                  P3 pulled
                        Pn       from source               Pn
Old, still active    TSk+1,…,                            TD1,…,       New transactions
transactions            TSl                               TDm
                     Source                        Destination
                                                                    Page owned by Node
          Index wireframes remain frozen
                                                                    Page not owned by Node

                                Sudipto Das {sudipto@cs.ucsb.edu}           83
Finish mode

             P1                                 P1
             P2                                 P2
             P3                                 P3

                       P1, P2, …
                     pushed from
              Pn        source                  Pn
                                           TDm+1,…
Completed
                                             ,TDn
            Source                      Destination
                                                         Page owned by Node

                                                         Page not owned by Node

                     Sudipto Das {sudipto@cs.ucsb.edu}           84
Finish mode
   Pages can be pulled by the destination, if needed

                P1                                P1
                P2                                P2
                P3                                P3

                         P1, P2, …
                       pushed from
                 Pn       source                  Pn
                                             TDm+1,…
Completed
                                               ,TDn
              Source                      Destination
                                                           Page owned by Node

                                                           Page not owned by Node

                       Sudipto Das {sudipto@cs.ucsb.edu}           85
Normal operation
Index wireframe un-frozen

                                            P1
                                            P2
                                            P3



                                            Pn
                                        TDn+1,…,
                                          TDp
         Source                     Destination
                                                      Page owned by Node

                                                      Page not owned by Node

                  Sudipto Das {sudipto@cs.ucsb.edu}           86
Artifacts of this design
   Once migrated, pages are never pulled back
    by source
    ◦ Abort transactions at source accessing the
      migrated pages
   No structural changes to indices during
    migration
    ◦ Abort transactions (at both nodes) that make
      structural changes to indices
   Destination “pulls” pages on-demand
    ◦ Transactions at the destination experience higher
      latency compared to normal operation

                     Sudipto Das {sudipto@cs.ucsb.edu}   87
Implementation
   Prototyped using an open source OLTP
    database H2
    ◦   Supports standard SQL/JDBC API
    ◦   Serializable isolation level
    ◦   Tree Indices
    ◦   Relational data model
   Modified the database engine
    ◦ Added support for freezing indices
    ◦ Page migration status maintained using index
    ◦ ~6000 LOC
   Tungsten SQL Router migrates JDBC
    connections during migration

                     Sudipto Das {sudipto@cs.ucsb.edu}   88
Results Overview
   Downtime (partition unavailability)
    ◦ S&C: 3 – 8 seconds (needed to migrate, unavailable
      for updates)
    ◦ Zephyr: No downtime. Either source or
      destination is available
   Service interruption (failed operations)
    ◦ S&C: ~100 s – 1,000s. All transactions with updates
      are aborted
    ◦ Zephyr: ~10s – 100s. Order of magnitude less
      interruption
   Minimal operational and data transfer
    overhead

                    Sudipto Das {sudipto@cs.ucsb.edu}   89
Failed Operations



Order of
magnitude
fewer failed
operations




               Sudipto Das {sudipto@cs.ucsb.edu}   90
Concluding Remarks




          Sudipto Das {sudipto@cs.ucsb.edu}   91
Concluding Remarks




          Sudipto Das {sudipto@cs.ucsb.edu}   92
Concluding Remarks
 Majorenabling
 technologies
 ◦ Transactions at Scale
   ElasTraS
   G-Store
 ◦ Lightweight Elasticity
   Albatross
   Zephyr


                Sudipto Das {sudipto@cs.ucsb.edu}   93
Future Directions

   Self-managing controller for large
    multitenant database infrastructures

   Convergence of transactional and analytics
    systems for real-time intelligence

   Putting human-in-the-loop: Leveraging
    crowd-sourcing

                Sudipto Das {sudipto@cs.ucsb.edu}   94
Acknowledgements

  My advisors and my committee members
  Computer Science Dept. at UCSB
  Funding sources: NSF, NEC Labs America,
   and AWS in Education
  Colleagues at DSL and at UCSB
  My family




November 16, 2011   Sudipto Das {sudipto@cs.ucsb.edu}   95
Thank you!

Collaborators
UCSB:
Divy Agrawal, Amr El Abbadi, Ömer Eğecioğlu
Shashank Agarwal, Shyam Antony, Aaron Elmore,
Shoji Nishimura (NEC Japan)
Microsoft Research Redmond:
Phil Bernstein, Colin Reid
IBM Almaden:
Yannis Sismanis, Kevin Beyer, Rainer Gemulla,
Peter Haas, John McPherson

Más contenido relacionado

Similar a Scalable and Elastic Transactional Data Stores for Cloud Computing Platforms

Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerDataWorks Summit
 
Thu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjayThu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjayAjay Shriwastava
 
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics
 
Slides: Relational to NoSQL Migration
Slides: Relational to NoSQL MigrationSlides: Relational to NoSQL Migration
Slides: Relational to NoSQL MigrationDATAVERSITY
 
Introduction to NuoDB - March 2018
Introduction to NuoDB - March 2018Introduction to NuoDB - March 2018
Introduction to NuoDB - March 2018NuoDB
 
At the Crossroads of The IT, CT and OT Domains: Architecting MEC Platforms
At the Crossroads of The IT, CT and OT Domains: Architecting MEC PlatformsAt the Crossroads of The IT, CT and OT Domains: Architecting MEC Platforms
At the Crossroads of The IT, CT and OT Domains: Architecting MEC PlatformsMehdi Sif
 
DDS Advanced Tutorial - OMG June 2013 Berlin Meeting
DDS Advanced Tutorial - OMG June 2013 Berlin MeetingDDS Advanced Tutorial - OMG June 2013 Berlin Meeting
DDS Advanced Tutorial - OMG June 2013 Berlin MeetingJaime Martin Losa
 
[OpenStack Day in Korea 2015] Keynote 2 - Leveraging OpenStack to Realize the...
[OpenStack Day in Korea 2015] Keynote 2 - Leveraging OpenStack to Realize the...[OpenStack Day in Korea 2015] Keynote 2 - Leveraging OpenStack to Realize the...
[OpenStack Day in Korea 2015] Keynote 2 - Leveraging OpenStack to Realize the...OpenStack Korea Community
 
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...HostedbyConfluent
 
Cloud – from Conception to Completion
Cloud – from Conception to CompletionCloud – from Conception to Completion
Cloud – from Conception to CompletionLogicalis Australia
 
DICE & Cloudify – Quality Big Data Made Easy
DICE & Cloudify – Quality Big Data Made EasyDICE & Cloudify – Quality Big Data Made Easy
DICE & Cloudify – Quality Big Data Made EasyCloudify Community
 
Introduction to cloud computing
Introduction to cloud computingIntroduction to cloud computing
Introduction to cloud computingPUBLEAD (R)
 
Horizontal Scaling for Millions of Customers!
Horizontal Scaling for Millions of Customers! Horizontal Scaling for Millions of Customers!
Horizontal Scaling for Millions of Customers! elangovans
 
apidays LIVE Singapore 2022_Redesigning Data Architecture.pdf
apidays LIVE Singapore 2022_Redesigning Data Architecture.pdfapidays LIVE Singapore 2022_Redesigning Data Architecture.pdf
apidays LIVE Singapore 2022_Redesigning Data Architecture.pdfapidays
 
Cloud Standards and Virtualization
Cloud Standards and VirtualizationCloud Standards and Virtualization
Cloud Standards and VirtualizationPeter Tröger
 
DAT322_The Nanoservices Architecture That Powers BBC Online
DAT322_The Nanoservices Architecture That Powers BBC OnlineDAT322_The Nanoservices Architecture That Powers BBC Online
DAT322_The Nanoservices Architecture That Powers BBC OnlineAmazon Web Services
 
Snowflake’s Cloud Data Platform and Modern Analytics
Snowflake’s Cloud Data Platform and Modern AnalyticsSnowflake’s Cloud Data Platform and Modern Analytics
Snowflake’s Cloud Data Platform and Modern AnalyticsSenturus
 
Using Hybrid Cloud for Scalable, Global Applications - RightScale Compute 2013
Using Hybrid Cloud for Scalable, Global Applications - RightScale Compute 2013Using Hybrid Cloud for Scalable, Global Applications - RightScale Compute 2013
Using Hybrid Cloud for Scalable, Global Applications - RightScale Compute 2013RightScale
 
¿Cómo modernizar una arquitectura de TI con la virtualización de datos?
¿Cómo modernizar una arquitectura de TI con la virtualización de datos?¿Cómo modernizar una arquitectura de TI con la virtualización de datos?
¿Cómo modernizar una arquitectura de TI con la virtualización de datos?Denodo
 

Similar a Scalable and Elastic Transactional Data Stores for Cloud Computing Platforms (20)

Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
 
Thu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjayThu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjay
 
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks Presentation
 
Slides: Relational to NoSQL Migration
Slides: Relational to NoSQL MigrationSlides: Relational to NoSQL Migration
Slides: Relational to NoSQL Migration
 
Introduction to NuoDB - March 2018
Introduction to NuoDB - March 2018Introduction to NuoDB - March 2018
Introduction to NuoDB - March 2018
 
At the Crossroads of The IT, CT and OT Domains: Architecting MEC Platforms
At the Crossroads of The IT, CT and OT Domains: Architecting MEC PlatformsAt the Crossroads of The IT, CT and OT Domains: Architecting MEC Platforms
At the Crossroads of The IT, CT and OT Domains: Architecting MEC Platforms
 
DDS Advanced Tutorial - OMG June 2013 Berlin Meeting
DDS Advanced Tutorial - OMG June 2013 Berlin MeetingDDS Advanced Tutorial - OMG June 2013 Berlin Meeting
DDS Advanced Tutorial - OMG June 2013 Berlin Meeting
 
[OpenStack Day in Korea 2015] Keynote 2 - Leveraging OpenStack to Realize the...
[OpenStack Day in Korea 2015] Keynote 2 - Leveraging OpenStack to Realize the...[OpenStack Day in Korea 2015] Keynote 2 - Leveraging OpenStack to Realize the...
[OpenStack Day in Korea 2015] Keynote 2 - Leveraging OpenStack to Realize the...
 
Vue d'ensemble Dremio
Vue d'ensemble DremioVue d'ensemble Dremio
Vue d'ensemble Dremio
 
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
 
Cloud – from Conception to Completion
Cloud – from Conception to CompletionCloud – from Conception to Completion
Cloud – from Conception to Completion
 
DICE & Cloudify – Quality Big Data Made Easy
DICE & Cloudify – Quality Big Data Made EasyDICE & Cloudify – Quality Big Data Made Easy
DICE & Cloudify – Quality Big Data Made Easy
 
Introduction to cloud computing
Introduction to cloud computingIntroduction to cloud computing
Introduction to cloud computing
 
Horizontal Scaling for Millions of Customers!
Horizontal Scaling for Millions of Customers! Horizontal Scaling for Millions of Customers!
Horizontal Scaling for Millions of Customers!
 
apidays LIVE Singapore 2022_Redesigning Data Architecture.pdf
apidays LIVE Singapore 2022_Redesigning Data Architecture.pdfapidays LIVE Singapore 2022_Redesigning Data Architecture.pdf
apidays LIVE Singapore 2022_Redesigning Data Architecture.pdf
 
Cloud Standards and Virtualization
Cloud Standards and VirtualizationCloud Standards and Virtualization
Cloud Standards and Virtualization
 
DAT322_The Nanoservices Architecture That Powers BBC Online
DAT322_The Nanoservices Architecture That Powers BBC OnlineDAT322_The Nanoservices Architecture That Powers BBC Online
DAT322_The Nanoservices Architecture That Powers BBC Online
 
Snowflake’s Cloud Data Platform and Modern Analytics
Snowflake’s Cloud Data Platform and Modern AnalyticsSnowflake’s Cloud Data Platform and Modern Analytics
Snowflake’s Cloud Data Platform and Modern Analytics
 
Using Hybrid Cloud for Scalable, Global Applications - RightScale Compute 2013
Using Hybrid Cloud for Scalable, Global Applications - RightScale Compute 2013Using Hybrid Cloud for Scalable, Global Applications - RightScale Compute 2013
Using Hybrid Cloud for Scalable, Global Applications - RightScale Compute 2013
 
¿Cómo modernizar una arquitectura de TI con la virtualización de datos?
¿Cómo modernizar una arquitectura de TI con la virtualización de datos?¿Cómo modernizar una arquitectura de TI con la virtualización de datos?
¿Cómo modernizar una arquitectura de TI con la virtualización de datos?
 

Último

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 

Último (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 

Scalable and Elastic Transactional Data Stores for Cloud Computing Platforms

  • 1. PhD Defense Scalable and Elastic Transactional Data Stores for Cloud Computing Platforms Sudipto Das Computer Science, UC Santa Barbara sudipto@cs.ucsb.edu Committee: Divy Agrawal (co-chair), Amr El Abbadi (co-chair), Phil Bernstein, Tim Sherwood Sponsors:
  • 2. Web replacing Desktop Sudipto Das {sudipto@cs.ucsb.edu} 2
  • 3. Paradigm shift in Infrastructure Sudipto Das {sudipto@cs.ucsb.edu} 3
  • 4. Paradigm shift in Infrastructure Sudipto Das {sudipto@cs.ucsb.edu} 4
  • 5. Cloud computing  Computing infrastructure and solutions delivered as a service ◦ Industry worth USD150 billion by 2014*  Contributors to success ◦ Economies of scale ◦ Elasticity and pay-per-use pricing  Popular paradigms ◦ Infrastructure as a Service (IaaS) ◦ Platform as a Service (PaaS) ◦ Software as a Service (SaaS) *http://www.crn.com/news/channel-programs/225700984/cloud-computing-services-market-to-near-150-billion-in-2014.htm Sudipto Das {sudipto@cs.ucsb.edu} 5
  • 6. Databases for cloud platforms  Data is central to applications  DBMSs are mission critical component in cloud software stack ◦ Manage petabytes of data, drive revenue ◦ Serve a variety of applications (multitenancy)  Data needs for cloud applications ◦ OLTP systems: store and serve data ◦ Data analysis systems: decision support, intelligence Sudipto Das {sudipto@cs.ucsb.edu} 6
  • 7. Databases for cloud platforms  Data is central to applications  DBMSs are mission critical component in cloud software stack ◦ Manage petabytes of data, drive revenue ◦ Serve a variety of applications (multitenancy)  Data needs for cloud applications ◦ OLTP systems: store and serve data ◦ Data analysis systems: decision support, intelligence Sudipto Das {sudipto@cs.ucsb.edu} 7
  • 8. Application landscape  Social gaming  Rich content and mash-ups  Managed applications  Cloud application platforms Sudipto Das {sudipto@cs.ucsb.edu} 8
  • 9. Challenges for OLTP systems  Scalability ◦ While ensuring efficient transaction execution!  Lightweight Elasticity ◦ Scale on-demand! Sudipto Das {sudipto@cs.ucsb.edu} 9
  • 10. Two approaches to scalability  Scale-up ◦ Preferred in classical enterprise setting (RDBMS) ◦ Flexible ACID transactions ◦ Transactions access a single node Sudipto Das {sudipto@cs.ucsb.edu} 10
  • 11. Two approaches to scalability  Scale-up ◦ Preferred in classical enterprise setting (RDBMS) ◦ Flexible ACID transactions ◦ Transactions access a single node  Scale-out ◦ Cloud friendly (Key value stores) ◦ Execution at a single server  Limited functionality & guarantees ◦ No multi-row or multi-step transactions Sudipto Das {sudipto@cs.ucsb.edu} 11
  • 12. Why care about transactions? confirm_friend_request(user1, user2) { begin_transaction(); 
 update_friend_list(user1, user2, status.confirmed); 
 update_friend_list(user2, user1, status.confirmed); end_transaction(); } Sudipto Das {sudipto@cs.ucsb.edu} 12
  • 13. Why care about transactions? confirm_friend_request(user1, user2) { begin_transaction(); 
 update_friend_list(user1, user2, status.confirmed); 
 update_friend_list(user2, user1, status.confirmed); end_transaction(); } Simplicity in application design with ACID transactions Sudipto Das {sudipto@cs.ucsb.edu} 13
  • 14. confirm_friend_request_A(user1, user2) { try { update_friend_list(user1, user2, status.confirmed); } catch(exception e) { report_error(e); return; } try { update_friend_list(user2, user1, status.confirmed); 
 } catch(exception e) { revert_friend_list(user1, user2); report_error(e); return; } } confirm_friend_request_B(user1, user2) { try{ 
 update_friend_list(user1, user2, status.confirmed); } catch(exception e) { 
 report_error(e); 
 add_to_retry_queue(operation.updatefriendlist, user1, user2, current_time()); 
} try { 
 update_friend_list(user2, user1, status.confirmed); } catch(exception e) { 
 report_error(e); 
 add_to_retry_queue(operation.updatefriendlist, user2, user1, current_time()); } } Sudipto Das {sudipto@cs.ucsb.edu} 14
  • 15. confirm_friend_request_A(user1, user2) { try { update_friend_list(user1, user2, status.confirmed); } catch(exception e) { report_error(e); return; } try { update_friend_list(user2, user1, status.confirmed); 
 } catch(exception e) { revert_friend_list(user1, user2); report_error(e); return; } } confirm_friend_request_B(user1, user2) { try{ 
 update_friend_list(user1, user2, status.confirmed); } catch(exception e) { 
 report_error(e); 
 add_to_retry_queue(operation.updatefriendlist, user1, user2, current_time()); 
} try { 
 update_friend_list(user2, user1, status.confirmed); } catch(exception e) { 
 report_error(e); 
 add_to_retry_queue(operation.updatefriendlist, user2, user1, current_time()); } } Sudipto Das {sudipto@cs.ucsb.edu} 15
  • 16. Challenge: Transactions at Scale Key Value Stores Scale-out RDBMSs ACID transactions Sudipto Das {sudipto@cs.ucsb.edu} 16
  • 17. Challenge: Lightweight Elasticity Provisioning on-demand and not for peak Optimize operating cost! Capacity Resources Resources Demand Capacity Demand Time Time Traditional Infrastructures Deployment in the Cloud Unused resources Slide Credits: Berkeley RAD Lab Sudipto Das {sudipto@cs.ucsb.edu} 17
  • 18. Contributions for OLTP systems  Transactions at Scale ◦ ElasTraS [HotCloud 2009, UCSB TR 2010] ◦ G-Store [SoCC 2010]  Lightweight Elasticity ◦ Albatross [VLDB 2011] ◦ Zephyr [SIGMOD 2011]  Self-Manageability ◦ Pythia [in progress] Sudipto Das {sudipto@cs.ucsb.edu} 18
  • 19. Contributions for OLTP systems It is possible to architect scalable DBMSs that efficiently support transactional semantics to ease application design and elastically adapt to fluctuating operational demands to optimize the operating cost. Sudipto Das {sudipto@cs.ucsb.edu} 19
  • 20. Contributions for OLTP systems It is possible to architect scalable DBMSs that efficiently support transactional semantics to ease application design and elastically adapt to fluctuating operational demands to optimize the operating cost.  Transactions at Scale ◦ ElasTraS [HotCloud 2009, UCSB TR 2010] ◦ G-Store [SoCC 2010] Sudipto Das {sudipto@cs.ucsb.edu} 20
  • 21. Contributions for OLTP systems It is possible to architect scalable DBMSs that efficiently support transactional semantics to ease application design and elastically adapt to fluctuating operational demands to optimize the operating cost.  Transactions at  Lightweight Scale Elasticity ◦ ElasTraS [HotCloud ◦ Albatross 2009, UCSB TR 2010] [VLDB 2011] ◦ G-Store ◦ Zephyr [SoCC 2010] [SIGMOD 2011] Sudipto Das {sudipto@cs.ucsb.edu} 21
  • 22. Contributions for OLTP systems It is possible to architect scalable DBMSs that efficiently support transactional semantics to ease application design and elastically adapt to fluctuating operational demands to optimize the operating cost.  Transactions at  Lightweight Scale Elasticity ◦ ElasTraS [HotCloud ◦ Albatross 2009, UCSB TR 2010] [VLDB 2011] ◦ G-Store ◦ Zephyr [SoCC 2010] [SIGMOD 2011] Sudipto Das {sudipto@cs.ucsb.edu} 22
  • 23. Contributions Data Management Transaction Processing Dynamic Static partitioning partitioning ElasTraS G-Store [HotCloud ‘09] [SoCC ‘10] [TR ‘10] Albatross [VLDB ‘11] Zephyr [SIGMOD ‘11] Dissertation Sudipto Das {sudipto@cs.ucsb.edu} 23
  • 24. Contributions Data Management Analytics Transaction Processing Ricardo Dynamic Static [SIGMOD ‘10] partitioning partitioning MD-HBase ElasTraS [MDM ‘11] G-Store [HotCloud ‘09] Best Paper [SoCC ‘10] [TR ‘10] Runner up Anonimos Albatross [VLDB ‘11] [ICDE ‘10], Zephyr [SIGMOD ‘11] [TKDE] Dissertation Sudipto Das {sudipto@cs.ucsb.edu} 24
  • 25. Contributions Data Management Analytics Transaction Processing Novel Architectures Ricardo Dynamic Static Hyder [SIGMOD ‘10] partitioning partitioning [CIDR ‘11] Best Paper MD-HBase ElasTraS [MDM ‘11] G-Store [HotCloud ‘09] CoTS Best Paper [SoCC ‘10] [TR ‘10] [ICDE ‘09], Runner up [VLDB ‘09] Anonimos Albatross [VLDB ‘11] [ICDE ‘10], Zephyr [SIGMOD ‘11] TCAM [TKDE] [DaMoN ‘08] Dissertation Sudipto Das {sudipto@cs.ucsb.edu} 25
  • 26. Transactions at Scale Key Value Stores Scale-out RDBMSs ACID transactions Sudipto Das {sudipto@cs.ucsb.edu} 26
  • 27. Scale-out with static partitioning  Table level partitioning (range, hash) ◦ Distributed transactions  Partitioning the Database schema ◦ Co-locate data items accessed together ◦ Goal: Minimize distributed transactions Sudipto Das {sudipto@cs.ucsb.edu} 27
  • 28. Scale-out with static partitioning  Table level partitioning (range, hash) ◦ Distributed transactions  Partitioning the Database schema ◦ Co-locate data items accessed together ◦ Goal: Minimize distributed transactions Sudipto Das {sudipto@cs.ucsb.edu} 28
  • 29. Scale-out with static partitioning  Table level partitioning (range, hash) ◦ Distributed transactions  Partitioning the Database schema ◦ Co-locate data items accessed together ◦ Goal: Minimize distributed transactions  Scaling-out with static partitioning ◦ ElasTraS [HotCloud 2009, TR 2010] Sudipto Das {sudipto@cs.ucsb.edu} 29
  • 30. Scale-out with static partitioning  Table level partitioning (range, hash) ◦ Distributed transactions  Partitioning the Database schema ◦ Co-locate data items accessed together ◦ Goal: Minimize distributed transactions  Scaling-out with static partitioning ◦ ElasTraS [HotCloud 2009, TR 2010] ◦ Cloud SQL Server [ICDE 2011] ◦ MegaStore [CIDR 2011] ◦ RelationalCloud [CIDR 2011] Sudipto Das {sudipto@cs.ucsb.edu} 30
  • 31. Dynamically formed partitions  Access patterns change, often rapidly ◦ Online multi-player gaming applications ◦ Collaboration based applications ◦ Scientific computing applications  Not amenable to static partitioning Sudipto Das {sudipto@cs.ucsb.edu} 31
  • 32. Dynamically formed partitions  Access patterns change, often rapidly ◦ Online multi-player gaming applications ◦ Collaboration based applications ◦ Scientific computing applications  Not amenable to static partitioning  How to get the benefit of partitioning when accesses do not statically partition? ◦ Ours is the first solution to allow that Sudipto Das {sudipto@cs.ucsb.edu} 32
  • 33. Online Multi-player Games ID Name $$$ Score Player Profile Sudipto Das {sudipto@cs.ucsb.edu} 33
  • 34. Online Multi-player Games Sudipto Das {sudipto@cs.ucsb.edu} 34
  • 35. Online Multi-player Games Execute transactions on player profiles while the game is in progress Sudipto Das {sudipto@cs.ucsb.edu} 35
  • 36. Online Multi-player Games Sudipto Das {sudipto@cs.ucsb.edu} 36
  • 37. Online Multi-player Games Partitions/groups are dynamic Sudipto Das {sudipto@cs.ucsb.edu} 37
  • 38. Online Multi-player Games Hundreds of thousands of concurrent groups Sudipto Das {sudipto@cs.ucsb.edu} 38
  • 39. Data Fusion for dynamic partitions [G-Store, SoCC 2010]  Transactional access to a group of data items formed on-demand  Challenge: Avoid distributed transactions! Sudipto Das {sudipto@cs.ucsb.edu} 39
  • 40. Data Fusion for dynamic partitions [G-Store, SoCC 2010]  Transactional access to a group of data items formed on-demand  Challenge: Avoid distributed transactions!  Key Group Abstraction ◦ Groups are small ◦ Groups have non-trivial lifetime ◦ Groups are dynamic and on-demand  Groups are dynamically formed tenant databases Sudipto Das {sudipto@cs.ucsb.edu} 40
  • 41. Transactions on Groups Without distributed transactions  One key selected as the leader Sudipto Das {sudipto@cs.ucsb.edu} 41
  • 42. Transactions on Groups Without distributed transactions  One key selected as the leader  Followers transfer ownership of keys to leader Sudipto Das {sudipto@cs.ucsb.edu} 42
  • 43. Transactions on Groups Without distributed transactions Key Group Ownership of keys at a single node  One key selected as the leader  Followers transfer ownership of keys to leader Sudipto Das {sudipto@cs.ucsb.edu} 43
  • 44. Transactions on Groups Without distributed transactions Grouping Protocol Key Group Ownership of keys at a single node  One key selected as the leader  Followers transfer ownership of keys to leader Sudipto Das {sudipto@cs.ucsb.edu} 44
  • 45. Why is group formation hard?  Guarantee the contract between leaders and followers in the presence of: ◦ Leader and follower failures ◦ Lost, duplicated, or re-ordered messages ◦ Dynamics of the underlying system  How to ensure efficient and ACID execution of transactions? Sudipto Das {sudipto@cs.ucsb.edu} 45
  • 46. Grouping protocol L(Joining) L(Joined) Follower(s) Create Request J JA JAA Leader L(Creating) L(Joined) Time Sudipto Das {sudipto@cs.ucsb.edu} 46
  • 47. Grouping protocol L(Joining) L(Joined) Follower(s) Create Request J JA JAA Group Opns Leader L(Creating) L(Joined) Time Sudipto Das {sudipto@cs.ucsb.edu} 47
  • 48. Grouping protocol L(Joining) L(Joined) L(Free) Follower(s) Create Request J JA JAA D DA Group Opns Leader L(Creating) L(Joined) L(Deleting) L(Deleted) Delete Time Request Sudipto Das {sudipto@cs.ucsb.edu} 48
  • 49. Grouping protocol Log entries L(Joining) L(Joined) L(Free) Follower(s) Create Request J JA JAA D DA Group Opns Leader L(Creating) L(Joined) L(Deleting) L(Deleted) Delete Time Request Sudipto Das {sudipto@cs.ucsb.edu} 49
  • 50. Grouping protocol Log entries L(Joining) L(Joined) L(Free) Follower(s) Create Request J JA JAA D DA Group Opns Leader L(Creating) L(Joined) L(Deleting) L(Deleted) Delete Time Request  Conceptually akin to “locking” ◦ Locks held by groups Sudipto Das {sudipto@cs.ucsb.edu} 50
  • 51. Efficient transaction processing  How does the leader execute transactions? ◦ Caches data for group members  underlying data store equivalent to a disk ◦ Transaction logging for durability ◦ Cache asynchronously flushed to propagate updates ◦ Guaranteed update propagation Transaction Manager Leader Log Cache Manager Asynchronous update Propagation Followers Sudipto Das {sudipto@cs.ucsb.edu} 51
  • 52. Prototype: G-Store [SoCC 2010] An implementation over Key-value stores Application Clients Transactional Multi-Key Access Grouping Transaction Grouping Transaction Grouping Transaction Layer Manager Layer Manager Layer Manager Key-Value Store Logic Key-Value Store Logic Key-Value Store Logic Distributed Storage G-Store Sudipto Das {sudipto@cs.ucsb.edu} 52
  • 53. Prototype: G-Store [SoCC 2010] An implementation over Key-value stores Application Clients Transactional Multi-Key Access Grouping middleware layer resident on top of a key-value store Grouping Transaction Grouping Transaction Grouping Transaction Layer Manager Layer Manager Layer Manager Key-Value Store Logic Key-Value Store Logic Key-Value Store Logic Distributed Storage G-Store Sudipto Das {sudipto@cs.ucsb.edu} 53
  • 54. G-Store Evaluation  Implemented using HBase ◦ Added the middleware layer ◦ ~10000 LOC  Experiments in Amazon EC2  Benchmark: An online multi-player game  Cluster size: 10 nodes  Data size: ~1 billion rows (>1 TB) Sudipto Das {sudipto@cs.ucsb.edu} 54
  • 55. G-Store Evaluation  Implemented using HBase ◦ Added the middleware layer ◦ ~10000 LOC  Experiments in Amazon EC2  Benchmark: An online multi-player game  Cluster size: 10 nodes  Data size: ~1 billion rows (>1 TB)  For groups with 100 keys ◦ Group creation latency: ~10 – 100ms ◦ More than 10,000 groups concurrently created Sudipto Das {sudipto@cs.ucsb.edu} 55
  • 56. G-Store Evaluation Group creation latency Group creation throughput Sudipto Das {sudipto@cs.ucsb.edu} 56
  • 57. Lightweight Elasticity Provisioning on-demand and not for peak Optimize operating cost! Capacity Resources Resources Demand Capacity Demand Time Time Traditional Infrastructures Deployment in the Cloud Unused resources Slide Credits: Berkeley RAD Lab Sudipto Das {sudipto@cs.ucsb.edu} 57
  • 58. Elasticity in the Database tier Load Balancer Application/ Web/Caching tier Database tier Sudipto Das {sudipto@cs.ucsb.edu} 58
  • 59. Elasticity in the Database tier Load Balancer Application/ Web/Caching tier Database tier Sudipto Das {sudipto@cs.ucsb.edu} 59
  • 60. Elasticity in the Database tier Load Balancer Application/ Web/Caching tier Database tier Sudipto Das {sudipto@cs.ucsb.edu} 60
  • 61. Elasticity in the Database tier Load Balancer Application/ Web/Caching tier Database tier Sudipto Das {sudipto@cs.ucsb.edu} 61
  • 62. Elasticity in the Database tier Load Balancer Application/ Web/Caching tier Database tier Sudipto Das {sudipto@cs.ucsb.edu} 62
  • 63. Elasticity in the Database tier Load Balancer Application/ Web/Caching tier Database tier Sudipto Das {sudipto@cs.ucsb.edu} 63
  • 64. Elasticity in the Database tier Load Balancer Application/ Web/Caching tier Database tier Sudipto Das {sudipto@cs.ucsb.edu} 64
  • 65. Live database migration  Migrate a database partition (or tenant) in a live system ◦ Optimize operating cost ◦ Resource orchestration in multitenant systems Sudipto Das {sudipto@cs.ucsb.edu} 65
  • 66. Live database migration  Migrate a database partition (or tenant) in a live system ◦ Optimize operating cost ◦ Resource orchestration in multitenant systems  Different from ◦ Migration between software versions ◦ Migration in case of schema evolution Sudipto Das {sudipto@cs.ucsb.edu} 66
  • 67. VM migration for DB elasticity  One DB partition-per-VM ◦ Pros: allows fine-grained load balancing VM VM VM ◦ Cons  Performance overhead Hypervisor  Poor consolidation ratio [Curino et al., CIDR 2011] Sudipto Das {sudipto@cs.ucsb.edu} 67
  • 68. VM migration for DB elasticity  One DB partition-per-VM ◦ Pros: allows fine-grained load balancing ◦ Cons VM VM VM  Performance overhead  Poor consolidation ratio [Curino et Hypervisor al., CIDR 2011]  Multiple DB partitions in a VM ◦ Pros: good performance ◦ Cons: Migrate all partitions  VM Coarse-grained load balancing Hypervisor Sudipto Das {sudipto@cs.ucsb.edu} 68
  • 69. Live database migration  Multiple partitions share the same database process ◦ Shared process multitenancy  Migrate individual partitions on- demand in a live system ◦ Virtualization in the database tier  Straightforward solution ◦ Stop serving partition at the source ◦ Copy to destination ◦ Start serving at the destination ◦ Expensive! Sudipto Das {sudipto@cs.ucsb.edu} 69
  • 70. Migration cost measures  Service un-availability ◦ Time the partition is unavailable  Number of failed requests ◦ Number of operations failing/transactions aborting  Performance overhead ◦ Impact on response times  Additional data transferred Sudipto Das {sudipto@cs.ucsb.edu} 70
  • 71. Two common DBMS architectures  Decoupled storage architectures ◦ ElasTraS, G-Store, Deuteronomy, MegaStore ◦ Persistent data is not migrated ◦ Albatross [VLDB 2011]  Shared nothing architectures ◦ SQL Azure, Relational Cloud, MySQL Cluster ◦ Migrate persistent data ◦ Zephyr [SIGMOD 2011] Sudipto Das {sudipto@cs.ucsb.edu} 71
  • 72. Two common DBMS architectures  Decoupled storage architectures ◦ ElasTraS, G-Store, Deuteronomy, MegaStore ◦ Persistent data is not migrated ◦ Albatross [VLDB 2011]  Shared nothing architectures ◦ SQL Azure, Relational Cloud, MySQL Cluster ◦ Migrate persistent data ◦ Zephyr [SIGMOD 2011] Sudipto Das {sudipto@cs.ucsb.edu} 72
  • 73. Why is live DB migration hard?  Persistent DB image must be migrated (GBs) ◦ How to ensure no downtime?  Nodes can fail during migration ◦ How to guarantee correctness during failures?  Transaction atomicity and durability.  Recover migration state after failure.  Transactions execute during migration ◦ How to guarantee serializability?  Transaction correctness equivalent to normal operation Sudipto Das {sudipto@cs.ucsb.edu} 73
  • 74. Our approach: Zephyr [SIGMOD 2011]  Migration executed in phases ◦ Starts with transfer of minimal information to destination (“wireframe”)  Database pages used as granule of migration ◦ Unique page ownership  Source and destination concurrently execute transactions in one migration phase  Minimal transaction synchronization  Guaranteed serializability  Logging and handshaking protocols Sudipto Das {sudipto@cs.ucsb.edu} 74
  • 75. Simplifying assumptions  For this talk ◦ Transactions access a single partition ◦ No replication ◦ No structural changes to indices  Extensions in the paper [SIGMOD 2011] ◦ Relaxes these assumptions Sudipto Das {sudipto@cs.ucsb.edu} 75
  • 76. Design overview P1 P2 Owned Pages P3 Pn Active transactions TS1,…, TSk Source Destination Page owned by Node Page not owned by Node Sudipto Das {sudipto@cs.ucsb.edu} 76
  • 77. Init mode Freeze indices and migrate wireframe P1 P1 P2 P2 Owned Pages P3 P3 Un-owned Pages Pn Pn TS1,…, Active transactions TSk Source Destination Page owned by Node Page not owned by Node Sudipto Das {sudipto@cs.ucsb.edu} 77
  • 78. What is an index wireframe? Source Sudipto Das {sudipto@cs.ucsb.edu} 78
  • 79. What is an index wireframe? Source Destination Sudipto Das {sudipto@cs.ucsb.edu} 79
  • 80. Dual mode P1 P1 P2 P2 P3 P3 Pn Pn Old, still active TSk+1,…, TD1,…, New transactions transactions TSl TDm Source Destination Page owned by Node Index wireframes remain frozen Page not owned by Node Sudipto Das {sudipto@cs.ucsb.edu} 80
  • 81. Dual mode P1 P3 accessed by P1 P2 TDi P2 P3 P3 Pn Pn Old, still active TSk+1,…, TD1,…, New transactions transactions TSl TDm Source Destination Page owned by Node Index wireframes remain frozen Page not owned by Node Sudipto Das {sudipto@cs.ucsb.edu} 81
  • 82. Dual mode Requests for un-owned pages can block P1 P3 accessed by P1 P2 TDi P2 P3 P3 Pn Pn Old, still active TSk+1,…, TD1,…, New transactions transactions TSl TDm Source Destination Page owned by Node Index wireframes remain frozen Page not owned by Node Sudipto Das {sudipto@cs.ucsb.edu} 82
  • 83. Dual mode Requests for un-owned pages can block P1 P3 accessed by P1 P2 TDi P2 P3 P3 P3 pulled Pn from source Pn Old, still active TSk+1,…, TD1,…, New transactions transactions TSl TDm Source Destination Page owned by Node Index wireframes remain frozen Page not owned by Node Sudipto Das {sudipto@cs.ucsb.edu} 83
  • 84. Finish mode P1 P1 P2 P2 P3 P3 P1, P2, … pushed from Pn source Pn TDm+1,… Completed ,TDn Source Destination Page owned by Node Page not owned by Node Sudipto Das {sudipto@cs.ucsb.edu} 84
  • 85. Finish mode Pages can be pulled by the destination, if needed P1 P1 P2 P2 P3 P3 P1, P2, … pushed from Pn source Pn TDm+1,… Completed ,TDn Source Destination Page owned by Node Page not owned by Node Sudipto Das {sudipto@cs.ucsb.edu} 85
  • 86. Normal operation Index wireframe un-frozen P1 P2 P3 Pn TDn+1,…, TDp Source Destination Page owned by Node Page not owned by Node Sudipto Das {sudipto@cs.ucsb.edu} 86
  • 87. Artifacts of this design  Once migrated, pages are never pulled back by source ◦ Abort transactions at source accessing the migrated pages  No structural changes to indices during migration ◦ Abort transactions (at both nodes) that make structural changes to indices  Destination “pulls” pages on-demand ◦ Transactions at the destination experience higher latency compared to normal operation Sudipto Das {sudipto@cs.ucsb.edu} 87
  • 88. Implementation  Prototyped using an open source OLTP database H2 ◦ Supports standard SQL/JDBC API ◦ Serializable isolation level ◦ Tree Indices ◦ Relational data model  Modified the database engine ◦ Added support for freezing indices ◦ Page migration status maintained using index ◦ ~6000 LOC  Tungsten SQL Router migrates JDBC connections during migration Sudipto Das {sudipto@cs.ucsb.edu} 88
  • 89. Results Overview  Downtime (partition unavailability) ◦ S&C: 3 – 8 seconds (needed to migrate, unavailable for updates) ◦ Zephyr: No downtime. Either source or destination is available  Service interruption (failed operations) ◦ S&C: ~100 s – 1,000s. All transactions with updates are aborted ◦ Zephyr: ~10s – 100s. Order of magnitude less interruption  Minimal operational and data transfer overhead Sudipto Das {sudipto@cs.ucsb.edu} 89
  • 90. Failed Operations Order of magnitude fewer failed operations Sudipto Das {sudipto@cs.ucsb.edu} 90
  • 91. Concluding Remarks Sudipto Das {sudipto@cs.ucsb.edu} 91
  • 92. Concluding Remarks Sudipto Das {sudipto@cs.ucsb.edu} 92
  • 93. Concluding Remarks  Majorenabling technologies ◦ Transactions at Scale  ElasTraS  G-Store ◦ Lightweight Elasticity  Albatross  Zephyr Sudipto Das {sudipto@cs.ucsb.edu} 93
  • 94. Future Directions  Self-managing controller for large multitenant database infrastructures  Convergence of transactional and analytics systems for real-time intelligence  Putting human-in-the-loop: Leveraging crowd-sourcing Sudipto Das {sudipto@cs.ucsb.edu} 94
  • 95. Acknowledgements  My advisors and my committee members  Computer Science Dept. at UCSB  Funding sources: NSF, NEC Labs America, and AWS in Education  Colleagues at DSL and at UCSB  My family November 16, 2011 Sudipto Das {sudipto@cs.ucsb.edu} 95
  • 96. Thank you! Collaborators UCSB: Divy Agrawal, Amr El Abbadi, Ömer Eğecioğlu Shashank Agarwal, Shyam Antony, Aaron Elmore, Shoji Nishimura (NEC Japan) Microsoft Research Redmond: Phil Bernstein, Colin Reid IBM Almaden: Yannis Sismanis, Kevin Beyer, Rainer Gemulla, Peter Haas, John McPherson

Notas del editor

  1. In the last few years, we have witnessed a trend where web applications have been replacing desktop applications and large numbers of applications are now accessed via the browsers.
  2. This shift from desktop to the web has also resulted in a paradigm shift in the application deployment infrastructure resulting in a paradigm popularly known as Cloud Computing.
  3. This shift from desktop to the web has also resulted in a paradigm shift in the application deployment infrastructure resulting in a paradigm popularly known as Cloud Computing.
  4. In its simplest form, cloud computing is essentially computing infrastructure and solutions delivered as a service. Analysts predict that this industry will be worth 150 billion dollars by 2014. Even though almost every aspect of computing can be provided as a service, there have been three popular cloud paradigms:Infrastructure as a service, the lowest level of abstraction, provides raw CPU, storage, and network as a service. Popular examples include Amazon web services, Rackspace, etc.The next higher level of abstraction is platform as a service that provides a platform or containers to deploy applications where the platform provider abstracts data management, fault-tolerance, elastic scaling etc, thus simplifying application deployment. Popular examples include Google AppEngine, Windows Azure, etc.The highest level of abstraction is software as a service that exposes a simple interface to customize pre-designed application logic. Popular examples include Salesforce.com.Major factors that have contributed to the success of cloud platforms are advances in the technology front, such as virtualization and pervasive broadband internet connectivity, as well as business and economic factors, such as economies of scale, transfer of risks etc.In this talk, we focus on Cloud application platforms, in particular, the database systems that serve these cloud application platforms.
  5. Data is central to all modern applications and most modern enterprises manage petabytes of data. Hence DBMSs form a mission critical component in the cloud software stack and is the key to success as well as generating revenue.Considering the data needs for web-applications, there are two broad categories of systems:On one hand are OLTP systems that store and serve data. On the other hand are OLAP systems that provide intelligence and decision support.In this talk, we will focus on OLTP systems.Bring in the concept of service provider and the service user and whose problem are we solving (NEC discussion).
  6. Data is central to all modern applications and most modern enterprises manage petabytes of data. Hence DBMSs form a mission critical component in the cloud software stack and is the key to success as well as generating revenue.Considering the data needs for web-applications, there are two broad categories of systems:On one hand are OLTP systems that store and serve data. On the other hand are OLAP systems that provide intelligence and decision support.In this talk, we will focus on OLTP systems.Bring in the concept of service provider and the service user and whose problem are we solving (NEC discussion).
  7. Therefore, in summary, the major challenges for an OLTP database in the cloud are:Supporting transactions and scale-out while minimizing the number of distributed transactions,Supporting lightweight elastic scaling in a live system, andProviding autonomic control with intelligence similar to a human controller.
  8. Stress about the ACID properties of transactions and how the applications benefit from it by simplifying their design.
  9. Stress about the ACID properties of transactions and how the applications benefit from it by simplifying their design.
  10. Therefore, if we consider Scale-out as the vertical axis and Functionality (or support for transactions) as the horizontal axis, at one extreme are the RDBMSs that support rich functionality but are hard to scale-out, and at the other extreme are Key-Value stores that allow scaling out to thousands of servers but support limited functionality.There exists a big chasm between the two types of systems and the challenge is to bridge this divide by efficiently supporting transactions while scaling out.Cloud platforms are multitenant and must support a variety of applications with varying needs. Therefore, bridging this chasm is important to support a variety of applications.Functionality , whether transactions are a subset.
  11. In addition, when such a database is deployed on an elastic pay-per-use cloud infrastructure that allows for on-demand provisioning compared to static provisioning for the peak load, the challenge is to make the database layer elastic as the underlying cloud infrastructure without introducing a lot of overhead to make it elastic.Scale vs Elasticity
  12. To this end, my dissertation makes the following contributions to address these challenges:We propose two different solutions to support transactions at scale for two different application scenarios: Elastras allows for elastically scalable transaction execution in databases where partitions are statically defined, while G-Store allows efficient transaction processing where database partitions are dynamically defined.Supporting lightweight elasticity essentially boils down to lightweight migration of database partitions in a live system. To this end, we propose two different techniques for two common database architectures: Albatross is a technique for live migration in databases that use the storage abstraction, while Zephyr is a technique for live migration in shared nothing database architectures.Finally, we are currently working on the design of Pythia, an autonomic controller.For the interest of time, in this talk, I will only get into the details of G-Store and Zephyr while providing a very high level overview of Elastras.
  13. To this end, my dissertation makes the following contributions to address these challenges:We propose two different solutions to support transactions at scale for two different application scenarios: Elastras allows for elastically scalable transaction execution in databases where partitions are statically defined, while G-Store allows efficient transaction processing where database partitions are dynamically defined.Supporting lightweight elasticity essentially boils down to lightweight migration of database partitions in a live system. To this end, we propose two different techniques for two common database architectures: Albatross is a technique for live migration in databases that use the storage abstraction, while Zephyr is a technique for live migration in shared nothing database architectures.Finally, we are currently working on the design of Pythia, an autonomic controller.For the interest of time, in this talk, I will only get into the details of G-Store and Zephyr while providing a very high level overview of Elastras.
  14. To this end, my dissertation makes the following contributions to address these challenges:We propose two different solutions to support transactions at scale for two different application scenarios: Elastras allows for elastically scalable transaction execution in databases where partitions are statically defined, while G-Store allows efficient transaction processing where database partitions are dynamically defined.Supporting lightweight elasticity essentially boils down to lightweight migration of database partitions in a live system. To this end, we propose two different techniques for two common database architectures: Albatross is a technique for live migration in databases that use the storage abstraction, while Zephyr is a technique for live migration in shared nothing database architectures.Finally, we are currently working on the design of Pythia, an autonomic controller.For the interest of time, in this talk, I will only get into the details of G-Store and Zephyr while providing a very high level overview of Elastras.
  15. To this end, my dissertation makes the following contributions to address these challenges:We propose two different solutions to support transactions at scale for two different application scenarios: Elastras allows for elastically scalable transaction execution in databases where partitions are statically defined, while G-Store allows efficient transaction processing where database partitions are dynamically defined.Supporting lightweight elasticity essentially boils down to lightweight migration of database partitions in a live system. To this end, we propose two different techniques for two common database architectures: Albatross is a technique for live migration in databases that use the storage abstraction, while Zephyr is a technique for live migration in shared nothing database architectures.Finally, we are currently working on the design of Pythia, an autonomic controller.For the interest of time, in this talk, I will only get into the details of G-Store and Zephyr while providing a very high level overview of Elastras.
  16. To this end, my dissertation makes the following contributions to address these challenges:We propose two different solutions to support transactions at scale for two different application scenarios: Elastras allows for elastically scalable transaction execution in databases where partitions are statically defined, while G-Store allows efficient transaction processing where database partitions are dynamically defined.Supporting lightweight elasticity essentially boils down to lightweight migration of database partitions in a live system. To this end, we propose two different techniques for two common database architectures: Albatross is a technique for live migration in databases that use the storage abstraction, while Zephyr is a technique for live migration in shared nothing database architectures.Finally, we are currently working on the design of Pythia, an autonomic controller.For the interest of time, in this talk, I will only get into the details of G-Store and Zephyr while providing a very high level overview of Elastras.
  17. But before we delve into the details, I would like to spend a couple of minutes to give an overview of my research in the broader area of data management.The current talk, and my thesis, focuses on the OLTP aspect.In the data analysis front, I have worked on multiple projects. As an intern at IBM Almaden, I worked on a project called Ricardo that provides the ability for deep statistical analysis and modeling over large amounts of data. This paper was published in SIGMOD 2010 and parts of the framework ship in IBM InfoSphereBigInsights Enterprise edition. Recently, I worked on a project called MD-Hbase that presents the design and implementation of a scalable multi-dimensional indexing mechanism to support efficient high throughput location updates and multi-dimensional analysis queries on top of a Key-value store. Earlier, I have also worked on data stream processing systems providing intra-operator parallelism in common data stream operators, such as frequent elements or top-k elements, to efficiently exploit multicore processors.I have also worked on designing systems to exploit novel hardware architectures.
  18. But before we delve into the details, I would like to spend a couple of minutes to give an overview of my research in the broader area of data management.The current talk, and my thesis, focuses on the OLTP aspect.In the data analysis front, I have worked on multiple projects. As an intern at IBM Almaden, I worked on a project called Ricardo that provides the ability for deep statistical analysis and modeling over large amounts of data. This paper was published in SIGMOD 2010 and parts of the framework ship in IBM InfoSphereBigInsights Enterprise edition. Recently, I worked on a project called MD-Hbase that presents the design and implementation of a scalable multi-dimensional indexing mechanism to support efficient high throughput location updates and multi-dimensional analysis queries on top of a Key-value store. Earlier, I have also worked on data stream processing systems providing intra-operator parallelism in common data stream operators, such as frequent elements or top-k elements, to efficiently exploit multicore processors.I have also worked on designing systems to exploit novel hardware architectures.
  19. But before we delve into the details, I would like to spend a couple of minutes to give an overview of my research in the broader area of data management.The current talk, and my thesis, focuses on the OLTP aspect.In the data analysis front, I have worked on multiple projects. As an intern at IBM Almaden, I worked on a project called Ricardo that provides the ability for deep statistical analysis and modeling over large amounts of data. This paper was published in SIGMOD 2010 and parts of the framework ship in IBM InfoSphereBigInsights Enterprise edition. Recently, I worked on a project called MD-Hbase that presents the design and implementation of a scalable multi-dimensional indexing mechanism to support efficient high throughput location updates and multi-dimensional analysis queries on top of a Key-value store. Earlier, I have also worked on data stream processing systems providing intra-operator parallelism in common data stream operators, such as frequent elements or top-k elements, to efficiently exploit multicore processors.I have also worked on designing systems to exploit novel hardware architectures.
  20. The goal of partitioning the schema is to leverage the application semantics and access patterns to minimize the number of distributed transactions.
  21. The goal of partitioning the schema is to leverage the application semantics and access patterns to minimize the number of distributed transactions.
  22. The goal of partitioning the schema is to leverage the application semantics and access patterns to minimize the number of distributed transactions.
  23. The goal of partitioning the schema is to leverage the application semantics and access patterns to minimize the number of distributed transactions.
  24. Now we know how to scale-out when the partitions are statically defined. So lets make it a bit more interesting: How to scale-out with transactions on dynamically formed partitions?Recall that our concept of partitions are the data items that are frequently being accessed within the same transaction. For certain applications, that might change with time. For instance, in online multi-player games, the application needs transactional access on the player profiles that are part of the same game instance, and this set changes with time. Similar behavior is observed in a number of collaboration based applications (examples?).
  25. Now we know how to scale-out when the partitions are statically defined. So lets make it a bit more interesting: How to scale-out with transactions on dynamically formed partitions?Recall that our concept of partitions are the data items that are frequently being accessed within the same transaction. For certain applications, that might change with time. For instance, in online multi-player games, the application needs transactional access on the player profiles that are part of the same game instance, and this set changes with time. Similar behavior is observed in a number of collaboration based applications (examples?).
  26. If the player profiles are part of the same database partition, then transactions on this group of players can be executed efficiently.
  27. If the player profiles are part of the same database partition, then transactions on this group of players can be executed efficiently.
  28. However, this group of players change with time, thus resulting in the concept of dynamically defined database partitions.
  29. However, this group of players change with time, thus resulting in the concept of dynamically defined database partitions.
  30. Scale.
  31. Paper has more detailed evaluation
  32. Paper has more detailed evaluation
  33. So what does elasticity in the database tier mean?Mention the cost performance trade-off and repeat the fact that it is the cloud infrastructure that allows us to optimize operating cost, something that was not thought important in classical infrastructures.
  34. So what does elasticity in the database tier mean?Mention the cost performance trade-off and repeat the fact that it is the cloud infrastructure that allows us to optimize operating cost, something that was not thought important in classical infrastructures.
  35. So what does elasticity in the database tier mean?Mention the cost performance trade-off and repeat the fact that it is the cloud infrastructure that allows us to optimize operating cost, something that was not thought important in classical infrastructures.
  36. So what does elasticity in the database tier mean?Mention the cost performance trade-off and repeat the fact that it is the cloud infrastructure that allows us to optimize operating cost, something that was not thought important in classical infrastructures.
  37. So what does elasticity in the database tier mean?Mention the cost performance trade-off and repeat the fact that it is the cloud infrastructure that allows us to optimize operating cost, something that was not thought important in classical infrastructures.
  38. So what does elasticity in the database tier mean?Mention the cost performance trade-off and repeat the fact that it is the cloud infrastructure that allows us to optimize operating cost, something that was not thought important in classical infrastructures.
  39. So what does elasticity in the database tier mean?Mention the cost performance trade-off and repeat the fact that it is the cloud infrastructure that allows us to optimize operating cost, something that was not thought important in classical infrastructures.
  40. Define wireframe in this slide. Defer index wireframe definition to the later slide.
  41. Freeze  No structural modifications to the indices.Wireframe  Minimal information needed to start executing transactions at the destination, schema information, user authentication, the index wireframes, etc.
  42. Just to give a concrete example of a wireframe, if we consider a B+ tree index, then only the internal nodes of the indices are migrated as part of the wireframe.
  43. Just to give a concrete example of a wireframe, if we consider a B+ tree index, then only the internal nodes of the indices are migrated as part of the wireframe.
  44. Once the destination is initialized with the minimal information, it can start executing transactions. At this point, migration enters the Dual mode where both the source and destination are executing transactions, new transactions arrive at the destination while the source continues execution of transactions that were active at the start of migration.
  45. Once the destination is initialized with the minimal information, it can start executing transactions. At this point, migration enters the Dual mode where both the source and destination are executing transactions, new transactions arrive at the destination while the source continues execution of transactions that were active at the start of migration.
  46. Once the destination is initialized with the minimal information, it can start executing transactions. At this point, migration enters the Dual mode where both the source and destination are executing transactions, new transactions arrive at the destination while the source continues execution of transactions that were active at the start of migration.
  47. Once the destination is initialized with the minimal information, it can start executing transactions. At this point, migration enters the Dual mode where both the source and destination are executing transactions, new transactions arrive at the destination while the source continues execution of transactions that were active at the start of migration.
  48. Make the future more specific.