Scalable and Elastic Transactional Data Stores for Cloud Computing Platforms

PhD Defense

Scalable and Elastic
Transactional Data Stores for
Cloud Computing Platforms
Sudipto Das
Computer Science, UC Santa Barbara
sudipto@cs.ucsb.edu
Committee:
Divy Agrawal (co-chair), Amr El Abbadi (co-chair),
Phil Bernstein, Tim Sherwood

Sponsors:

Web replacing Desktop

Sudipto Das {sudipto@cs.ucsb.edu} 2

Paradigm shift in Infrastructure


Cloud computing
 Computing infrastructure
and solutions delivered as a
service
◦ Industry worth USD150 billion by
2014*
 Contributors to success
◦ Economies of scale
◦ Elasticity and pay-per-use pricing
 Popular paradigms
◦ Infrastructure as a Service (IaaS)
◦ Platform as a Service (PaaS)
◦ Software as a Service (SaaS)
*http://www.crn.com/news/channel-programs/225700984/cloud-computing-services-market-to-near-150-billion-in-2014.htm


Databases for cloud platforms
 Data is central to applications
 DBMSs are mission critical component in
cloud software stack
◦ Manage petabytes of data, drive revenue
◦ Serve a variety of applications (multitenancy)
 Data needs for cloud applications
◦ OLTP systems: store and serve data
◦ Data analysis systems: decision support,
intelligence


Application landscape
 Social gaming

 Rich
content and
mash-ups

 Managed
applications

 Cloud application
platforms


Challenges for OLTP systems

 Scalability
◦ While ensuring efficient transaction
execution!

 Lightweight Elasticity
◦ Scale on-demand!


Two approaches to scalability
 Scale-up
◦ Preferred in classical
enterprise setting (RDBMS)
◦ Flexible ACID transactions
◦ Transactions access a single node


Two approaches to scalability
 Scale-up
◦ Preferred in classical
enterprise setting (RDBMS)
◦ Flexible ACID transactions
◦ Transactions access a single node
 Scale-out
◦ Cloud friendly (Key value
stores)
◦ Execution at a single server
 Limited functionality & guarantees
◦ No multi-row or multi-step
transactions

Why care about transactions?

confirm_friend_request(user1, user2)
{
begin_transaction();
  update_friend_list(user1, user2, status.confirmed);
end_transaction();
}


Why care about transactions?

confirm_friend_request(user1, user2)
{
begin_transaction();
end_transaction();
}

Simplicity in application design
with ACID transactions

confirm_friend_request_A(user1, user2) {
try {
update_friend_list(user1, user2, status.confirmed);
} catch(exception e) {
report_error(e);
return;
}
try {
update_friend_list(user2, user1, status.confirmed);  
revert_friend_list(user1, user2);
report_error(e);
return;
}
}
confirm_friend_request_B(user1, user2) {
try{
  report_error(e);
  add_to_retry_queue(operation.updatefriendlist, user1, user2, current_time());
 }
try {
}
}


confirm_friend_request_A(user1, user2) {
try {
update_friend_list(user1, user2, status.confirmed);
report_error(e);
return;
}
try {
update_friend_list(user2, user1, status.confirmed);  
revert_friend_list(user1, user2);
report_error(e);
return;
}
}
confirm_friend_request_B(user1, user2) {
try{
 }
try {
}
}


Challenge: Transactions at Scale

Key Value Stores
Scale-out

RDBMSs

ACID transactions

Challenge: Lightweight Elasticity
Provisioning on-demand and not for peak
Optimize operating cost!

Capacity

Resources
Resources

Demand Capacity

Demand
Time Time

Traditional Infrastructures Deployment in the Cloud

Unused resources
Slide Credits: Berkeley RAD Lab


Contributions for OLTP systems

 Transactions at Scale
◦ ElasTraS [HotCloud 2009, UCSB TR 2010]
◦ G-Store [SoCC 2010]
 Lightweight Elasticity
◦ Albatross [VLDB 2011]
◦ Zephyr [SIGMOD 2011]
 Self-Manageability
◦ Pythia [in progress]

It is possible to architect scalable DBMSs that
efficiently support transactional semantics to ease
application design and elastically adapt to fluctuating
operational demands to optimize the operating cost.



 Transactions at
Scale
◦ ElasTraS [HotCloud
2009, UCSB TR 2010]
◦ G-Store
[SoCC 2010]


 Transactions at  Lightweight
Scale Elasticity
◦ ElasTraS [HotCloud ◦ Albatross
2009, UCSB TR 2010] [VLDB 2011]
◦ G-Store ◦ Zephyr
[SoCC 2010] [SIGMOD 2011]

Contributions
Data Management

Transaction Processing

Dynamic Static
partitioning partitioning
ElasTraS
G-Store
[HotCloud ‘09]
[SoCC ‘10]
[TR ‘10]

Albatross [VLDB ‘11]
Zephyr [SIGMOD ‘11]

Dissertation

Contributions
Data Management

Analytics Transaction Processing

Ricardo Dynamic Static
[SIGMOD ‘10] partitioning partitioning
MD-HBase ElasTraS
[MDM ‘11] G-Store
[HotCloud ‘09]
Best Paper [SoCC ‘10]
[TR ‘10]
Runner up

Anonimos Albatross [VLDB ‘11]
[ICDE ‘10], Zephyr [SIGMOD ‘11]
[TKDE]
Dissertation

Contributions
Data Management

Analytics Transaction Processing Novel
Architectures

Ricardo Dynamic Static Hyder
[SIGMOD ‘10] partitioning partitioning [CIDR ‘11]
Best Paper
MD-HBase ElasTraS
[MDM ‘11] G-Store
[HotCloud ‘09] CoTS
Best Paper [SoCC ‘10]
[TR ‘10] [ICDE ‘09],
Runner up
[VLDB ‘09]
Anonimos Albatross [VLDB ‘11]
[ICDE ‘10], Zephyr [SIGMOD ‘11] TCAM
[TKDE] [DaMoN ‘08]
Dissertation

Transactions at Scale

Key Value Stores
Scale-out

RDBMSs

ACID transactions

Scale-out with static partitioning
 Table level partitioning (range, hash)
◦ Distributed transactions
 Partitioning the Database schema
◦ Co-locate data items accessed together
◦ Goal: Minimize distributed transactions


 Scaling-out with static partitioning
◦ ElasTraS [HotCloud 2009, TR 2010]


 Scaling-out with static partitioning
◦ ElasTraS [HotCloud 2009, TR 2010]
◦ Cloud SQL Server [ICDE 2011]
◦ MegaStore [CIDR 2011]
◦ RelationalCloud [CIDR 2011]

Dynamically formed partitions
 Access patterns change, often rapidly
◦ Online multi-player gaming applications
◦ Collaboration based applications
◦ Scientific computing applications
 Not amenable to static partitioning


Dynamically formed partitions
 Access patterns change, often rapidly
◦ Online multi-player gaming applications
◦ Collaboration based applications
◦ Scientific computing applications
 Not amenable to static partitioning
 How to get the benefit of partitioning
when accesses do not statically partition?
◦ Ours is the first solution to allow that


Online Multi-player Games

ID Name $$$ Score

Player Profile



Execute transactions
on player profiles while
the game is in progress


Partitions/groups
are dynamic



Hundreds of thousands
of concurrent groups

Data Fusion for dynamic partitions
[G-Store, SoCC 2010]

 Transactional access to a group of data
items formed on-demand
 Challenge: Avoid distributed transactions!


Data Fusion for dynamic partitions
[G-Store, SoCC 2010]

 Transactional access to a group of data
items formed on-demand
 Challenge: Avoid distributed transactions!
 Key Group Abstraction
◦ Groups are small
◦ Groups have non-trivial lifetime
◦ Groups are dynamic and on-demand
 Groups are dynamically formed tenant
databases

Transactions on Groups
Without distributed transactions

 One key selected as the
leader



leader
 Followers transfer
ownership of keys to leader



Key
Group
Ownership
of keys at a
single node

leader


Grouping Protocol

Key
Group
Ownership
of keys at a
single node

leader


Why is group formation hard?
 Guarantee the contract between
leaders and followers in the presence of:
◦ Leader and follower failures
◦ Lost, duplicated, or re-ordered messages
◦ Dynamics of the underlying system
 How to ensure efficient and ACID
execution of transactions?


Grouping protocol

L(Joining) L(Joined)
Follower(s)
Create
Request J JA JAA

Leader
L(Creating) L(Joined)

Time


Grouping protocol

L(Joining) L(Joined)
Follower(s)
Create
Request J JA JAA
Group Opns
Leader
L(Creating) L(Joined)

Time


Grouping protocol

L(Joining) L(Joined) L(Free)
Follower(s)
Create
Request J JA JAA D DA
Group Opns
Leader
L(Creating) L(Joined) L(Deleting) L(Deleted)
Delete
Time
Request


Grouping protocol
Log entries

Follower(s)
Create
Group Opns
Leader
Delete
Time
Request


Grouping protocol
Log entries

Follower(s)
Create
Group Opns
Leader
Delete
Time
Request
 Conceptually akin to “locking”
◦ Locks held by groups

Efficient transaction processing
 How does the leader execute transactions?
◦ Caches data for group members  underlying data
store equivalent to a disk
◦ Transaction logging for durability
◦ Cache asynchronously flushed to propagate updates
◦ Guaranteed update propagation
Transaction Manager
Leader Log
Cache Manager

Asynchronous update
Propagation

Followers

Prototype: G-Store [SoCC 2010]
An implementation over Key-value stores

Application Clients

Transactional Multi-Key Access

Grouping Transaction Grouping Transaction Grouping Transaction
Layer Manager Layer Manager Layer Manager
Key-Value Store Logic Key-Value Store Logic Key-Value Store Logic

Distributed Storage
G-Store

Prototype: G-Store [SoCC 2010]
An implementation over Key-value stores

Application Clients

Transactional Multi-Key Access

Grouping middleware layer resident on top of a key-value store

Grouping Transaction Grouping Transaction Grouping Transaction
Layer Manager Layer Manager Layer Manager
Key-Value Store Logic Key-Value Store Logic Key-Value Store Logic

Distributed Storage
G-Store

G-Store Evaluation
 Implemented using HBase
◦ Added the middleware layer
◦ ~10000 LOC
 Experiments in Amazon EC2
 Benchmark: An online multi-player game
 Cluster size: 10 nodes
 Data size: ~1 billion rows (>1 TB)


G-Store Evaluation
 Implemented using HBase
◦ Added the middleware layer
◦ ~10000 LOC
 Experiments in Amazon EC2
 Benchmark: An online multi-player game
 Cluster size: 10 nodes
 Data size: ~1 billion rows (>1 TB)
 For groups with 100 keys
◦ Group creation latency: ~10 – 100ms
◦ More than 10,000 groups concurrently created


G-Store Evaluation

Group creation latency Group creation throughput


Lightweight Elasticity
Provisioning on-demand and not for peak
Optimize operating cost!

Capacity

Resources
Resources

Demand Capacity

Demand
Time Time

Traditional Infrastructures Deployment in the Cloud

Unused resources
Slide Credits: Berkeley RAD Lab


Elasticity in the Database tier

Load Balancer

Application/
Web/Caching
tier

Database tier



Load Balancer

Application/
Web/Caching
tier

Database tier


Live database migration
 Migrate a database partition (or tenant)
in a live system
◦ Optimize operating cost
◦ Resource orchestration in multitenant
systems


 Migrate a database partition (or tenant)
in a live system
◦ Optimize operating cost
◦ Resource orchestration in multitenant
systems
 Different from
◦ Migration between software versions
◦ Migration in case of schema evolution


VM migration for DB elasticity
 One DB partition-per-VM
◦ Pros: allows fine-grained load
balancing
VM VM VM
◦ Cons
 Performance overhead Hypervisor
 Poor consolidation ratio [Curino et
al., CIDR 2011]


VM migration for DB elasticity
 One DB partition-per-VM
◦ Pros: allows fine-grained load
balancing
◦ Cons VM VM VM
 Performance overhead
 Poor consolidation ratio [Curino et Hypervisor
al., CIDR 2011]
 Multiple DB partitions in
a VM
◦ Pros: good performance
◦ Cons: Migrate all partitions  VM
Coarse-grained load balancing
Hypervisor

 Multiple partitions share the same
database process
◦ Shared process multitenancy
 Migrate individual partitions on-
demand in a live system
◦ Virtualization in the database tier
 Straightforward solution
◦ Stop serving partition at the source
◦ Copy to destination
◦ Start serving at the destination
◦ Expensive!


Migration cost measures
 Service un-availability
◦ Time the partition is unavailable
 Number of failed requests
◦ Number of operations failing/transactions
aborting
 Performance overhead
◦ Impact on response times
 Additional data transferred


Two common DBMS architectures
 Decoupled storage
architectures
◦ ElasTraS, G-Store, Deuteronomy,
MegaStore
◦ Persistent data is not migrated

 Shared nothing architectures
◦ SQL Azure, Relational Cloud,
MySQL Cluster
◦ Migrate persistent data

Why is live DB migration hard?
 Persistent DB image must be migrated (GBs)
◦ How to ensure no downtime?
 Nodes can fail during migration
◦ How to guarantee correctness during
failures?
 Transaction atomicity and durability.
 Recover migration state after failure.
 Transactions execute during migration
◦ How to guarantee serializability?
 Transaction correctness equivalent to normal operation


Our approach: Zephyr
[SIGMOD 2011]
 Migration executed in phases
◦ Starts with transfer of minimal information to
destination (“wireframe”)
 Database pages used as granule of
migration
◦ Unique page ownership
 Source and destination concurrently
execute transactions in one migration phase
 Minimal transaction synchronization
 Guaranteed serializability
 Logging and handshaking protocols

Simplifying assumptions
 For this talk
◦ Transactions access a single partition
◦ No replication
◦ No structural changes to indices
 Extensions in the paper [SIGMOD 2011]
◦ Relaxes these assumptions


Design overview

P1
P2
Owned Pages P3

Pn

Active transactions
TS1,…,
TSk
Source Destination
Page owned by Node

Page not owned by Node


Init mode
Freeze indices and migrate wireframe

P1 P1
P2 P2
Owned Pages P3 P3 Un-owned Pages

Pn Pn
TS1,…,
Active transactions
TSk
Source Destination
Page owned by Node



What is an index wireframe?

Source


What is an index wireframe?

Source Destination


Dual mode

P1 P1
P2 P2
P3 P3

Pn Pn
Old, still active TSk+1,…, TD1,…, New transactions
transactions TSl TDm
Source Destination
Page owned by Node
Index wireframes remain frozen


Dual mode

P1 P3 accessed by P1
P2 TDi P2
P3 P3

Pn Pn
Source Destination
Page owned by Node


Dual mode
Requests for un-owned pages can block

P2 TDi P2
P3 P3

Pn Pn
Source Destination
Page owned by Node


Dual mode
Requests for un-owned pages can block

P2 TDi P2
P3 P3

P3 pulled
Pn from source Pn
Source Destination
Page owned by Node


Finish mode

P1 P1
P2 P2
P3 P3

P1, P2, …
pushed from
Pn source Pn
TDm+1,…
Completed
,TDn
Source Destination
Page owned by Node



Finish mode
Pages can be pulled by the destination, if needed

P1 P1
P2 P2
P3 P3

P1, P2, …
pushed from
Pn source Pn
TDm+1,…
Completed
,TDn
Source Destination
Page owned by Node



Normal operation
Index wireframe un-frozen

P1
P2
P3

Pn
TDn+1,…,
TDp
Source Destination
Page owned by Node



Artifacts of this design
 Once migrated, pages are never pulled back
by source
◦ Abort transactions at source accessing the
migrated pages
 No structural changes to indices during
migration
◦ Abort transactions (at both nodes) that make
structural changes to indices
 Destination “pulls” pages on-demand
◦ Transactions at the destination experience higher
latency compared to normal operation


Implementation
 Prototyped using an open source OLTP
database H2
◦ Supports standard SQL/JDBC API
◦ Serializable isolation level
◦ Tree Indices
◦ Relational data model
 Modified the database engine
◦ Added support for freezing indices
◦ Page migration status maintained using index
◦ ~6000 LOC
 Tungsten SQL Router migrates JDBC
connections during migration


Results Overview
 Downtime (partition unavailability)
◦ S&C: 3 – 8 seconds (needed to migrate, unavailable
for updates)
◦ Zephyr: No downtime. Either source or
destination is available
 Service interruption (failed operations)
◦ S&C: ~100 s – 1,000s. All transactions with updates
are aborted
◦ Zephyr: ~10s – 100s. Order of magnitude less
interruption
 Minimal operational and data transfer
overhead


Failed Operations

Order of
magnitude
fewer failed
operations


Concluding Remarks


Concluding Remarks
 Majorenabling
technologies
◦ Transactions at Scale
 ElasTraS
 G-Store
◦ Lightweight Elasticity
 Albatross
 Zephyr


Future Directions

 Self-managing controller for large
multitenant database infrastructures

 Convergence of transactional and analytics
systems for real-time intelligence

 Putting human-in-the-loop: Leveraging
crowd-sourcing


Acknowledgements

 My advisors and my committee members
 Computer Science Dept. at UCSB
 Funding sources: NSF, NEC Labs America,
and AWS in Education
 Colleagues at DSL and at UCSB
 My family

November 16, 2011 Sudipto Das {sudipto@cs.ucsb.edu} 95

Thank you!

Collaborators
UCSB:
Divy Agrawal, Amr El Abbadi, Ömer Eğecioğlu
Shashank Agarwal, Shyam Antony, Aaron Elmore,
Shoji Nishimura (NEC Japan)
Microsoft Research Redmond:
Phil Bernstein, Colin Reid
IBM Almaden:
Yannis Sismanis, Kevin Beyer, Rainer Gemulla,
Peter Haas, John McPherson

Scalable and Elastic Transactional Data Stores for Cloud Computing Platforms

Recomendados

Recomendados

Más contenido relacionado

Similar a Scalable and Elastic Transactional Data Stores for Cloud Computing Platforms

Similar a Scalable and Elastic Transactional Data Stores for Cloud Computing Platforms (20)

Último

Último (20)

Scalable and Elastic Transactional Data Stores for Cloud Computing Platforms

Notas del editor