Cloud computing has emerged as a multi-billion dollar industry and as a successful paradigm for web application deployment. Economies-of-scale, elasticity, and pay-per-use pricing have been the biggest promises of cloud. Database management systems (DBMSs) serving these web applications form a critical component of the cloud software stack. These DBMSs must be able to scale-out to clusters of commodity servers to serve thousands of applications and their huge amounts of data. Moreover, to minimize the operating costs such DBMSs must also be elastic, i.e. posses the ability to increase and decrease the cluster size in a live system. This is in addition to serving a variety of applications (i.e. support multitenancy) while being self-managing, fault-tolerant, and highly available.
The overarching goal of my dissertation is to propose abstractions, protocols, and paradigms to design scalable and elastic database management systems that address the unique set of challenges posed by the cloud. My dissertation shows that with careful choice of design and features, it is possible to architect scalable DBMSs that efficiently support transactional semantics to ease application design and elastically adapt to fluctuating operational demands to optimize the operating cost. In this talk, I will outline my work that embodies this principle. In the first part, I will present techniques and system architectures to enable efficient and scalable transaction processing on clusters of commodity servers. In the second part, I will present techniques for on-demand database migration in a live system, a primitive operation critical to support lightweight elasticity as a first class feature in DBMSs. I will conclude the talk with a discussion of possible future directions.
DevoxxFR 2024 Reproducible Builds with Apache Maven
Scalable and Elastic Transactional Data Stores for Cloud Computing Platforms
1. PhD Defense
Scalable and Elastic
Transactional Data Stores for
Cloud Computing Platforms
Sudipto Das
Computer Science, UC Santa Barbara
sudipto@cs.ucsb.edu
Committee:
Divy Agrawal (co-chair), Amr El Abbadi (co-chair),
Phil Bernstein, Tim Sherwood
Sponsors:
5. Cloud computing
Computing infrastructure
and solutions delivered as a
service
◦ Industry worth USD150 billion by
2014*
Contributors to success
◦ Economies of scale
◦ Elasticity and pay-per-use pricing
Popular paradigms
◦ Infrastructure as a Service (IaaS)
◦ Platform as a Service (PaaS)
◦ Software as a Service (SaaS)
*http://www.crn.com/news/channel-programs/225700984/cloud-computing-services-market-to-near-150-billion-in-2014.htm
Sudipto Das {sudipto@cs.ucsb.edu} 5
6. Databases for cloud platforms
Data is central to applications
DBMSs are mission critical component in
cloud software stack
◦ Manage petabytes of data, drive revenue
◦ Serve a variety of applications (multitenancy)
Data needs for cloud applications
◦ OLTP systems: store and serve data
◦ Data analysis systems: decision support,
intelligence
Sudipto Das {sudipto@cs.ucsb.edu} 6
7. Databases for cloud platforms
Data is central to applications
DBMSs are mission critical component in
cloud software stack
◦ Manage petabytes of data, drive revenue
◦ Serve a variety of applications (multitenancy)
Data needs for cloud applications
◦ OLTP systems: store and serve data
◦ Data analysis systems: decision support,
intelligence
Sudipto Das {sudipto@cs.ucsb.edu} 7
8. Application landscape
Social gaming
Rich
content and
mash-ups
Managed
applications
Cloud application
platforms
Sudipto Das {sudipto@cs.ucsb.edu} 8
9. Challenges for OLTP systems
Scalability
◦ While ensuring efficient transaction
execution!
Lightweight Elasticity
◦ Scale on-demand!
Sudipto Das {sudipto@cs.ucsb.edu} 9
10. Two approaches to scalability
Scale-up
◦ Preferred in classical
enterprise setting (RDBMS)
◦ Flexible ACID transactions
◦ Transactions access a single node
Sudipto Das {sudipto@cs.ucsb.edu} 10
11. Two approaches to scalability
Scale-up
◦ Preferred in classical
enterprise setting (RDBMS)
◦ Flexible ACID transactions
◦ Transactions access a single node
Scale-out
◦ Cloud friendly (Key value
stores)
◦ Execution at a single server
Limited functionality & guarantees
◦ No multi-row or multi-step
transactions
Sudipto Das {sudipto@cs.ucsb.edu} 11
12. Why care about transactions?
confirm_friend_request(user1, user2)
{
begin_transaction();
update_friend_list(user1, user2, status.confirmed);
update_friend_list(user2, user1, status.confirmed);
end_transaction();
}
Sudipto Das {sudipto@cs.ucsb.edu} 12
13. Why care about transactions?
confirm_friend_request(user1, user2)
{
begin_transaction();
update_friend_list(user1, user2, status.confirmed);
update_friend_list(user2, user1, status.confirmed);
end_transaction();
}
Simplicity in application design
with ACID transactions
Sudipto Das {sudipto@cs.ucsb.edu} 13
16. Challenge: Transactions at Scale
Key Value Stores
Scale-out
RDBMSs
ACID transactions
Sudipto Das {sudipto@cs.ucsb.edu} 16
17. Challenge: Lightweight Elasticity
Provisioning on-demand and not for peak
Optimize operating cost!
Capacity
Resources
Resources
Demand Capacity
Demand
Time Time
Traditional Infrastructures Deployment in the Cloud
Unused resources
Slide Credits: Berkeley RAD Lab
Sudipto Das {sudipto@cs.ucsb.edu} 17
19. Contributions for OLTP systems
It is possible to architect scalable DBMSs that
efficiently support transactional semantics to ease
application design and elastically adapt to fluctuating
operational demands to optimize the operating cost.
Sudipto Das {sudipto@cs.ucsb.edu} 19
20. Contributions for OLTP systems
It is possible to architect scalable DBMSs that
efficiently support transactional semantics to ease
application design and elastically adapt to fluctuating
operational demands to optimize the operating cost.
Transactions at
Scale
◦ ElasTraS [HotCloud
2009, UCSB TR 2010]
◦ G-Store
[SoCC 2010]
Sudipto Das {sudipto@cs.ucsb.edu} 20
21. Contributions for OLTP systems
It is possible to architect scalable DBMSs that
efficiently support transactional semantics to ease
application design and elastically adapt to fluctuating
operational demands to optimize the operating cost.
Transactions at Lightweight
Scale Elasticity
◦ ElasTraS [HotCloud ◦ Albatross
2009, UCSB TR 2010] [VLDB 2011]
◦ G-Store ◦ Zephyr
[SoCC 2010] [SIGMOD 2011]
Sudipto Das {sudipto@cs.ucsb.edu} 21
22. Contributions for OLTP systems
It is possible to architect scalable DBMSs that
efficiently support transactional semantics to ease
application design and elastically adapt to fluctuating
operational demands to optimize the operating cost.
Transactions at Lightweight
Scale Elasticity
◦ ElasTraS [HotCloud ◦ Albatross
2009, UCSB TR 2010] [VLDB 2011]
◦ G-Store ◦ Zephyr
[SoCC 2010] [SIGMOD 2011]
Sudipto Das {sudipto@cs.ucsb.edu} 22
24. Contributions
Data Management
Analytics Transaction Processing
Ricardo Dynamic Static
[SIGMOD ‘10] partitioning partitioning
MD-HBase ElasTraS
[MDM ‘11] G-Store
[HotCloud ‘09]
Best Paper [SoCC ‘10]
[TR ‘10]
Runner up
Anonimos Albatross [VLDB ‘11]
[ICDE ‘10], Zephyr [SIGMOD ‘11]
[TKDE]
Dissertation
Sudipto Das {sudipto@cs.ucsb.edu} 24
25. Contributions
Data Management
Analytics Transaction Processing Novel
Architectures
Ricardo Dynamic Static Hyder
[SIGMOD ‘10] partitioning partitioning [CIDR ‘11]
Best Paper
MD-HBase ElasTraS
[MDM ‘11] G-Store
[HotCloud ‘09] CoTS
Best Paper [SoCC ‘10]
[TR ‘10] [ICDE ‘09],
Runner up
[VLDB ‘09]
Anonimos Albatross [VLDB ‘11]
[ICDE ‘10], Zephyr [SIGMOD ‘11] TCAM
[TKDE] [DaMoN ‘08]
Dissertation
Sudipto Das {sudipto@cs.ucsb.edu} 25
26. Transactions at Scale
Key Value Stores
Scale-out
RDBMSs
ACID transactions
Sudipto Das {sudipto@cs.ucsb.edu} 26
27. Scale-out with static partitioning
Table level partitioning (range, hash)
◦ Distributed transactions
Partitioning the Database schema
◦ Co-locate data items accessed together
◦ Goal: Minimize distributed transactions
Sudipto Das {sudipto@cs.ucsb.edu} 27
28. Scale-out with static partitioning
Table level partitioning (range, hash)
◦ Distributed transactions
Partitioning the Database schema
◦ Co-locate data items accessed together
◦ Goal: Minimize distributed transactions
Sudipto Das {sudipto@cs.ucsb.edu} 28
29. Scale-out with static partitioning
Table level partitioning (range, hash)
◦ Distributed transactions
Partitioning the Database schema
◦ Co-locate data items accessed together
◦ Goal: Minimize distributed transactions
Scaling-out with static partitioning
◦ ElasTraS [HotCloud 2009, TR 2010]
Sudipto Das {sudipto@cs.ucsb.edu} 29
30. Scale-out with static partitioning
Table level partitioning (range, hash)
◦ Distributed transactions
Partitioning the Database schema
◦ Co-locate data items accessed together
◦ Goal: Minimize distributed transactions
Scaling-out with static partitioning
◦ ElasTraS [HotCloud 2009, TR 2010]
◦ Cloud SQL Server [ICDE 2011]
◦ MegaStore [CIDR 2011]
◦ RelationalCloud [CIDR 2011]
Sudipto Das {sudipto@cs.ucsb.edu} 30
31. Dynamically formed partitions
Access patterns change, often rapidly
◦ Online multi-player gaming applications
◦ Collaboration based applications
◦ Scientific computing applications
Not amenable to static partitioning
Sudipto Das {sudipto@cs.ucsb.edu} 31
32. Dynamically formed partitions
Access patterns change, often rapidly
◦ Online multi-player gaming applications
◦ Collaboration based applications
◦ Scientific computing applications
Not amenable to static partitioning
How to get the benefit of partitioning
when accesses do not statically partition?
◦ Ours is the first solution to allow that
Sudipto Das {sudipto@cs.ucsb.edu} 32
38. Online Multi-player Games
Hundreds of thousands
of concurrent groups
Sudipto Das {sudipto@cs.ucsb.edu} 38
39. Data Fusion for dynamic partitions
[G-Store, SoCC 2010]
Transactional access to a group of data
items formed on-demand
Challenge: Avoid distributed transactions!
Sudipto Das {sudipto@cs.ucsb.edu} 39
40. Data Fusion for dynamic partitions
[G-Store, SoCC 2010]
Transactional access to a group of data
items formed on-demand
Challenge: Avoid distributed transactions!
Key Group Abstraction
◦ Groups are small
◦ Groups have non-trivial lifetime
◦ Groups are dynamic and on-demand
Groups are dynamically formed tenant
databases
Sudipto Das {sudipto@cs.ucsb.edu} 40
41. Transactions on Groups
Without distributed transactions
One key selected as the
leader
Sudipto Das {sudipto@cs.ucsb.edu} 41
42. Transactions on Groups
Without distributed transactions
One key selected as the
leader
Followers transfer
ownership of keys to leader
Sudipto Das {sudipto@cs.ucsb.edu} 42
43. Transactions on Groups
Without distributed transactions
Key
Group
Ownership
of keys at a
single node
One key selected as the
leader
Followers transfer
ownership of keys to leader
Sudipto Das {sudipto@cs.ucsb.edu} 43
44. Transactions on Groups
Without distributed transactions
Grouping Protocol
Key
Group
Ownership
of keys at a
single node
One key selected as the
leader
Followers transfer
ownership of keys to leader
Sudipto Das {sudipto@cs.ucsb.edu} 44
45. Why is group formation hard?
Guarantee the contract between
leaders and followers in the presence of:
◦ Leader and follower failures
◦ Lost, duplicated, or re-ordered messages
◦ Dynamics of the underlying system
How to ensure efficient and ACID
execution of transactions?
Sudipto Das {sudipto@cs.ucsb.edu} 45
46. Grouping protocol
L(Joining) L(Joined)
Follower(s)
Create
Request J JA JAA
Leader
L(Creating) L(Joined)
Time
Sudipto Das {sudipto@cs.ucsb.edu} 46
47. Grouping protocol
L(Joining) L(Joined)
Follower(s)
Create
Request J JA JAA
Group Opns
Leader
L(Creating) L(Joined)
Time
Sudipto Das {sudipto@cs.ucsb.edu} 47
48. Grouping protocol
L(Joining) L(Joined) L(Free)
Follower(s)
Create
Request J JA JAA D DA
Group Opns
Leader
L(Creating) L(Joined) L(Deleting) L(Deleted)
Delete
Time
Request
Sudipto Das {sudipto@cs.ucsb.edu} 48
49. Grouping protocol
Log entries
L(Joining) L(Joined) L(Free)
Follower(s)
Create
Request J JA JAA D DA
Group Opns
Leader
L(Creating) L(Joined) L(Deleting) L(Deleted)
Delete
Time
Request
Sudipto Das {sudipto@cs.ucsb.edu} 49
50. Grouping protocol
Log entries
L(Joining) L(Joined) L(Free)
Follower(s)
Create
Request J JA JAA D DA
Group Opns
Leader
L(Creating) L(Joined) L(Deleting) L(Deleted)
Delete
Time
Request
Conceptually akin to “locking”
◦ Locks held by groups
Sudipto Das {sudipto@cs.ucsb.edu} 50
51. Efficient transaction processing
How does the leader execute transactions?
◦ Caches data for group members underlying data
store equivalent to a disk
◦ Transaction logging for durability
◦ Cache asynchronously flushed to propagate updates
◦ Guaranteed update propagation
Transaction Manager
Leader Log
Cache Manager
Asynchronous update
Propagation
Followers
Sudipto Das {sudipto@cs.ucsb.edu} 51
52. Prototype: G-Store [SoCC 2010]
An implementation over Key-value stores
Application Clients
Transactional Multi-Key Access
Grouping Transaction Grouping Transaction Grouping Transaction
Layer Manager Layer Manager Layer Manager
Key-Value Store Logic Key-Value Store Logic Key-Value Store Logic
Distributed Storage
G-Store
Sudipto Das {sudipto@cs.ucsb.edu} 52
53. Prototype: G-Store [SoCC 2010]
An implementation over Key-value stores
Application Clients
Transactional Multi-Key Access
Grouping middleware layer resident on top of a key-value store
Grouping Transaction Grouping Transaction Grouping Transaction
Layer Manager Layer Manager Layer Manager
Key-Value Store Logic Key-Value Store Logic Key-Value Store Logic
Distributed Storage
G-Store
Sudipto Das {sudipto@cs.ucsb.edu} 53
54. G-Store Evaluation
Implemented using HBase
◦ Added the middleware layer
◦ ~10000 LOC
Experiments in Amazon EC2
Benchmark: An online multi-player game
Cluster size: 10 nodes
Data size: ~1 billion rows (>1 TB)
Sudipto Das {sudipto@cs.ucsb.edu} 54
55. G-Store Evaluation
Implemented using HBase
◦ Added the middleware layer
◦ ~10000 LOC
Experiments in Amazon EC2
Benchmark: An online multi-player game
Cluster size: 10 nodes
Data size: ~1 billion rows (>1 TB)
For groups with 100 keys
◦ Group creation latency: ~10 – 100ms
◦ More than 10,000 groups concurrently created
Sudipto Das {sudipto@cs.ucsb.edu} 55
56. G-Store Evaluation
Group creation latency Group creation throughput
Sudipto Das {sudipto@cs.ucsb.edu} 56
57. Lightweight Elasticity
Provisioning on-demand and not for peak
Optimize operating cost!
Capacity
Resources
Resources
Demand Capacity
Demand
Time Time
Traditional Infrastructures Deployment in the Cloud
Unused resources
Slide Credits: Berkeley RAD Lab
Sudipto Das {sudipto@cs.ucsb.edu} 57
58. Elasticity in the Database tier
Load Balancer
Application/
Web/Caching
tier
Database tier
Sudipto Das {sudipto@cs.ucsb.edu} 58
59. Elasticity in the Database tier
Load Balancer
Application/
Web/Caching
tier
Database tier
Sudipto Das {sudipto@cs.ucsb.edu} 59
60. Elasticity in the Database tier
Load Balancer
Application/
Web/Caching
tier
Database tier
Sudipto Das {sudipto@cs.ucsb.edu} 60
61. Elasticity in the Database tier
Load Balancer
Application/
Web/Caching
tier
Database tier
Sudipto Das {sudipto@cs.ucsb.edu} 61
62. Elasticity in the Database tier
Load Balancer
Application/
Web/Caching
tier
Database tier
Sudipto Das {sudipto@cs.ucsb.edu} 62
63. Elasticity in the Database tier
Load Balancer
Application/
Web/Caching
tier
Database tier
Sudipto Das {sudipto@cs.ucsb.edu} 63
64. Elasticity in the Database tier
Load Balancer
Application/
Web/Caching
tier
Database tier
Sudipto Das {sudipto@cs.ucsb.edu} 64
65. Live database migration
Migrate a database partition (or tenant)
in a live system
◦ Optimize operating cost
◦ Resource orchestration in multitenant
systems
Sudipto Das {sudipto@cs.ucsb.edu} 65
66. Live database migration
Migrate a database partition (or tenant)
in a live system
◦ Optimize operating cost
◦ Resource orchestration in multitenant
systems
Different from
◦ Migration between software versions
◦ Migration in case of schema evolution
Sudipto Das {sudipto@cs.ucsb.edu} 66
67. VM migration for DB elasticity
One DB partition-per-VM
◦ Pros: allows fine-grained load
balancing
VM VM VM
◦ Cons
Performance overhead Hypervisor
Poor consolidation ratio [Curino et
al., CIDR 2011]
Sudipto Das {sudipto@cs.ucsb.edu} 67
68. VM migration for DB elasticity
One DB partition-per-VM
◦ Pros: allows fine-grained load
balancing
◦ Cons VM VM VM
Performance overhead
Poor consolidation ratio [Curino et Hypervisor
al., CIDR 2011]
Multiple DB partitions in
a VM
◦ Pros: good performance
◦ Cons: Migrate all partitions VM
Coarse-grained load balancing
Hypervisor
Sudipto Das {sudipto@cs.ucsb.edu} 68
69. Live database migration
Multiple partitions share the same
database process
◦ Shared process multitenancy
Migrate individual partitions on-
demand in a live system
◦ Virtualization in the database tier
Straightforward solution
◦ Stop serving partition at the source
◦ Copy to destination
◦ Start serving at the destination
◦ Expensive!
Sudipto Das {sudipto@cs.ucsb.edu} 69
70. Migration cost measures
Service un-availability
◦ Time the partition is unavailable
Number of failed requests
◦ Number of operations failing/transactions
aborting
Performance overhead
◦ Impact on response times
Additional data transferred
Sudipto Das {sudipto@cs.ucsb.edu} 70
71. Two common DBMS architectures
Decoupled storage
architectures
◦ ElasTraS, G-Store, Deuteronomy,
MegaStore
◦ Persistent data is not migrated
◦ Albatross [VLDB 2011]
Shared nothing architectures
◦ SQL Azure, Relational Cloud,
MySQL Cluster
◦ Migrate persistent data
◦ Zephyr [SIGMOD 2011]
Sudipto Das {sudipto@cs.ucsb.edu} 71
72. Two common DBMS architectures
Decoupled storage
architectures
◦ ElasTraS, G-Store, Deuteronomy,
MegaStore
◦ Persistent data is not migrated
◦ Albatross [VLDB 2011]
Shared nothing architectures
◦ SQL Azure, Relational Cloud,
MySQL Cluster
◦ Migrate persistent data
◦ Zephyr [SIGMOD 2011]
Sudipto Das {sudipto@cs.ucsb.edu} 72
73. Why is live DB migration hard?
Persistent DB image must be migrated (GBs)
◦ How to ensure no downtime?
Nodes can fail during migration
◦ How to guarantee correctness during
failures?
Transaction atomicity and durability.
Recover migration state after failure.
Transactions execute during migration
◦ How to guarantee serializability?
Transaction correctness equivalent to normal operation
Sudipto Das {sudipto@cs.ucsb.edu} 73
74. Our approach: Zephyr
[SIGMOD 2011]
Migration executed in phases
◦ Starts with transfer of minimal information to
destination (“wireframe”)
Database pages used as granule of
migration
◦ Unique page ownership
Source and destination concurrently
execute transactions in one migration phase
Minimal transaction synchronization
Guaranteed serializability
Logging and handshaking protocols
Sudipto Das {sudipto@cs.ucsb.edu} 74
75. Simplifying assumptions
For this talk
◦ Transactions access a single partition
◦ No replication
◦ No structural changes to indices
Extensions in the paper [SIGMOD 2011]
◦ Relaxes these assumptions
Sudipto Das {sudipto@cs.ucsb.edu} 75
76. Design overview
P1
P2
Owned Pages P3
Pn
Active transactions
TS1,…,
TSk
Source Destination
Page owned by Node
Page not owned by Node
Sudipto Das {sudipto@cs.ucsb.edu} 76
77. Init mode
Freeze indices and migrate wireframe
P1 P1
P2 P2
Owned Pages P3 P3 Un-owned Pages
Pn Pn
TS1,…,
Active transactions
TSk
Source Destination
Page owned by Node
Page not owned by Node
Sudipto Das {sudipto@cs.ucsb.edu} 77
78. What is an index wireframe?
Source
Sudipto Das {sudipto@cs.ucsb.edu} 78
79. What is an index wireframe?
Source Destination
Sudipto Das {sudipto@cs.ucsb.edu} 79
80. Dual mode
P1 P1
P2 P2
P3 P3
Pn Pn
Old, still active TSk+1,…, TD1,…, New transactions
transactions TSl TDm
Source Destination
Page owned by Node
Index wireframes remain frozen
Page not owned by Node
Sudipto Das {sudipto@cs.ucsb.edu} 80
81. Dual mode
P1 P3 accessed by P1
P2 TDi P2
P3 P3
Pn Pn
Old, still active TSk+1,…, TD1,…, New transactions
transactions TSl TDm
Source Destination
Page owned by Node
Index wireframes remain frozen
Page not owned by Node
Sudipto Das {sudipto@cs.ucsb.edu} 81
82. Dual mode
Requests for un-owned pages can block
P1 P3 accessed by P1
P2 TDi P2
P3 P3
Pn Pn
Old, still active TSk+1,…, TD1,…, New transactions
transactions TSl TDm
Source Destination
Page owned by Node
Index wireframes remain frozen
Page not owned by Node
Sudipto Das {sudipto@cs.ucsb.edu} 82
83. Dual mode
Requests for un-owned pages can block
P1 P3 accessed by P1
P2 TDi P2
P3 P3
P3 pulled
Pn from source Pn
Old, still active TSk+1,…, TD1,…, New transactions
transactions TSl TDm
Source Destination
Page owned by Node
Index wireframes remain frozen
Page not owned by Node
Sudipto Das {sudipto@cs.ucsb.edu} 83
84. Finish mode
P1 P1
P2 P2
P3 P3
P1, P2, …
pushed from
Pn source Pn
TDm+1,…
Completed
,TDn
Source Destination
Page owned by Node
Page not owned by Node
Sudipto Das {sudipto@cs.ucsb.edu} 84
85. Finish mode
Pages can be pulled by the destination, if needed
P1 P1
P2 P2
P3 P3
P1, P2, …
pushed from
Pn source Pn
TDm+1,…
Completed
,TDn
Source Destination
Page owned by Node
Page not owned by Node
Sudipto Das {sudipto@cs.ucsb.edu} 85
86. Normal operation
Index wireframe un-frozen
P1
P2
P3
Pn
TDn+1,…,
TDp
Source Destination
Page owned by Node
Page not owned by Node
Sudipto Das {sudipto@cs.ucsb.edu} 86
87. Artifacts of this design
Once migrated, pages are never pulled back
by source
◦ Abort transactions at source accessing the
migrated pages
No structural changes to indices during
migration
◦ Abort transactions (at both nodes) that make
structural changes to indices
Destination “pulls” pages on-demand
◦ Transactions at the destination experience higher
latency compared to normal operation
Sudipto Das {sudipto@cs.ucsb.edu} 87
88. Implementation
Prototyped using an open source OLTP
database H2
◦ Supports standard SQL/JDBC API
◦ Serializable isolation level
◦ Tree Indices
◦ Relational data model
Modified the database engine
◦ Added support for freezing indices
◦ Page migration status maintained using index
◦ ~6000 LOC
Tungsten SQL Router migrates JDBC
connections during migration
Sudipto Das {sudipto@cs.ucsb.edu} 88
89. Results Overview
Downtime (partition unavailability)
◦ S&C: 3 – 8 seconds (needed to migrate, unavailable
for updates)
◦ Zephyr: No downtime. Either source or
destination is available
Service interruption (failed operations)
◦ S&C: ~100 s – 1,000s. All transactions with updates
are aborted
◦ Zephyr: ~10s – 100s. Order of magnitude less
interruption
Minimal operational and data transfer
overhead
Sudipto Das {sudipto@cs.ucsb.edu} 89
94. Future Directions
Self-managing controller for large
multitenant database infrastructures
Convergence of transactional and analytics
systems for real-time intelligence
Putting human-in-the-loop: Leveraging
crowd-sourcing
Sudipto Das {sudipto@cs.ucsb.edu} 94
95. Acknowledgements
My advisors and my committee members
Computer Science Dept. at UCSB
Funding sources: NSF, NEC Labs America,
and AWS in Education
Colleagues at DSL and at UCSB
My family
November 16, 2011 Sudipto Das {sudipto@cs.ucsb.edu} 95
96. Thank you!
Collaborators
UCSB:
Divy Agrawal, Amr El Abbadi, Ömer Eğecioğlu
Shashank Agarwal, Shyam Antony, Aaron Elmore,
Shoji Nishimura (NEC Japan)
Microsoft Research Redmond:
Phil Bernstein, Colin Reid
IBM Almaden:
Yannis Sismanis, Kevin Beyer, Rainer Gemulla,
Peter Haas, John McPherson
Notas del editor
In the last few years, we have witnessed a trend where web applications have been replacing desktop applications and large numbers of applications are now accessed via the browsers.
This shift from desktop to the web has also resulted in a paradigm shift in the application deployment infrastructure resulting in a paradigm popularly known as Cloud Computing.
This shift from desktop to the web has also resulted in a paradigm shift in the application deployment infrastructure resulting in a paradigm popularly known as Cloud Computing.
In its simplest form, cloud computing is essentially computing infrastructure and solutions delivered as a service. Analysts predict that this industry will be worth 150 billion dollars by 2014. Even though almost every aspect of computing can be provided as a service, there have been three popular cloud paradigms:Infrastructure as a service, the lowest level of abstraction, provides raw CPU, storage, and network as a service. Popular examples include Amazon web services, Rackspace, etc.The next higher level of abstraction is platform as a service that provides a platform or containers to deploy applications where the platform provider abstracts data management, fault-tolerance, elastic scaling etc, thus simplifying application deployment. Popular examples include Google AppEngine, Windows Azure, etc.The highest level of abstraction is software as a service that exposes a simple interface to customize pre-designed application logic. Popular examples include Salesforce.com.Major factors that have contributed to the success of cloud platforms are advances in the technology front, such as virtualization and pervasive broadband internet connectivity, as well as business and economic factors, such as economies of scale, transfer of risks etc.In this talk, we focus on Cloud application platforms, in particular, the database systems that serve these cloud application platforms.
Data is central to all modern applications and most modern enterprises manage petabytes of data. Hence DBMSs form a mission critical component in the cloud software stack and is the key to success as well as generating revenue.Considering the data needs for web-applications, there are two broad categories of systems:On one hand are OLTP systems that store and serve data. On the other hand are OLAP systems that provide intelligence and decision support.In this talk, we will focus on OLTP systems.Bring in the concept of service provider and the service user and whose problem are we solving (NEC discussion).
Data is central to all modern applications and most modern enterprises manage petabytes of data. Hence DBMSs form a mission critical component in the cloud software stack and is the key to success as well as generating revenue.Considering the data needs for web-applications, there are two broad categories of systems:On one hand are OLTP systems that store and serve data. On the other hand are OLAP systems that provide intelligence and decision support.In this talk, we will focus on OLTP systems.Bring in the concept of service provider and the service user and whose problem are we solving (NEC discussion).
Therefore, in summary, the major challenges for an OLTP database in the cloud are:Supporting transactions and scale-out while minimizing the number of distributed transactions,Supporting lightweight elastic scaling in a live system, andProviding autonomic control with intelligence similar to a human controller.
Stress about the ACID properties of transactions and how the applications benefit from it by simplifying their design.
Stress about the ACID properties of transactions and how the applications benefit from it by simplifying their design.
Therefore, if we consider Scale-out as the vertical axis and Functionality (or support for transactions) as the horizontal axis, at one extreme are the RDBMSs that support rich functionality but are hard to scale-out, and at the other extreme are Key-Value stores that allow scaling out to thousands of servers but support limited functionality.There exists a big chasm between the two types of systems and the challenge is to bridge this divide by efficiently supporting transactions while scaling out.Cloud platforms are multitenant and must support a variety of applications with varying needs. Therefore, bridging this chasm is important to support a variety of applications.Functionality , whether transactions are a subset.
In addition, when such a database is deployed on an elastic pay-per-use cloud infrastructure that allows for on-demand provisioning compared to static provisioning for the peak load, the challenge is to make the database layer elastic as the underlying cloud infrastructure without introducing a lot of overhead to make it elastic.Scale vs Elasticity
To this end, my dissertation makes the following contributions to address these challenges:We propose two different solutions to support transactions at scale for two different application scenarios: Elastras allows for elastically scalable transaction execution in databases where partitions are statically defined, while G-Store allows efficient transaction processing where database partitions are dynamically defined.Supporting lightweight elasticity essentially boils down to lightweight migration of database partitions in a live system. To this end, we propose two different techniques for two common database architectures: Albatross is a technique for live migration in databases that use the storage abstraction, while Zephyr is a technique for live migration in shared nothing database architectures.Finally, we are currently working on the design of Pythia, an autonomic controller.For the interest of time, in this talk, I will only get into the details of G-Store and Zephyr while providing a very high level overview of Elastras.
To this end, my dissertation makes the following contributions to address these challenges:We propose two different solutions to support transactions at scale for two different application scenarios: Elastras allows for elastically scalable transaction execution in databases where partitions are statically defined, while G-Store allows efficient transaction processing where database partitions are dynamically defined.Supporting lightweight elasticity essentially boils down to lightweight migration of database partitions in a live system. To this end, we propose two different techniques for two common database architectures: Albatross is a technique for live migration in databases that use the storage abstraction, while Zephyr is a technique for live migration in shared nothing database architectures.Finally, we are currently working on the design of Pythia, an autonomic controller.For the interest of time, in this talk, I will only get into the details of G-Store and Zephyr while providing a very high level overview of Elastras.
To this end, my dissertation makes the following contributions to address these challenges:We propose two different solutions to support transactions at scale for two different application scenarios: Elastras allows for elastically scalable transaction execution in databases where partitions are statically defined, while G-Store allows efficient transaction processing where database partitions are dynamically defined.Supporting lightweight elasticity essentially boils down to lightweight migration of database partitions in a live system. To this end, we propose two different techniques for two common database architectures: Albatross is a technique for live migration in databases that use the storage abstraction, while Zephyr is a technique for live migration in shared nothing database architectures.Finally, we are currently working on the design of Pythia, an autonomic controller.For the interest of time, in this talk, I will only get into the details of G-Store and Zephyr while providing a very high level overview of Elastras.
To this end, my dissertation makes the following contributions to address these challenges:We propose two different solutions to support transactions at scale for two different application scenarios: Elastras allows for elastically scalable transaction execution in databases where partitions are statically defined, while G-Store allows efficient transaction processing where database partitions are dynamically defined.Supporting lightweight elasticity essentially boils down to lightweight migration of database partitions in a live system. To this end, we propose two different techniques for two common database architectures: Albatross is a technique for live migration in databases that use the storage abstraction, while Zephyr is a technique for live migration in shared nothing database architectures.Finally, we are currently working on the design of Pythia, an autonomic controller.For the interest of time, in this talk, I will only get into the details of G-Store and Zephyr while providing a very high level overview of Elastras.
To this end, my dissertation makes the following contributions to address these challenges:We propose two different solutions to support transactions at scale for two different application scenarios: Elastras allows for elastically scalable transaction execution in databases where partitions are statically defined, while G-Store allows efficient transaction processing where database partitions are dynamically defined.Supporting lightweight elasticity essentially boils down to lightweight migration of database partitions in a live system. To this end, we propose two different techniques for two common database architectures: Albatross is a technique for live migration in databases that use the storage abstraction, while Zephyr is a technique for live migration in shared nothing database architectures.Finally, we are currently working on the design of Pythia, an autonomic controller.For the interest of time, in this talk, I will only get into the details of G-Store and Zephyr while providing a very high level overview of Elastras.
But before we delve into the details, I would like to spend a couple of minutes to give an overview of my research in the broader area of data management.The current talk, and my thesis, focuses on the OLTP aspect.In the data analysis front, I have worked on multiple projects. As an intern at IBM Almaden, I worked on a project called Ricardo that provides the ability for deep statistical analysis and modeling over large amounts of data. This paper was published in SIGMOD 2010 and parts of the framework ship in IBM InfoSphereBigInsights Enterprise edition. Recently, I worked on a project called MD-Hbase that presents the design and implementation of a scalable multi-dimensional indexing mechanism to support efficient high throughput location updates and multi-dimensional analysis queries on top of a Key-value store. Earlier, I have also worked on data stream processing systems providing intra-operator parallelism in common data stream operators, such as frequent elements or top-k elements, to efficiently exploit multicore processors.I have also worked on designing systems to exploit novel hardware architectures.
But before we delve into the details, I would like to spend a couple of minutes to give an overview of my research in the broader area of data management.The current talk, and my thesis, focuses on the OLTP aspect.In the data analysis front, I have worked on multiple projects. As an intern at IBM Almaden, I worked on a project called Ricardo that provides the ability for deep statistical analysis and modeling over large amounts of data. This paper was published in SIGMOD 2010 and parts of the framework ship in IBM InfoSphereBigInsights Enterprise edition. Recently, I worked on a project called MD-Hbase that presents the design and implementation of a scalable multi-dimensional indexing mechanism to support efficient high throughput location updates and multi-dimensional analysis queries on top of a Key-value store. Earlier, I have also worked on data stream processing systems providing intra-operator parallelism in common data stream operators, such as frequent elements or top-k elements, to efficiently exploit multicore processors.I have also worked on designing systems to exploit novel hardware architectures.
But before we delve into the details, I would like to spend a couple of minutes to give an overview of my research in the broader area of data management.The current talk, and my thesis, focuses on the OLTP aspect.In the data analysis front, I have worked on multiple projects. As an intern at IBM Almaden, I worked on a project called Ricardo that provides the ability for deep statistical analysis and modeling over large amounts of data. This paper was published in SIGMOD 2010 and parts of the framework ship in IBM InfoSphereBigInsights Enterprise edition. Recently, I worked on a project called MD-Hbase that presents the design and implementation of a scalable multi-dimensional indexing mechanism to support efficient high throughput location updates and multi-dimensional analysis queries on top of a Key-value store. Earlier, I have also worked on data stream processing systems providing intra-operator parallelism in common data stream operators, such as frequent elements or top-k elements, to efficiently exploit multicore processors.I have also worked on designing systems to exploit novel hardware architectures.
The goal of partitioning the schema is to leverage the application semantics and access patterns to minimize the number of distributed transactions.
The goal of partitioning the schema is to leverage the application semantics and access patterns to minimize the number of distributed transactions.
The goal of partitioning the schema is to leverage the application semantics and access patterns to minimize the number of distributed transactions.
The goal of partitioning the schema is to leverage the application semantics and access patterns to minimize the number of distributed transactions.
Now we know how to scale-out when the partitions are statically defined. So lets make it a bit more interesting: How to scale-out with transactions on dynamically formed partitions?Recall that our concept of partitions are the data items that are frequently being accessed within the same transaction. For certain applications, that might change with time. For instance, in online multi-player games, the application needs transactional access on the player profiles that are part of the same game instance, and this set changes with time. Similar behavior is observed in a number of collaboration based applications (examples?).
Now we know how to scale-out when the partitions are statically defined. So lets make it a bit more interesting: How to scale-out with transactions on dynamically formed partitions?Recall that our concept of partitions are the data items that are frequently being accessed within the same transaction. For certain applications, that might change with time. For instance, in online multi-player games, the application needs transactional access on the player profiles that are part of the same game instance, and this set changes with time. Similar behavior is observed in a number of collaboration based applications (examples?).
If the player profiles are part of the same database partition, then transactions on this group of players can be executed efficiently.
If the player profiles are part of the same database partition, then transactions on this group of players can be executed efficiently.
However, this group of players change with time, thus resulting in the concept of dynamically defined database partitions.
However, this group of players change with time, thus resulting in the concept of dynamically defined database partitions.
Scale.
Paper has more detailed evaluation
Paper has more detailed evaluation
So what does elasticity in the database tier mean?Mention the cost performance trade-off and repeat the fact that it is the cloud infrastructure that allows us to optimize operating cost, something that was not thought important in classical infrastructures.
So what does elasticity in the database tier mean?Mention the cost performance trade-off and repeat the fact that it is the cloud infrastructure that allows us to optimize operating cost, something that was not thought important in classical infrastructures.
So what does elasticity in the database tier mean?Mention the cost performance trade-off and repeat the fact that it is the cloud infrastructure that allows us to optimize operating cost, something that was not thought important in classical infrastructures.
So what does elasticity in the database tier mean?Mention the cost performance trade-off and repeat the fact that it is the cloud infrastructure that allows us to optimize operating cost, something that was not thought important in classical infrastructures.
So what does elasticity in the database tier mean?Mention the cost performance trade-off and repeat the fact that it is the cloud infrastructure that allows us to optimize operating cost, something that was not thought important in classical infrastructures.
So what does elasticity in the database tier mean?Mention the cost performance trade-off and repeat the fact that it is the cloud infrastructure that allows us to optimize operating cost, something that was not thought important in classical infrastructures.
So what does elasticity in the database tier mean?Mention the cost performance trade-off and repeat the fact that it is the cloud infrastructure that allows us to optimize operating cost, something that was not thought important in classical infrastructures.
Define wireframe in this slide. Defer index wireframe definition to the later slide.
Freeze No structural modifications to the indices.Wireframe Minimal information needed to start executing transactions at the destination, schema information, user authentication, the index wireframes, etc.
Just to give a concrete example of a wireframe, if we consider a B+ tree index, then only the internal nodes of the indices are migrated as part of the wireframe.
Just to give a concrete example of a wireframe, if we consider a B+ tree index, then only the internal nodes of the indices are migrated as part of the wireframe.
Once the destination is initialized with the minimal information, it can start executing transactions. At this point, migration enters the Dual mode where both the source and destination are executing transactions, new transactions arrive at the destination while the source continues execution of transactions that were active at the start of migration.
Once the destination is initialized with the minimal information, it can start executing transactions. At this point, migration enters the Dual mode where both the source and destination are executing transactions, new transactions arrive at the destination while the source continues execution of transactions that were active at the start of migration.
Once the destination is initialized with the minimal information, it can start executing transactions. At this point, migration enters the Dual mode where both the source and destination are executing transactions, new transactions arrive at the destination while the source continues execution of transactions that were active at the start of migration.
Once the destination is initialized with the minimal information, it can start executing transactions. At this point, migration enters the Dual mode where both the source and destination are executing transactions, new transactions arrive at the destination while the source continues execution of transactions that were active at the start of migration.