2. Ques%ons We Will Answer
ā¢ What is an in-memory database?
ā¢ Why do they ma3er?
ā¢ How do you build one?
ā¢ How do people use MemSQL?
(c) Ankur Goyal
11. In-Memory Databases...
ā¢ Use memory instead of disk
ā¢ Do not (need to) save data on disk
ā¢ Put the whole dataset in memory
(c) Ankur Goyal
12. In-Memory Databases...
ā¢ Use memory instead of disk
ā¢ Do not (need to) save data on disk
ā¢ Put the whole dataset in memory
(c) Ankur Goyal
13. In-Memory Databases...
ā¢ Use memory instead of disk
ā¢ Do not (need to) save data on disk
ā¢ Put the whole dataset in memory
Well, some)mes...
(c) Ankur Goyal
16. In-Memory Databases
ā¢ Are durable to disk (and respect ACID)
ā¢ Can spill on disk or pin data in-memory (and take advantage of it)
(c) Ankur Goyal
17. In-Memory Databases
ā¢ Are durable to disk (and respect ACID)
ā¢ Can spill on disk or pin data in-memory (and take advantage of it)
ā¢ Tradeoļ¬s are suited to systems with lots of memory
(c) Ankur Goyal
18. In-Memory Databases
ā¢ Are durable to disk (and respect ACID)
ā¢ Can spill on disk or pin data in-memory (and take advantage of it)
ā¢ Tradeoļ¬s are suited to systems with lots of memory
ā¢ Tend to be distributed systems
(c) Ankur Goyal
19. In-Memory Databases
ā¢ Are durable to disk (and respect ACID)
ā¢ Can spill on disk or pin data in-memory (and take advantage of it)
ā¢ Tradeoļ¬s are suited to systems with lots of memory
ā¢ Tend to be distributed systems
ā¢ Have a diļ¬erent set of boClenecks
(c) Ankur Goyal
23. Why?
ā¢ Memory is ge,ng cheaper (about 40% every year)
ā¢ Cache is the new RAM (RAM is the new disk, disk is the new
tape, etc)
(c) Ankur Goyal
24. Why?
ā¢ Memory is ge,ng cheaper (about 40% every year)
ā¢ Cache is the new RAM (RAM is the new disk, disk is the new
tape, etc)
ā¢ In-memory databases leverage SSD (no random writes)
(c) Ankur Goyal
25. Why?
ā¢ Memory is ge,ng cheaper (about 40% every year)
ā¢ Cache is the new RAM (RAM is the new disk, disk is the new
tape, etc)
ā¢ In-memory databases leverage SSD (no random writes)
ā¢ NVRAM is coming (and could be cheaper than SSD)
(c) Ankur Goyal
26. Why?
ā¢ Memory is ge,ng cheaper (about 40% every year)
ā¢ Cache is the new RAM (RAM is the new disk, disk is the new
tape, etc)
ā¢ In-memory databases leverage SSD (no random writes)
ā¢ NVRAM is coming (and could be cheaper than SSD)
In-memory databases are tuned to
modern hardware and modern workloads
(c) Ankur Goyal
30. In-Memory Storage Mo/va/on
ā¢ Insanely fast random reads & writes
ā¢ Atomic writes as granular as a byte
(c) Ankur Goyal
31. In-Memory Storage Mo/va/on
ā¢ Insanely fast random reads & writes
ā¢ Atomic writes as granular as a byte
ā¢ Working space is precious (RAM)
(c) Ankur Goyal
32. In-Memory Storage Mo/va/on
ā¢ Insanely fast random reads & writes
ā¢ Atomic writes as granular as a byte
ā¢ Working space is precious (RAM)
ā¢ Very diļ¬erent for rowstores and columnstores
(c) Ankur Goyal
35. In-Memory Rowstore
ā¢ Rowstores have lots of random reads/writes
ā¢ Datasets are usually small < 10 TB
Solu%on: keep the whole dataset in memory
(c) Ankur Goyal
36. In-Memory Rowstore
ā¢ Rowstores have lots of random reads/writes
ā¢ Datasets are usually small < 10 TB
Solu%on: keep the whole dataset in memory
ā¢ Use memory op+mized data structures (skip list)
(c) Ankur Goyal
37. What is a Skip List
ā¢ Invented in 1989 by William Pugh
(c) Ankur Goyal
38. What is a Skip List
ā¢ Invented in 1989 by William Pugh
ā¢ Expected O(log(n)) lookup, insert, delete
(c) Ankur Goyal
39. What is a Skip List
ā¢ Invented in 1989 by William Pugh
ā¢ Expected O(log(n)) lookup, insert, delete
ā¢ No pages
(c) Ankur Goyal
53. Concurrency Control
ā¢ No pages => No latches
ā¢ Skip list in MemSQL is lockfree
ā¢ Every node is a lock-free linked list
(c) Ankur Goyal
54. Concurrency Control
ā¢ No pages => No latches
ā¢ Skip list in MemSQL is lockfree
ā¢ Every node is a lock-free linked list
ā¢ Row locks are implemented with futexes (4 bytes)
(c) Ankur Goyal
55. Concurrency Control
ā¢ No pages => No latches
ā¢ Skip list in MemSQL is lockfree
ā¢ Every node is a lock-free linked list
ā¢ Row locks are implemented with futexes (4 bytes)
ā¢ Read-commiGed and snapshot isolaHon
(c) Ankur Goyal
68. Columnstore LSM
ā¢ Log-Structured Merge of sorted runs
ā¢ Tunable tradeoļ¬s for read/write ampliļ¬ca=on
ā¢ Enables fast writes to a sorted columnstore
(c) Ankur Goyal
69. Columnstore LSM
ā¢ Log-Structured Merge of sorted runs
ā¢ Tunable tradeoļ¬s for read/write ampliļ¬ca=on
ā¢ Enables fast writes to a sorted columnstore
ā¢ Smallest sorted run is a skip list
(c) Ankur Goyal
72. Durability in an In-Memory System?
ā¢ Memory is not a reliable medium (yet)
(c) Ankur Goyal
73. Durability in an In-Memory System?
ā¢ Memory is not a reliable medium (yet)
ā¢ There is always a hierarchy
(c) Ankur Goyal
74. Durability in an In-Memory System?
ā¢ Memory is not a reliable medium (yet)
ā¢ There is always a hierarchy
ā¢ E.g. EBS -> S3 -> Glacier
(c) Ankur Goyal
75. Durability in an In-Memory System?
ā¢ Memory is not a reliable medium (yet)
ā¢ There is always a hierarchy
ā¢ E.g. EBS -> S3 -> Glacier
ā¢ To operate at in-memory speed, all disk I/O must be sequenHal
(c) Ankur Goyal
76. Durability in the Rowstore
ā¢ Indexes are not materialized on disk
(c) Ankur Goyal
77. Durability in the Rowstore
ā¢ Indexes are not materialized on disk
ā¢ Reconstruct indexes on the ļ¬y during recovery
(c) Ankur Goyal
78. Durability in the Rowstore
ā¢ Indexes are not materialized on disk
ā¢ Reconstruct indexes on the ļ¬y during recovery
ā¢ Only need to log PK data
(c) Ankur Goyal
79. Durability in the Rowstore
ā¢ Indexes are not materialized on disk
ā¢ Reconstruct indexes on the ļ¬y during recovery
ā¢ Only need to log PK data
ā¢ Take full database snapshots periodically
(c) Ankur Goyal
80. Durability in the Rowstore
ā¢ Indexes are not materialized on disk
ā¢ Reconstruct indexes on the ļ¬y during recovery
ā¢ Only need to log PK data
ā¢ Take full database snapshots periodically
ā¢ Tunable to be sync/async
(c) Ankur Goyal
82. Durability in the Columnstore
ā¢ Metadata uses ordinary rowstore mechanism
(c) Ankur Goyal
83. Durability in the Columnstore
ā¢ Metadata uses ordinary rowstore mechanism
ā¢ Segments are huge (several KB or even MB)
(c) Ankur Goyal
84. Durability in the Columnstore
ā¢ Metadata uses ordinary rowstore mechanism
ā¢ Segments are huge (several KB or even MB)
ā¢ Read/wri=en sequen?ally
(c) Ankur Goyal
85. Durability in the Columnstore
ā¢ Metadata uses ordinary rowstore mechanism
ā¢ Segments are huge (several KB or even MB)
ā¢ Read/wri=en sequen?ally
ā¢ Columnstore segments synchronously wri=en to disk
(c) Ankur Goyal
86. Durability in the Columnstore
ā¢ Metadata uses ordinary rowstore mechanism
ā¢ Segments are huge (several KB or even MB)
ā¢ Read/wri=en sequen?ally
ā¢ Columnstore segments synchronously wri=en to disk
ā¢ Memory-speed writes go to sidecar rowstore
(c) Ankur Goyal
88. Crash Recovery
ā¢ Replay latest snapshot, and then every log ļ¬le since
ā¢ No par7ally wri9en state on disk, so no undos
(c) Ankur Goyal
89. Crash Recovery
ā¢ Replay latest snapshot, and then every log ļ¬le since
ā¢ No par7ally wri9en state on disk, so no undos
ā¢ Columnstore just replays metadata
(c) Ankur Goyal
90. Crash Recovery
ā¢ Replay latest snapshot, and then every log ļ¬le since
ā¢ No par7ally wri9en state on disk, so no undos
ā¢ Columnstore just replays metadata
ā¢ Replica7on == Con7nuous replay over the network
(c) Ankur Goyal
92. class Row(object):
def __init__(self, a):
self.a = a
t = [Row(x) for x in range(1000000)]
class State(object):
def __init__(self):
self.agg_sum = 0
def loop(state, row):
state.agg_sum += row.a + 1
def query():
state = State()
for r in t:
loop(state, r)
return state
if __name__ == '__main__':
start = time.time()
state = query()
end = time.time()
print "Answer: %d, Time (s): %g" % (state.agg_sum, (end-start))
(c) Ankur Goyal
93. struct Row int main(void)
{ {
Row(int a_arg) : a(a_arg) { } std::vector<Row> rows;
int a; for (int i = 0; i < 1000000; i++)
}; {
rows.emplace_back(i);
struct State }
{
State() : agg_sum(0) { } clock_t start = clock();
int64_t agg_sum; State state = query(rows);
}; clock_t end = clock();
inline void loop(State& state, const Row& row) printf("Answer: %lld, Time (s): %gn",
{ state.agg_sum, (end-start) * 1.0 / CLOCKS_PER_SEC);
state.agg_sum += row.a + 1; }
}
inline State query(std::vector<Row>& rows)
{
State s;
for (Row& r : rows)
{
loop(s, r);
}
return s;
}
(c) Ankur Goyal
94. Comparison
$ python test.py
Answer: 500000500000, Time (s): 0.251049
$ time g++ test.cpp -o test-cpp -std=c++0x
real 0m0.176s
user 0m0.150s
sys 0m0.023s
$ ./test-cpp
Answer: 500000500000, Time (s): 0.006745
(c) Ankur Goyal
95. Comparison
$ python test.py
Answer: 500000500000, Time (s): 0.251049
$ time g++ test.cpp -o test-cpp -std=c++0x
real 0m0.176s
user 0m0.150s
sys 0m0.023s
$ ./test-cpp
Answer: 500000500000, Time (s): 0.006745
37x diļ¬erence in execu+on
(c) Ankur Goyal
96. Comparison
$ python test.py
Answer: 500000500000, Time (s): 0.251049
$ time g++ test.cpp -o test-cpp -std=c++0x
real 0m0.176s
user 0m0.150s
sys 0m0.023s
$ ./test-cpp
Answer: 500000500000, Time (s): 0.006745
37x diļ¬erence in execu+on
1.37x even with compila+on +me
(c) Ankur Goyal
100. Code Genera*on
ā¢ Expression execu.on
ā¢ Inline scans
ā¢ Need a powerful plan cache
ā¢ OLTP vs. data explora.on
(c) Ankur Goyal
101. Plancache Example (1)
SELECT * FROM users WHERE id = 5
SELECT * FROM users WHERE id = 8
=>
SELECT * FROM users WHERE id = @
(c) Ankur Goyal
102. Plancache Example (2)
SELECT * FROM users WHERE id IN (1,2,3,4,5) OR a IN (3,5,7)
SELECT * FROM users WHERE id IN (20) OR a IN (1,2,3,4)
=>
SELECT * FROM users WHERE id IN (@) OR a IN (@)
(c) Ankur Goyal
103. Drill Down Example
SELECT SELECT SELECT
region, SUM(price) rep, SUM(price) rep, SUM(price)
FROM sales => FROM sales => FROM sales
GROUP BY region WHERE region="northeast" WHERE region=^
GROUP BY rep; GROUP BY rep;
SELECT SELECT
product, SUM(price) product, SUM(price)
=> FROM sales => FROM sales
WHERE region="northwest" WHERE region=^
GROUP BY product; GROUP BY product;
(c) Ankur Goyal
104. Drill Down Example
SELECT SELECT SELECT
region, SUM(price) rep, SUM(price) rep, SUM(price)
FROM sales => FROM sales => FROM sales
GROUP BY region WHERE region="northeast" WHERE region=^
GROUP BY rep; GROUP BY rep;
SELECT SELECT
product, SUM(price) product, SUM(price)
=> FROM sales => FROM sales
WHERE region="northwest" WHERE region=^
GROUP BY product; GROUP BY product;
No plancache match !
(c) Ankur Goyal
108. Code Genera*on is Hard
ā¢ Old compilers adage: Pick 2 of 3
(c) Ankur Goyal
109. Code Genera*on is Hard
ā¢ Old compilers adage: Pick 2 of 3
ā¢ Fast execu:on :me
ā¢ Fast compile :me
ā¢ Fast development :me
(c) Ankur Goyal
110. Code Genera*on is Hard
ā¢ Old compilers adage: Pick 2 of 3
ā¢ Fast execu:on :me
ā¢ Fast compile :me
ā¢ Fast development :me
ā¢ E.g. Assembly, C++, Python
(c) Ankur Goyal
111. Code Genera*on is Hard
ā¢ Old compilers adage: Pick 2 of 3
ā¢ Fast execu:on :me
ā¢ Fast compile :me
ā¢ Fast development :me
ā¢ E.g. Assembly, C++, Python
ā¢ JIT compilers turned this on its head
(c) Ankur Goyal
128. Abstrac(ons
ā¢ Distributed Query Plan created on aggregator
ā¢ Layers of primi9ve opera9ons glued together
ā¢ Full SQL on leaves
ā¢ REMOTE tables
ā¢ RESULT tables
(c) Ankur Goyal
131. Primi%ves (SQL)
ā¢ Queries over physical indexes
ā¢ Hook into global transac9onal state
ā¢ Full SQL on a single par99on
(c) Ankur Goyal
132. Primi%ves (SQL)
ā¢ Queries over physical indexes
ā¢ Hook into global transac9onal state
ā¢ Full SQL on a single par99on
ā¢ Access to rowstores and columnstores
(c) Ankur Goyal
133. Primi%ves (SQL)
Example query the aggregator can send to the leaf:
SELECT
t.a, t.b, SUM(t.price)
FROM
t -- This will scan a physical table on the leaf
WHERE
t.c = 1000 -- This will use a local index
GROUP BY
t.a, t.b -- This will produce 1 row per group
(c) Ankur Goyal
136. Primi%ves (Remote Tables)
ā¢ Address data across leaves
ā¢ SQL interface + custom shard key
ā¢ Parallel execu<on primi<ves
ā¢ Reshuļ¬ing
ā¢ Merging on group keys
ā¢ Merging data from joins (e.g. leE joins)
(c) Ankur Goyal
137. Primi%ves (Remote Tables)
SELECT
t.a, SUM(s_net.c)
FROM
-- The row in s where s_net.b = t.a may not
-- be on the same node as the local t. REMOTE(s)
-- addresses the table across the cluster.
t, REMOTE(s) AS s_net
WHERE
t.a = s_net.b
GROUP BY
t.a
(c) Ankur Goyal
138. Primi%ves (Remote Tables)
SELECT
t.a, SUM(s_net.c)
FROM
-- This is a reshuffle operation. It relies on t
-- being sharded on (t.a) and type(t.a) == type(s.b).
-- It will only pull rows in s.b that match the
-- shard key's local values of (t.a).
t, REMOTE(s) WITH (shard_key=(s.b)) AS s_net
WHERE
t.a = s_net.b
GROUP BY
t.a
(c) Ankur Goyal
142. Primi%ves (Result Tables)
ā¢ Shared, cached results of SQL queries
ā¢ Shares scans/computa9ons across readers
ā¢ Supports streaming seman9cs
ā¢ Technically an op9miza9on
(c) Ankur Goyal
143. Primi%ves (Result Tables)
ā¢ Shared, cached results of SQL queries
ā¢ Shares scans/computa9ons across readers
ā¢ Supports streaming seman9cs
ā¢ Technically an op9miza9on
ā¢ Similar to an RDD in Spark
(c) Ankur Goyal
144. Primi%ves (Result Tables)
CREATE RESULT TABLE
t_reshuffled AS
SELECT
t.a, t.b, SUM(t.price)
FROM
t
GROUP BY
t.a, t.b
SHARD BY
t.a, t.b
(c) Ankur Goyal
153. Horizontals and Ver/cals
ā¢ Real-'me data processing is everywhere
ā¢ Top use-cases:
Real-Time Analy'cs and Large-Scale Applica'ons
(c) Ankur Goyal
154. Horizontals and Ver/cals
ā¢ Real-'me data processing is everywhere
ā¢ Top use-cases:
Real-Time Analy'cs and Large-Scale Applica'ons
ā¢ Top ver'cals:
Financial Services, Webscale, Telco, Federal, Media
(c) Ankur Goyal
156. Real-&me Analy&cs
ā¢ High volumes of data, processed in real-8me
ā¢ Fast updates in the rowstore
ā¢ INSERT ... ON DUPLICATE KEY UPDATE
ā¢ E.g. 2M update transac8ons/sec on 10 nodes
(c) Ankur Goyal
157. Real-&me Analy&cs
ā¢ High volumes of data, processed in real-8me
ā¢ Fast updates in the rowstore
ā¢ INSERT ... ON DUPLICATE KEY UPDATE
ā¢ E.g. 2M update transac8ons/sec on 10 nodes
ā¢ Fast appends, even one row at a 8me, in the columnstore
ā¢ E.g. 1 GB/s on 16 EC2 nodes
(c) Ankur Goyal
160. Real-&me Analy&cs
ā¢ Converging with mainline analy2cs
ā¢ No compromises, e.g. limited SQL, limited windows
ā¢ Real-2me means fast reads as well
(c) Ankur Goyal
161. Real-&me Analy&cs
ā¢ Converging with mainline analy2cs
ā¢ No compromises, e.g. limited SQL, limited windows
ā¢ Real-2me means fast reads as well
ā¢ Subsecond queries for dashboards
(c) Ankur Goyal
162. Real-&me Analy&cs
ā¢ Converging with mainline analy2cs
ā¢ No compromises, e.g. limited SQL, limited windows
ā¢ Real-2me means fast reads as well
ā¢ Subsecond queries for dashboards
ā¢ Millisecond queries for applica2ons
(c) Ankur Goyal
165. Large-Scale Applica.ons
ā¢ Large-scale opera.onal analy.cs and applica.ons
ā¢ Hundreds of nodes for perf and HA
ā¢ True "produc.on" workloads
(c) Ankur Goyal
166. Large-Scale Applica.ons
ā¢ Large-scale opera.onal analy.cs and applica.ons
ā¢ Hundreds of nodes for perf and HA
ā¢ True "produc.on" workloads
ā¢ Exis.ng OLTP databases lack scalability and SQL perf
(c) Ankur Goyal
167. Large-Scale Applica.ons
ā¢ Large-scale opera.onal analy.cs and applica.ons
ā¢ Hundreds of nodes for perf and HA
ā¢ True "produc.on" workloads
ā¢ Exis.ng OLTP databases lack scalability and SQL perf
ā¢ Exis.ng OLAP databases lack opera.onal features
(c) Ankur Goyal
170. Take-Aways
ā¢ In-memory Database != All-memory Database
ā¢ In-memory Databases are databases built to modern tradeoļ¬s
(c) Ankur Goyal
171. Take-Aways
ā¢ In-memory Database != All-memory Database
ā¢ In-memory Databases are databases built to modern tradeoļ¬s
ā¢ Old problems with new solu<ons
(c) Ankur Goyal
172. Take-Aways
ā¢ In-memory Database != All-memory Database
ā¢ In-memory Databases are databases built to modern tradeoļ¬s
ā¢ Old problems with new solu<ons
ā¢ Real-<me analy<cs and Large-scale applica<ons == New projects
(c) Ankur Goyal
173. Take-Aways
ā¢ In-memory Database != All-memory Database
ā¢ In-memory Databases are databases built to modern tradeoļ¬s
ā¢ Old problems with new solu<ons
ā¢ Real-<me analy<cs and Large-scale applica<ons == New projects
ā¢ We are hiring and ā¤ Waterloo.
ā¢ Come visit us in SF: email ankur@memsql.com
(c) Ankur Goyal