Experiences building a distributed shared log on RADOS - Noah Watkins
1. Experiences building a distributed
shared-log on RADOS
Noah Watkins
UC Santa Cruz
@noahdesu
2. 2
● Graduate student
● UC Santa Cruz
● Data management,
file systems, HPC,
QoS
● Ceph as a
prototyping
platform
About me
● Graduate student
● UC Santa Cruz
3. 3
● Graduate student
● UC Santa Cruz
● Data management,
file systems, HPC,
QoS
● Ceph as a
prototyping
platform
About me
● Graduate student
● UC Santa Cruz
● Data management
● Storage systems
● High-performance computing
● Quality of service
4. 4
● Graduate student
● UC Santa Cruz
● Data management,
file systems, HPC,
QoS
● Ceph as a
prototyping
platform
About me
● Graduate student
● UC Santa Cruz
● Data management
● Storage systems
● High-performance computing
● Quality of service
● Ceph shop for prototyping
8. The agenda
● Why should I care about logs?
● How to build a really, really fast log
● Mapping techniques for fast logs onto Ceph
● A few usage examples
8
10. A log is an ordered list of data elements
● General abstract form of a log
● Ordered list of elements
10
0 1 2 3 4 5 6 7 8 9 10 11 12
11. A log is an ordered list of data elements
● General abstract form of a log
● Ordered list of elements
● New entries are appended to the log
11
0 1 2 3 4 5 6 7 8 9 10 11 12
clients
append
12. A log is an ordered list of data elements
● General abstract form of a log
● Ordered list of elements
● New entries are appended to the log
● Log entries can be read directly
12
0 1 2 3 4 5 6 7 8 9 10 11 12
clients
append
13. Everyone is going bananas for logs
13
● Very high throughput logs
● No global ordering
● Strongly consistent, global ordering
● Write batching
● Single writer
Apache
BookKeeper
19. Shared-logs are challenging to scale
clients
0 1
2
Paxos
append
Writes are
funneled through
a total ordering
engine
19
0 1 2 3 4 5 6 7 8 9 10 11 12
20. Shared-logs are challenging to scale
clients
0 1
2
Paxos
append
Writes are
funneled through
a total ordering
engine
20
0 1 2 3 4 5 6 7 8 9 10 11 12
Parallel reads
supported when
log is distributed
Random read
workloads perform
poorly on HDDs
21. CORFU: A Shared Log Design for Flash Clusters
21
Balakrishnan, et. al, NSDI 2011
25. CORFU stripes log across a cluster of flash devices
25
clients
1, 2, 3, 4...
0 1 2 3 4 5 6 7 8 9 10 11 12
RPC: Tail, Next
std::atomic<u64> seq;
tail = seq;
next = seq++;
Sequencer
100K-1M pos/sec
parallel appendreads
26. You might be thinking...
clients
1, 2, 3, 4, 5, ….
RPC: Tail, Next
std::atomic<u64> seq;
tail = seq;
next = seq++;
Sequencer
read append Decouple append I/O from
assignment of ordering● It can’t possibly be this simple....
26
100K-1M pos/sec
27. You might be thinking...
clients
1, 2, 3, 4, 5, ….
RPC: Tail, Next
std::atomic<u64> seq;
tail = seq;
next = seq++;
Sequencer
read append Decouple append I/O from
assignment of ordering● It can’t possibly be this simple.... you’d be correct
27
100K-1M pos/sec
28. You might be thinking...
clients
1, 2, 3, 4, 5, ….
RPC: Tail, Next
std::atomic<u64> seq;
tail = seq;
next = seq++;
Sequencer
read append Decouple append I/O from
assignment of ordering● It can’t possibly be this simple.... you’d be correct
● Isn’t this supposed to be a Ceph talk?
28
100K-1M pos/sec
29. The CORFU I/O path
LogInterface(e.g.Append)
29
Cluster of flash devices
Append( ) ??
30. CORFU uses a cluster map and striping function
LogInterface(e.g.Append)
30
Cluster of flash devices
Append( )
Striping and
Addressing
31. CORFU replicates log entries across the cluster
LogInterface(e.g.Append)
31
Cluster of flash devices
Append( )
Striping and
Addressing
Fault tolerance
Replication
32. The CORFU protocol relies on custom I/O interface
LogInterface(e.g.Append)
32
Cluster of flash devices
Append( )
Striping and
Addressing
Fault tolerance
Custom
hardware
interface
Replication
33. A dedicated CORFU cluster is expensive...
● Duplicated services
○ Fault-tolerance
○ Metadata management
● Dedicated / over-provisioned hardware
○ You need a full cluster of flash devices!
● Custom storage interfaces
○ Hardware or software
● Already have a Ceph cluster?
○ Too bad
● Can we use software-defined storage?
33
34. Mapping the components of CORFU onto Ceph
LogInterface(e.g.Append)
34
Cluster of flash devices
Append( )
Striping and
Addressing
Fault tolerance
Custom
hardware
interface
Replication
35. A cluster of OSDs rather than raw storage devices
LogInterface(e.g.Append)
35
Cluster of OSDs
Append( )
Striping and
Addressing
Fault tolerance
Custom
hardware
interface
OSD OSD OSD
OSD OSD OSD
OSD OSD OSD
OSD OSD OSD
OSD OSD OSD
38. Storage interface built using RADOS object classes
LogInterface(e.g.Append)
38
Append( )
Striping and
Addressing
Object naming
RADOS
object
class
OSD OSD OSD
OSD OSD OSD
OSD OSD OSD
OSD OSD OSD
OSD OSD OSD
Replication
Cluster of OSDs
39. ZLog is an implementation of CORFU on Ceph
osd osd osd
39
clients
1, 2, 3, 4...
0 1 2 3 4 5 6 7 8 9 10 11 12
RPC: Tail, Next
std::atomic<u64> seq;
tail = seq;
next = seq++;
Sequencer
100K-1M pos/sec
parallel appendreads
BlueStore RocksDBLMDB
40. ZLog is an implementation of CORFU on Ceph
osd osd osd
40
clients
1, 2, 3, 4...
0 1 2 3 4 5 6 7 8 9 10 11 12
RPC: Tail, Next
std::atomic<u64> seq;
tail = seq;
next = seq++;
Sequencer
100K-1M pos/sec
parallel appendreads
BlueStore RocksDBLMDB
Transparent
changes to
hardware,
software, tuning
41. ZLog is an implementation of CORFU on Ceph
osd osd osd
41
clients
1, 2, 3, 4...
0 1 2 3 4 5 6 7 8 9 10 11 12
RPC: Tail, Next
std::atomic<u64> seq;
tail = seq;
next = seq++;
Sequencer
100K-1M pos/sec
parallel appendreads
BlueStore RocksDBLMDB
Transparent
changes to
hardware,
software, tuning
Integration with
features like
tiering, erasure
coding, and I/O
hinting
42. The ZLog project is open-source on Github
● https://github.com/cruzdb/zlog
● Development backend (LMDB)
● Tools to build Ceph plugin
● High and low-level APIs
42
// build a backend instance
auto backend = std::unique_ptr<zlog::Backend>(
new zlog::storage::ceph::CephBackend(ioctx));
// open the log
zlog::Log log;
int ret = zlog::Log::CreateWithBackend(std::move(backend),
"mylog", &log);
// append to the log
uint64_t pos;
log.Append(Slice(“foo”), &pos);
43. The ZLog project is open-source on Github
● https://github.com/cruzdb/zlog
● Development backend (LMDB)
● Tools to build Ceph plugin
● High and low-level APIs
43
// build a backend instance
auto backend = std::unique_ptr<zlog::Backend>(
new zlog::storage::ceph::CephBackend(ioctx));
// open the log
zlog::Log log;
int ret = zlog::Log::CreateWithBackend(std::move(backend),
"mylog", &log);
// append to the log
uint64_t pos;
log.Append(Slice(“foo”), &pos);
44. The ZLog project is open-source on Github
● https://github.com/cruzdb/zlog
● Development backend (LMDB)
● Tools to build Ceph plugin
● High and low-level APIs
44
// build a backend instance
auto backend = std::unique_ptr<zlog::Backend>(
new zlog::storage::ceph::CephBackend(ioctx));
// open the log
zlog::Log log;
int ret = zlog::Log::CreateWithBackend(std::move(backend),
"mylog", &log);
// append to the log
uint64_t pos;
log.Append(Slice(“foo”), &pos);
45. The ZLog project is open-source on Github
● https://github.com/cruzdb/zlog
● Development backend (LMDB)
● Tools to build Ceph plugin
● High and low-level APIs
45
// build a backend instance
auto backend = std::unique_ptr<zlog::Backend>(
new zlog::storage::ceph::CephBackend(ioctx));
// open the log
zlog::Log log;
int ret = zlog::Log::CreateWithBackend(std::move(backend),
"mylog", &log);
// append to the log
uint64_t pos;
log.Append(Slice(“foo”), &pos);
50. Design decisions for the ZLog I/O path
50
RADOS object classes
CORFU storage interface
OSD
librados
ZLog client library
Striping policy
● Log entry → Object
● Stripe log across objects
● … but not too many objects
51. Design decisions for the ZLog I/O path
51
RADOS object classes
CORFU storage interface
OSD
librados
ZLog client library
Striping policy
● Log entry → Object
● Stripe log across objects
● … but not too many objects
● Changes to striping policy
52. Design decisions for the ZLog I/O path
52
RADOS object classes
CORFU storage interface
OSD
librados
ZLog client library
Striping policy
● Log entry → Object
● Stripe log across objects
● … but not too many objects
● Changes to striping policy
● Coordinated metadata update
53. Design decisions for the ZLog I/O path
53
RADOS object classes
CORFU storage interface
OSD
librados
ZLog client library
Striping policy
● Log entry → Object
● Stripe log across objects
● … but not too many objects
● Changes to striping policy
● Coordinated metadata update
● Treat an object like a
consensus API
54. Design decisions for the ZLog I/O path
54
RADOS object classes
CORFU storage interface
OSD
librados
ZLog client library
Striping policy
● Log entry → Object
● Stripe log across objects
● … but not too many objects
● Changes to striping policy
● Coordinated metadata update
● Treat an object like a
consensus API
55. The CORFU storage interface
● write(obj, position, data)
● read(obj, position)
● invalidate(obj, position)
● trim(obj, position)
○ Mark for GC
● seal(obj, epoch)
55
56. The CORFU storage interface
● write(obj, position, data)
● read(obj, position)
● invalidate(obj, position)
● trim(obj, position)
○ Mark for GC
● seal(obj, epoch)
56
57. The CORFU storage interface
● write(obj, position, data)
○ Write-once: position must be open
● read(obj, position)
● invalidate(obj, position)
● trim(obj, position)
○ Mark for GC
● seal(obj, epoch)
57
58. The CORFU storage interface
● write(obj, position, data)
○ Write-once: position must be open
○ Request must be tagged with up-to-date epoch
● read(obj, position)
● invalidate(obj, position)
● trim(obj, position)
○ Mark for GC
● seal(obj, epoch)
58
59. The CORFU storage interface
● write(obj, position, data)
○ Write-once: position must be open
○ Request must be tagged with up-to-date epoch
○ Position must not fall within a GC’d range
● read(obj, position)
● invalidate(obj, position)
● trim(obj, position)
○ Mark for GC
● seal(obj, epoch)
59
60. The CORFU storage interface
● write(obj, position, data)
○ Write-once: position must be open
○ Request must be tagged with up-to-date epoch
○ Position must not fall within a GC’d range
○ Optionally update maximum position observed
● read(obj, position)
● invalidate(obj, position)
● trim(obj, position)
○ Mark for GC
● seal(obj, epoch)
60
61. The CORFU storage interface
● write(obj, position, data)
○ Write-once: position must be open
○ Request must be tagged with up-to-date epoch
○ Position must not fall within a GC’d range
○ Optionally update maximum position observed
○ Atomic update
● read(obj, position)
● invalidate(obj, position)
● trim(obj, position)
○ Mark for GC
● seal(obj, epoch)
61
62. librados won’t [currently] cut it for this job
● write(obj, position, data)
○ Write-once: position must be open
○ Request must be tagged with up-to-date epoch
○ Position must not fall within a GC’d range
○ Optionally update maximum position observed
○ Atomic update
● read(obj, position)
● invalidate(obj, position)
● trim(obj, position)
○ Mark for GC
● seal(obj, epoch)
62
63. CORFU write interface as an object class
63
RADOS object classes
CORFU storage interface
OSD
librados
ZLog client library
Striping policy
int write(context_t hctx, bufferlist *in, bufferlist *out) {
// ensure up-to-date client
if (epoch_guard(header, op.epoch(), false)) {
return -ESPIPE;
}
// read entry based on request position
ceph::bufferlist bl;
int ret = cls_cxx_map_get_val(hctx, key, &bl);
if (ret < 0)
return ret;
if (entry && !decode(bl, entry))
return -EIO;
return 0;
// write entry if position is not taken
if (ret == -ENOENT) {
entry_write_entry(hctx, key, entry);
if (!header.has_max_pos() || (op.pos() > header.max_pos()))
header.set_max_pos(op.pos());
ret = entry_write_header(hctx, header);
} else {
// handle errors associated with write-once semantics
}
}
64. CORFU write interface as an object class
64
RADOS object classes
CORFU storage interface
OSD
librados
ZLog client library
Striping policy
int write(context_t hctx, bufferlist *in, bufferlist *out) {
// ensure up-to-date client
if (epoch_guard(header, op.epoch(), false)) {
return -ESPIPE;
}
// read entry based on request position
ceph::bufferlist bl;
int ret = cls_cxx_map_get_val(hctx, key, &bl);
if (ret < 0)
return ret;
if (entry && !decode(bl, entry))
return -EIO;
return 0;
// write entry if position is not taken
if (ret == -ENOENT) {
entry_write_entry(hctx, key, entry);
if (!header.has_max_pos() || (op.pos() > header.max_pos()))
header.set_max_pos(op.pos());
ret = entry_write_header(hctx, header);
} else {
// handle errors associated with write-once semantics
}
}
connection to client
65. CORFU write interface as an object class
65
RADOS object classes
CORFU storage interface
OSD
librados
ZLog client library
Striping policy
int write(context_t hctx, bufferlist *in, bufferlist *out) {
// ensure up-to-date client
if (epoch_guard(header, op.epoch(), false)) {
return -ESPIPE;
}
// read entry based on request position
ceph::bufferlist bl;
int ret = cls_cxx_map_get_val(hctx, key, &bl);
if (ret < 0)
return ret;
if (entry && !decode(bl, entry))
return -EIO;
return 0;
// write entry if position is not taken
if (ret == -ENOENT) {
entry_write_entry(hctx, key, entry);
if (!header.has_max_pos() || (op.pos() > header.max_pos()))
header.set_max_pos(op.pos());
ret = entry_write_header(hctx, header);
} else {
// handle errors associated with write-once semantics
}
}
connection to client
66. CORFU write interface as an object class
66
RADOS object classes
CORFU storage interface
OSD
librados
ZLog client library
Striping policy
int write(context_t hctx, bufferlist *in, bufferlist *out) {
// ensure up-to-date client
if (epoch_guard(header, op.epoch(), false)) {
return -ESPIPE;
}
// read entry based on request position
ceph::bufferlist bl;
int ret = cls_cxx_map_get_val(hctx, key, &bl);
if (ret < 0)
return ret;
if (entry && !decode(bl, entry))
return -EIO;
return 0;
// write entry if position is not taken
if (ret == -ENOENT) {
entry_write_entry(hctx, key, entry);
if (!header.has_max_pos() || (op.pos() > header.max_pos()))
header.set_max_pos(op.pos());
ret = entry_write_header(hctx, header);
} else {
// handle errors associated with write-once semantics
}
}
connection to client
67. Getting started with object classes
1. The hello world object class (in Ceph: src/cls/hello/cls_hello.cc)
a. Super well documented, easy examples
67
68. Getting started with object classes
1. The hello world object class (in Ceph: src/cls/hello/cls_hello.cc)
a. Super well documented, easy examples
2. Build and load the plugin into Ceph
68
69. Getting started with object classes
1. The hello world object class (in Ceph: src/cls/hello/cls_hello.cc)
a. Super well documented, easy examples
2. Build and load the plugin into Ceph
3. Old way to manage plugins
a. Manage your own Ceph fork
69
70. Getting started with object classes
1. The hello world object class (in Ceph: src/cls/hello/cls_hello.cc)
a. Super well documented, easy examples
2. Build and load the plugin into Ceph
3. Old way to manage plugins
a. Manage your own Ceph fork
4. New way to manage plugins with SDK (credit: Neha Ojha @ b7215b0)
a. dnf install rados-objclass-devel
b. #include <rados/objclass.h>
c. compile cls_hello.cc as object library
70
71. Getting started with object classes
1. The hello world object class (in Ceph: src/cls/hello/cls_hello.cc)
a. Super well documented, easy examples
2. Build and load the plugin into Ceph
3. Old way to manage plugins
a. Manage your own Ceph fork
4. New way to manage plugins with SDK (credit: Neha Ojha @ b7215b0)
a. dnf install rados-objclass-devel
b. #include <rados/objclass.h>
c. compile cls_hello.cc as object library
5. Distribute your plugin to OSDs at <libdir>/rados-classes/cls_hello.so
6. Starting making requests ioctx::exec(“hello”, …)
71
72. The CORFU seal(obj, epoch) interface is tougher
● Semantics are do the following atomically
○ Verify and update the stored epoch
○ Return the largest position written
● RADOS object classes don’t support this mix of ops
○ Currently
● Solution
○ Split the operation
○ Verify
● Luckily this isn’t a fast path
72
74. PostgreSQL logical replication with ZLog
74
0 1 2 3 4 5 6 7 8 9 10 11 12
osd osd osd
INSERT UPDATE DELETEReplay
Replay
Replay
Replay
Replay
C C C C
Queries
ZLog
https://github.com/cruzdb/pg_zlog
75. CruzDB is an immutable key-value store on ZLog
75
0 1 2 3 4 5 6 7 8 9 10 11 12
osd osd osd
GET PUT DELETEReplay
Replay
Replay
Replay
Replay
C C C C
Key/Value Transactions
ZLog
https://github.com/cruzdb/cruzdb
76. CruzDB is an immutable key-value store on ZLog
76
0 1 2 3 4 5 6 7 8 9 10 11 12
osd osd osd
GET PUT DELETEReplay
Replay
Replay
Replay
Replay
C C C C
Key/Value Transactions
ZLog
Database is a
copy-on-write
red-black tree
https://github.com/cruzdb/cruzdb
77. CruzDB is an immutable key-value store on ZLog
77
0 1 2 3 4 5 6 7 8 9 10 11 12
osd osd osd
GET PUT DELETEReplay
Replay
Replay
Replay
Replay
C C C C
Key/Value Transactions
ZLog
Database is a
copy-on-write
red-black tree
Efficient read-only
snapshots
https://github.com/cruzdb/cruzdb
78. CruzDB is an immutable key-value store on ZLog
78
0 1 2 3 4 5 6 7 8 9 10 11 12
osd osd osd
GET PUT DELETEReplay
Replay
Replay
Replay
Replay
C C C C
Key/Value Transactions
ZLog
Database is a
copy-on-write
red-black tree
Efficient read-only
snapshotsTime travel
capabilities
https://github.com/cruzdb/cruzdb
79. Wrapping up
● We use Ceph as a prototyping system
○ Application-specific interfaces
● There is a broad range of interests in log-oriented interfaces
● We’ve built a Ceph-based implementation of a high-performance log
○ ZLog @ https://github.com/cruzdb/zlog
● Using it for several real-world use cases
○ PostgreSQL logical replication (https://github.com/cruzdb/pg_zlog)
○ CruzDB immutable database (https://github.com/cruzdb/cruzdb)
79
80. Thank you and questions
● Noah Watkins (@noahdesu)
● Learning resources
○ Object class SDK documentation
■ http://docs.ceph.com/docs/master/rados/api/objclass-sdk/
○ ZLog is the only user of the SDK I’m aware of
■ https://github.com/cruzdb/zlog/blob/master/src/storage/ceph/cls_zlog.cc
○ The Ceph tree contains a ton of object classes for reference
■ https://github.com/ceph/ceph/tree/master/src/cls
○ Writing object classes using Lua
■ https://ceph.com/geen-categorie/dynamic-object-interfaces-with-lua/
■ https://nwat.xyz/blog/2018/03/13/video-of-my-talk-at-lua-workshop-2017/
80