2. 2
● Graduate student
● UC Santa Cruz
● Data management,
file systems, HPC,
QoS
● Ceph as a
prototyping
platform
Storage research at UCSC
● Data management
● Storage systems
● High-performance computing
● Quality of service
● Real-time systems
4. 4
● Graduate student
● UC Santa Cruz
● Data management,
file systems, HPC,
QoS
● Ceph as a
prototyping
platform
Storage research at UCSC
● SIRIUS Project (DOE)
● Programmable storage (CROSS, NSF)
● Declarative storage (NSF)
● IRIS-HEP (NSF)
5. SIRIUS: Science-driven Data Management for Multi-tiered
Storage (ORNL, Sandia, Brown, Rutgers, UCSC)
● Storage challenges for
exascale systems
● (1) Heterogeneity
○ Where should data be
stored?
● (2) Predictable
performance
○ Millions of processes
performing I/O
● Many challenges...
DOE SSIO, “Science-Driven Data
Management for Multi-Tiered Storage”
with ORNL and Sandia, award
DE-SC0016074
6. Malacology
A Programmable Storage System
[Sevilla et al. EuroSys '17]
Michael A. Sevilla, Noah Watkins, Ivo Jimenez,
Peter Alvaro, Shel Finkelstein, Jeff LeFevre, Carlos Maltzahn
University of California, Santa Cruz
7. Malacology: programmable interface research platform
7
Target storage interface
(Goal)
Internal sys services
(Building blocks)
Composed, generic
service glue layer
Malacology: A Programmable Storage System, M. Sevilla, N. Watkins, I. Jimenez, P. Alvaro, S. Finkelstein, J. LeFevre, C. Maltzahn, EuroSys ‘17
Mantle: A Programmable Metadata Load Balancer for the Ceph File System, M. Sevilla, N. Watkins, C. Maltzahn, I. Nassi, S. Brandt, et. al, SC ‘15
CORFU: A Shared Log Design for Flash Clusters, Mahesh Balakrishnan, Dahlia Malkhi, Vijayan Prabhakaran, and Ted Wobber, Michael Wei, et. al, NSDI ‘12
8. How to grow a database: scale-up
Database Node
CPURAM
Database Storage
Network /
Bus
Q
8
https://aws.amazon.com/ec2/instance-types/
Skyhook project
● Elastic database system
● Lead: Jeff LeFevre
● Active CROSS incubator
10. DB-specific Data Interface
Ceph OSD
RAM CPU
Storage+Index
Q
Skyhook: aligns data with storage interfaces
Ceph OSD
RAM CPU
Storage+Index
Ceph OSD
RAM CPU
Storage+Index
Ceph OSD
RAM CPU
Storage+Index
Ceph OSD
RAM CPU
Storage+Index
Ceph OSD
RAM CPU
Storage+Index
Ceph OSD
RAM CPU
Storage+Index
Ceph Cluster
C1 C2 C3
Table
Table
Shards
partitioning
{ object.i }
{ object.i }
10
Database Node
RAM CPU
Foreign Data Wrappers
Q
Q
Database node
* Indexing
* Projection
* Filtering
* Aggregation
App-specific
interface
11. Skyhook experiments with programmable storage
● Real-world dataset
○ TPC lineitem table
○ 1 billion rows
○ 140 GB
● Storage in Ceph objects
○ Table divided into ~10,000 14 MB objects
■ Optimize for workload (e.g. 4MB)
○ Each object contains a dedicated index
■ Index stored in omap (RocksDB)
● Storage hardware (thanks CloudLab!)
○ Modern 20 core Intel
○ 128 GB DRAM, 500 GB SSD
○ 10 GB/s Ethernet
○ 1 -- 16 Ceph nodes
Database Node
CPURAM
Programmable storage
Network
11
(Database-specific data interface)
Q
Q
Q
Q
12. Benchmark queries evaluated
Qa: Range query with 10% selectivity:
SELECT * FROM lineitem WHERE extendedprice > 71000.0
Qb: Point query (unique row) issued with and without index:
SELECT extendedprice
FROM lineitem
WHERE orderkey=5 AND linenumber=3
Qc: Regex query with 10% selectivity (CPU intensive):
SELECT * FROM lineitem WHERE comment iLIKE '%uriously%'
12
13. Range query performance (10% selectivity)
13
Improved I/O performance
● Local I/O bandwidth
● Local CPU resources
● Reduced network traffic
● CPU parallelism
Database Node
CPURAM
Database Storage
Network
Lower=
Client-side processing Server-side processing
14. Point query performance (find unique row)
14
● Local I/O bandwidth
● Local CPU resources
● Reduced network traffic
● CPU parallelism
● 10,000 index lookups!
● 1 billion rows
Database Node
CPURAM
Database Storage
Network
Lower=
Client-side processing Server-side processing Server-side processing
with index acceleration
16. zlog: implementation of CORFU on Ceph
1 2 3 4 5 6 7 8 9 10 11 12 13 14 ...
● Extend the benefits of software-defined storage to log abstraction
○ Transparently select storage media and physical design
○ Take advantage of performance upgrades and new features
○ Offload critical components such as replication and erasure-coding
LevelDB
RocksD
B
WiredTiger
librados
osd osd osd
CORFU protocol
enforced by
custom,
transactional
storage interface.
Balakrishnan et al., “CORFU: A Shared Log Design for Flash Clusters”, NSDI, `12
19. The state of programmable storage is messy
● Large design space
○ High cost of searching this space
● Costs are difficult to predict
○ Simple upgrade and change the calculus!
● Much harder than what we have presented
○ > 500 tunables/settings in Ceph
■ Not counting dependencies
○ Runs on a wide-variety of hardware
● No hope of migrating to a new system
○ There are no standards!
19
More programmability work:
“Data Center Scale Programmable Storage” with Dirk Grunwald (CU Boulder),
NSF #1705021
20. DeclStore: Layering is for
the Faint of Heart
Noah Watkins, Michael Sevilla, Ivo Jimenez, Kathryn
Dahlgren, Peter Alvaro, Shel Finkelstein, Carlos Maltzahn
[HotStorage, July 2017]
21. Declarative storage
21
● Automate parts of this process
○ Searching the design space
○ Generating implementations
Query optimization & plan generation
Cost model
● Express interfaces declaratively
○ Eliminate need for storage system expertise
○ High-level abstractions across services / systems
● Prototyping with the language
○ Formal underpinning, demonstrated across domains
○ Can express all of the CORFU semantics
“Declarative Programmable Storage”,
with Peter Alvaro, NSF #1764102
- 2018
22. IRIS-HEP
● $25 million NSF-funded 5-year project with Princeton
● Institute for Research and Innovation (IRIS)
● High Energy Physics (HEP)
● Exascale storage and analysis challenges