TileDB is an open-source storage manager for multi-dimensional sparse and dense array data. It has a novel architecture that addresses some of the pain points in storing array data on “big-data” and “cloud” storage architectures. This talk will highlight TileDB’s design and its ability to integrate with analysis environments relevant to the PyData community such as Python, R, Julia, etc.
2. The Problem
• We focus on persistent storage of massive data
• Plethora of complex formats across many applications
- Genomics (FastQ, BAM, VCF, CRAM, etc.), LiDAR (LAS, LAZ), Databases (proprietary formats, Parquet), …
• Every format is associated with a library responsible for
- Backend support (POSIX, HDFS, AWS S3, …), parallel IO, compression, other filters, …
• Downstream computations (e.g., Linear Algebra) typically work on vectors and arrays
• Two common problems:
- redundant software engineering for high performance (parallel IO, compression, etc.)
- expensive conversion to arrays for downstream computations
5. Storage Module vs. DBMS
Storage Module
DBMS
Storage Module
IO
Compression
Access / Slicing
APIs to higher level modules
Other filters (e.g., encryption)
DBMS
Query language
Query optimizer
Query executor
Query parser
A storage module
can be integrated with other
data science tools as well,
without an ODBC/JDBC
7. TileDB History
Stavros Jake Tyler Seth
2016 VLDB paper on TileDB
2018 - We are hiring!
2017 TileDB, Inc. is incorporated backed by
2015 TileDB research project kicks off at
10. The TileDB Format
Filters
Binary data across an attribute
Chunk Chunk Chunk Chunk
Each chunk fits in L1 cache
Atomic unit of filtering
Tile
Atomic unit of IO
Filters
Compression (gzip, zstd, …)
Byte/Bit Shuffle
Encryption
Delta encoding
Bit-width reduction
Filter 1
Filter 2
Filter 1
Filter 2
Filter 1
Filter 2
Filter 1
Filter 2
11. The TileDB Format
Cloud
• TileDB works great on AWS S3
- Just use s3://bucket-name/path/to/array instead of my_array
- No concept of directories, natural use of / in the URI
- aws s3 sync just works
- LSM-tree-based updates excellent fit for such an object store
• Adding Azure, Google Cloud and Alibaba Cloud soon
12. TileDB Parallelism
• Fully multi-threaded via Intel TBB
• TileDB does not rely on an external engine for parallelism (e.g., Dask)
• Thread-/Process-safety, no need for locking, multiple reader/writer model
• Parallel IO (good use of S3 multipart upload and byte range requests)
• Parallel filters
• Parallel sorting
• Parallel slicing
13. APIs and Integration
• Lightweight interfaces between the TileDB C library and HL APIs
• Zero-copying wherever possible
• Predicate push-down
• Effective partitioning (especially for sparse arrays)
14. ND arrays
Sparse arrays
Compression/Filters
Parallel IO
Parallelism
S3 support
Updates
Zarr
APIs
LSM-tree-like chunk-based chunk-based file-based
SWMR pushed to app pushed to app
multiple multiple only Python multiple
pushed to app Blosc / pushed to app pushed to app
open-source closed-source open-source pushed to app
15. • In-memory columnar format
• DataFrames, limited ND array support
• Designed for fast in-memory operations
• Rich datatype support, complex objects
• Persistence through virtual memory mapping or delegated to external on-disk formats
• TileDB integration with Apache Arrow is on our roadmap!
16. TileDB Value to
• Manage dense and sparse data persistence using a single API
• Get the most from you modern hardware! Concurrent IO, parallel
compression, accelerated encryption and more
• Easily interface with multiple different storage backends (including
cloud storage) and get performance with little to no code changes
• Common format that can be leveraged by “big data” / SQL
platforms and Python, R, Julia, … ecosystems
17. Thank You
We are Hiring !
tiledb.workable.com
careers@tiledb.io
https://github.com/TileDB-Inc
pip install tiledb