SlideShare a Scribd company logo
1 of 17
Download to read offline
Data storage made 

fast and easy
The Problem
• We focus on persistent storage of massive data
• Plethora of complex formats across many applications

- Genomics (FastQ, BAM, VCF, CRAM, etc.), LiDAR (LAS, LAZ), Databases (proprietary formats, Parquet), …

• Every format is associated with a library responsible for

- Backend support (POSIX, HDFS, AWS S3, …), parallel IO, compression, other filters, …

• Downstream computations (e.g., Linear Algebra) typically work on vectors and arrays

• Two common problems:
- redundant software engineering for high performance (parallel IO, compression, etc.)
- expensive conversion to arrays for downstream computations
What is Array Data?
1) Slicing
2) Compression
Goals
Applications
Genomics Time Series Tabular
Source: NYU’s Center for Urban Science and Progress
LiDAR Imaging
Storage Module vs. DBMS
Storage Module
DBMS
Storage Module
IO
Compression
Access / Slicing
APIs to higher level modules
Other filters (e.g., encryption)
DBMS
Query language
Query optimizer
Query executor
Query parser
A storage module
can be integrated with other
data science tools as well,
without an ODBC/JDBC
What is TileDB?
Architecture
TileDB is a storage module for a novel
multi-dimensional array data format
TileDB History
Stavros Jake Tyler Seth
2016 VLDB paper on TileDB
2018 - We are hiring!
2017 TileDB, Inc. is incorporated backed by
2015 TileDB research project kicks off at
The TileDB Format

Physical Organization
a1.tdb
a2.tdb
a2_var.tdb
__fragment_metadata.tdb
__<uuid>_<timestamp>
__array_schema.tdb
__lock.tdb
my_array
a1.tdb
a2.tdb
a2_var.tdb
__fragment_metadata.tdb
__<uuid>_<timestamp>
__array_schema.tdb
__lock.tdb
__coords.tdb
my_array
The TileDB Format

Updates
a1.tdb
a2.tdb
a2_var.tdb
__fragment_metadata.tdb
__<uuid>_<timestmap2>
a1.tdb
a2.tdb
a2_var.tdb
__fragment_metadata.tdb
__coords.tdb
__<uuid>_<timestmap3>
LSM-tree-like updates
and consolidation
a1.tdb
a2.tdb
a2_var.tdb
__fragment_metadata.tdb
__<uuid>_<timestamp1>
__array_schema.tdb
__lock.tdb
my_array
The TileDB Format

Filters
Binary data across an attribute
Chunk Chunk Chunk Chunk
Each chunk fits in L1 cache
Atomic unit of filtering
Tile
Atomic unit of IO
Filters
Compression (gzip, zstd, …)
Byte/Bit Shuffle
Encryption
Delta encoding
Bit-width reduction
Filter 1
Filter 2
Filter 1
Filter 2
Filter 1
Filter 2
Filter 1
Filter 2
The TileDB Format

Cloud
• TileDB works great on AWS S3

- Just use s3://bucket-name/path/to/array instead of my_array

- No concept of directories, natural use of / in the URI

- aws s3 sync just works

- LSM-tree-based updates excellent fit for such an object store

• Adding Azure, Google Cloud and Alibaba Cloud soon
TileDB Parallelism
• Fully multi-threaded via Intel TBB

• TileDB does not rely on an external engine for parallelism (e.g., Dask)

• Thread-/Process-safety, no need for locking, multiple reader/writer model

• Parallel IO (good use of S3 multipart upload and byte range requests)

• Parallel filters

• Parallel sorting

• Parallel slicing
APIs and Integration
• Lightweight interfaces between the TileDB C library and HL APIs

• Zero-copying wherever possible

• Predicate push-down

• Effective partitioning (especially for sparse arrays)
ND arrays
Sparse arrays
Compression/Filters
Parallel IO
Parallelism
S3 support
Updates
Zarr
APIs
LSM-tree-like chunk-based chunk-based file-based
SWMR pushed to app pushed to app
multiple multiple only Python multiple
pushed to app Blosc / pushed to app pushed to app
open-source closed-source open-source pushed to app
• In-memory columnar format

• DataFrames, limited ND array support

• Designed for fast in-memory operations

• Rich datatype support, complex objects

• Persistence through virtual memory mapping or delegated to external on-disk formats

• TileDB integration with Apache Arrow is on our roadmap!
TileDB Value to
• Manage dense and sparse data persistence using a single API

• Get the most from you modern hardware! Concurrent IO, parallel
compression, accelerated encryption and more

• Easily interface with multiple different storage backends (including
cloud storage) and get performance with little to no code changes

• Common format that can be leveraged by “big data” / SQL
platforms and Python, R, Julia, … ecosystems
Thank You
We are Hiring !
tiledb.workable.com
careers@tiledb.io
https://github.com/TileDB-Inc
pip install tiledb

More Related Content

What's hot

OpenStack超入門シリーズ いまさら聞けないSwiftの使い方
OpenStack超入門シリーズ いまさら聞けないSwiftの使い方OpenStack超入門シリーズ いまさら聞けないSwiftの使い方
OpenStack超入門シリーズ いまさら聞けないSwiftの使い方
Toru Makabe
 

What's hot (20)

Repository Management with JFrog Artifactory
Repository Management with JFrog ArtifactoryRepository Management with JFrog Artifactory
Repository Management with JFrog Artifactory
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Apache NiFi の紹介 #streamctjp
Apache NiFi の紹介  #streamctjpApache NiFi の紹介  #streamctjp
Apache NiFi の紹介 #streamctjp
 
Docker Kubernetes Istio
Docker Kubernetes IstioDocker Kubernetes Istio
Docker Kubernetes Istio
 
Reproducible Computational Pipelines with Docker and Nextflow
Reproducible Computational Pipelines with Docker and NextflowReproducible Computational Pipelines with Docker and Nextflow
Reproducible Computational Pipelines with Docker and Nextflow
 
Docker Introduction
Docker IntroductionDocker Introduction
Docker Introduction
 
FIWARE - スマートサービスを支えるオープンソース
FIWARE - スマートサービスを支えるオープンソースFIWARE - スマートサービスを支えるオープンソース
FIWARE - スマートサービスを支えるオープンソース
 
모두의 쿠버네티스 (Kubernetes for everyone)
모두의 쿠버네티스 (Kubernetes for everyone)모두의 쿠버네티스 (Kubernetes for everyone)
모두의 쿠버네티스 (Kubernetes for everyone)
 
BuildKitの概要と最近の機能
BuildKitの概要と最近の機能BuildKitの概要と最近の機能
BuildKitの概要と最近の機能
 
Azure Key Vault
Azure Key VaultAzure Key Vault
Azure Key Vault
 
Kubernetes GitOps featuring GitHub, Kustomize and ArgoCD
Kubernetes GitOps featuring GitHub, Kustomize and ArgoCDKubernetes GitOps featuring GitHub, Kustomize and ArgoCD
Kubernetes GitOps featuring GitHub, Kustomize and ArgoCD
 
Data persistency (draco, cygnus, sth comet, quantum leap)
Data persistency (draco, cygnus, sth comet, quantum leap)Data persistency (draco, cygnus, sth comet, quantum leap)
Data persistency (draco, cygnus, sth comet, quantum leap)
 
MSA 전략 2: 마이크로서비스, 어떻게 구현할 것인가?
MSA 전략 2: 마이크로서비스, 어떻게 구현할 것인가?MSA 전략 2: 마이크로서비스, 어떻게 구현할 것인가?
MSA 전략 2: 마이크로서비스, 어떻게 구현할 것인가?
 
これから始めるAzure Kubernetes Service入門
これから始めるAzure Kubernetes Service入門これから始めるAzure Kubernetes Service入門
これから始めるAzure Kubernetes Service入門
 
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQueryIntro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
 
Oracle Cloud deployment with Terraform
Oracle Cloud deployment with TerraformOracle Cloud deployment with Terraform
Oracle Cloud deployment with Terraform
 
捕鯨!詳解docker
捕鯨!詳解docker捕鯨!詳解docker
捕鯨!詳解docker
 
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorApache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
 
OpenStack超入門シリーズ いまさら聞けないSwiftの使い方
OpenStack超入門シリーズ いまさら聞けないSwiftの使い方OpenStack超入門シリーズ いまさら聞けないSwiftの使い方
OpenStack超入門シリーズ いまさら聞けないSwiftの使い方
 
Red Hat Insights
Red Hat InsightsRed Hat Insights
Red Hat Insights
 

Similar to The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski

An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...
DataWorks Summit
 
Aws for Startups Building Cloud Enabled Apps
Aws for Startups Building Cloud Enabled AppsAws for Startups Building Cloud Enabled Apps
Aws for Startups Building Cloud Enabled Apps
Amazon Web Services
 

Similar to The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski (20)

Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Accesso ai dati con Azure Data Platform
Accesso ai dati con Azure Data PlatformAccesso ai dati con Azure Data Platform
Accesso ai dati con Azure Data Platform
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
Scaling horizontally on AWS
Scaling horizontally on AWSScaling horizontally on AWS
Scaling horizontally on AWS
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Hadoop
HadoopHadoop
Hadoop
 
Deep Dive in Big Data
Deep Dive in Big DataDeep Dive in Big Data
Deep Dive in Big Data
 
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir Volk
 
An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
 
Aws for Startups Building Cloud Enabled Apps
Aws for Startups Building Cloud Enabled AppsAws for Startups Building Cloud Enabled Apps
Aws for Startups Building Cloud Enabled Apps
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h base
 
Database Technologies
Database TechnologiesDatabase Technologies
Database Technologies
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empower
 
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
 
Azure Cosmos DB - The Swiss Army NoSQL Cloud Database
Azure Cosmos DB - The Swiss Army NoSQL Cloud DatabaseAzure Cosmos DB - The Swiss Army NoSQL Cloud Database
Azure Cosmos DB - The Swiss Army NoSQL Cloud Database
 
Using Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architectureUsing Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architecture
 

More from PyData

More from PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 
Deprecating the state machine: building conversational AI with the Rasa stack...
Deprecating the state machine: building conversational AI with the Rasa stack...Deprecating the state machine: building conversational AI with the Rasa stack...
Deprecating the state machine: building conversational AI with the Rasa stack...
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski

  • 1. Data storage made 
 fast and easy
  • 2. The Problem • We focus on persistent storage of massive data • Plethora of complex formats across many applications - Genomics (FastQ, BAM, VCF, CRAM, etc.), LiDAR (LAS, LAZ), Databases (proprietary formats, Parquet), … • Every format is associated with a library responsible for - Backend support (POSIX, HDFS, AWS S3, …), parallel IO, compression, other filters, … • Downstream computations (e.g., Linear Algebra) typically work on vectors and arrays • Two common problems: - redundant software engineering for high performance (parallel IO, compression, etc.) - expensive conversion to arrays for downstream computations
  • 3. What is Array Data? 1) Slicing 2) Compression Goals
  • 4. Applications Genomics Time Series Tabular Source: NYU’s Center for Urban Science and Progress LiDAR Imaging
  • 5. Storage Module vs. DBMS Storage Module DBMS Storage Module IO Compression Access / Slicing APIs to higher level modules Other filters (e.g., encryption) DBMS Query language Query optimizer Query executor Query parser A storage module can be integrated with other data science tools as well, without an ODBC/JDBC
  • 6. What is TileDB? Architecture TileDB is a storage module for a novel multi-dimensional array data format
  • 7. TileDB History Stavros Jake Tyler Seth 2016 VLDB paper on TileDB 2018 - We are hiring! 2017 TileDB, Inc. is incorporated backed by 2015 TileDB research project kicks off at
  • 8. The TileDB Format
 Physical Organization a1.tdb a2.tdb a2_var.tdb __fragment_metadata.tdb __<uuid>_<timestamp> __array_schema.tdb __lock.tdb my_array a1.tdb a2.tdb a2_var.tdb __fragment_metadata.tdb __<uuid>_<timestamp> __array_schema.tdb __lock.tdb __coords.tdb my_array
  • 9. The TileDB Format
 Updates a1.tdb a2.tdb a2_var.tdb __fragment_metadata.tdb __<uuid>_<timestmap2> a1.tdb a2.tdb a2_var.tdb __fragment_metadata.tdb __coords.tdb __<uuid>_<timestmap3> LSM-tree-like updates and consolidation a1.tdb a2.tdb a2_var.tdb __fragment_metadata.tdb __<uuid>_<timestamp1> __array_schema.tdb __lock.tdb my_array
  • 10. The TileDB Format
 Filters Binary data across an attribute Chunk Chunk Chunk Chunk Each chunk fits in L1 cache Atomic unit of filtering Tile Atomic unit of IO Filters Compression (gzip, zstd, …) Byte/Bit Shuffle Encryption Delta encoding Bit-width reduction Filter 1 Filter 2 Filter 1 Filter 2 Filter 1 Filter 2 Filter 1 Filter 2
  • 11. The TileDB Format
 Cloud • TileDB works great on AWS S3 - Just use s3://bucket-name/path/to/array instead of my_array - No concept of directories, natural use of / in the URI - aws s3 sync just works - LSM-tree-based updates excellent fit for such an object store • Adding Azure, Google Cloud and Alibaba Cloud soon
  • 12. TileDB Parallelism • Fully multi-threaded via Intel TBB • TileDB does not rely on an external engine for parallelism (e.g., Dask) • Thread-/Process-safety, no need for locking, multiple reader/writer model • Parallel IO (good use of S3 multipart upload and byte range requests) • Parallel filters • Parallel sorting • Parallel slicing
  • 13. APIs and Integration • Lightweight interfaces between the TileDB C library and HL APIs • Zero-copying wherever possible • Predicate push-down • Effective partitioning (especially for sparse arrays)
  • 14. ND arrays Sparse arrays Compression/Filters Parallel IO Parallelism S3 support Updates Zarr APIs LSM-tree-like chunk-based chunk-based file-based SWMR pushed to app pushed to app multiple multiple only Python multiple pushed to app Blosc / pushed to app pushed to app open-source closed-source open-source pushed to app
  • 15. • In-memory columnar format • DataFrames, limited ND array support • Designed for fast in-memory operations • Rich datatype support, complex objects • Persistence through virtual memory mapping or delegated to external on-disk formats • TileDB integration with Apache Arrow is on our roadmap!
  • 16. TileDB Value to • Manage dense and sparse data persistence using a single API • Get the most from you modern hardware! Concurrent IO, parallel compression, accelerated encryption and more • Easily interface with multiple different storage backends (including cloud storage) and get performance with little to no code changes • Common format that can be leveraged by “big data” / SQL platforms and Python, R, Julia, … ecosystems
  • 17. Thank You We are Hiring ! tiledb.workable.com careers@tiledb.io https://github.com/TileDB-Inc pip install tiledb