SlideShare a Scribd company logo
1 of 12
RocksDb State Store in
Structured Streaming
Vikram Agrawal
Qubole
State Storage Example
Batch 1
Batch 2
Batch 3
State
= 10
[1-4]
[5-8]
[9-10]
Count: 10
State
= 36
State
= 55
Count: 10 + 26 = 36
Count: 36 + 19 = 55
Running Count
Current Implementation
● Versioned key-value store for every
shuffle partitions
● In memory hashmap for key-value
store
● For each batch
○ Load previous version (batch) state
checkpointed in HDFS/S3
○ Create new state using previous version
and current batch data records
○ Create delta file in checkpointed folder
for the updated state.
● Maintenance task every x minutes
to create snapshot files using delta
files
Issues in the current implementation
● Scalability
○ Uses Executor JVM memory to store the states. State store size is limited by size of the
executor memory.
○ Executor JVM memory is shared by state storage and other tasks operations. State storage
size will impact the performance of task execution.
● Latency
○ GC issues, executor failures, OOM issues are common when size of state storage increases
which increases overall latency of a micro-batch
RocksDb Based Implementation
● Use rocksDb instead of HashMap for each shuffle partitions
● RocksDB
○ a storage engine with key/value interface based on levelDB
○ new writes are inserted into the memtable; when memtable fills up, it flushes the
data on local storage
○ supports both point lookups and range scans, and provides different types of ACID
guarantees
○ optimized for flash storage
End Goals
● Consistency
○ On Successful batch completion, all state updates should be committed
○ On Failure, none of the updates should be committed
○ Solution => Use Transactional RocksDB
● Isolation
○ Thread which opens the transaction in write mode should be able to read the updates
○ Another thread which opens the db in Read-only mode should not see any updates
○ Solution => backup the rocksDB in a seperate folder after every batch completion
● Durability
○ Checkpoint the delta in a S3 folder
○ Compact all delta files to create a snapshot once in a while
RocksDb State Store Implementation
Task for
partition P1
& batch V3
/ P1
/ P2
Delta
file for
P1
Delta
file for
P2
P1v1.delta
P1v2.delta
P1v3.delta
P2v1.delta
P2v2.delta
P2v3.delta
Snapshot
file for P1
Snapshot
file for P2
P1V2.snapshot P1V2.snapshot
Task for
partition P2
and batch V3
Worker node
External
FileSystem
Local Hard Disk
Maintenance
thread
Rocksdb
client
Rocksdb
client
Write as
snapshot
File
Executor 1
/
Implementation
● Creation of new State
For batch x and partition Pi
○ if Node(Pi, x) = Node(Pi, x-1) : state is already loaded in rocksDb
○ Else if Node(Pi, x) = Node(Pi, x-2) : update rocksDb state using downloaded Delta(Pi, X-1)
○ Otherwise create new rocksDB store using checkpointed data (snapshot + delta)
● During Batch Execution
○ Open rocksdb in transactional mode
○ on successful completion of the batch
■ Commit the transaction
■ Upload delta file into checkpoint folder (S3/HDFS)
■ Create a backup of current Db state in local storage
○ abort the transaction on any error
● Snapshot creation (Maintenance task)
○ Create a tarball of last backed up DB state and upload it to the checkpoint folder
Performance Analysis
● Setup
○ Master: r3.xlarge; Executor: r3.xlarge
○ Driver size - 8GB; Executor size - 12.5 GB
○ Campaign data sets -10000 unique campaign
○ Ingestion rate : 5k events per sec
● Sink
○ 3 node MKS (kafka) cluster
● Query
○ Sliding Window Aggregation on event time, campaign-id
○ For each key, ~150 bytes of state data is generated
● Config
○ shuffle partitions = 200 and 8
○ spark.dynamicAllocation.maxExecutors = 2
Comparison
Executor’s GC time and Heap Usage
Memory Based State Storage
Executor’s GC time and Heap Usage
RocksDb based state storage

More Related Content

What's hot

What's hot (20)

Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Apache flink
Apache flinkApache flink
Apache flink
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and Containers
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per day
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Connection Pooling in PostgreSQL using pgbouncer
Connection Pooling in PostgreSQL using pgbouncer Connection Pooling in PostgreSQL using pgbouncer
Connection Pooling in PostgreSQL using pgbouncer
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLTop 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
 
The Full MySQL and MariaDB Parallel Replication Tutorial
The Full MySQL and MariaDB Parallel Replication TutorialThe Full MySQL and MariaDB Parallel Replication Tutorial
The Full MySQL and MariaDB Parallel Replication Tutorial
 
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
 

Similar to Rocks db state store in structured streaming

Logging Last Resource Optimization for Distributed Transactions in Oracle We...
Logging Last Resource Optimization for Distributed Transactions in  Oracle We...Logging Last Resource Optimization for Distributed Transactions in  Oracle We...
Logging Last Resource Optimization for Distributed Transactions in Oracle We...
Gera Shegalov
 

Similar to Rocks db state store in structured streaming (20)

Backing up Wikipedia Databases
Backing up Wikipedia DatabasesBacking up Wikipedia Databases
Backing up Wikipedia Databases
 
Logging Last Resource Optimization for Distributed Transactions in Oracle We...
Logging Last Resource Optimization for Distributed Transactions in  Oracle We...Logging Last Resource Optimization for Distributed Transactions in  Oracle We...
Logging Last Resource Optimization for Distributed Transactions in Oracle We...
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
5 Levels of High Availability: From Multi-instance to Hybrid Cloud
5 Levels of High Availability: From Multi-instance to Hybrid Cloud5 Levels of High Availability: From Multi-instance to Hybrid Cloud
5 Levels of High Availability: From Multi-instance to Hybrid Cloud
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Dx11 performancereloaded
Dx11 performancereloadedDx11 performancereloaded
Dx11 performancereloaded
 
TeraCache: Efficient Caching Over Fast Storage Devices
TeraCache: Efficient Caching Over Fast Storage DevicesTeraCache: Efficient Caching Over Fast Storage Devices
TeraCache: Efficient Caching Over Fast Storage Devices
 
Low latency stream processing with jet
Low latency stream processing with jetLow latency stream processing with jet
Low latency stream processing with jet
 
Progress_190315
Progress_190315Progress_190315
Progress_190315
 
Glusterfs for sysadmins-justin_clift
Glusterfs for sysadmins-justin_cliftGlusterfs for sysadmins-justin_clift
Glusterfs for sysadmins-justin_clift
 
QCon 2017 - Java/JVM com Docker em produção: lições das trincheiras
QCon 2017 - Java/JVM com Docker em produção: lições das trincheirasQCon 2017 - Java/JVM com Docker em produção: lições das trincheiras
QCon 2017 - Java/JVM com Docker em produção: lições das trincheiras
 
Cloud storage: the right way OSS EU 2018
Cloud storage: the right way OSS EU 2018Cloud storage: the right way OSS EU 2018
Cloud storage: the right way OSS EU 2018
 
Customize and Secure the Runtime and Dependencies of Your Procedural Language...
Customize and Secure the Runtime and Dependencies of Your Procedural Language...Customize and Secure the Runtime and Dependencies of Your Procedural Language...
Customize and Secure the Runtime and Dependencies of Your Procedural Language...
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Large scale overlay networks with ovn: problems and solutions
Large scale overlay networks with ovn: problems and solutionsLarge scale overlay networks with ovn: problems and solutions
Large scale overlay networks with ovn: problems and solutions
 
Advanced task management with Celery
Advanced task management with CeleryAdvanced task management with Celery
Advanced task management with Celery
 
OpenZFS send and receive
OpenZFS send and receiveOpenZFS send and receive
OpenZFS send and receive
 
nebulaconf
nebulaconfnebulaconf
nebulaconf
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 

Rocks db state store in structured streaming

  • 1. RocksDb State Store in Structured Streaming Vikram Agrawal Qubole
  • 2. State Storage Example Batch 1 Batch 2 Batch 3 State = 10 [1-4] [5-8] [9-10] Count: 10 State = 36 State = 55 Count: 10 + 26 = 36 Count: 36 + 19 = 55 Running Count
  • 3. Current Implementation ● Versioned key-value store for every shuffle partitions ● In memory hashmap for key-value store ● For each batch ○ Load previous version (batch) state checkpointed in HDFS/S3 ○ Create new state using previous version and current batch data records ○ Create delta file in checkpointed folder for the updated state. ● Maintenance task every x minutes to create snapshot files using delta files
  • 4. Issues in the current implementation ● Scalability ○ Uses Executor JVM memory to store the states. State store size is limited by size of the executor memory. ○ Executor JVM memory is shared by state storage and other tasks operations. State storage size will impact the performance of task execution. ● Latency ○ GC issues, executor failures, OOM issues are common when size of state storage increases which increases overall latency of a micro-batch
  • 5. RocksDb Based Implementation ● Use rocksDb instead of HashMap for each shuffle partitions ● RocksDB ○ a storage engine with key/value interface based on levelDB ○ new writes are inserted into the memtable; when memtable fills up, it flushes the data on local storage ○ supports both point lookups and range scans, and provides different types of ACID guarantees ○ optimized for flash storage
  • 6. End Goals ● Consistency ○ On Successful batch completion, all state updates should be committed ○ On Failure, none of the updates should be committed ○ Solution => Use Transactional RocksDB ● Isolation ○ Thread which opens the transaction in write mode should be able to read the updates ○ Another thread which opens the db in Read-only mode should not see any updates ○ Solution => backup the rocksDB in a seperate folder after every batch completion ● Durability ○ Checkpoint the delta in a S3 folder ○ Compact all delta files to create a snapshot once in a while
  • 7. RocksDb State Store Implementation Task for partition P1 & batch V3 / P1 / P2 Delta file for P1 Delta file for P2 P1v1.delta P1v2.delta P1v3.delta P2v1.delta P2v2.delta P2v3.delta Snapshot file for P1 Snapshot file for P2 P1V2.snapshot P1V2.snapshot Task for partition P2 and batch V3 Worker node External FileSystem Local Hard Disk Maintenance thread Rocksdb client Rocksdb client Write as snapshot File Executor 1 /
  • 8. Implementation ● Creation of new State For batch x and partition Pi ○ if Node(Pi, x) = Node(Pi, x-1) : state is already loaded in rocksDb ○ Else if Node(Pi, x) = Node(Pi, x-2) : update rocksDb state using downloaded Delta(Pi, X-1) ○ Otherwise create new rocksDB store using checkpointed data (snapshot + delta) ● During Batch Execution ○ Open rocksdb in transactional mode ○ on successful completion of the batch ■ Commit the transaction ■ Upload delta file into checkpoint folder (S3/HDFS) ■ Create a backup of current Db state in local storage ○ abort the transaction on any error ● Snapshot creation (Maintenance task) ○ Create a tarball of last backed up DB state and upload it to the checkpoint folder
  • 9. Performance Analysis ● Setup ○ Master: r3.xlarge; Executor: r3.xlarge ○ Driver size - 8GB; Executor size - 12.5 GB ○ Campaign data sets -10000 unique campaign ○ Ingestion rate : 5k events per sec ● Sink ○ 3 node MKS (kafka) cluster ● Query ○ Sliding Window Aggregation on event time, campaign-id ○ For each key, ~150 bytes of state data is generated ● Config ○ shuffle partitions = 200 and 8 ○ spark.dynamicAllocation.maxExecutors = 2
  • 11. Executor’s GC time and Heap Usage Memory Based State Storage
  • 12. Executor’s GC time and Heap Usage RocksDb based state storage