Rocks db state store in structured streaming

2. State Storage Example Batch 1 Batch 2 Batch 3 State = 10 [1-4] [5-8] [9-10] Count: 10 State = 36 State = 55 Count: 10 + 26 = 36 Count: 36 + 19 = 55 Running Count

3. Current Implementation ● Versioned key-value store for every shuffle partitions ● In memory hashmap for key-value store ● For each batch ○ Load previous version (batch) state checkpointed in HDFS/S3 ○ Create new state using previous version and current batch data records ○ Create delta file in checkpointed folder for the updated state. ● Maintenance task every x minutes to create snapshot files using delta files

4. Issues in the current implementation ● Scalability ○ Uses Executor JVM memory to store the states. State store size is limited by size of the executor memory. ○ Executor JVM memory is shared by state storage and other tasks operations. State storage size will impact the performance of task execution. ● Latency ○ GC issues, executor failures, OOM issues are common when size of state storage increases which increases overall latency of a micro-batch

5. RocksDb Based Implementation ● Use rocksDb instead of HashMap for each shuffle partitions ● RocksDB ○ a storage engine with key/value interface based on levelDB ○ new writes are inserted into the memtable; when memtable fills up, it flushes the data on local storage ○ supports both point lookups and range scans, and provides different types of ACID guarantees ○ optimized for flash storage

6. End Goals ● Consistency ○ On Successful batch completion, all state updates should be committed ○ On Failure, none of the updates should be committed ○ Solution => Use Transactional RocksDB ● Isolation ○ Thread which opens the transaction in write mode should be able to read the updates ○ Another thread which opens the db in Read-only mode should not see any updates ○ Solution => backup the rocksDB in a seperate folder after every batch completion ● Durability ○ Checkpoint the delta in a S3 folder ○ Compact all delta files to create a snapshot once in a while

7. RocksDb State Store Implementation Task for partition P1 & batch V3 / P1 / P2 Delta file for P1 Delta file for P2 P1v1.delta P1v2.delta P1v3.delta P2v1.delta P2v2.delta P2v3.delta Snapshot file for P1 Snapshot file for P2 P1V2.snapshot P1V2.snapshot Task for partition P2 and batch V3 Worker node External FileSystem Local Hard Disk Maintenance thread Rocksdb client Rocksdb client Write as snapshot File Executor 1 /

8. Implementation ● Creation of new State For batch x and partition Pi ○ if Node(Pi, x) = Node(Pi, x-1) : state is already loaded in rocksDb ○ Else if Node(Pi, x) = Node(Pi, x-2) : update rocksDb state using downloaded Delta(Pi, X-1) ○ Otherwise create new rocksDB store using checkpointed data (snapshot + delta) ● During Batch Execution ○ Open rocksdb in transactional mode ○ on successful completion of the batch ■ Commit the transaction ■ Upload delta file into checkpoint folder (S3/HDFS) ■ Create a backup of current Db state in local storage ○ abort the transaction on any error ● Snapshot creation (Maintenance task) ○ Create a tarball of last backed up DB state and upload it to the checkpoint folder

9. Performance Analysis ● Setup ○ Master: r3.xlarge; Executor: r3.xlarge ○ Driver size - 8GB; Executor size - 12.5 GB ○ Campaign data sets -10000 unique campaign ○ Ingestion rate : 5k events per sec ● Sink ○ 3 node MKS (kafka) cluster ● Query ○ Sliding Window Aggregation on event time, campaign-id ○ For each key, ~150 bytes of state data is generated ● Config ○ shuffle partitions = 200 and 8 ○ spark.dynamicAllocation.maxExecutors = 2

10. Comparison

11. Executor’s GC time and Heap Usage Memory Based State Storage

12. Executor’s GC time and Heap Usage RocksDb based state storage

Rocks db state store in structured streaming

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Rocks db state store in structured streaming

Similar to Rocks db state store in structured streaming (20)

Recently uploaded

Recently uploaded (20)

Rocks db state store in structured streaming