2. State Storage Example
Batch 1
Batch 2
Batch 3
State
= 10
[1-4]
[5-8]
[9-10]
Count: 10
State
= 36
State
= 55
Count: 10 + 26 = 36
Count: 36 + 19 = 55
Running Count
3. Current Implementation
● Versioned key-value store for every
shuffle partitions
● In memory hashmap for key-value
store
● For each batch
○ Load previous version (batch) state
checkpointed in HDFS/S3
○ Create new state using previous version
and current batch data records
○ Create delta file in checkpointed folder
for the updated state.
● Maintenance task every x minutes
to create snapshot files using delta
files
4. Issues in the current implementation
● Scalability
○ Uses Executor JVM memory to store the states. State store size is limited by size of the
executor memory.
○ Executor JVM memory is shared by state storage and other tasks operations. State storage
size will impact the performance of task execution.
● Latency
○ GC issues, executor failures, OOM issues are common when size of state storage increases
which increases overall latency of a micro-batch
5. RocksDb Based Implementation
● Use rocksDb instead of HashMap for each shuffle partitions
● RocksDB
○ a storage engine with key/value interface based on levelDB
○ new writes are inserted into the memtable; when memtable fills up, it flushes the
data on local storage
○ supports both point lookups and range scans, and provides different types of ACID
guarantees
○ optimized for flash storage
6. End Goals
● Consistency
○ On Successful batch completion, all state updates should be committed
○ On Failure, none of the updates should be committed
○ Solution => Use Transactional RocksDB
● Isolation
○ Thread which opens the transaction in write mode should be able to read the updates
○ Another thread which opens the db in Read-only mode should not see any updates
○ Solution => backup the rocksDB in a seperate folder after every batch completion
● Durability
○ Checkpoint the delta in a S3 folder
○ Compact all delta files to create a snapshot once in a while
7. RocksDb State Store Implementation
Task for
partition P1
& batch V3
/ P1
/ P2
Delta
file for
P1
Delta
file for
P2
P1v1.delta
P1v2.delta
P1v3.delta
P2v1.delta
P2v2.delta
P2v3.delta
Snapshot
file for P1
Snapshot
file for P2
P1V2.snapshot P1V2.snapshot
Task for
partition P2
and batch V3
Worker node
External
FileSystem
Local Hard Disk
Maintenance
thread
Rocksdb
client
Rocksdb
client
Write as
snapshot
File
Executor 1
/
8. Implementation
● Creation of new State
For batch x and partition Pi
○ if Node(Pi, x) = Node(Pi, x-1) : state is already loaded in rocksDb
○ Else if Node(Pi, x) = Node(Pi, x-2) : update rocksDb state using downloaded Delta(Pi, X-1)
○ Otherwise create new rocksDB store using checkpointed data (snapshot + delta)
● During Batch Execution
○ Open rocksdb in transactional mode
○ on successful completion of the batch
■ Commit the transaction
■ Upload delta file into checkpoint folder (S3/HDFS)
■ Create a backup of current Db state in local storage
○ abort the transaction on any error
● Snapshot creation (Maintenance task)
○ Create a tarball of last backed up DB state and upload it to the checkpoint folder
9. Performance Analysis
● Setup
○ Master: r3.xlarge; Executor: r3.xlarge
○ Driver size - 8GB; Executor size - 12.5 GB
○ Campaign data sets -10000 unique campaign
○ Ingestion rate : 5k events per sec
● Sink
○ 3 node MKS (kafka) cluster
● Query
○ Sliding Window Aggregation on event time, campaign-id
○ For each key, ~150 bytes of state data is generated
● Config
○ shuffle partitions = 200 and 8
○ spark.dynamicAllocation.maxExecutors = 2