The document discusses rewriting a claims reimbursement system using Spark. It describes how Spark provides better performance, scalability and cost savings compared to the previous Oracle-based system. Key points include using Spark for ETL to load data into a Delta Lake data lake, implementing the business logic in a reusable Java library, and seeing significant increases in processing volumes and speeds compared to the prior system. Challenges and tips for adoption are also provided.
5. Why Spark
▪ Performant and Scalable
▪ Works in both high and low volume scenarios.
▪ Applicable for applications not having three Vs - Volume, Velocity and Variety.
▪ Stability
▪ Easy and mature API.
▪ Mature set of tools are available for development and support.
▪ Cost
▪ Free opensource version is production grade.
▪ Possibility of cost savings with dynamic scaling on public cloud.
6. Why Spark
▪ Easy to Adapt
▪ Few concepts to master – distributed nature of execution, lazy evaluation.
▪ The API is very fluent.
▪ Java and Python are common skill sets.
▪ Easy IDE based development – no VM needed as in Hadoop.
▪ Code can be unit tested.
▪ Easy to apply Object Oriented and functional programming.
▪ Code can be reused for Batch and Streaming mode.
▪ Compatibility
▪ It is compatible with almost all technologies.
▪ A great number of data drivers.
▪ Active Community Support
▪ There is a plenty of support on Stack overflow, Spark’s official documentation and Databricks blogs.
9. Claim ETL Rewrite
▪ Spark 2.4.2, opensource.
▪ Data lake in Parquet format.
▪ Plan to migrate to Delta Lake.
▪ Spark standalone cluster.
▪ On premise cluster.
▪ Plan to move to Cloud.
▪ Production support tools.
▪ Zeppelin notebook.
▪ Spark Shell.
11. Gains
▪ Deprecation of old codebase and technical debt.
▪ Cost Savings
▪ Realized in the Cloud with dynamic scaling.
▪ Data will be available in a compressed, splittable format for any other
processing.
▪ It is a great infrastructure for ad hoc data analysis and machine
learning etc.
12. Challenges
▪ Operationalization
▪ New toolset to interact with data.
▪ Data-lake can’t be updated easily.
▪ Debugging takes longer as accessing data from data-lake is slow.
▪ New skills to learn to support the new system.
13. Challenges
▪ Custom tools and scripts developed by support team need to be
refactored.
▪ Cost Savings
▪ Old infrastructure cost may not phase out immediately.
▪ Development
New skill sets
Adoption
▪ Possible data consistency issues if you choose dual write
architecture.
19. Implementation Highlights
▪ Same claim reimbursement library is used across streaming, batch
and REST API frameworks.
▪ The main business logic is implemented as a library written in Java.
▪ Java is chosen for its compatibility with both Scala and Java.
▪ Spark-SQL API is not used for writing business logic so that the core library is reusable across other integration frameworks.
▪ Conversion is needed between Java and Scala objects, either by coping values explicitly or using ‘import
collection.JavaConverters._’ to convert collection API from Scala to Java.
▪ Spark is used for its scalability
▪ Spark handles the consumption and processing of data.
▪ Dataset API is used for Domain Driver implementation.
▪ A data structure for the claim is chosen as an aggregate object so that join operation can be avoided. This facilitates reimbursement
calculation as a ‘map’ operation.
20. Results
▪ We have not seen this volume getting processed in our PLSQL based
system.
▪ The DB time is for ‘insert’ operation. Update and Delete operations are
far slower.
Process Volume Time Throughput
(Spark)
Throughput
(Baseline)
Claim Reimbursement (batch
mode, file system to file system) 80M 86 minutes 1M claims/minute
Pushing this result to a Oracle DB
80M *160 minutes 0.5M claims/minute
Total 80M 4hrs 333K claims/minute
(20 vCPU, 100GB memory)
400K claims/hr.
(20 vCPU, 50GB
memory, Oracle)
21. Cost
Technology Cost
Spark on Azure Databricks @ $4/hour
(DS15v2, 20 CPUs, 140 GB, Premium Tier, Data Engineering)
$20
+ Storage
Oracle on cloud
Dynamic scaling is not available
for non Exadata workload.
We have not see such a large volume go through, so
the time is not comparable.
PG on Azure (with support) @ $4/hour of G5 server
Given the complexity of the calculation process I am
not sure if the job time is comparable.
22. Delta lake adoption
▪ We are using open source delta lake, not Databricks’ managed delta
lake.
▪ We plan to use Databricks’ managed delta-lake when we migrate to Azure.
▪ Data is partitioned by one key and ordered by another key.
▪ Z-order doesn’t work outside Databricks’ environment. Ordering data by a single key is good enough for us.
▪ However the more keys you use in z-order the less effective it becomes.
▪ Optimize command doesn’t work outside Databricks’ environment. Delta lake needs to be rebuild to compensate for that.
▪ The partition and order keys are chosen based on the most frequent
access so that best performance gain can be achieved.
23. Delta lake adoption
▪ All queries, having filter condition on these keys, run extremely fast as
compared to parquet based data-lake.
▪ As you might know that delta-lake improves performance by skipping data.
▪ All queries, not having filter condition on these keys, run at same
speed.
24. Performance Comparison
Process
20 vCPUs, 100GB memory, 97 GB data, 87M records
Data lake (parquet) Data lake (delta)
FILTER operation on the keys on which the delta-lake has
been partitioned or/and ordered. SELECT all columns.
18m 20s
FILTER operation on the keys on which the delta-lake has
been partitioned or/and ordered. SELECT 1 column. 3s <1s
UPDATE using partitioned or/and ordered columns.
1h 10m 37s
Merge 6000 records
1h 10m 23m
Left-outer join between 87M and 6K records on
partitioned or/and ordered keys
36m 20s
FILTER operation on the keys on which the delta-lake has
not been partitioned or/and ordered. 18m 18m
25. Expanding the Horizon
▪ By rewriting a RDBMS based application using Spark, we are looking
forward to expanding our capabilities much further.
▪ High volume operation has become a matter of hardware scaling.
▪ Not depending on SQL optimization for complicated business logic.
▪ Integration with streaming workload has become a reality.
▪ Applying machine learning on our dataset starts to look easier.
▪ Saving cost on license and support looks closer.
26. Tips
▪ Running Spark on premises doesn’t save a ton of cost. Plan for cloud
migration as a core component of the rewrite project.
▪ Consider changing production support procedures, ad hoc tool sets,
inertia of people using RDBMS, debug time on data lake etc. as part of
the rewrite project
▪ Use delta-lake. There is hardly any reason to use direct parquet based
data lake
27. Tips
▪ An optimum schema design will provide the biggest performance gain
and implementation simplicity. This needs to be thought out properly.
▪ Dataset API makes development process much easier.
▪ Easy to implement complicated logic.
▪ A team new to Spark can use a language like Java/Scala to write
bulk of the logic.
▪ Business logic needn’t depend on Spark syntax which makes it
dependent on Spark to execute.
▪ Existing libraries can be used.
▪ It is slower than Dataframe.
28. Tips
▪ Be cognizant of all pieces of the data pipeline. Moving bottleneck from
one component may shift the bottleneck to the other components.
Ideally separate out the scalable component from non-scalable ones.
▪ Opensource Spark and delta-lake are pretty good and production
grade. It could be a cheaper option for you
▪ A Spark application is a practical solution even if you don’t have those
6 Vs calling for a bigdata application.