In this Data age, business applications generate big data. To generate value out of large scale data applications, data models are the key. Data models serve various purposes, and it is essential to show reliable insights in a timely fashion. This session will cover the technical aspect of leveraging Spark's distributed engine to process Big data to generate insights. It includes a few aspects to optimize processes with Spark SQL. Come join me to explore the process of making data interesting!
2. About me
● Jayesh Patel
● Currently Working for Rockstar Games
● Over 14 years in building data-driven processes for Healthcare,
Pharmaceuticals, Games, Marketing Services, and Public Transportation
● Expertise in developing end-to-end analytic solutions, big data ingestion, data
modeling, and machine learning pipelines
● Senior IEEE member and contributor
3. Agenda
● Data Modelling Overview
● Traditional Data Modeling Challenges
● Big Data Models
● Apache Spark and Spark SQL
● Data Modeling with Big Data
● Case Study
● Spark SQL Demo
● Q & A
4. Data Modeling: OLTP to OLAP to Big Data
● Data Model: A method to organize and store data
● OLTP
● OLAP
● ER Modeling
● Dimensional Modeling
5. Challenges on Scaling RDBMS
● Expensive
● Too much machine time and long query response time
● Can not handle data variety
6. Big Data: 5V
Volume: Data
Size
Velocity:
Speed of
Change
Variety:
Different forms
of sources
Veracity:
Uncertainty of
Data
Value:
Business
Value
Big Data
7. Traditional Vs Big Data Models
● Design first and then Implement
● Discover and then Analyze
Traditional Big Data
Top-Down, Hierarchical Distributed, Democratic
Passive, Push Collaborative, Interactive
Manageable volume with steady
growth of data
Massive volume with exponential
growth of data
Main purpose -> BI Main Purpose -> Statistical Analysis &
ML
Design Implement Discover Analyze
8. Current: Big Data Models
● Too many models
● Which models should I use for my task?
● This model shows X and other model shows Y for the same metrics. Why?
● Why this model stopped refreshing after January 2019?
● My query doesn’t respond when I try to join these models?
9. What to expect?
● Performance: quick queries and reduce I/O throughput
● Cost: Reusability of insights
● Efficiency: Value addition with data utilization
● Quality: Consistent metrics and reducing possible computing errors
10. Apache Spark
● Powerful open source processing engine built around speed & ease of use
● Unified analytics engine for big data and machine learning
● The largest open source community
12. Spark SQL
● Integrates relational processing with Spark’s functional programming
● Offers distributed in-memory computations on massive scale
13. Spark SQL
● Supports HiveQL and SQL.
● Offers standard functions, aggregation and window functions for Dataframes
14. Spark vs Hive: Data Modeling
Reference: Apache Spark @Scale: A 60 TB+ production use case
15. Big Data Modeling
● Still think dimensionally
● Integrate disparate data source using conformed dimensions
● Expect to integrate structured, semi structured and unstructured data
● Divide and Conquer with distributed processing
○ Avoid joining large fact tables.
16. How ???
● One grain = One model: Store all measurers for the same grain in one model
● De-normalize: One huge fat table is better than multiple large tables
● Batch Model: Data volume for a day may be too high. Break it down to smaller
batches
● Transactional Models: To avoid large table scans, intermediate transactional
model can keep data ready for analytical models
● Data Model Lineage: Very important for dependency management
17. Case Study: Healthcare Provider Analytics
● Modeling Healthcare Provider Analytics
○ Claims analytics: no of claims per day, claims denied, claims rejected
○ Payment Stats: as of balance, outstanding AR
○ Patient stats: no of patients per day, no of repeat patients, no of new patients
18. Case Study: Healthcare Provider Analytics
● Modeling Healthcare Provider Analytics for a SaaS provider
○ Provides electronic medical records (EMR), practice management (PM) and revenue cycle
management (RCM) platform for healthcare providers
○ Using Apache Spark, Python and Kudu
● Metrics
○ Claims Facts: no of claims per day, claims denied, claims rejected
○ Payment Facts: as of balance, outstanding AR
○ Patient Facts: no of patients per day, no of repeat patients, no of new patients