Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Jayesh Patel

Big Data Modeling with Spark SQL
Make data valuable

About me
● Jayesh Patel
● Currently Working for Rockstar Games
● Over 14 years in building data-driven processes for Healthcare,
Pharmaceuticals, Games, Marketing Services, and Public Transportation
● Expertise in developing end-to-end analytic solutions, big data ingestion, data
modeling, and machine learning pipelines
● Senior IEEE member and contributor

Agenda
● Data Modelling Overview
● Traditional Data Modeling Challenges
● Big Data Models
● Apache Spark and Spark SQL
● Data Modeling with Big Data
● Case Study
● Spark SQL Demo
● Q & A

Data Modeling: OLTP to OLAP to Big Data
● Data Model: A method to organize and store data
● OLTP
● OLAP
● ER Modeling
● Dimensional Modeling

Challenges on Scaling RDBMS
● Expensive
● Too much machine time and long query response time
● Can not handle data variety

Big Data: 5V
Volume: Data
Size
Velocity:
Speed of
Change
Variety:
Different forms
of sources
Veracity:
Uncertainty of
Data
Value:
Business
Value
Big Data

Traditional Vs Big Data Models
● Design first and then Implement
● Discover and then Analyze
Traditional Big Data
Top-Down, Hierarchical Distributed, Democratic
Passive, Push Collaborative, Interactive
Manageable volume with steady
growth of data
Massive volume with exponential
growth of data
Main purpose -> BI Main Purpose -> Statistical Analysis &
ML
Design Implement Discover Analyze

Current: Big Data Models
● Too many models
● Which models should I use for my task?
● This model shows X and other model shows Y for the same metrics. Why?
● Why this model stopped refreshing after January 2019?
● My query doesn’t respond when I try to join these models?

What to expect?
● Performance: quick queries and reduce I/O throughput
● Cost: Reusability of insights
● Efficiency: Value addition with data utilization
● Quality: Consistent metrics and reducing possible computing errors

Apache Spark
● Powerful open source processing engine built around speed & ease of use
● Unified analytics engine for big data and machine learning
● The largest open source community

Apache Spark
● Core: Resilient Distributed Datasets
● Parallelize transformations and computations
● Fault Tolerant
● Evaluates Lazily

Spark SQL
● Integrates relational processing with Spark’s functional programming
● Offers distributed in-memory computations on massive scale

Spark SQL
● Supports HiveQL and SQL.
● Offers standard functions, aggregation and window functions for Dataframes

Spark vs Hive: Data Modeling
Reference: Apache Spark @Scale: A 60 TB+ production use case

Big Data Modeling
● Still think dimensionally
● Integrate disparate data source using conformed dimensions
● Expect to integrate structured, semi structured and unstructured data
● Divide and Conquer with distributed processing
○ Avoid joining large fact tables.

How ???
● One grain = One model: Store all measurers for the same grain in one model
● De-normalize: One huge fat table is better than multiple large tables
● Batch Model: Data volume for a day may be too high. Break it down to smaller
batches
● Transactional Models: To avoid large table scans, intermediate transactional
model can keep data ready for analytical models
● Data Model Lineage: Very important for dependency management

Case Study: Healthcare Provider Analytics
● Modeling Healthcare Provider Analytics
○ Claims analytics: no of claims per day, claims denied, claims rejected
○ Payment Stats: as of balance, outstanding AR
○ Patient stats: no of patients per day, no of repeat patients, no of new patients

● Modeling Healthcare Provider Analytics for a SaaS provider
○ Provides electronic medical records (EMR), practice management (PM) and revenue cycle
management (RCM) platform for healthcare providers
○ Using Apache Spark, Python and Kudu
● Metrics
○ Claims Facts: no of claims per day, claims denied, claims rejected
○ Payment Facts: as of balance, outstanding AR
○ Patient Facts: no of patients per day, no of repeat patients, no of new patients

Provider
Analytics
Claims
Stats
Payments
Stats
Patient
Stats

Results
● Near real time refresh
● Easy to maintain & backfill
● Independent of delay in source data processing
● Less joins

Spark SQL Demo
Military Network Interaction Data from UCI

Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Jayesh Patel

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Jayesh Patel

Similar a Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Jayesh Patel (20)

Más de Data Con LA

Más de Data Con LA (20)

Último

Último (20)

Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Jayesh Patel