SlideShare una empresa de Scribd logo
1 de 23
Big Data Modeling with Spark SQL
Make data valuable
About me
● Jayesh Patel
● Currently Working for Rockstar Games
● Over 14 years in building data-driven processes for Healthcare,
Pharmaceuticals, Games, Marketing Services, and Public Transportation
● Expertise in developing end-to-end analytic solutions, big data ingestion, data
modeling, and machine learning pipelines
● Senior IEEE member and contributor
Agenda
● Data Modelling Overview
● Traditional Data Modeling Challenges
● Big Data Models
● Apache Spark and Spark SQL
● Data Modeling with Big Data
● Case Study
● Spark SQL Demo
● Q & A
Data Modeling: OLTP to OLAP to Big Data
● Data Model: A method to organize and store data
● OLTP
● OLAP
● ER Modeling
● Dimensional Modeling
Challenges on Scaling RDBMS
● Expensive
● Too much machine time and long query response time
● Can not handle data variety
Big Data: 5V
Volume: Data
Size
Velocity:
Speed of
Change
Variety:
Different forms
of sources
Veracity:
Uncertainty of
Data
Value:
Business
Value
Big Data
Traditional Vs Big Data Models
● Design first and then Implement
● Discover and then Analyze
Traditional Big Data
Top-Down, Hierarchical Distributed, Democratic
Passive, Push Collaborative, Interactive
Manageable volume with steady
growth of data
Massive volume with exponential
growth of data
Main purpose -> BI Main Purpose -> Statistical Analysis &
ML
Design Implement Discover Analyze
Current: Big Data Models
● Too many models
● Which models should I use for my task?
● This model shows X and other model shows Y for the same metrics. Why?
● Why this model stopped refreshing after January 2019?
● My query doesn’t respond when I try to join these models?
What to expect?
● Performance: quick queries and reduce I/O throughput
● Cost: Reusability of insights
● Efficiency: Value addition with data utilization
● Quality: Consistent metrics and reducing possible computing errors
Apache Spark
● Powerful open source processing engine built around speed & ease of use
● Unified analytics engine for big data and machine learning
● The largest open source community
Apache Spark
● Core: Resilient Distributed Datasets
● Parallelize transformations and computations
● Fault Tolerant
● Evaluates Lazily
Spark SQL
● Integrates relational processing with Spark’s functional programming
● Offers distributed in-memory computations on massive scale
Spark SQL
● Supports HiveQL and SQL.
● Offers standard functions, aggregation and window functions for Dataframes
Spark vs Hive: Data Modeling
Reference: Apache Spark @Scale: A 60 TB+ production use case
Big Data Modeling
● Still think dimensionally
● Integrate disparate data source using conformed dimensions
● Expect to integrate structured, semi structured and unstructured data
● Divide and Conquer with distributed processing
○ Avoid joining large fact tables.
How ???
● One grain = One model: Store all measurers for the same grain in one model
● De-normalize: One huge fat table is better than multiple large tables
● Batch Model: Data volume for a day may be too high. Break it down to smaller
batches
● Transactional Models: To avoid large table scans, intermediate transactional
model can keep data ready for analytical models
● Data Model Lineage: Very important for dependency management
Case Study: Healthcare Provider Analytics
● Modeling Healthcare Provider Analytics
○ Claims analytics: no of claims per day, claims denied, claims rejected
○ Payment Stats: as of balance, outstanding AR
○ Patient stats: no of patients per day, no of repeat patients, no of new patients
Case Study: Healthcare Provider Analytics
● Modeling Healthcare Provider Analytics for a SaaS provider
○ Provides electronic medical records (EMR), practice management (PM) and revenue cycle
management (RCM) platform for healthcare providers
○ Using Apache Spark, Python and Kudu
● Metrics
○ Claims Facts: no of claims per day, claims denied, claims rejected
○ Payment Facts: as of balance, outstanding AR
○ Patient Facts: no of patients per day, no of repeat patients, no of new patients
Case Study: Healthcare Provider Analytics
Provider
Analytics
Claims
Stats
Payments
Stats
Patient
Stats
Results
● Near real time refresh
● Easy to maintain & backfill
● Independent of delay in source data processing
● Less joins
Spark SQL Demo
Military Network Interaction Data from UCI
Q & A
Connect on LinkedIn
Thanks

Más contenido relacionado

La actualidad más candente

Rahul_Bhatia_resume_new
Rahul_Bhatia_resume_newRahul_Bhatia_resume_new
Rahul_Bhatia_resume_new
Rahul Bhatia
 

La actualidad más candente (20)

Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...
Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...
Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...
 
Creating an Enterprise AI Strategy
Creating an Enterprise AI StrategyCreating an Enterprise AI Strategy
Creating an Enterprise AI Strategy
 
Data science life cycle
Data science life cycleData science life cycle
Data science life cycle
 
Resume_Tabluau_R_Python_ML
Resume_Tabluau_R_Python_MLResume_Tabluau_R_Python_ML
Resume_Tabluau_R_Python_ML
 
SlamData Overview 9-1-2014
SlamData Overview 9-1-2014SlamData Overview 9-1-2014
SlamData Overview 9-1-2014
 
Resume
ResumeResume
Resume
 
Data analytics
Data analyticsData analytics
Data analytics
 
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
 
Resume
ResumeResume
Resume
 
Solution Architecture US healthcare
Solution Architecture US healthcare Solution Architecture US healthcare
Solution Architecture US healthcare
 
Business Intelligence Barista: What DataViz Tool to Use, and When?
Business Intelligence Barista: What DataViz Tool to Use, and When?Business Intelligence Barista: What DataViz Tool to Use, and When?
Business Intelligence Barista: What DataViz Tool to Use, and When?
 
Project Insights for Data Driven Decisions
Project Insights for Data Driven DecisionsProject Insights for Data Driven Decisions
Project Insights for Data Driven Decisions
 
SMU BIA Sharing on Data Science
SMU BIA Sharing on Data ScienceSMU BIA Sharing on Data Science
SMU BIA Sharing on Data Science
 
BAS 250 Lecture 2
BAS 250 Lecture 2BAS 250 Lecture 2
BAS 250 Lecture 2
 
Data Analyst Interview Questions & Answers
Data Analyst Interview Questions & AnswersData Analyst Interview Questions & Answers
Data Analyst Interview Questions & Answers
 
Big data - Characteristics, types and Application
Big data - Characteristics, types and ApplicationBig data - Characteristics, types and Application
Big data - Characteristics, types and Application
 
BAS 150 Lesson 1 Lecture
BAS 150 Lesson 1 LectureBAS 150 Lesson 1 Lecture
BAS 150 Lesson 1 Lecture
 
Rahul_Bhatia_resume_new
Rahul_Bhatia_resume_newRahul_Bhatia_resume_new
Rahul_Bhatia_resume_new
 
Bi 2.0 hadoop everywhere
Bi 2.0   hadoop everywhereBi 2.0   hadoop everywhere
Bi 2.0 hadoop everywhere
 
Big Data Science Challenges in Media
Big Data Science Challenges in MediaBig Data Science Challenges in Media
Big Data Science Challenges in Media
 

Similar a Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Jayesh Patel

Spark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligenceSpark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Wei Di
 
Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...
Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...
Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...
Databricks
 
Data Refinement: The missing link between data collection and decisions
Data Refinement: The missing link between data collection and decisionsData Refinement: The missing link between data collection and decisions
Data Refinement: The missing link between data collection and decisions
Vivastream
 

Similar a Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Jayesh Patel (20)

Transition to a modern data platform
Transition to a modern data platform Transition to a modern data platform
Transition to a modern data platform
 
Vadlamudi saketh30 (ml)
Vadlamudi saketh30 (ml)Vadlamudi saketh30 (ml)
Vadlamudi saketh30 (ml)
 
Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
 
Data science guide
Data science guideData science guide
Data science guide
 
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligenceSpark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligence
 
Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...
Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...
Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...
 
Transforming B2B Sales with Spark Powered Sales Intelligence
Transforming B2B Sales with Spark Powered Sales IntelligenceTransforming B2B Sales with Spark Powered Sales Intelligence
Transforming B2B Sales with Spark Powered Sales Intelligence
 
Driving Digital Transformation with Machine Learning in Oracle Analytics
Driving Digital Transformation with Machine Learning in Oracle AnalyticsDriving Digital Transformation with Machine Learning in Oracle Analytics
Driving Digital Transformation with Machine Learning in Oracle Analytics
 
Total Data Industry Report
Total Data Industry ReportTotal Data Industry Report
Total Data Industry Report
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
 
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
 
Big Data overview
Big Data overviewBig Data overview
Big Data overview
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
 
Business Intelligence Architecture
Business Intelligence ArchitectureBusiness Intelligence Architecture
Business Intelligence Architecture
 
IBM's Business Analytics Portfolio for Training Purposes
IBM's Business Analytics Portfolio for Training PurposesIBM's Business Analytics Portfolio for Training Purposes
IBM's Business Analytics Portfolio for Training Purposes
 
DataCanvas: Big Data Analytic Flow in Cloud
DataCanvas: Big Data Analytic Flow in CloudDataCanvas: Big Data Analytic Flow in Cloud
DataCanvas: Big Data Analytic Flow in Cloud
 
Data Refinement: The missing link between data collection and decisions
Data Refinement: The missing link between data collection and decisionsData Refinement: The missing link between data collection and decisions
Data Refinement: The missing link between data collection and decisions
 
Knowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data ScienceKnowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data Science
 
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
 

Más de Data Con LA

Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA
 

Más de Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Jayesh Patel

  • 1. Big Data Modeling with Spark SQL Make data valuable
  • 2. About me ● Jayesh Patel ● Currently Working for Rockstar Games ● Over 14 years in building data-driven processes for Healthcare, Pharmaceuticals, Games, Marketing Services, and Public Transportation ● Expertise in developing end-to-end analytic solutions, big data ingestion, data modeling, and machine learning pipelines ● Senior IEEE member and contributor
  • 3. Agenda ● Data Modelling Overview ● Traditional Data Modeling Challenges ● Big Data Models ● Apache Spark and Spark SQL ● Data Modeling with Big Data ● Case Study ● Spark SQL Demo ● Q & A
  • 4. Data Modeling: OLTP to OLAP to Big Data ● Data Model: A method to organize and store data ● OLTP ● OLAP ● ER Modeling ● Dimensional Modeling
  • 5. Challenges on Scaling RDBMS ● Expensive ● Too much machine time and long query response time ● Can not handle data variety
  • 6. Big Data: 5V Volume: Data Size Velocity: Speed of Change Variety: Different forms of sources Veracity: Uncertainty of Data Value: Business Value Big Data
  • 7. Traditional Vs Big Data Models ● Design first and then Implement ● Discover and then Analyze Traditional Big Data Top-Down, Hierarchical Distributed, Democratic Passive, Push Collaborative, Interactive Manageable volume with steady growth of data Massive volume with exponential growth of data Main purpose -> BI Main Purpose -> Statistical Analysis & ML Design Implement Discover Analyze
  • 8. Current: Big Data Models ● Too many models ● Which models should I use for my task? ● This model shows X and other model shows Y for the same metrics. Why? ● Why this model stopped refreshing after January 2019? ● My query doesn’t respond when I try to join these models?
  • 9. What to expect? ● Performance: quick queries and reduce I/O throughput ● Cost: Reusability of insights ● Efficiency: Value addition with data utilization ● Quality: Consistent metrics and reducing possible computing errors
  • 10. Apache Spark ● Powerful open source processing engine built around speed & ease of use ● Unified analytics engine for big data and machine learning ● The largest open source community
  • 11. Apache Spark ● Core: Resilient Distributed Datasets ● Parallelize transformations and computations ● Fault Tolerant ● Evaluates Lazily
  • 12. Spark SQL ● Integrates relational processing with Spark’s functional programming ● Offers distributed in-memory computations on massive scale
  • 13. Spark SQL ● Supports HiveQL and SQL. ● Offers standard functions, aggregation and window functions for Dataframes
  • 14. Spark vs Hive: Data Modeling Reference: Apache Spark @Scale: A 60 TB+ production use case
  • 15. Big Data Modeling ● Still think dimensionally ● Integrate disparate data source using conformed dimensions ● Expect to integrate structured, semi structured and unstructured data ● Divide and Conquer with distributed processing ○ Avoid joining large fact tables.
  • 16. How ??? ● One grain = One model: Store all measurers for the same grain in one model ● De-normalize: One huge fat table is better than multiple large tables ● Batch Model: Data volume for a day may be too high. Break it down to smaller batches ● Transactional Models: To avoid large table scans, intermediate transactional model can keep data ready for analytical models ● Data Model Lineage: Very important for dependency management
  • 17. Case Study: Healthcare Provider Analytics ● Modeling Healthcare Provider Analytics ○ Claims analytics: no of claims per day, claims denied, claims rejected ○ Payment Stats: as of balance, outstanding AR ○ Patient stats: no of patients per day, no of repeat patients, no of new patients
  • 18. Case Study: Healthcare Provider Analytics ● Modeling Healthcare Provider Analytics for a SaaS provider ○ Provides electronic medical records (EMR), practice management (PM) and revenue cycle management (RCM) platform for healthcare providers ○ Using Apache Spark, Python and Kudu ● Metrics ○ Claims Facts: no of claims per day, claims denied, claims rejected ○ Payment Facts: as of balance, outstanding AR ○ Patient Facts: no of patients per day, no of repeat patients, no of new patients
  • 19. Case Study: Healthcare Provider Analytics Provider Analytics Claims Stats Payments Stats Patient Stats
  • 20. Results ● Near real time refresh ● Easy to maintain & backfill ● Independent of delay in source data processing ● Less joins
  • 21. Spark SQL Demo Military Network Interaction Data from UCI
  • 22. Q & A Connect on LinkedIn