SlideShare una empresa de Scribd logo
1 de 24
Descargar para leer sin conexión
WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
Peter James, Healthdirect Australia
How Australia’s National Health Services
Directory Improved Data Quality, Reliability, and
Integrity with Databricks Delta and Structured
Streaming
#UnifiedAnalytics #SparkAISummit
About Us
Healthdirect Australia designs and delivers
innovative services for governments to provide
every Australian with 24/7 access to the trusted
information and advice they need to manage
their own health and health-related issues.
3#UnifiedAnalytics #SparkAISummit
Our Impact
4#UnifiedAnalytics #SparkAISummit
National Health Services Directory
The National Health Services Directory (NHSD) holds core
information about healthcare organisations, services and
practitioners.
• Hospital, EMR, Registry Data sharing Arrangements
• Clinical Pathway Applications
• Telehealth and Contact Centre Services
• Support eHealth Secure Messaging Delivery
• API Integration using Fast Healthcare Interoperability Resources (FHIR)
• Consumer API’s and Web Widgets
• Sector Analytics and Reporting
5#UnifiedAnalytics #SparkAISummit
Supporting Health Planning
NHSD is used widely for analysis & planning
6#UnifiedAnalytics #SparkAISummit
healthmap.com.au Royal Flying Doctors Service’s, SPOT
platform
Landscape Changes
A Strategic Assessment defined a long-range vision of
success
• Australian Health Sector promotes adoption of HL7/FHIR, Provider
Directory Federation and Open Data standards
• Business drivers pushing for Cost reduction and operational efficiencies
• Architecture needed to scale to meet the objectives defined for future
success
• Need to develop new Analytics capabilities in support of Sector Health
Planning
• Improve Information Security to support with broader usage scenarios and
regulatory changes
7#UnifiedAnalytics #SparkAISummit
Challenges
Data Quality and Governance
• Availability and access to Systems Of Record
• Often no Single Source Of Truth
• High variety of data sources from Federal, State, Public/Private Hospitals,
EMR, and other Commercial Vendors
• Health Domains are complex - Ontologies, Taxonomies, Thesauri, Code
sets (e.g. MeSH, SNOMED-CT, ANZSIC)
• Record Linking Issues, identity harmonisation, quasi-identifiers
• Disjoint Data Governance (Network of Data Governance participants)
8#UnifiedAnalytics #SparkAISummit
Challenges
Data Silos
• Data stored in multiple subsystems
– File Systems, ETL and workflow transient storage systems
– Multiple RDBMS, Multiple Schemas
– Search Engine Indexes
• Read Inconsistency
– Data out-of-synch between Search, Databases and Analytics storage
– Large batch updates of low quality data leading to high error rates and process inefficiencies
– Transactional boundaries per Data Silo no holistic end-to-end consistency
• Data Access
– High operational overheads processes
– Incompatible with Security boundaries
9#UnifiedAnalytics #SparkAISummit
Data Silos to Data Aggregation
Single point of access to the component pieces
10#UnifiedAnalytics #SparkAISummit
AHPRA
(Registries)
Medicare
Jurisdictions
Aged
Care
Secure
Message
Vendors
Desktop
Vendors
Hospital
Networks
Others
NHSD
Challenges
Data Scale
• Processes need to scale to ~1 Billion Data Points
• New demands include Bookings, Appointments, Referrals,
Pricing and eHealth Transaction Activity – est. 1TB p.a.
• Support Data federation and Interoperability requirements with
requests growing > 58% p.a.
• Existing systems were already under pressure with batch
overruns and significant administration overheads
11#UnifiedAnalytics #SparkAISummit
Architecture Overview
12#UnifiedAnalytics #SparkAISummit
Architecture Improvements
Databricks DELTA to create Logical Data Zones
• LANDING, RAW, STAGE, GOLD (i.e. Bronze, Silver, Gold)
• Store data ‘as-is’ either structured and/or unstructured data in DELTA Tables
and Physical Partitions
• Read Consistency, Data integrity - ACID transactions
• Used DELTA Cache for frequent queries with transparent, automated cache
control
• Databricks Operational Control Plane for Cluster Administration,
Management functions like, Access Control, Jobs and Schedules
• Data Control Plane runs within our AWS Accounts under our Security Policy
and regulatory compliance
13#UnifiedAnalytics #SparkAISummit
Architecture Improvements
SPARK Structured Streaming for Continuous Applications
• Create a Streaming versions of existing batch ETL jobs.
• Move to Event based Data flows via Streaming Input, Kinesis, S3, and
DELTA
• Running more frequently lead to smaller and more manageable data
volumes
• Automation of release processes through Continuous Delivery and use
of Databricks REST API
• Recoverability through Checkpoints and Reliability through Streaming
Sinks to DELTA tables
14#UnifiedAnalytics #SparkAISummit
Databricks Cluster
Continuous Applications –
Continued
• Keeps 'in-memory’ state of object sets
including comparable versions for object
changes
• Stateful transformations
(flatMapGroupsWithState)
• Aggregations for Data Quality
measurements
• Built User-defined libraries for Data
Cleansing, Validation and complex logic
15#UnifiedAnalytics #SparkAISummit
def appendAttributeStats(
self, sourceDF,
attribute_name,
metrics = ['mean', 'min',
'max', 'median',
'num_missing',
'num_unique']):
Data Plane & Pipeline
16#UnifiedAnalytics #SparkAISummit
Architecture Improvements
Unified Analytics and Operations
• Data cleaning & matching algorithms reliable, predictable and measurable
• Data Quality Analytics calculated at runtime, attached to object attributes,
populate analysts dashboards
• Track Data Lineage and Lifecycles providing Operational visibility of where our
data is at any point
• Complex processes easily broken down into simpler steps with ‘checks and
balances’ prior to execution
• Significant cost reduction through decommissioning complex Administration
UI in favour of notebook environment
17#UnifiedAnalytics #SparkAISummit
Improved Algorithm Development &
Usage
18#UnifiedAnalytics #SparkAISummit
Data Sharing Agreements Data Integration Algos &
Processes
Data Quality
Auditing
Algorithms
• sound design, goal-focused
• verified against trusted data
• version-controlled, documented
• archived with training & validation
data
Data
• permanent, versioned
• use provenance to enhance algos,
revise confidence estimate
Data Quality
Reporting
Algorithm Evaluation
19#UnifiedAnalytics #SparkAISummit
FPR TPR threshold
0.014433 0.800855 1.00
0.014433 0.801709 0.99
0.014433 0.831054 0.98
0.014433 0.855840 0.97
0.016495 0.895726 0.95
0.026804 0.950997 0.90
• interpretation
– the algorithm is appropriate to the domain
– the data is of good quality
– un-matchable names are rare exceptions
– a matching threshold of 0.95 gives
TPR = 95.1%
FPR = 2.7%
ROC curve for the set-based, fuzzy
name-matching algorithm
Improved Data Processing
• 95% fuzzy name matches vs. <80% and dependent
on manual verification
• ~30,000 Automated Updates / mth vs. ~30,000 bi-
annual & Manual
• 20 mins full load vs. > 24 hrs
• 20 million records @ ~1 million/min
20#UnifiedAnalytics #SparkAISummit
Improved Data Security
• Databricks Platform provided foundational Security Accreditations
• Able to cross-walk these security compliance frameworks and localize
to Australian Regulatory frameworks
• This provided significant cost/effort reduction of GRC
• Continuous Data Assurance - Security Guard-Rails monitor changes to
access privileges, e.g. role elevation/conditional access, data security
metadata data and auditing against data leakage and spills.
21#UnifiedAnalytics #SparkAISummit
Securing it ALL
22#UnifiedAnalytics #SparkAISummit
Other Benefits
• Don’t have to be big to see benefits, strategic value and a new business model
• Strategic momentum through realizing business vision with technology transformation
• Dataset Transparency and Explainable Models with well-documented lineage and
quality
• Broader participation - no one holds a single parts of knowledge anymore, we extract
more value in our data when everyone was able to access it
• Governance transitioned focus on operational improvements
• Inked Agreement in August 2018 completed migration to Databricks December–
Live March 2019
23#UnifiedAnalytics #SparkAISummit
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Más contenido relacionado

Más de Databricks

Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Databricks
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
Databricks
 

Más de Databricks (20)

Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
 
Improving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot InstancesImproving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot Instances
 
Importance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowImportance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLow
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 

Último

Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 

Último (20)

Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 

How Australia’s National Health Services Directory Improved Data Quality, Reliability, and Integrity with Databricks Delta and Structured Streaming

  • 1. WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
  • 2. Peter James, Healthdirect Australia How Australia’s National Health Services Directory Improved Data Quality, Reliability, and Integrity with Databricks Delta and Structured Streaming #UnifiedAnalytics #SparkAISummit
  • 3. About Us Healthdirect Australia designs and delivers innovative services for governments to provide every Australian with 24/7 access to the trusted information and advice they need to manage their own health and health-related issues. 3#UnifiedAnalytics #SparkAISummit
  • 5. National Health Services Directory The National Health Services Directory (NHSD) holds core information about healthcare organisations, services and practitioners. • Hospital, EMR, Registry Data sharing Arrangements • Clinical Pathway Applications • Telehealth and Contact Centre Services • Support eHealth Secure Messaging Delivery • API Integration using Fast Healthcare Interoperability Resources (FHIR) • Consumer API’s and Web Widgets • Sector Analytics and Reporting 5#UnifiedAnalytics #SparkAISummit
  • 6. Supporting Health Planning NHSD is used widely for analysis & planning 6#UnifiedAnalytics #SparkAISummit healthmap.com.au Royal Flying Doctors Service’s, SPOT platform
  • 7. Landscape Changes A Strategic Assessment defined a long-range vision of success • Australian Health Sector promotes adoption of HL7/FHIR, Provider Directory Federation and Open Data standards • Business drivers pushing for Cost reduction and operational efficiencies • Architecture needed to scale to meet the objectives defined for future success • Need to develop new Analytics capabilities in support of Sector Health Planning • Improve Information Security to support with broader usage scenarios and regulatory changes 7#UnifiedAnalytics #SparkAISummit
  • 8. Challenges Data Quality and Governance • Availability and access to Systems Of Record • Often no Single Source Of Truth • High variety of data sources from Federal, State, Public/Private Hospitals, EMR, and other Commercial Vendors • Health Domains are complex - Ontologies, Taxonomies, Thesauri, Code sets (e.g. MeSH, SNOMED-CT, ANZSIC) • Record Linking Issues, identity harmonisation, quasi-identifiers • Disjoint Data Governance (Network of Data Governance participants) 8#UnifiedAnalytics #SparkAISummit
  • 9. Challenges Data Silos • Data stored in multiple subsystems – File Systems, ETL and workflow transient storage systems – Multiple RDBMS, Multiple Schemas – Search Engine Indexes • Read Inconsistency – Data out-of-synch between Search, Databases and Analytics storage – Large batch updates of low quality data leading to high error rates and process inefficiencies – Transactional boundaries per Data Silo no holistic end-to-end consistency • Data Access – High operational overheads processes – Incompatible with Security boundaries 9#UnifiedAnalytics #SparkAISummit
  • 10. Data Silos to Data Aggregation Single point of access to the component pieces 10#UnifiedAnalytics #SparkAISummit AHPRA (Registries) Medicare Jurisdictions Aged Care Secure Message Vendors Desktop Vendors Hospital Networks Others NHSD
  • 11. Challenges Data Scale • Processes need to scale to ~1 Billion Data Points • New demands include Bookings, Appointments, Referrals, Pricing and eHealth Transaction Activity – est. 1TB p.a. • Support Data federation and Interoperability requirements with requests growing > 58% p.a. • Existing systems were already under pressure with batch overruns and significant administration overheads 11#UnifiedAnalytics #SparkAISummit
  • 13. Architecture Improvements Databricks DELTA to create Logical Data Zones • LANDING, RAW, STAGE, GOLD (i.e. Bronze, Silver, Gold) • Store data ‘as-is’ either structured and/or unstructured data in DELTA Tables and Physical Partitions • Read Consistency, Data integrity - ACID transactions • Used DELTA Cache for frequent queries with transparent, automated cache control • Databricks Operational Control Plane for Cluster Administration, Management functions like, Access Control, Jobs and Schedules • Data Control Plane runs within our AWS Accounts under our Security Policy and regulatory compliance 13#UnifiedAnalytics #SparkAISummit
  • 14. Architecture Improvements SPARK Structured Streaming for Continuous Applications • Create a Streaming versions of existing batch ETL jobs. • Move to Event based Data flows via Streaming Input, Kinesis, S3, and DELTA • Running more frequently lead to smaller and more manageable data volumes • Automation of release processes through Continuous Delivery and use of Databricks REST API • Recoverability through Checkpoints and Reliability through Streaming Sinks to DELTA tables 14#UnifiedAnalytics #SparkAISummit
  • 15. Databricks Cluster Continuous Applications – Continued • Keeps 'in-memory’ state of object sets including comparable versions for object changes • Stateful transformations (flatMapGroupsWithState) • Aggregations for Data Quality measurements • Built User-defined libraries for Data Cleansing, Validation and complex logic 15#UnifiedAnalytics #SparkAISummit def appendAttributeStats( self, sourceDF, attribute_name, metrics = ['mean', 'min', 'max', 'median', 'num_missing', 'num_unique']):
  • 16. Data Plane & Pipeline 16#UnifiedAnalytics #SparkAISummit
  • 17. Architecture Improvements Unified Analytics and Operations • Data cleaning & matching algorithms reliable, predictable and measurable • Data Quality Analytics calculated at runtime, attached to object attributes, populate analysts dashboards • Track Data Lineage and Lifecycles providing Operational visibility of where our data is at any point • Complex processes easily broken down into simpler steps with ‘checks and balances’ prior to execution • Significant cost reduction through decommissioning complex Administration UI in favour of notebook environment 17#UnifiedAnalytics #SparkAISummit
  • 18. Improved Algorithm Development & Usage 18#UnifiedAnalytics #SparkAISummit Data Sharing Agreements Data Integration Algos & Processes Data Quality Auditing Algorithms • sound design, goal-focused • verified against trusted data • version-controlled, documented • archived with training & validation data Data • permanent, versioned • use provenance to enhance algos, revise confidence estimate Data Quality Reporting
  • 19. Algorithm Evaluation 19#UnifiedAnalytics #SparkAISummit FPR TPR threshold 0.014433 0.800855 1.00 0.014433 0.801709 0.99 0.014433 0.831054 0.98 0.014433 0.855840 0.97 0.016495 0.895726 0.95 0.026804 0.950997 0.90 • interpretation – the algorithm is appropriate to the domain – the data is of good quality – un-matchable names are rare exceptions – a matching threshold of 0.95 gives TPR = 95.1% FPR = 2.7% ROC curve for the set-based, fuzzy name-matching algorithm
  • 20. Improved Data Processing • 95% fuzzy name matches vs. <80% and dependent on manual verification • ~30,000 Automated Updates / mth vs. ~30,000 bi- annual & Manual • 20 mins full load vs. > 24 hrs • 20 million records @ ~1 million/min 20#UnifiedAnalytics #SparkAISummit
  • 21. Improved Data Security • Databricks Platform provided foundational Security Accreditations • Able to cross-walk these security compliance frameworks and localize to Australian Regulatory frameworks • This provided significant cost/effort reduction of GRC • Continuous Data Assurance - Security Guard-Rails monitor changes to access privileges, e.g. role elevation/conditional access, data security metadata data and auditing against data leakage and spills. 21#UnifiedAnalytics #SparkAISummit
  • 23. Other Benefits • Don’t have to be big to see benefits, strategic value and a new business model • Strategic momentum through realizing business vision with technology transformation • Dataset Transparency and Explainable Models with well-documented lineage and quality • Broader participation - no one holds a single parts of knowledge anymore, we extract more value in our data when everyone was able to access it • Governance transitioned focus on operational improvements • Inked Agreement in August 2018 completed migration to Databricks December– Live March 2019 23#UnifiedAnalytics #SparkAISummit
  • 24. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT