Data Scientist Allison Baker and Development Manager of Data Products Cody Hall work with a talented team of data scientists, software engineers, and web developers, and are building the framework and infrastructure to support a real-time prediction application, with the ability to scale across the entire company. Paramount to these efforts has been the capability of integrating the architecture for software production with the predictive models generated by H2O. This talk will review the processes by which HCA is building a pipeline to predict patient outcomes in real-time, heavily relying on H2O’s POJO scoring API and implemented in Clojure data processing. #h2ony
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
1. 1
Predicting Patient Outcomes in Real-Time at HCA
Presentation by Allison Baker and Cody Hall
Hospital Corporation of America
Department of Data and Analytics, Clinical Services Group
July 20, 2016
2. 2CONFIDENTIAL - Contains proprietary information. Not intended for external distribution.
• Introduction to HCA
• Introduction to our team
• Data science pipeline
• Near real-time architecture
• Real-time architecture
• Current POC goals
Overview
3. 3CONFIDENTIAL - Contains proprietary information. Not intended for external distribution.
“Above all else, we are committed to the care and improvement of human life. In
recognition of this commitment, we strive to deliver high-quality, cost-effective
healthcare in the communities we serve.” – HCA Mission Statement
• Hospital Corporation of America (HCA) is the leading healthcare provider in the
country
– 169 hospitals
– 116 freestanding surgery centers in 20 states and the U.K.
• Approximately 233,000 employees across the company
• Over 26 million patient encounters each year
• More than 8 million emergency room visits each year
• About 2 million inpatients treated annually
Hospital Corporation of America
4. 4CONFIDENTIAL - Contains proprietary information. Not intended for external distribution.
Where We Are
5. 5CONFIDENTIAL - Contains proprietary information. Not intended for external distribution.
Data Science and Data Products Teams
Dr. Martin Tobias
Data Scientist
Sandeepkumar Kothiwale
Data Scientist
Allison Baker
Data Scientist
Dr. Nan Chen
Data Scientist
Kunal Marwah
Data Scientist
Gerardo Castro
Data Scientist
Chris Cate
Data Scientist
Igor Ges
Data Product Engineer
Josh Wolter
BI Developer
Dr. Jesse Spencer-Smith
Director of Data Science
Dr. Edmund Jackson
Chief Data Scientist
VP of Data and Analytics
Warren Sadler
Data Product Engineer
Cody Hall
Development Manager of Data Products
Nick Selleh
Application Engineer
6. 6CONFIDENTIAL - Contains proprietary information. Not intended for external distribution.
CRISP-DM and Data Science
7. 7CONFIDENTIAL - Contains proprietary information. Not intended for external distribution.
• Begin by asking stakeholders and business owners “What business
decisions will be made with the analysis results?”
• Document all project and product features, timelines and code using
GitHub
• Source historical data using Teradata SQL
• Log all data sourcing and data extract steps using DRAKE
• Options
– Continuous integration
– Jenkins to monitor DRAKE builds
Problem Definition and Data Sourcing
8. 8CONFIDENTIAL - Contains proprietary information. Not intended for external distribution.
• Run preliminary visualization
• QA data testing for coverage, outliers, abnormalities, format and structural issues,
frequency, duplication and accuracy
• Pre-process data
– Balance outcomes
– Filter patients
– Remove non-data
• Engineer features
Data Manipulation
9. 9CONFIDENTIAL - Contains proprietary information. Not intended for external distribution.
• Analytic server
– 64 cores
– 4 Terabytes of hard disk
– 1.5 Terabytes of RAM
• Iterate models
• Evaluate statistics
Modeling
10. 10CONFIDENTIAL - Contains proprietary information. Not intended for external distribution.
• Consider
– Re-defining the problem
– Additional modeling
– Additional data sourcing
• Discuss results with clinical owners and
business stakeholders
– Consider additional features
Interpretation and Reporting
11. 11CONFIDENTIAL - Contains proprietary information. Not intended for external distribution.
• We can effectively engineer thousands of clinically and statistically relevant
features.
• We can successfully build accurate, complex and sophisticated predictive
models.
• How do we take these models to the patient bedside?
What Now?
12. 12CONFIDENTIAL - Contains proprietary information. Not intended for external distribution.
Delivering Value to the Business
13. 13CONFIDENTIAL - Contains proprietary information. Not intended for external distribution.
Near Real-Time Tool
• Consists of 3 main components
– Data source (different than historical training source)
– Scoring engine
– User interface
• Shows early value using a minimally viable product-based approach
• Phases POC to include development time for real-time architecture
• Updates in 15 minute batches
• Provides near real-time predictions
• Solicits feedback from facilities, focusing on accuracy and usefulness
14. 14CONFIDENTIAL - Contains proprietary information. Not intended for external distribution.
Data Sources are Constantly Changing
15. 15CONFIDENTIAL - Contains proprietary information. Not intended for external distribution.
Prediction Product
Facility + Team
Patient
Kafka
Topic
OpenGate
MS
SQL PostgreSQL
Analytic
Store
HDFS Cluster
Predictive Model
• Single POJO .jar
• Clojure (FE library)
ETL
• Independent SQL process
HDFS Cluster
Data Source
• 15 minute batches
• SQL defined
Data Source
• Streaming
• HL7QL defined
• GitHub & Nexus
• Jenkins
• Tableau
Supporting Infrastructure
• PostgreSQL administration
& monitoring
• Docker with Node JS (UI)
User Interface (UI)
• Displays measures + events
• Notifications of predictions
• Prompt for acknowledgement or
dismissal
• On acknowledgement, disable
notifications for 12 hours
Measures + Events:
Vitals
Lab results
Orders
Demographics
Surgery times
Nursing documentations
Prediction
Measures
+ EventsHL-7
Measures
+ Events
& PredictionHL-7
Measures + Events
HL7QL
(Spark)
Kafka
Topic
EDN Predictive Model + ETL
• Clojure (FE library)/Spark job
• PowderKeg
Measures
+ Events
Data Persistence
Near Real-Time System
Real-Time System
16. 16CONFIDENTIAL - Contains proprietary information. Not intended for external distribution.
Real-Time Infrastructure
• Continuously consumes HL7 messages from a Kafka topic and parses via Spark and
HL7QL
• Processes (producers) publish messages to Kafka topics (categories) and
subscriptions are made to the topics to process the message feeds
(consumers)
• Apache Spark is the application interface to allow for cloud computing
• HL7 Query Language (HL7QL) parses the messages
• Scores (predicts) on new streaming information
– Runs a .jar file via a Spark process compiled from Clojure code and H2O POJO
• Deploys with Docker
– Container-based application architecture
• Continuously monitors with Jenkins
18. 18CONFIDENTIAL - Contains proprietary information. Not intended for external distribution.
A Proof of Concept Use Case and Goals
Primary:
1. Assess clinical workflow to identify how the model can support the current clinical
processes for treating negative patient outcomes
2. Determine the model’s capability to extract meaningful information from existing
and available patient data and identify patterns that predict the outcome
3. Determine the usefulness of an early prediction model within a clinical workflow
Secondary:
1. Improve the prediction model through incorporation of feedback provided by the
clinical team
2. Maximize the utility of the prediction tool to improve a clinical workflow for the
facility staff
20. 20CONFIDENTIAL - Contains proprietary information. Not intended for external distribution.
Questions
Notas del editor
Really focusing on the use of Tools
Architecture
Deployment
Add number of inpatients (~1.8 million)
real-time – prediction is used to lengthen the intervention window for therapy.
Batch – for operational stuff.
Ask the right question
Gather data to support your hypotheses
Test your assumptions
- Get through this loop as quickly as possible -> h2o makes modeling component straightforward and pain-free.
Don’t get caught up on this slide
Cross Industry Standard Process for Data Mining, commonly known by its acronym CRISP-DM, was a data mining process model that describes the overall approach to solving business (or clinical) problems with predictive analytics. Working through this process requires both a Business understanding and Data understanding at the forefront of everything.
Data preparation
Modeling
Evaluation
Deployment
The overall arching goal is to extract knowledge from data, using predictive modeling to visualize and present data with an intelligent awareness of the clinical and/or business consequences
Data science projects begin by asking a clearly defined business question
What business decisions will be made using the results of the analysis?
What does “done” look like?
Establish that the project falls within one of five defined analysis types:
Type 1. Classification: Is this A or B?
Type 2. Anomaly Detection: Is this unusual?
Type 3. Regression: How much/how many?
Type 4. Unsupervised Learning: How is it organized?
Type 5. Prescriptive: What should I do next?
GitHub: web-based tool allowing for version control and SCM
Teradata SQL Assistant: Windows-based tool for building and running sql queries against our EDW
DRAKE: workflow tool
SQL, R, Clojure
Balancing
Center and scale
Sampling
Why do we use R vs. h2o?
Engineering Features -> we do FE outside of h2o so pre-processing
Historically we were restricted by the computational availability of our laptops.
Nice visualizations for eval results!!!
Weak signal?
Apply the model to real live data and gain clinical feedback on patients we are seeing in our hospitals now
Build out infrastructure and architecture to score patients in real-time
Preventing negative patient outcomes and saving lives
H2o is the harness that runs on the jvm, brining predictive models to the patients’ bedsides
Tableau helps you work with business to solve problems, quickly.
Want to use the model in real life and gain clinical feedback
Create a way for model to capture feedback through an application
See if the model fits into clinical workflow.
Near real-time does not scale
real-time in healthcare means HL7 based messaging.
Clojure encapsulates the pojo
Cloudera resilient distributed dataset
Doing all of this on every single commit
4 times an hour (05, 20, 35, 50) the job is startedA Docker container is spun up, and a jar is executedData is retrieved from OpenGate, aggregated and transformedPredictive model is appliedPredictions are written to PostgreSQLLogs are stored and execution results are reported
GOAL: The model accurately predicts patient outcomes earlier than those
identified through current clinical processes