Working with Amazon Web Services “AWS” and 1Strategy, an Advance AWS Consulting partner; the Cambia Health Data Sciences teams have been able to deploy HIPAA compliant and secured AWS Elastic Map Reduce (EMR) data pipelines on the cloud. In this session, we will dive deep into the architectural components of this solution and you will learn how utilizing AWS services has helped Cambia decrease processing time for analytics, increase application flexibility and accelerate speed to production. The second part of the session is going to cover machine learning and its role in reducing cost and improving quality of care. The healthcare community must rely on advanced analytics and machine learning to analyze multiple facets of healthcare data and process it at scale to gain insights on things that matter. You will learn why AWS is a well suited platform for machine learning. We will take you through the steps of building a machine learning model using Amazon ML for a real world problem of predicting patient readmissions.
2. What to Expect from the Session
• Benefits from large-scale analytics with PHI - Arnoud
• Securing Amazon EMR & Elasticsearch - Rich
• Additional solution components for HIPAA compliance [demo] - Rich
• Reducing cost and improve quality of care with Amazon Machine
Learning [demo] - Ujjwal
NOTE: This is a deep dive session on HOW rather than WHAT. We will show
implementation details.
• This session expects familiarity with:
• AWS services - EMR and S3
BDM401 - Deep Dive: Amazon EMR Best Practices & Design Patterns
BDA206 - Building Big Data Applications with the AWS Big Data Platform
• Encryption and distributed systems like Hadoop and Elasticsearch
4. Cambia Health Solutions
Our Roots
Born from an inspired idea
Our Cause
Becoming catalysts
for transformation
Our Vision
Delivering a reimagined
health care experience
7. Master Data Management
Source A Source B
First
Name
John John
Last
Name
Doe Doe
DOB 1970-01-01 2016-11-28
Street 105 Main St 105 Main St
City Portland Portland
State OR OR
Source A Source B
First
Name
Jillian Jill
Last
Name
Doe Doe-Doe
SSN 123-45-6789 123-45-6789
Street 605 Oak Dr 105 Main Street
City PDX Portland
State OR Oregon
No. Fatherandson. Yes.Married,changedname,andmoved.
This is artificial data fabricated for illustration purposes only.
Are these the same people?
8. Master Data Management – Approach
Demographics
Laboratory
Pharmaceutics
Geography
Claims
Composite
record of
best values
Cambia
Match and Merge
on Amazon EMR
9. Master Data Management – Quality
98.50%
99.90%
99.99%
97.5%
98.0%
98.5%
99.0%
99.5%
100.0%
Match Correctness
Vendor Cambia V1 Cambia V1.1
98.80%
84.30%
98.10%
75.0%
80.0%
85.0%
90.0%
95.0%
100.0%
Match Completeness
Vendor Cambia V1 Cambia V1.1
7,000+ records containing 1,600+ matches
Manually checked and confirmed in the real world
10. Master Data Management – Performance
90 minutes 40 minutes
0
500
1000
1500
2000
2500
minutes
Run time
Vendor Cambia V1 Cambia V1.1
2160 minutes
or 36 hours
17.7M records containing 1.8M matches
11. Next Steps
Scale
in and out or up and down
Amazon Machine
Learning
Amazon
EMR
Build out healthcare
data science models
HIPAA compliant
search on data
Amazon
EC2
13. At Rest – when data is in a stored location
Definition of Terms
In Transit – when data is moved to and from storage
In Process – when data is in temporary space for processing state
17. EMRFS
on S3
EMRFS on S3 – This is achieved via s3 client-side encryption with AWS KMS.
HDFS – via Hadoop File System (HDFS) transparent data encryption as
described in the Apache Docs.
HDFS on
EMR Cluster
Config File
Encrypted
Encryption at Rest
24. EMRFS on
S3
HDFS on
EMR
Cluster
Encryption in Transit
<!-- Client certificate Store -->
<property>
<name>ssl.client.keystore.type</name>
<value>jks</value>
</property>
<property>
<name>ssl.client.keystore.location</name>
<value>/etc/emr/security/ssl/keystore.jks</value>
</property>
<property>
<name>ssl.client.keystore.password</name>
<value>changeit</value>
</property>
<!-- Client Trust Store -->
<property>
<name>ssl.client.truststore.type</name>
<value>jks</value>
</property>
<property>
<name>ssl.client.truststore.location</name>
<value>/etc/emr/security/ssl/truststore.jks</value>
</property>
<property>
<name>ssl.client.truststore.password</name>
<value>changeit</value>
</property>
<property>
<name>ssl.client.truststore.reload.interval</name>
<value>10000</value>
</property>
</configuration>
25. Three areas to address
1. Hadoop RPC - Hadoop RPC is used by API clients of MapReduce
2. HDFS DTP - HDFS Transparent encryption this traffic is automatically encrypted
3. Hadoop MapReduce Shuffle - MapReduce shuffles and sorts the output of each map task to reducers
on different nodes
HDFS
on EMR
Cluster
Encryption in Transit - Cluster
26. RPC
client
Hadoop RPC - Hadoop RPC is used by API clients of MapReduce
EMR
Cluster
EMRFS
on S3
Encryption in Transit - Cluster
27. RPC
client
<property>
<name>hadoop.security.service.user.name.key</name>
<value></value>
<description>
For those cases where the same RPC protocol is implemented by multiple
servers, this configuration is required for specifying the principal
name to use for the service when the client wishes to make an RPC call.
</description>
</property>
<property>
<name>hadoop.rpc.protection</name>
<value>authentication</value>
<description>A comma-separated list of protection values for secured sasl
connections. Possible values are authentication, integrity and privacy.
authentication means authentication only and no integrity or privacy;
integrity implies authentication and integrity are enabled; and privacy
implies all of authentication, integrity and privacy are enabled.
hadoop.security.saslproperties.resolver.class can be used to override
the hadoop.rpc.protection for a connection at the server side.
</description>
</property>
Encryption in Transit - Cluster
28. Data
Encryption
Key (DEK)
Envelope Data
Encryption Key
(EDEK)
Hadoop KMS
HDFS Data Transfer Protocol (DTP) – Using HDFS
Transparent encryption enabled ensures automatic
encryption
Encryption in Transit - Cluster
EMRFS
on S3
EMR
Cluster
29. <property>
<name>dfs.encrypt.data.transfer</name>
<value>true</value>
<description>
Whether or not actual block data that is read/written from/to HDFS should
be encrypted on the wire. This only needs to be set on the NN and DNs,
clients will deduce this automatically. It is possible to override this setting
per connection by specifying custom logic via dfs.trustedchannel.resolver.class.
</description>
</property>
<property>
<name>dfs.encrypt.data.transfer.algorithm</name>
<value></value>
<description>
This value may be set to either "3des" or "rc4". If nothing is set, then
the configured JCE default on the system is used (usually 3DES.) It is
widely believed that 3DES is more cryptographically secure, but RC4 is
substantially faster.
</description>
</property>
Data
Encryption
Key (DEK)
Envelope Data
Encryption Key
(EDEK)
Hadoop KMS
Hadoop Data Transfer Protocol (DTP) configured on
startup with a bootstrap script
Encryption in Transit - Cluster
30. Hadoop
Encrypted
Shuffle and Sort
Hadoop MapReduce Shuffle - In the shuffle phase, Hadoop MapReduce (MRv2) shuffles the output of
each map task to reducers on different nodes using HTTP by default.
EMR
Cluster
Encryption in Transit - Cluster
EMRFS
on S3
32. EMRFS
on S3
EMR
Cluster
Encryption in Transit - Cluster
Spark block transfer service – This is can be encrypted using SASL encryption in Spark 1.5.1 and later.
36. Bootstrap Script
function encrypt_disk() {
local dev=$1
local dir=$2
local cryptname="crypt_${dir:1}"
# Unmount the drive
sudo umount "$dev"
# Encrypt the drive
sudo cryptsetup luksFormat -q --key-file "$PWD_FILE" "$dev"
sudo cryptsetup luksOpen -q --key-file "$PWD_FILE" "$dev" "$cryptname"
# Format the drive
sudo mkfs -t xfs "/dev/mapper/$cryptname"
sudo mount -o defaults,noatime,inode64 "/dev/mapper/$cryptname" "$dir"
sudo rm -rf "$dir/lost+found"
sudo mkdir -p "$dir/encrypted"
sudo chown -R hadoop:hadoop "$dir"
echo "/dev/mapper/$cryptname $dir xfs defaults,noatime,inode64 0 0" |
sudo tee -a /etc/fstab
echo "$cryptname $dev $PWD_FILE" | sudo tee -a /etc/crypttab
}
Temporary
Space on EBS
Volumes
Encryption in Process
37. HDFS on
EMR ClusterEMRFS on S3
Temporary Space
on EBS Volumes
RPC
Hadoop Encrypted
Shuffle and Sort
Native DTP
Summary of the EMR Encryption Process
38. EMR Updates
1Strategy blog links
amzn.to/2g0JJIN
September 21st, 2016
bit.ly/1strategy_emr
AWS EMR Encryption Documentation
42. EMRFS on S3
Temporary Space
on EBS Volumes
ElasticSearch on EC2
Instances
ElasticSearch Encryption Process Summary
43. HIPAA is more than encryption
Auditing & custom tools:
• Audit script to show limited users have access to encrypted S3 data
• S3 Buckets are encrypted
• Show S3 Objects are encrypted
*Working with Cambia to open source these tools
bit.ly/1strategy_emr_code
46. Machine Learning inside Healthcare
Analyzing Medical Images
Prescription Compliance Prediction
Evidence Based & Precision Medicine
Text classification and mining
Medicare and Medicaid Fraud
Hospital Bed Utilization
Treatment Queries and Suggestions
Drug Discovery and Clinical Trials
Population Health
Vaccination and Immunization
Omics and Clinical Data Integration
Patient Outcomes
Patient Readmission
Prediction through risk
stratification
47. Real World Problem – Hospital Readmissions
• Hospital Readmission Reduction
Program (HRRP) part of the Affordable
Care Act.
• Centers for Medicare & Medicaid
Services (CMS) required to reduce
payments to hospitals with excess
readmissions.
• Not all readmissions can be prevented
• Facilities with high readmission rates
had their Medicare payment cut by 1%
in 2013 which rose to 2% in 2014.
Source - www.ncbi.nlm.nih.gov/pmc/articles/PMC3558794
48. Our Focus
Utilizing AWS For Machine Learning (ML)
Continuum of Machine Learning Solutions
• Limited ML Options
• Binary
• Multiclass
• Regression
• Simple to train
• Easy to evaluate
• Quick to deploy
• Comprehensive ML options
• Requires work to train
• No support for evaluation
• Additional work to deploy
• Scalable
• Customizable
Amazon EMR
+ Spark ML
Amazon Machine
Learning
49. Introducing Amazon Machine Learning (AML)
• Easy to use, managed machine learning
service built for developers
• Robust, powerful machine learning
technology based on Amazon’s internal
systems
• Use your data already stored in the
AWS cloud
• Models in production within seconds
50. Machine Learning
Proactive Prediction of Readmission
Patient
Demographics
Patient History
Admission
Attributes
Other features
Patient
High Risk Patient
Low Risk Patient
Moderate Risk
Patient
55. Data Load and Standardization
COPY<Redshift_Table_Name> FROM's3://<file_path.csv>' CREDENTIALS
'aws_access_key_id=<>;aws_secret_access_key=<>’ DELIMITER ',’ IGNOREHEADER 1;
Data Load
• Updated NULL values
• Change attributes values which do not comply with standard patterns.
• ex: Phone = (206) XXX-XXXX
• Complete geographical data where possible
• Include timeline values if possible
• Group granular attributes in sets.
• ex: Ages 0 to 20 as youth, 20 to 40 as adult and so on.
Data Standardization
56. Create AML Data Source with Redshift
CreateDataSourceFromRedshift API
Console
57. Real-time Predictions Using API
• Synchronous, low-latency, high-throughput prediction generation
• Request through service API or server or mobile SDKs
• Best for interaction applications that deal with individual data records
>>> import boto
>>> ml = boto.connect_machinelearning()
>>> ml.predict(
ml_model_id=’my_model',
predict_endpoint=’example_endpoint’,
record={’key1':’value1’, ’key2':’value2’})
{
'Prediction': {
'predictedValue': 13.284348,
'details': {
'Algorithm': 'SGD',
'PredictiveModelType': 'REGRESSION’
}
}
}
58.
59. Application Website Hosted on S3
var machinelearning = new AWS.MachineLearning({apiVersion:
'2014-12-12'});
var params = {
MLModelId: ‘<AML Model ID>',
PredictEndpoint: ‘<AML Model Real Time End Point>',
Record: <Selected Attributes record set>
};
var request = machinelearning.predict(params);
Application calls the Predict() API using necessary parameters
Website hosting in S3 without web servers eliminates complexities of
scaling hardware based on traffic routed to your application.
bit.ly/aml_demo - Demo bit.ly/hcl301_blog - Blog
60. Expanded Architecture
Amazon
S3
Amazon
Redshift
Amazon Machine
Learning Amazon
EC2
Amazon
EMR
users
Internet
Corporate Data Center
Make data suitable to acting as
an ML data source
An ML model is
created with Redshift
as the data source
EC2 as a frontend
for AML end point
Process unstructured and
semi-structured data
Data Lake
Amazon
S3
Amazon
QuickSight
Amazon
RDS users
Batch prediction
generated and
stored in S3
DB Schemas
CSV Files
Unstructured files
QuickSight
generates BI reports
on prediction data.
An RDS schema
acts as a source
for QuickSight
62. Join us tonight at the Health Care happy hour
sponsored by Cambia Health Solutions,
8KMiles.com and AWS at:
Japonais restaurant in the Mirage
on Monday 11/28 from 6-8 PM
AWS and Cambia are co-presenting:
SEC305 – Scaling Security Resources for
Your First 10 Million Customers
Tuesday, Nov 29, 12:30 PM - 1:30 PM
Do you want to know
more about how to
secure health data?