2019 Triangle Machine Learning Day - Biomedical Image Understanding and EHRs at LifeOmic: Harnessing the Power of the Cloud - Matthew Phillips, September 20, 2019

Biomedical Image Understanding and EHRs
at LifeOmic: Harnessing the Power of the
Cloud
Duke Triangle ML Day, 9/20/2019

Overview
• My background
• LifeOmic and the Precision Health Cloud
• Cleaning up Clinical Notes
• Biomedical image segmentation
• Future work: Multimodal AI with EHRs
9/23/2019Copyright© 2019, LifeOmic, Inc. 2

Overview
• My background

My Background
• Experimental Neuroscientist (CUNY, Columbia)
• Computational Neuroscientist (Duke)
• 3D CAM C++ Software Engineer (Align Technology)
• ML Researcher (Kitware, LifeOmic)
 Mainly computer vision
 More recently, applying NLP-like models to EHRs
 At LifeOmic for about 9 mos.

Overview
• My background

LifeOmic Team Overview
PEOPLE
• 35 cloud software developers
(architecture, UX/UI,
analytics, ML/AI)
• 15 mobile software
developers
• 8 scientific experts (genetics
and data science)
• 5 security experts
• 7 marketing and
administration
CORE COMPETENCIES
• Enterprise cloud software development
• Large-scale architectures
• Global AWS deployment
• Machine learning and AI
• Security
• Genomic data processing, interpretation,
and analytics
• Mobile application development
• iOS and Android
LOCATIONS
• Indianapolis (HQ)
• Research Triangle Park
• Salt Lake City

Data Ingestion: Electronic medical records,
Medical images, REDCap, Omics data, Patient
Acquired
The Precision Health Cloud
Ecosystem
Clinicians
Patients
Wearables and
connected devices
Researchers

Cloud/Mobile Precision Health Solution
FHIR | REST | GA4GH
FHIR | REST | GA4GH
• iOS andAndroid
• Evidence-based lifestyle factors proven
to improve health
• Healthy plants
• Exercise
• Mindfulness
• Sleep
• Metabolic flexibility (intermittent
fasting or time-restricted eating)
• Gamification
• Social interaction
• Based on the enormously successful
LIFE app

INDIANA UNIVERSITY
Precision Health Initiative - Disease Focused
• Adult Cancer
• Pediatric - Sarcomas
• Multiple Myeloma
• Diabetes
• Alzheimer’s Disease
Pharmacogenomics

IU Precision Health Architecture
IU clinicians and
researchers
Industry
Sequences
REDCap
LifeOmic PHC Platform
Standardized VCFs
Cohort
Builder
CMG
Sequences
IU Health
Clinical
Eskenazi
Clinical
INPC
Clinical
Imaging (e.g.,
Pathology)
Data Sources
IU Data Staging
Data Quality and
Standardization
FHIR
UITS DC2:
FASTQ/BAM
to VCFFASTQ, BAM,
VCF
UITS SDA
Archive
Subject
Viewer
Insights KB
Data Quality
External
Data
Sources
Data Commons
Archiving
FHIR
Intake
BAM,
VCF
IU System
LifeOmic System
Non-IU System
AnalyticsData
Storage
ML / AIAuto
Indexing
LifeOmic PHC AppsSurveysR StudioTableau
3rd Party Analytics Tools
API
API
LIFE
Mobile

LifeOmic Task Service – Bring Code to the Data
Data in PHC
(e.g. sequencing, images, EHR, mobile)
Execute Docker based tools
against the data
Analyze the results
In the PHC
• A Task is a sequence of Docker images that run against data stored in the PHC with the outputs going
back into the PHC.
• All of the data stays in the PHC to reduce transfer times and cost
• Tasks run on compute that is provisioned within the PHC based on a task’s CPU or GPU and memory
requirements
• Docker images can be pulled from Docker Hub or uploaded to the PHC for use in a task
• Gnosis provides genomic data sets like reference genomes that can be used as inputs to tasks.

Overview
• My background

OCR as a Service - Broad applicability
• Communication via Fax accounts for ~75 percent of all medical
communication1.
• OCR can be applied in real-time, and retrospectively.
• Relevance to all of healthcare, including consumer. Non-developer is
the end user.
• Huge repositories of data currently exist.
1 https://www.vox.com/health-care/2017/10/30/16228054/american-medical-system-fax-machines-why

Proposed Solution
1. Direct integration with EHRs to load PDF into PHC
2. Task Service: PDF de-noising, then to Textract
3. Apply Ontology Service (for lookup of key medical terms)
4. Display original PDF + OCR Text side by side in Subject Viewer

Referring
Medical
Oncologist
Faxes Clinical Notes
and Lab Values to IU.
Medical Associate
Scans Fax into EMR
(PDF Image)
Medical Abstractor: Pulls out what was
given, when, dosage, duration, prior
therapy, lab values. Manual entry into
REDCap. 4 – 5 hours.
Physician
manually
checks each
value. 2.5 – 3
hours per
patient
Loaded into PHC
Referring
Medical
Oncologist
Faxes Clinical Notes
and Lab Values to IU.
Medical Associate
Scans Fax into EMR
(PDF Image)
Physician
manually
checks each
value. 30 min –
1 hour
Loaded into PHCIngest Scan from EMR to
PHC. 30 min – 1 hour
6.5 – 8 Hours
1 – 2 Hours

Noisy Clinical Notes—Examples
 Dither

 Cell ‘residue’, dark background

 Ghosting (the
printing from the
other side is faintly
visible!)

 Linear speckle

So this has already been solved, right?
• There is far less published research on this than you might expect.
• https://www.kaggle.com/c/denoising-dirty-documents (2015)
• D, Vishwanath, Rohit Rahul, Gunjan Sehgal, Swati, Arindam Chowdhury, Monika Sharma,
Lovekesh Vig, Gautam Shroff, and Ashwin Srinivasan. “Deep Reader: Information Extraction
from Document Images via Relation Extraction and Natural Language.” ArXiv:1812.04377 [Cs],
December 11, 2018. http://arxiv.org/abs/1812.04377.
• Older papers, papers on image denoising generally …
• Also couldn’t find off-the-shelf specific document denoiser. No entry for this on ‘Papers with
Code’, for example.
• AWS Textract fails on all of the examples shown.

Our solution:
 Use Attention U-Net (Oktay et al. 2017) and treat like a
segmentation task
 Break the document into high-resolution tiles

Results
 Dither: success.
Top: Denoised. Bottom: New Textract output

Results
 Residue and dark
background
eliminated.
 Now many items
extracted (often
imperfectly)
 Top: Before/after,
bottom: Textract
output. (No output
at all prior to
denoising.)

Results
 Ghosting eliminated,
but no help with text
quality here

Results
 Linear speckle
turns out to be a
tough nut to crack.

Results
 The model has had no exposure to the Kaggle dataset…
 Let’s see how it does on that

Denoising the Kaggle dataset
https://www.kaggle.com/c/denoising-dirty-documents/

•

Results Summary
 Some kinds of noise the model can handle
 Some kinds of noise it can’t (yet)
 Stay tuned!

What about the ‘Power of the Cloud’?
 “Denoise”
Task Service
operational

 Example observation data
for which there may be
accompanying clinical
notes

 Illustrative example: After
denoising and text
extraction, key terms
looked up automatically.

Overview
• My background

Projects
• Retinal Fundus Images (PALM 2019)
• Pancreas (KiTS 2019)

Retinal Fundus Images
• Segment/identify optic disc, fovea, atrophy, and detached retina
• Started out as (late-breaking) entry to PALM challenge at ISBI
• Then switched to general research

Retinal Fundus Images

Segmentation Model and Training
 Attention U-Net (Oktay et al. 2018)
 Output Recycling (simplified version of VoxResNet, Chen et al. 2018).
 Critically, leverages the fact that multiple tasks are being performed on the same dataset. This is the
novel part.
 CoordConv (Liu et al. 2018)

Output Recycling

Segmentation with CoordConv

Segmentation with CoordConv
 The gradients were concatenated channel-wise and broadcast to all convolutional and deconvolutional
layers of U-Net

Illustrative segmentation results

Output Recycling Improves Performance

CoordConv improves performance on patch-based
segmentation
Patch size DICE score
improvement
25% 0.192
50% 0.034
75% 0.064
100% 0.012
 Improvement found for fovea
segmentation task only
 Fovea mask is smallest of all tasks,
fovea region visually least conspicuous
 In the 25% case, training failed completely
unless CoordConv was used

Pancreas Segmentation—WIP

 Attention U-Net did reasonably well on the
first try on a 2D approach
 If CoordConv could help even in the 2D
image case, it could be a game-changer
for volumetric segmentation

Pancreas 3D Segmentation—WIP

 Stay tuned!

Overview
• My background

AI Tech Health Sprint
• Government initiative to “transform federal open data from HHS, the
U.S. Department of Veterans Affairs (VA), and other agencies into digital
tools.”
• Specifically, “create digital tools that help in finding experimental
therapies for patients, and vice versa.”
• Data was a long time coming!
https://digital.gov/2018/11/02/health-tech-sprint-aims-at-improving-care-access-experience/

Precision Cancer Cohort A
• 170 subjects
 Comprehensive medication, surgical, and observational data
 107 vcf files from 107 indexed subjects
 70K Dicom images from 37 subjects, 30 indexed
 65K images from the indexed subjects

PCCA vcf Files
Simulated data for illustrative purposes

PCCA DICOM Files
Simulated data for illustrative purposes

Overview
• My background

Collaborators
Baiju Parikh, Director of Business Development
Ananth Iyer, Principal Machine Learning Engineer

Community
• Just down the road in Morrisville!
• Goes after in-depth technical
discussions
• and great pizza!

References
 KiTS19 Challenge: https://kits19.grand-challenge.org/home/
 PALM Challenge: https://palm.grand-challenge.org/
 H. Chen, Q. Dou, L. Yu, J. Qin, and P.-A. Heng, “VoxResNet: Deep voxelwise residual networks for brain segmentation from 3D MR images,” Neuroimage, vol. 170, pp.
446–455, 15 2018.
 Rosanne Liu, Joel Lehman, Piero Molino, Felipe Petroski Such, Eric Frank, Alex Sergeev, and Jason Yosinski. An intriguing failing of convolutional neural networks and
the co-ordconv solution. arXiv:1807.03247 [cs, stat], Jul 2018. URL http://arxiv.org/abs/1807.03247. arXiv: 1807.03247.
 Ozan Oktay, Jo Schlemper, Loic Le Folgoc, Matthew Lee, Mattias Heinrich, Kazunari Misawa, Kensaku Mori, Steven McDonagh, Nils Y. Hammerla, Bernhard Kainz,
and et al. Attention u-net: Learning where to look for the pancreas. arXiv:1804.03999 [cs], Apr 2018. URL http://arxiv.org/abs/1804.03999. arXiv: 1804.03999.
 D, Vishwanath, Rohit Rahul, Gunjan Sehgal, Swati, Arindam Chowdhury, Monika Sharma, Lovekesh Vig, Gautam Shroff, and Ashwin Srinivasan. “Deep Reader:
Information Extraction from Document Images via Relation Extraction and Natural Language.” ArXiv:1812.04377 [Cs], December 11, 2018.

2019 Triangle Machine Learning Day - Biomedical Image Understanding and EHRs at LifeOmic: Harnessing the Power of the Cloud - Matthew Phillips, September 20, 2019

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 2019 Triangle Machine Learning Day - Biomedical Image Understanding and EHRs at LifeOmic: Harnessing the Power of the Cloud - Matthew Phillips, September 20, 2019

Similar to 2019 Triangle Machine Learning Day - Biomedical Image Understanding and EHRs at LifeOmic: Harnessing the Power of the Cloud - Matthew Phillips, September 20, 2019 (20)

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Recently uploaded

Recently uploaded (20)

2019 Triangle Machine Learning Day - Biomedical Image Understanding and EHRs at LifeOmic: Harnessing the Power of the Cloud - Matthew Phillips, September 20, 2019

Editor's Notes