Data and Computational Challenges in Integrative Biomedical Informatics

Data and Computational
Challenges in Integrative
Biomedical Informatics
Joel Saltz MD, PhD
Chair Department of Biomedical Informatics,
Director Center for Comprehensive Informatics
Emory University
Adjunct Professor CSE, CS
College of Computing, Georgia Tech

Center for Comprehensive Informatics

ANALYTICS
INTEGRATIVE DATA

Integrative Biomedical Informatics Analytics

• Anatomic/functional
characterization at fine
level (Pathology) and
gross level (Radiology) Radiology
Imaging
• High throughput multi-
scale image
segmentation, feature Patient
“Omic”
extraction, analysis of Outcome
Data
features
• Integration of
anatomic/functional Pathologic
Features
characterization with
multiple types of
“omic” information

Integrative Spatio-Temporal Molecular Analytics

• Aka Big Data

Quantitative Feature Analysis in Pathology:
Emory In Silico Center for Brain Tumor
Research (PI = Dan Brat, PD= Joel Saltz)

Using TCGA Data to Study
Glioblastoma

Diagnostic Improvement

Molecular Classification

Predictors of Progression

Millions of Nuclei Defined by n Features

• Top-down analysis: use the features
with existing diagnostic constructs

• Bottom-up analysis: let features define
and drive the analysis

TCGA Whole Slide Images
Step 1:
Nuclei
• Identify individual nuclei
Segmentation
and their boundaries

Jun Kong

Nuclear Analysis Workflow
Step 1: Step 2:
Nuclei Feature
Segmentation Extraction

• Describe individual nuclei in terms of size,
shape, and texture

Step 3:
Nuclei
Nuclear Qualities
Classification

1 10

Oligodendroglioma Astrocytoma

Comparison of Machine-based Classification
to Human Based Classification

Separation of GBM, Oligo1, Oligo2 Separation of GBM, Oligo1 and
as Designated by Oligo2 as Designated by Machine
Neuropathologists

Survival Analysis

Human Machine

Gene Expression Correlates of High Oligo-Astro
Ratio on Machine-based Classification

Oligo Related Genes

Myelin Basic Protein
Proteolipoprotein
HoxD1

Nuclear features most
Associated with Oligo
Signature Genes:

Circularity (high)
Eccentricity (low)

Millions of Nuclei Defined by n Features

• Top-down analysis: analyze features in
context of existing diagnostic constructs

• Bottom-up analysis: let nuclear features
define and drive the analysis

Direct Study of Relationship Between
vs

Lee Cooper,
Carlos Moreno

Clustering identifies three morphological groups

• Analyzed 200 million nuclei from 162 TCGA GBMs (462 slides)
• Named for functions of associated genes:
Cell Cycle (CC), Chromatin Modification (CM),
Protein Biosynthesis (PB)
• Prognostically-significant (logrank p=4.5e-4)

CC CM PB
1
CC
10 0.8 CM
PB
20
Feature Indices

0.6

Survival
30 0.4

40 0.2

50
0
0 500 1000 1500 2000 2500 3000
Days


Associations


ANALYTICS
HEALTHCARE DATA

Clinical Phenotype Characterization and the Emory
Analytic Information Warehouse

• Example Project: Find hot spots in readmissions within 30 days
– What fraction of patients with a given principal diagnosis will be
readmitted within 30 days?
– What fraction of patients with a given set of diseases will be readmitted
within 30 days?
– How does severity and time course of co-morbidities affect
readmissions?
– Geographic analyses

• Compare and contrast with UHC Clinical Data Base
– Repeat analyses across all UHC hospitals
– Are we performing the same?
– How are UHC-curated groupings of patients (e.g., product lines) useful?

• Need a repeatable process that we can apply identically to both
local and UHC data

Overall System

Metadata
Repository
I2b2 Web I2b2
Server
Database

Investigator Metadata
Manager

Data Modeler

Data Query
Processing Specification

Data Analyst
Investigator

Database
Mapper

Data Analyst
Study-
Query tools specific
Database Source Source Source
Investigator
data data data

5-year Datasets from Emory and
University Healthcare Consortium

• EUH, EUHM and WW (inpatient encounters)
• Removed encounter pairs with chemotherapy and radiation
therapy readmit encounters (CDW data)

• Encounter location (down to unit for Emory)
• Providers (Emory only)
• Discharge disposition
• Primary and secondary ICD9 codes
• Procedure codes
• DRGs
• Medication orders (Emory only)
• Labs (Emory only)
• Vitals (Emory only)
• Geographic information (CDW only + US Census and American
Community Survey)
Analytic Information

Using Emory & UHC Data to Find
Associations With 30-day Readmits

• Problem: “Raw” clinical and administrative variables
are difficult to use for associative data mining
– Too many diagnosis codes, procedure codes
– Continuous variables (e.g., labs) require interpretation
– Temporal relationships between variables are implicit
• Solution: Transform the data into a much smaller set
of variables using heuristic knowledge
– Categorize diagnosis and procedure codes using code
hierarchies
– Classify continuous variables using standard
interpretations (e.g., high, normal, low)
– Identify temporal patterns (e.g., frequency, duration,
sequence)
– Apply standard data mining techniques


Derived Variables

• 30-day readmit
• The 9 Emory Enhanced Risk Assessment Tool diagnosis categories
• UHC product lines
• Variables derived from a combination of codes and/or laboratory test results
– Obesity
– Diabetes/uncontrolled diabetes
– End-stage renal disease (ESRD)
– Pressure ulcer
– Sickle cell disease/sickle cell crisis
• Temporal variables derived over multiple encounters
– Multiple MI
– Multiple 30-day readmissions
– Chemotherapy within 180 (or 365) days before surgery
– Previous encounter within the last 90 (or 180) days

30-Day Readmission Rates for
Derived Variables

Emory Health Care

Geographic Analyses
UHC Medicine General Product Line (#15)

Analytic Information Warehouse

Predictive Modeling for Readmission

• Random forests (ensemble of decision trees)
– Create a decision tree using a random subset of the
variables in the dataset
– Generate a large number of such trees
– All trees vote to classify each test example in a
training dataset
– Generate a patient-specific readmission risk for each
encounter
• Rank the encounters by risk for a subsequent 30-
day readmission


Emory Readmission Rates for High and
Low Risk Groups Generated with

Random Forest

Predictive Modeling Applied to 180 UHC Hospitals
Readmission fraction of top 10% high risk patients

0.9

0.8

0.7

0.6

0.5 All Hospital Model

Individual Hospital
0.4
Model
0.3

0.2

0.1

0
113
17
25
33
41
49
57
65
73
81
89
97

161
105

121
129
137
145
153

169
177
185
9
1

Status of Healthcare Data Analytics

• Integrative dataset analysis can leverage patient
information gathered over many encounters
• Temporal analyses can generate derived variables that
appear to correlate with readmissions
• Predictive modeling has promise of providing decision
support
• Data Analytics arm of the Emory New Care Model
Initiative led by Greg Esper
• Ongoing analyses involve characterization of clinical
phenotype in GWAS, biomarker and quality
improvement efforts
• Co-lead (with Bill Hersh) of CTSA CER Informatics
taskforce dedicated to this issue


DATA COMPUTING
HIGH END AND LARGE

Supercomputing – Collaboration with ORNL: Titan – Peak Speed
30,000,000,000,000,000 floating point operations per second!

Core Transformations for multi-scale pipelines

• Data Cleaning and Low Level Transformations
• Data Subsetting, Filtering, Subsampling
• Spatio-temporal Mapping and Registration
• Object Segmentation
• Feature Extraction, Object Classification
• Spatio-temporal Aggregation
• Change Detection, Comparison, and Quantification

Extreme DataCutter – Two Level Model


Node Level Work Scheduling

VLDB 2012

Change Detection, Comparison, and Quantification

Data and Computational Challenges in Integrative Biomedical Informatics

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (6)

Similar to Data and Computational Challenges in Integrative Biomedical Informatics

Similar to Data and Computational Challenges in Integrative Biomedical Informatics (14)

More from Joel Saltz

More from Joel Saltz (17)

Recently uploaded

Recently uploaded (20)

Data and Computational Challenges in Integrative Biomedical Informatics

Editor's Notes