Integrative
analyses of large scale spatio-temporal datasets play increasingly important roles in many areas of science and engineering. Our recent work in this area is motivated by application scenarios involving complementary digital microscopy, Radiology and "omic"
analyses in cancer research. In these scenarios, our objective is to use a coordinated set of image analysis, feature extraction and machine learning methods to predict disease progression and to aid in targeting new therapies.
We describe methods
we have developed for extraction, management and analysis of features along with the systems software methods for optimizing execution on high end CPU/GPU platforms. We will also describe biomedical results obtained from these studies and extensions of the
computational methods to broader application areas.
2. a.k.a “Big Data”
Center for Comprehensive Informatics
• Integrative Spatio-Temporal Analytics
• Deep Integrative Biomedical Research
• High End Computing/”Big Data” Computers,
Systems Software
• Analysis of Patient Populations
3. Application Targets
Center for Comprehensive Informatics
• Multi-dimensional spatial-temporal datasets
– Radiology and Microscopy Image Analyses
– Oil Reservoir Simulation/Carbon
Sequestration/Groundwater Pollution Remediation
– Biomass monitoring and disaster surveillance using
multiple types of satellite imagery
– Weather prediction using satellite and ground sensor
data
– Analysis of Results from Large Scale Simulations
• Correlative and cooperative analysis of data from
multiple sensor modalities and sources
• What-if scenarios and multiple design choices or
initial conditions
4. Core Transformations
Center for Comprehensive Informatics
• Data Cleaning and Low Level Transformations
• Data Subsetting, Filtering, Subsampling
• Spatio-temporal Mapping and Registration
• Object Segmentation
• Feature Extraction, Object Classification
• Spatio-temporal Aggregation
• Change Detection, Comparison, and Quantification
5. Emory In Silico Center for Brain Tumor
Research (PI = Dan Brat, PD= Joel Saltz)
6.
7.
8.
9. National Science Foundation Grand Challenge
in Land Cover Dynamics
• Remote sensing analysis of
high resolution satellite
images.
• Databases of land cover
dynamics are essential for
global carbon models,
biogeochemical cycling,
hydrological modeling and
ecosystem response
modeling
• Maps of the world's tropical
rain forest during the past
three decades.
Larry Davis , Rama Chellappa , Joel Saltz , Alan Sussman , John
Townshend
10. Analysis of Computational Data; Uncertainty
Quantification, Comparisons with Experimental Results
Center for Comprehensive Informatics
Dimitri Mavriplis, Raja Das, Joel Saltz
11. a.k.a “Big Data”
Center for Comprehensive Informatics
• Integrative Spatio-Temporal Analytics
• Deep Integrative Biomedical Research
• High End Computing/”Big Data” Computers,
Systems Software
• Analysis of Patient Populations
14. Computerized Classification System
for Grading Neuroblastoma
Initialization Yes
Image Tile Background? Label
I=L
• Background Identification
No
Create Image I(L)
• Image Decomposition (Multi-
Training Tiles resolution levels)
Segmentation I = I -1 • Image Segmentation
Down-sampling
(EMLDA)
Segmentation
Feature Construction
• Feature Construction (2nd
Yes
No order statistics, Tonal
Feature Extraction I > 1?
Feature Construction
Features)
Feature Extraction
Classification
• Feature Extraction (LDA) +
Classification (Bayesian)
Classifier Training
No
• Multi-resolution Layer
Within Confidence
Region ? Controller (Confidence
Yes
TRAINING Region)
TESTING
15. Using TCGA Data to Study
Glioblastoma
Diagnostic Improvement
Molecular Classification
Predictors of Progression
18. Can we use image analysis of TCGA GBMs TO INFORM
diagnostic criteria based on molecular or clinical
endpoints?
Nuclear Qualities
Oligodendroglioma Astrocytoma
Application: Oligodendroglioma Component in GBM
19. Millions of Nuclei Defined by n Features
• Bottom-up analysis: let features define
and drive the analysis
• Top-down analysis: use the features
with existing diagnostic constructs
20. TCGA Whole Slide Images
Step 1:
Nuclei
• Identify individual nuclei
Segmentation
and their boundaries
Jun Kong
21. Nuclear Analysis Workflow
Step 1: Step 2:
Nuclei Feature
Segmentation Extraction
• Describe individual nuclei in terms of size,
shape, and texture
24. Comparison of Machine-based Classification
to Human Based Classification
Separation of GBM, Oligo1, Oligo2 Separation of GBM, Oligo1 and
as Designated by Oligo2 as Designated by Machine
Neuropathologists
26. Gene Expression Correlates of High Oligo-Astro
Ratio on Machine-based Classification
Oligo Related Genes
Myelin Basic Protein
Proteolipoprotein
HoxD1
Nuclear features most
Associated with Oligo
Signature Genes:
Circularity (high)
Eccentricity (low)
27. Millions of Nuclei Defined by n Features
• Bottom-up analysis: let nuclear features
define and drive the analysis
• Top-down analysis: analyze features in
context of existing diagnostic constructs
28. Direct Study of Relationship Between
vs
Center for Comprehensive Informatics
Lee Cooper,
Carlos Moreno
29. Nuclear Features Used to Classify GBMs
Center for Comprehensive Informatics
50
3 2 1
20
1
45
40
Silhouette Area
40 60
Cluster
80
2
35
100
120
30
140
3
25 160
2 3 4 5 6 7 20 40 60 80 100 120 140 160
# Clusters 0 0.5 1
Silhouette Value
Consensus clustering of morphological
signatures
Study includes 200 million nuclei taken from 480
slides corresponding to 167 distinct patients
Each possibility evaluated using 2000 iterations of K-
means to quantify co-clustering
30. Clustering identifies three morphological groups
Center for Comprehensive Informatics
• Analyzed 200 million nuclei from 162 TCGA GBMs (462 slides)
• Named for functions of associated genes:
Cell Cycle (CC), Chromatin Modification (CM),
Protein Biosynthesis (PB)
• Prognostically-significant (logrank p=4.5e-4)
CC CM PB
1
CC
10 0.8 CM
PB
20
Feature Indices
0.6
Survival
30 0.4
40 0.2
50
0
0 500 1000 1500 2000 2500 3000
Days
32. Molecular Correlates of MR Features Using TCGA Data
MRIs of TCGA GBMs reviewed by 3-6 neuroradiologists using VASARI feature set and In
Vivo Imaging tools
MR Features compared to TCGA Transcriptional Classes and Genetic Alterations
David Gutman
34. VASARI
Feature Set
Scott Hwang
Chad Holder
Adam Flanders
35. Prognostic Significance of Vasari Features
Tests Between Groups: 0-33% vs. 34-95% Proportion enhancing
Test ChiSquare DF P-Value
Log-Rank 12.4775 3 0.0059*
Wilcoxon 10.0802 3 0.0179*
36. a.k.a “Big Data”
Center for Comprehensive Informatics
• Integrative Spatio-Temporal Analytics
• Deep Integrative Biomedical Research
• High End Computing/”Big Data” Computers,
Systems Software
• Analysis of Patient Populations
37. Titan – Peak Speed 30,000,000,000,000,000
floating point operations per second!
Center for Comprehensive Informatics
39. Extreme DataCutter Prototype
Center for Comprehensive Informatics
DataCutter
Pipeline of filters connected though logical streams
In transit processing
Flow control between filters and streams
Developed 1990s-2000s; led to IBM System S
Extreme DataCutter
Two level hierarchical pipeline framework
In transit processing
Coarse grained components coordinated by Manager that
coordinates work on pipeline stages between nodes
Fine grained pipeline operations managed at the node level
Both levels employ filter/stream paradigm
Bottom line – everything ends up as DAGS
45. Large Scale Data Management
Center for Comprehensive Informatics
Represented by a complex data model capturing
multi-faceted information including markups,
annotations, algorithm provenance, specimen, etc.
Support for complex relationships and spatial
query: multi-level granularities, relationships
between markups and annotations, spatial and
nested relationships
Highly optimized spatial query and analyses
Implemented in a variety of ways including
optimized CPU/GPU, Hadoop/HDFS and IBM DB2
46. Spatial Centric – Pathology Imaging “GIS”
Point query: human marked point Window query: return markups
inside a nucleus contained in a rectangle
.
Containment query: nuclear feature Spatial join query: algorithm
aggregation in tumor regions validation/comparison
48. VLDB 2012
Center for Comprehensive Informatics
Change Detection, Comparison, and Quantification
49. Approach to Integrated Sensor Data Analysis
Framework
Center for Comprehensive Informatics
• Abstract templates specify
dataset geometry
• Templates describe
collections of space-time
regions
• Mapping to memory
hierarchies provided by user
defined mapping functions
• Leverages Parashar’s
DataSpaces
50. a.k.a “Big Data”
Center for Comprehensive Informatics
• Integrative Spatio-Temporal Analytics
• Deep Integrative Biomedical Research
• High End Computing/”Big Data” Computers,
Systems Software
• Analysis of Patient Populations
51. Clinical Phenotype Characterization and the Emory
Analytic Information Warehouse
Center for Comprehensive Informatics
• Example Project: Find hot spots in readmissions within 30 days
– What fraction of patients with a given principal diagnosis will be
readmitted within 30 days?
– What fraction of patients with a given set of diseases will be readmitted
within 30 days?
– How does severity and time course of co-morbidities affect
readmissions?
– Geographic analyses
• Compare and contrast with UHC Clinical Data Base
– Repeat analyses across all UHC hospitals
– Are we performing the same?
– How are UHC-curated groupings of patients (e.g., product lines) useful?
• Need a repeatable process that we can apply identically to both
local and UHC data
Andrew Post, Sharath Cholleti, Doris Gao, Michel Monsour, Himanshu Rathod
52. Overall System
Center for Comprehensive Informatics
Metadata
Repository
I2b2 Web I2b2
Server
Database
Investigator Metadata
Manager
Data Modeler
Data Query
Processing Specification
Data Analyst
Investigator
Database
Mapper
Data Analyst
Study-
Query tools specific
Database Source Source Source
Investigator
data data data
53. 5-year Datasets from Emory and
University Healthcare Consortium
Center for Comprehensive Informatics
• EUH, EUHM and WW (inpatient encounters)
• Removed encounter pairs with chemotherapy and radiation
therapy readmit encounters (CDW data)
• Encounter location (down to unit for Emory)
• Providers (Emory only)
• Discharge disposition
• Primary and secondary ICD9 codes
• Procedure codes
• DRGs
• Medication orders (Emory only)
• Labs (Emory only)
• Vitals (Emory only)
• Geographic information (CDW only + US Census and American
Community Survey)
Analytic Information
54. Using Emory & UHC Data to Find
Associations With 30-day Readmits
Center for Comprehensive Informatics
• Problem: “Raw” clinical and administrative variables
are difficult to use for associative data mining
– Too many diagnosis codes, procedure codes
– Continuous variables (e.g., labs) require interpretation
– Temporal relationships between variables are implicit
• Solution: Transform the data into a much smaller set
of variables using heuristic knowledge
– Categorize diagnosis and procedure codes using code
hierarchies
– Classify continuous variables using standard
interpretations (e.g., high, normal, low)
– Identify temporal patterns (e.g., frequency, duration,
sequence)
– Apply standard data mining techniques
Analytic Information
55. Derived Variables
Center for Comprehensive Informatics
• 30-day readmit
• The 9 Emory Enhanced Risk Assessment Tool diagnosis categories
• UHC product lines
• Variables derived from a combination of codes and/or laboratory test results
– Obesity
– Diabetes/uncontrolled diabetes
– End-stage renal disease (ESRD)
– Pressure ulcer
– Sickle cell disease/sickle cell crisis
• Temporal variables derived over multiple encounters
– Multiple MI
– Multiple 30-day readmissions
– Chemotherapy within 180 (or 365) days before surgery
– Previous encounter within the last 90 (or 180) days
56. 30-Day Readmission Rates for Derived
Variables
Center for Comprehensive Informatics
Emory Health Care
57. Geographic Analyses
UHC Medicine General Product Line (#15)
Center for Comprehensive Informatics
Analytic Information Warehouse
58. Predictive Modeling for Readmission
Center for Comprehensive Informatics
• Random forests (ensemble of decision trees)
– Create a decision tree using a random subset of the
variables in the dataset
– Generate a large number of such trees
– All trees vote to classify each test example in a
training dataset
– Generate a patient-specific readmission risk for each
encounter
• Rank the encounters by risk for a subsequent 30-
day readmission
Sharath Cholleti
59. Emory Readmission Rates for High and
Low Risk Groups Generated with
Center for Comprehensive Informatics
Random Forest
60. Status of Clinical Phenotype
Characterization
Center for Comprehensive Informatics
• Integrative dataset analysis can leverage patient
information gathered over many encounters
• Temporal analyses can generate derived variables that
appear to correlate with readmissions
• Predictive modeling has promise of providing decision
support
• Data Analytics arm of the Emory New Care Model
Initiative led by Greg Esper
• Ongoing analyses involve characterization of clinical
phenotype in GWAS, biomarker and quality
improvement efforts
• Co-lead (with Bill Hersh) of CTSA CER Informatics
taskforce dedicated to this issue
61. a.k.a “Big Data”
Center for Comprehensive Informatics
• Integrative Spatio-Temporal Analytics
• Deep Integrative Biomedical Research
• High End Computing/”Big Data” Computers,
Systems Software
• Analysis of Patient Populations
62. Thanks to:
• In silico center team: Dan Brat (Science PI), Tahsin Kurc, Ashish
Center for Comprehensive Informatics
Sharma, Tony Pan, David Gutman, Jun Kong, Sharath Cholleti,
Carlos Moreno, Chad Holder, Erwin Van Meir, Daniel Rubin, Tom
Mikkelsen, Adam Flanders, Joel Saltz (Director)
• Digital Pathology R01 (s): Foran and Saltz; Jun Kong, Sharath
Cholleti, Fusheng Wang, Tony Pan, Tahsin Kurc, Ashish Sharma,
David Gutman (Emory), Wenjin Chen, Vicky Chu, Jun Hu, Lin Yang,
David J. Foran (Rutgers)
• Analytic Warehouse team: Andrew Post, Sharath Cholleti, Doris
Gao, Michel Monsour, Himanshu Rathod
• In vivo imaging Emory team: Tony Pan, Ashish Sharma, Joel Saltz
• NIH/in silico TCGA Imaging Group: Scott Hwang, Bob Clifford, Erich
Huang, Dima Hammoud, Manal Jilwan, Prashant Raghavan, Max
Wintermark, David Gutman, Carlos Moreno, Lee Cooper, John
Freymann, Justin Kirby, Arun Krishnan, Seena Dehkharghani, Carl
Jaffe
• ACTSI Biomedical Informatics Program: Marc Overcash, Tim
Morris, Tahsin Kurc, Alexander Quarshie, Circe Tsui, Adam Davis,
Sharon Mason, Andrew Post, Alfredo Tirado-Ramos
• ORNL HPC collaboration: Scott Klasky, David Pugmire ORNL
63. Thanks to
Center for Comprehensive Informatics
• National Cancer Institute
• National Library of Medicine
• National Science Foundation
• Cardiovascular Research Grid (NHLBI)
• Minority Health Grid (ARRA)
• Emory Health Care
• Kaiser Health Care
• Winship Cancer Institute
• Oak Ridge National Laboratory
• Woodruff Health Sciences