Keynote talk at HP-MICCAI / MICCAI-DCI 2011, The Joint Workshop on High Performance and Distributed Computing for Medical Imaging, featured at MICCAI 2011, held on September 22nd 2011 in Toronto, Canada
2. Squeezing Information from Spatial Datasets
• Leverage exascale data and
Center for Comprehensive Informatics
computer resources to
squeeze the most out of
image, sensor or simulation
data
• Run lots of different
algorithms to derive same
features
• Run lots of algorithms to
derive complementary
features
• Data models and data
management infrastructure
to manage data products,
feature sets and results from
classification and machine
learning algorithms
3. INTEGRATIVE BIOMEDICAL
INFORMATICS ANALYSIS
Center for Comprehensive Informatics
Reproducible anatomic/functional
characterization at gross level (Radiology) and
fine level (Pathology)
Integration of anatomic/functional
characterization with multiple types of “omic”
information
Create categories of jointly classified data to
describe pathophysiology, predict prognosis,
response to treatment
4. In Silico Center for
Brain Tumor Research
Specific Aims:
1. Influence of necrosis/
hypoxia on gene expression and
genetic classification.
2. Molecular correlates of high
resolution nuclear morphometry.
3. Gene expression profiles
that predict glioma progression.
4. Molecular correlates of MRI
enhancement patterns.
5. Integration of heterogeneous multiscale
information
Center for Comprehensive Informatics
•Coordinated initiatives
Pathology, Radiology,
“omics”
Radiology
•Exploit synergies
Imaging
between all initiatives
to improve ability to
forecast survival & Patient
“Omic” Outco
response. me
Data
Pathologic
Features
6. Example: Pathology and Gene Expression
Joint Predictors of Recurrence/Survival
Lee Cooper Carlos Moreno
8. In Silico Center for
Brain Tumor Research
Key Data Sets
REMBRANDT: Gene expression and genomics data
set of all glioma subtypes
The Cancer Genome Atlas (TCGA): Rich “omics” set
of GBM, digitized Pathology and Radiology
Pathology and Radiology Images from Henry Ford
Hospital, Emory, Thomas Jefferson U, MD Anderson
and others
10. TCGA and REMBRANDT Datasets
Center for Comprehensive Informatics
*Some overlap with TCGA main repository but rescanned at 40X Mikkelsen @ H
** Data obtained by generosity of Lisa Scarpace/Tom
** Data obtained by generosity of Flanders/Mark Curtis @ TJU @ HF
***Thanks to Adam Lisa Scarpace/Tom Mikkelsen
***Thanks to Adam Flanders/Mark Curtis @ TJU
11. Neuroimaging/MRI Data
Center for Comprehensive Informatics
*Pending Upload to NBIA portal
** Data obtained by efforts of Lisa Scarpace/Tom Mikkelsen
@ HF
14. Requirements for High Resolution Whole Slide
Feature Characterization
Center for Comprehensive Informatics
• 1010 pixels for each whole slide image
• 10 whole slide images per patient
• 108 image features per whole slide
image
• 10,000 brain tumor patients
• 1015 pixels
• 1013 features
• Hundreds of algorithms
• Annotations and markups from dozens
of humans
15. Distinguishing Characteristic in
Gliomas
Nuclear Qualities
Round shaped with Elongated with
smooth regular texture rough, irregular texture
Oligodendroglioma Astrocytoma
Use image analysis algorithms to segment and classify microanatomic
features (Nuclei, Astrocytoma, Necrosis ...) in whole slide images
Represent the segmentation and classification in a well defined
structured format that can be used to correlate the pathology with
other data modalities
16. Astrocytoma vs Oligodendroglima
Overlap in genetics, gene expression, histology
Center for Comprehensive Informatics
Astrocytoma vs Oligodendroglima
• Assess nuclear size (area and
perimeter), shape (eccentricity,
circularity major axis, minor axis, Fourier
shape descriptor and extent ratio),
intensity (average, maximum, minimum,
standard error) and texture (entropy,
energy, skewness and kurtosis).
17. Machine-based Classification of TCGA GBMs (J Kong)
Whole slide scans from 14 TCGA GBMS (69 slides)
7 purely astrocytic in morphology; 7 with 2+ oligo component
399,233 nuclei analyzed for astro/oligo features
Cases were categorized based on ratio of oligo/astro cells
TCGA Gene
Expression Query:
c-Met overexpression
19. Nuclear Features Used to Classify GBMs
10
Center for Comprehensive Informatics
20
30
40
Feature Indices
50
60
70
80
90
100
110
Clustergram of selected features used in consensus
clustering
20. Nuclear Features Used to Classify GBMs
2 1 3 4
Center for Comprehensive Informatics
1
20
40
60
2
Cluster
80
100
3
120
140
160 4
Consensus clustering of morphological
50 100 150
0 0.2 0.
signatures Silho
Study includes 200 million nuclei taken from 480
slides corresponding to 167 distinct patients.
21. 1
Cluster 1
0.9 Cluster 2
Center for Comprehensive Informatics
Cluster 3
0.8
Cluster 4
0.7
0.6
Survival
0.5
0.4
0.3
0.2
0.1
0
0 500 1000 1500 2000 2500 3000
Days
Survival of morphological clusters
22. 1
Proneural
0.9 Neural
Center for Comprehensive Informatics
Classical
0.8
Mesenchymal
0.7
0.6
Survival
0.5
0.4
0.3
0.2
0.1
0
0 500 1000 1500 2000 2500 3000
Days
Survival of patients by molecular tumor
subtype
23. Cluster 2:
What does this mean? Gemistocytic Signature
Center for Comprehensive Informatics
• No association with
• Transcriptional class
• Age
• Molecular correlates:
Wnt signaling, TP53 signaling,
Oxidative phosphorylation,
mRNA processing
25. Multicore/GPU Implementation
Center for Comprehensive Informatics
• Nuclear/cell segmentation
• Feature extraction
• Generate clusters of nuclear/cell
Features
• Classify nuclei/cells by feature cluster
• Classify patients by populations of
classified nuclei/cells
32. Scheduling for CPU/GPU System
Center for Comprehensive Informatics
• PRIORITY: assigns tasks to CPUs or GPUs based
on estimate of relative performance gain of each
task on GPU vs on CPU and load of the CPUs and
GPUs
• Sorted queue of (task, data element) tuples based
on the relative speedup expected for each tuple
• As more (task, data element) tuples are created,
tuples are inserted into the queue such that the
queue remains sorted
• First Come First Served: simple low overhead
policy
33. Aggregation to reduce kernel invocation and data
movement overheads
Center for Comprehensive Informatics
• Objects are grouped based on which image tile they
are segmented from. Each tuple consists of (objects
in image tile, feature computation operation)
• Perform group of tasks that are applied to the same
data element (or the same chunk of data elements)
• Performance Aware Task Fusion (PATF) -- groups
tasks based on GPU speedup values. Two groups of
tasks: one group containing tasks with good GPU
speedup values and the other group containing the
remaining tasks
37. Multiscale Systems Biology
Employ multi-resolution methods to
characterize necrosis, angiogenesis and
correlate these with “omics”
No enhancement Rim-enhancement
Normal Vessels ? Vascular Changes
Stable lesion Rapid progression
38. Correlation of Necrosis, Angiogenesis and “omics”
Center for Comprehensive Informatics
• GBMs display variable and regionally heterogeneous degrees
of necrosis (asterisk) and angiogenesis
• These factors may impact gene expression profiles
39. Genes Correlated with Necrosis include Transcription Factors
Identified as Regulators of the Mesenchymal Transition
Center for Comprehensive Informatics
•Frozen sections from 88 GBM samples marked
to identify regions of necrosis and angiogenesis
•Extent of both necrosis and angiogenesis calculated as a
percentage of total tissue area
Gene SAM q-value
Symbol (Corrected p-value)
C/EBPB < 0.000001
C/EBPD < 0.000001
FOSL2 < 0.000001
Carro MS, et al. Nature 263: 318-25, 2010
STAT3 0.0047
RUNX1 0.0082
41. Defining Rich Set of Qualitative and
Quantitative Image Biomarkers
Center for Comprehensive Informatics
• Community-driven ontology development
project; collaboration with ASNR
• Imaging features (5 categories)
– Location of lesion
– Morphology of lesion margin (definition, thickness,
enhancement, diffusion)
– Morphology of lesion substance (enhancement, PS
characteristics, focality/multicentricity, necrosis, cysts, midline
invasion, cortical involvement, T1/FLAIR ratio)
– Alterations in vicinity of lesion (edema, edema
crossing midline, hemorrhage, pial invasion, ependymal invasion,
satellites, deep WM invasion, calvarial remodeling)
– Resection features (extent of nCE tissue, CE tissue, resected
components)
42. Imaging Predictors of survival and molecular profiles
in the TCGA Glioblastoma Data set
The TCGA glioma working group
Center for Comprehensive Informatics
1Emory University Hospital, Atlanta, GA 2National Cancer Institute, Bethesda, MD. 3Thomas Jefferson University Hospital, Philadelphia,
PA. 4Henry Ford University Hospital, Detroit, Michigan. 5National Institute of Health, Bethesda, MD. 6Boston University School of
Medicine, Boston, MA. 7SAIC-Frederick, Inc., Frederick, MD. 8University of Virginia, Charlottesville, VA. 9 Northwestern University
Chicago, IL
Emory TJU/CBIT/NCI UVA/Northwestern Henry Ford
David A Gutman1 Adam Flanders3 Max Wintermark8 Lisa Scarpace4
Lee Cooper1 Eric Huang2 Manal Jilwan8 Tom Mikkelsen4
Scott N Hwang1 Robert J Clifford2 Prashant Raghavan8
Chad A Holder1 Dina Hammoud3 Pat Mongkolwat9
Doris Gao1 John Freymann7
Carlos Moreno1 Justin Kirby7
Arun Krishnan1 Carl Jaffe6
Jun Kong1
Seena Dehkharghani1
Joel Saltz1
Dan Brat1
43. Correlative Imaging Results
Center for Comprehensive Informatics
• Minimal enhancing tumor (≤5%)
strongly associated with Proneural
classification (p=0.0006).
• >5% proportion of necrosis and the
presence of microvascular hyperplasia in
pathology slides (p=0.008).
• Greater maximum tumor dimension (T2
signal) associated with
present/abundant microvascular
hyperplasia (p=0.001).
< 5% Enhancement
44. Correlative Imaging Results
Center for Comprehensive Informatics
• TP53 mutant tumors had a smaller mean
tumor sizes (p=0.002) on T2-weighted or
FLAIR images.
• EGFR mutant tumors were significantly
larger than TP53 mutant tumors
(p=0.0005).
• High level EGFR amplification was
associated with >5% enhancement and
>5% proportion of necrosis (p < 0.01).
> 5% Necrosis
46. Large Scale Data Management
Implemented with IBM DB2 for large scale
Center for Comprehensive Informatics
pathology image metadata (~million markups per
slide)
Represented by a complex data model capturing
multi-faceted information including markups,
annotations, algorithm provenance, specimen, etc.
Support for complex relationships and spatial
query: multi-level granularities, relationships
between markups and annotations, spatial and
nested relationships
47. Data Models to Represent Feature Sets
and Experimental Metadata
PAIS |pās| : Pathology Analytical Imaging Standards
• Provide semantically enabled data model to
support pathology analytical imaging
• Data objects, comprehensive data types, and
flexible relationships
• Object-oriented design, easily extensible
• Reuse existing standards
– Reuse relevant classes already defined in AIM
– Follow DICOM WG 26 metadata specifications on WSI
reference
– Specimen information in DICOM Supplement 122 and
caTissue
– Use caDSR for CDE and NCI Thesaurus for ontology
48. Pathology Imaging GIS
class Domain Mo...
WholeSlideImageReference TMAImageReference
Patient
Region
MicroscopyImageReference
Subject
0..1
Center for Comprehensive Informatics
0..1
DICOMImageReference 1
1 Specimen
ImageReference 1 0..*
0..* 0..1
1 0..1 AnatomicEntity
User 0..* 1
1
0..1
1 1 Equipment
Group 0..1
1
0..1 1 PAIS
1 0..*
Project AnnotationReference
1
0..1
0..* 1 1 1
Collection 0..* 0..*
1 0..*
0..* Annotation 0..*
0..* Markup
0..1
0..*
GeometricShape 1
Image analysis
1 1
0..* 0..* 0..*
1
Observation Calculation Inference
Surface Field
1 1
0..1 0..1 0..1
Provenance
0..*
PAIS model PAIS data management
Modeling and management of markup and annotation for
querying and sharing through parallel RDBMS + spatial DBMS
Segmentation
HDFS data staging MapReduce based queries
On the fly data processing for algorithm validation/algorithm
Feature extraction
sensitivity studies, or discovery of preliminary results
49. Workflow, Adaptivity and Quality of Service
• In-transit data processing using
filter/stream systems
• Semantic Workflows
• Hierarchical pipeline design with coarse
and fine grained components
• Adaptivity and Quality of Service
50. Computerized Classification System
for Grading Neuroblastoma
Initialization Yes
Image Tile Background? Label
I=L
• Background Identification
No
Create Image I(L)
• Image Decomposition (Multi-
Training Tiles resolution levels)
Segmentation I = I -1 • Image Segmentation
Down-sampling
(EMLDA)
Segmentation
Feature Construction
• Feature Construction (2nd
Yes
No order statistics, Tonal
Feature Extraction I > 1?
Feature Construction
Features)
Feature Extraction
Classification
• Feature Extraction (LDA) +
Classification (Bayesian)
Classifier Training
No
• Multi-resolution Layer
Within Confidence
Region ? Controller (Confidence
Yes
TRAINING Region)
TESTING
51. Slides’ Preparation
• 64990 x 59412 pixels in full resolution
• Original Size: 10.8 Gb; Compressed Sized: ≈ 833Mb
8x
40x
52. Semantic Workflows (Wings)
Collaborative Work with Yolanda Gil, Mary Hall
Center for Comprehensive Informatics
• A systematic strategy for composing application
components into workflows
• Search for the most appropriate implementation of
both components and workflows
• Component optimization
– Select among implementation variants of the same
computation
– Derive integer values of optimization parameters
– Only search promising code variants and a restricted
parameter space
• Workflow optimization
– Knowledge-rich representation of workflow properties
58. Conclusions
• Increasingly common task: Integrative analysis of
complementary multi-resolution and “omic” data sources
• Elucidation of mechanisms, identify medically important
disease sub-classes
• Cell level whole slide analysis of whole slide images
requires innovations in parallelization, data management
• Integration of Radiology and Pathology has promise but
pose additional challenges
• Role of quality of service/on demand computing
59. Thanks to:
• In silico center team: Dan Brat (Science PI), Tahsin Kurc, Ashish Sharma, Tony Pan, David
Gutman, Jun Kong, Sharath Cholleti, Carlos Moreno, Chad Holder, Erwin Van Meir, Daniel
Rubin, Tom Mikkelsen, Adam Flanders, Joel Saltz (Director)
• caGrid Knowledge Center: Joel Saltz, Mike Caliguiri, Steve Langella co-Directors; Tahsin
Kurc, Himanshu Rathod Emory leads
• caBIG In vivo imaging team: Eliot Siegel, Paul Mulhern, Adam Flanders, David Channon,
Daniel Rubin, Fred Prior, Larry Tarbox and many others
• In vivo imaging Emory team: Tony Pan, Ashish Sharma, Joel Saltz
• Emory ATC Supplement team: Tim Fox, Ashish Sharma, Tony Pan, Edi Schreibmann, Paul
Pantalone
• Digital Pathology R01: Foran and Saltz; Jun Kong, Sharath Cholleti, Fusheng Wang, Tony
Pan, Tahsin Kurc, Ashish Sharma, David Gutman (Emory), Wenjin Chen, Vicky Chu, Jun
Hu, Lin Yang, David J. Foran (Rutgers)
• NIH/in silico TCGA Imaging Group: Scott Hwang, Bob Clifford, Erich Huang, Dima
Hammoud, Manal Jilwan, Prashant Raghavan, Max Wintermark, David Gutman, Carlos
Moreno, Lee Cooper, John Freymann, Justin Kirby, Arun Krishnan, Seena Dehkharghani,
Carl Jaffe
• ACTSI Biomedical Informatics Program: Marc Overcash, Tim Morris, Tahsin Kurc,
Alexander Quarshie, Circe Tsui, Adam Davis, Sharon Mason, Andrew Post, Alfredo Tirado-
Ramos
• NSF Scientific Workflow Collaboration: Vijay Kumar, Yolanda Gil, Mary Hall, Ewa Deelman,
Tahsin Kurc, P. Sadayappan, Gaurang Mehta, Karan Vahi