GenomeSnip: Fragmenting the genomic wheel to augment discovery in cancer research

GenomeSnip: Fragmenting the
Genomic Wheel to augment discovery
in cancer research
Maulik R. Kamdar, Aftab Iqbal, Muhammad Saleem,
Helena F. Deus and Stefan Decker
Conference on Semantics in Healthcare and Life Sciences (CSHALS), February 26-28, 2014, Boston

Outline
 Introduction
 Integrative Genomics
 Genomic Visualization
 Motivation

 Methodology

 Integration of Data Sources
 Co-occurrence Data Cubes
 Similarity Measures

 GenomeSnip Platform
 Genomic Wheel
 Genomic Tracks Display

 Comparative Evaluation
 Discussion
 Applicability
 Future Work

Integrative Genomics
 Advanced computation analyses of genomics datasets –
Genome-wide Association studies (GWAS), identify several
susceptible loci (point alterations and oncogene) associated with
various types of cancer, and are catalogued.
 Network-based approaches isolate genes implicated together
(‘Co-occurrence’) in any disease or pathway, i.e. macro-molecular
associations on a higher, abstract level.
 Linked Data and Semantic Web Technologies address challenges on
integration of multiple knowledgebases.
 Advantages: Representation using Resource Description Framework
(RDF), structured querying, integration with other biomedical
datasets, data mining and knowledge extraction.

Genomic Visualization
Genome browsers navigate across the human genome in an intuitive,
linear fashion. Data points are mapped to the genomic coordinates and
datasets are visualized as charts or heatmaps.
Disadvantages:
The perceptive faculties of humans interpret patterns in depth
compared to length.
Fail to display inter- and intra-chromosomal relations, which could be
easily interpreted and used by humans.
Circular plots overcome the cognitive barriers to grasp relations
between disjoint genomic features on an abstract level.
Disadvantages:
Focus on the entire human genome or chromosome.
Visualizations rendered as static images and lack interactivity.

Motivation
 With the rise of high-throughput gene sequencing technologies, data
analysis has replaced data generation as the rate-limiting step for the
interpretation of genomic patterns and discovery.
 Analyzing genomic datasets using the knowledge extracted through
GWAS could help discovering newer tumour risk hypotheses and
diagnosing cancer on a personalized basis.
 Studying Co-occurrence networks of genes could predict hidden
protein-protein interactions (PPI) and functions.
 Improved, interactive, intuitive, genomic visualization solutions,
which infuse knowledge insights into cancer research, need to be
perfected to augment discovery.

GenomeSnip Platform
 A semantic, visual analytics prototype devised to expedite knowledge
exploration and discovery in cancer research.
 Idea: ‘Snip’ the human genome informatively in fragments through
interaction with an aggregative, circular visualization, the
‘Genomic Wheel’ (circular) and introspectively analyze the snipped
fragments in a ‘Genomic Tracks’ (linear) display.
 Technologies: Web-based client application developed using native
technologies like HTML5 Canvas, JavaScript and JSON.
KineticJS library, an HTML5 Canvas JavaScript framework, is used for
node nesting, layering, caching and event handling.
 Availability: http://srvgal78.deri.ie/genomeSnip/

Integration of Data Sources
UCSC Genome Browser:
UCSC Genome Browser:
Chromosome Bands
Chromosome Bands
(Ideograms) in the Human
(Ideograms) in the Human
Genome Assembly
Genome Assembly
(GRCh37/hg19,
(GRCh37/hg19,
Feb 2009)
Feb 2009)

CellBase: Coordinates and
CellBase: Coordinates and
descriptions of the proteindescriptions of the proteincoding genes, genomic
coding genes, genomic
variants like cancer-related
variants like cancer-related
mutations (COSMIC) and
mutations (COSMIC) and
SNPs (dbSNP).
SNPs (dbSNP).

Cancer Gene Census:
Cancer Gene Census:
Catalogued oncogenes,
Catalogued oncogenes,
which bear somatic and
which bear somatic and
germline mutations in their
germline mutations in their
sequences
sequences

UniProt: Disease-gene
mappings exposed by the
mappings exposed by the
EBI RDF platform.
EBI RDF platform.

Kyoto Encyclopedia of
Kyoto Encyclopedia of
Genes and Genomes by
mappingsGenomes by
mappings exposed
Genes and exposed
the EBI RDF
(KEGG): Pathway-gene
the EBI RDF
platform.exposedas a
Mappings exposed as a
Mappings
platform.
SPARQL Endpoint.
SPARQL Endpoint.

Pubmed2Ensembl:
Pubmed2Ensembl:
Customized exposed by
mappings service
mappings service
Customized exposed by
the EBI to provide geneextended RDF
the EBI to provide geneextended RDF
platform.
related publication
related publication
platform.
information.
information.

Linked TCGA:Informationof
Pubmed2Ensembl: of
KyotoTCGA:Coordinates
CancerGenome
CellBase: Disease-gene
UCSC Genome
Pubmed2Ensembl:
UniProt: Information
CancerEncyclopedia
Linked Gene Census:
Kyoto Gene Census:
CellBase: Coordinates
UCSC Encyclopedia
related andexposed
Customized genomic
Genes to theGenomesthe
mappings Chromosome
Cataloguedexposed by
and descriptions ofby
Browser:theGenomesthe
Customized service
mappings genomic
Catalogued service
related and
Genes to
and descriptions of
Browser: Chromosome
extended RDF genes,
the EBI RDFDNA
oncogenes, provide in
alterations likewhich
protein-coding genes,
Bands (Ideograms) in
extended likeDNA
the EBI to which
oncogenes, provide
alterations to
protein-coding
Bands (Ideograms)
gene-relatedGenome
platform.variants like
bearHumanSNPs, CNVs
methylations, SNPs, as
Mappings exposedlike
genomic variantsCNVs a
the Humanexposed as a
methylations,
gene-related and
Mappings
platform.
bear somatic and
genomic
the somaticGenome
publication
germlineexpression
and gene expression
SPARQL Endpoint. in
cancer-related
Assembly mutations in
and gene
publication
SPARQL
germline Endpoint.
cancer-related
Assembly mutations
changes, in cancer
information.
their sequences
mutationscancerpatients,
(GRCh37/hg19,patients,
changes, in (COSMIC)
information.
their sequences
mutations (COSMIC)
(GRCh37/hg19,
mapped to (dbSNP).
and SNPsthe genomic loci
Feb 2009)thegenomic loci
mapped to
and SNPs
Feb 2009) (dbSNP).

Co-occurrence Data Cubes
 Genes mapped with diseases, pathways and publications are
extracted a priori to instantiate co-occurrence pairs between all
possible combinations of genes, after identifier conversions.
 A co-occurrence data cube of 3 dimensions (segment 1, segment 2
and data source type) and 2 measures (co-occurrence count and
names of mapped resources) is created between the genes,
ideograms and chromosomes.
 The data cubes are transformed to RDF (N-triples syntax) using RDF
Data Cube Vocabulary and are stored in a Triple Store.
Chr1

Chr2

Chr3

…

Total

Dis

Chr1
Chr2
Chr3

Path

Pub

Dis

Path

Pub

Dis

Path

Pub

…

Dis

Path

Pub

28

4049

4522

15

4251

2915

19

4811

2667

…

226

75044

49827

15

4251

2915

14

1662

3589

14

2532

1748

…

155

42513

31424

19

4811

2667

14

2532

1748

16

1338

1390

…

224

44310

26817

Similarity Measures
 Inspired from Tversky's feature-based similarity measure.
 Similarity between two entities (chromosomes) is a weighted
summation of their common features (total number of co-occurrence
pairs of contained genes).

Sim12 = α *

Chr1 I Chr 2
HGTotalDis

Dis

+β*

Chr1 I Chr 2
HGTotalPath

Path

+γ *

Chr 1 I Chr 2

Pub

HGTotalPub

 HGTotalDis = Total number of gene-gene co-occurrence pairs extracted
from the entire Human Genome (HG) for a particular data source.
HGTotalDis = 1435, HGTotalPath = 83970, HGTotalPub = 129035

15
4251
2915
Sim12 = α *
+β*
+γ *
1435
83970
129035

Similarity Measures (contd.)
 The maximum similarity for any chromosome, is calculated by using
the evaluating Equation 1, with the total number of gene-gene cooccurrence pairs, in which genes of Chromosome 1 participate.

Chr1TotalDis
Chr1TotalPath
Chr 1TotalPub
Sim1Max = α *
+β*
+γ *
HGTotalDis
HGTotalPath
HGTotalPub
 Evaluating with the values of last column in the data cube

226
75044
49827
Sim1Max = α *
+β*
+γ *
1435
83970
129035
 The relative similarity impact on the side of any chromosome is
calculated by dividing the similarity measure with the maximum
similarity for the chromosome (Sim12/Sim1max).

Genomic Wheel
 The human genome is laid in a
circular layout.
 Chromosomes form the arcs
on the perimeter, whose
length is proportional to the
size of each chromosome.
 The
thickness
of
the
connecting
chords
is
proportional
to
relative
similarity impact between
connected chromosomes.
 α = 1, β = 0.4, γ = 0.1

Genomic Wheel (contd.)
 Hovering the mouse over each
chord displays slice of
co-occurrence data cube.
 Hierarchical categorization chromosome, ideogram, gene
and cancer point mutations
forms
layers
in
the
representative arc.
 Only the ‘chromosome’ and
‘ideogram’ layer are shown in
the initial layout.

Chord’s thickness is
Chord’s thickness is
proportional to relative
proportional to relative
similarity impact
similarity impact
between Chromosome 11
between Chromosome
and 22on the connected
and on the connected
side (tapering).
side (tapering).

 Clicking on each arc, the
represented
segment
is
highlighted and flares out to
display subsequent layers.
 At the genetic level the
relation
chords
are
represented using distinct
colors (Red, green and blue for
diseases,
pathways
and
publications respectively) to
enable visual discernibility.

 Genes catalogued in the
Cancer Gene Census, whose
gene sequences bear somatic
and
germline
mutations
responsible for cancer, are
represented by shades of red.
 Hovering the mouse pointer
over any gene displays
additional information.

Genomic Tracks Display

Select any tumor and
Select any tumor and
associated patients from
associated patients from
LinkedTCGA datasets
LinkedTCGA datasets


Stacked Display of
Stacked Display of
patients’ Exon
patients’ Exon
Expression (Green) and
Expression (Green) and
DNA Methylation (Red)
DNA Methylation (Red)
Results
Results


Catalogued Cancer-related
Catalogued Cancer-related
mutations extracted from
mutations extracted from
COSMIC
COSMIC

Simultaneous analysis on different genome tracks
Simultaneous analysis on different genome tracks

Zooming and Automated Scrolling controls
Zooming and Automated Scrolling controls

Comparative Evaluation
 Empirical Evaluation conducted through aaquestionnaire (http://goo.gl/vnLtX4)
 Empirical Evaluation conducted through questionnaire (http://goo.gl/vnLtX4)
embedded within the GenomeSnip platform.
embedded within the GenomeSnip platform.
 44PhD students and 55researchers researching in Biotechnology (primarily protein
 PhD students and researchers researching in Biotechnology (primarily protein
engineering) and Bioinformatics responded.
engineering) and Bioinformatics responded.

Genome
Genome
Circos Regulome
Maps
Snip

UCSC

IGV

Savant

Linear
Coordinates















Circular Plot















Web
Interface















Data Upload















Third-party
modules
TCGA
Demonstratio
n
GWAS
Insights
Chromosomal
Relations
Simultaneous
Analysis
Knowledge
Integration





















































































Heatma
p







Network















Heatma
Other
p
Visualizations
Dynamicity



Genome
Genome
Circos Regulome
Maps
Snip
Integration with JBrowse or
Integration with JBrowse or
GenomeMaps for aabetter




GenomeMaps for better
Genome Tracks View
Genome Tracks View 




UCSC

IGV

Savant

Linear
Coordinates







Circular Plot







Web
Interface















Data Upload















Third-party
modules
TCGA
Demonstratio
n
GWAS
Insights
Chromosomal
Relations
Simultaneous
Analysis
Knowledge
Integration





















































































Heatma
p







Network















Heatma
Other
p
Visualizations
Dynamicity



Genome
Genome
Circos Regulome
Maps
Snip

UCSC

IGV

Savant

Linear
Coordinates







Circular Plot







Web
Interface







Data Upload







Third-party
modules
TCGA
Demonstratio
n
GWAS
Insights
Chromosomal
Relations
Simultaneous
Analysis
Knowledge
Integration













































































Heatma
p







Network















Heatma
Other
p
Visualizations
Dynamicity












Data Upload  genomicsdata
Data Uploadof genomicsdata
of

of new patients for visual
of new patients for visual



comparison
comparison










Genome
Genome
Circos Regulome
Maps
Snip

UCSC

IGV

Savant

Linear
Coordinates















Circular Plot















Web
Interface















Data Upload















Third-party
modules
TCGA
Demonstratio
n
GWAS
Insights
Chromosomal
Relations
Simultaneous
Analysis
Knowledge
Integration













































































Heatma
p







Heatma
Other
p
Visualizations
Dynamicity





Introduce Heatmap plotsfor
Introduce Heatmap plotsfor
exon expression data Network


exon expression data










Applicability of GenomeSnip
Formulating Improved Hypotheses:
 Integration of the knowledge extracted from GWAS, available
literature and pathways, and interpretation through an intuitive
visualization, allowing isolation and comparative analysis of
interesting genomic segments using the Linked TCGA datasets.
Discovering Protein Interactions:
 Usage and improved visual discernibility of co-occurrence networks
of genes, along with analysis of highly co-occurrent gene pairs using
gene/protein expression datasets.
Predicting Tumour Risk on a personalized level:
 Evaluation of the genomic alterations, expression levels and clinical
data of a new patient against those patients registered under the
TCGA project, enabling personalized medicine.

Future Work on GenomeSnip
 Integrate prostate cancer prediction models into the platform, for
validation using Linked TCGA datasets as a training set, and
predicting tumour risk in new patients.
 Provide an ‘interaction’ overlay by integrating knowledge on proteinprotein interactions (BioGrid), gene co-expression (CoExpresDB) and
functions (Gene Ontology).
 Extract information like disease variant mutations from UniProt to
interlink the ‘point mutations’ layer in the ‘Genomic Wheel’ with
other segments.
 Improve the granularity of the ‘Genomic Wheel’ and extensively test
the user experience and the usability of this platform by conducting
a user-driven evaluation.

GenomeSnip: Fragmenting the genomic wheel to augment discovery in cancer research

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a GenomeSnip: Fragmenting the genomic wheel to augment discovery in cancer research

Similar a GenomeSnip: Fragmenting the genomic wheel to augment discovery in cancer research (20)

Más de Maulik Kamdar

Más de Maulik Kamdar (17)

Último

Último (20)

GenomeSnip: Fragmenting the genomic wheel to augment discovery in cancer research

Notas del editor