The document describes GenomeSnip, a semantic visual analytics platform for knowledge exploration and discovery in cancer research. It integrates multiple genomic and biomedical datasets and uses a circular "Genomic Wheel" visualization and linear "Genomic Tracks" display to show relationships between chromosomes and allow analysis of genomic features. An evaluation found it effectively integrated knowledge and enabled simultaneous, dynamic analysis compared to other tools. Future work includes integrating it with genome browsers for improved linear visualization of tracks.
Top Quality Call Girl Service Kalyanpur 6378878445 Available Call Girls Any Time
GenomeSnip: Fragmenting the genomic wheel to augment discovery in cancer research
1. GenomeSnip: Fragmenting the
Genomic Wheel to augment discovery
in cancer research
Maulik R. Kamdar, Aftab Iqbal, Muhammad Saleem,
Helena F. Deus and Stefan Decker
Conference on Semantics in Healthcare and Life Sciences (CSHALS), February 26-28, 2014, Boston
3. Integrative Genomics
Advanced computation analyses of genomics datasets –
Genome-wide Association studies (GWAS), identify several
susceptible loci (point alterations and oncogene) associated with
various types of cancer, and are catalogued.
Network-based approaches isolate genes implicated together
(‘Co-occurrence’) in any disease or pathway, i.e. macro-molecular
associations on a higher, abstract level.
Linked Data and Semantic Web Technologies address challenges on
integration of multiple knowledgebases.
Advantages: Representation using Resource Description Framework
(RDF), structured querying, integration with other biomedical
datasets, data mining and knowledge extraction.
4. Genomic Visualization
Genome browsers navigate across the human genome in an intuitive,
linear fashion. Data points are mapped to the genomic coordinates and
datasets are visualized as charts or heatmaps.
Disadvantages:
The perceptive faculties of humans interpret patterns in depth
compared to length.
Fail to display inter- and intra-chromosomal relations, which could be
easily interpreted and used by humans.
Circular plots overcome the cognitive barriers to grasp relations
between disjoint genomic features on an abstract level.
Disadvantages:
Focus on the entire human genome or chromosome.
Visualizations rendered as static images and lack interactivity.
5. Motivation
With the rise of high-throughput gene sequencing technologies, data
analysis has replaced data generation as the rate-limiting step for the
interpretation of genomic patterns and discovery.
Analyzing genomic datasets using the knowledge extracted through
GWAS could help discovering newer tumour risk hypotheses and
diagnosing cancer on a personalized basis.
Studying Co-occurrence networks of genes could predict hidden
protein-protein interactions (PPI) and functions.
Improved, interactive, intuitive, genomic visualization solutions,
which infuse knowledge insights into cancer research, need to be
perfected to augment discovery.
6. GenomeSnip Platform
A semantic, visual analytics prototype devised to expedite knowledge
exploration and discovery in cancer research.
Idea: ‘Snip’ the human genome informatively in fragments through
interaction with an aggregative, circular visualization, the
‘Genomic Wheel’ (circular) and introspectively analyze the snipped
fragments in a ‘Genomic Tracks’ (linear) display.
Technologies: Web-based client application developed using native
technologies like HTML5 Canvas, JavaScript and JSON.
KineticJS library, an HTML5 Canvas JavaScript framework, is used for
node nesting, layering, caching and event handling.
Availability: http://srvgal78.deri.ie/genomeSnip/
7. Integration of Data Sources
UCSC Genome Browser:
UCSC Genome Browser:
Chromosome Bands
Chromosome Bands
(Ideograms) in the Human
(Ideograms) in the Human
Genome Assembly
Genome Assembly
(GRCh37/hg19,
(GRCh37/hg19,
Feb 2009)
Feb 2009)
8. Integration of Data Sources
CellBase: Coordinates and
CellBase: Coordinates and
descriptions of the proteindescriptions of the proteincoding genes, genomic
coding genes, genomic
variants like cancer-related
variants like cancer-related
mutations (COSMIC) and
mutations (COSMIC) and
SNPs (dbSNP).
SNPs (dbSNP).
9. Integration of Data Sources
Cancer Gene Census:
Cancer Gene Census:
Catalogued oncogenes,
Catalogued oncogenes,
which bear somatic and
which bear somatic and
germline mutations in their
germline mutations in their
sequences
sequences
10. Integration of Data Sources
UniProt: Disease-gene
UniProt: Disease-gene
mappings exposed by the
mappings exposed by the
EBI RDF platform.
EBI RDF platform.
11. Integration of Data Sources
Kyoto Encyclopedia of
UniProt: Disease-gene
UniProt: Disease-gene
Kyoto Encyclopedia of
Genes and Genomes by
mappingsGenomes by
mappings exposed
Genes and exposed
the EBI RDF
(KEGG): Pathway-gene
the EBI RDF
(KEGG): Pathway-gene
platform.exposedas a
Mappings exposed as a
Mappings
platform.
SPARQL Endpoint.
SPARQL Endpoint.
12. Integration of Data Sources
Pubmed2Ensembl:
UniProt: Disease-gene
UniProt: Disease-gene
Pubmed2Ensembl:
Customized exposed by
mappings service
mappings service
Customized exposed by
the EBI to provide geneextended RDF
the EBI to provide geneextended RDF
platform.
related publication
related publication
platform.
information.
information.
13. Integration of Data Sources
Linked TCGA:Informationof
Pubmed2Ensembl: of
KyotoTCGA:Coordinates
UniProt: Disease-gene
CancerGenome
CellBase: Disease-gene
UCSC Genome
Pubmed2Ensembl:
UniProt: Information
CancerEncyclopedia
Linked Gene Census:
Kyoto Gene Census:
CellBase: Coordinates
UCSC Encyclopedia
related andexposed
Customized genomic
Genes to theGenomesthe
mappings Chromosome
Cataloguedexposed by
and descriptions ofby
Browser:theGenomesthe
Customized service
mappings genomic
Catalogued service
related and
Genes to
and descriptions of
Browser: Chromosome
extended RDF genes,
the EBI RDFDNA
oncogenes, provide in
alterations likewhich
(KEGG): Pathway-gene
protein-coding genes,
Bands (Ideograms) in
extended likeDNA
the EBI to which
oncogenes, provide
alterations to
(KEGG): Pathway-gene
protein-coding
Bands (Ideograms)
gene-relatedGenome
platform.variants like
bearHumanSNPs, CNVs
methylations, SNPs, as
Mappings exposedlike
genomic variantsCNVs a
the Humanexposed as a
methylations,
gene-related and
Mappings
platform.
bear somatic and
genomic
the somaticGenome
publication
germlineexpression
and gene expression
SPARQL Endpoint. in
cancer-related
Assembly mutations in
and gene
publication
SPARQL
germline Endpoint.
cancer-related
Assembly mutations
changes, in cancer
information.
their sequences
mutationscancerpatients,
(GRCh37/hg19,patients,
changes, in (COSMIC)
information.
their sequences
mutations (COSMIC)
(GRCh37/hg19,
mapped to (dbSNP).
and SNPsthe genomic loci
Feb 2009)thegenomic loci
mapped to
and SNPs
Feb 2009) (dbSNP).
14. Co-occurrence Data Cubes
Genes mapped with diseases, pathways and publications are
extracted a priori to instantiate co-occurrence pairs between all
possible combinations of genes, after identifier conversions.
A co-occurrence data cube of 3 dimensions (segment 1, segment 2
and data source type) and 2 measures (co-occurrence count and
names of mapped resources) is created between the genes,
ideograms and chromosomes.
The data cubes are transformed to RDF (N-triples syntax) using RDF
Data Cube Vocabulary and are stored in a Triple Store.
Chr1
Chr2
Chr3
…
Total
Dis
Chr1
Chr2
Chr3
Path
Pub
Dis
Path
Pub
Dis
Path
Pub
…
Dis
Path
Pub
28
4049
4522
15
4251
2915
19
4811
2667
…
226
75044
49827
15
4251
2915
14
1662
3589
14
2532
1748
…
155
42513
31424
19
4811
2667
14
2532
1748
16
1338
1390
…
224
44310
26817
15. Similarity Measures
Inspired from Tversky's feature-based similarity measure.
Similarity between two entities (chromosomes) is a weighted
summation of their common features (total number of co-occurrence
pairs of contained genes).
Sim12 = α *
Chr1 I Chr 2
HGTotalDis
Dis
+β*
Chr1 I Chr 2
HGTotalPath
Path
+γ *
Chr 1 I Chr 2
Pub
HGTotalPub
HGTotalDis = Total number of gene-gene co-occurrence pairs extracted
from the entire Human Genome (HG) for a particular data source.
HGTotalDis = 1435, HGTotalPath = 83970, HGTotalPub = 129035
15
4251
2915
Sim12 = α *
+β*
+γ *
1435
83970
129035
16. Similarity Measures (contd.)
The maximum similarity for any chromosome, is calculated by using
the evaluating Equation 1, with the total number of gene-gene cooccurrence pairs, in which genes of Chromosome 1 participate.
Chr1TotalDis
Chr1TotalPath
Chr 1TotalPub
Sim1Max = α *
+β*
+γ *
HGTotalDis
HGTotalPath
HGTotalPub
Evaluating with the values of last column in the data cube
226
75044
49827
Sim1Max = α *
+β*
+γ *
1435
83970
129035
The relative similarity impact on the side of any chromosome is
calculated by dividing the similarity measure with the maximum
similarity for the chromosome (Sim12/Sim1max).
17. Genomic Wheel
The human genome is laid in a
circular layout.
Chromosomes form the arcs
on the perimeter, whose
length is proportional to the
size of each chromosome.
The
thickness
of
the
connecting
chords
is
proportional
to
relative
similarity impact between
connected chromosomes.
α = 1, β = 0.4, γ = 0.1
18. Genomic Wheel (contd.)
Hovering the mouse over each
chord displays slice of
co-occurrence data cube.
Hierarchical categorization chromosome, ideogram, gene
and cancer point mutations
forms
layers
in
the
representative arc.
Only the ‘chromosome’ and
‘ideogram’ layer are shown in
the initial layout.
Chord’s thickness is
Chord’s thickness is
proportional to relative
proportional to relative
similarity impact
similarity impact
between Chromosome 11
between Chromosome
and 22on the connected
and on the connected
side (tapering).
side (tapering).
19. Genomic Wheel (contd.)
Clicking on each arc, the
represented
segment
is
highlighted and flares out to
display subsequent layers.
At the genetic level the
relation
chords
are
represented using distinct
colors (Red, green and blue for
diseases,
pathways
and
publications respectively) to
enable visual discernibility.
20. Genomic Wheel (contd.)
Genes catalogued in the
Cancer Gene Census, whose
gene sequences bear somatic
and
germline
mutations
responsible for cancer, are
represented by shades of red.
Hovering the mouse pointer
over any gene displays
additional information.
21. Genomic Tracks Display
Select any tumor and
Select any tumor and
associated patients from
associated patients from
LinkedTCGA datasets
LinkedTCGA datasets
22. Genomic Tracks Display
Stacked Display of
Stacked Display of
patients’ Exon
patients’ Exon
Expression (Green) and
Expression (Green) and
DNA Methylation (Red)
DNA Methylation (Red)
Results
Results
23. Genomic Tracks Display
Catalogued Cancer-related
Catalogued Cancer-related
mutations extracted from
mutations extracted from
COSMIC
COSMIC
26. Comparative Evaluation
Empirical Evaluation conducted through aaquestionnaire (http://goo.gl/vnLtX4)
Empirical Evaluation conducted through questionnaire (http://goo.gl/vnLtX4)
embedded within the GenomeSnip platform.
embedded within the GenomeSnip platform.
44PhD students and 55researchers researching in Biotechnology (primarily protein
PhD students and researchers researching in Biotechnology (primarily protein
engineering) and Bioinformatics responded.
engineering) and Bioinformatics responded.
31. Applicability of GenomeSnip
Formulating Improved Hypotheses:
Integration of the knowledge extracted from GWAS, available
literature and pathways, and interpretation through an intuitive
visualization, allowing isolation and comparative analysis of
interesting genomic segments using the Linked TCGA datasets.
Discovering Protein Interactions:
Usage and improved visual discernibility of co-occurrence networks
of genes, along with analysis of highly co-occurrent gene pairs using
gene/protein expression datasets.
Predicting Tumour Risk on a personalized level:
Evaluation of the genomic alterations, expression levels and clinical
data of a new patient against those patients registered under the
TCGA project, enabling personalized medicine.
32. Future Work on GenomeSnip
Integrate prostate cancer prediction models into the platform, for
validation using Linked TCGA datasets as a training set, and
predicting tumour risk in new patients.
Provide an ‘interaction’ overlay by integrating knowledge on proteinprotein interactions (BioGrid), gene co-expression (CoExpresDB) and
functions (Gene Ontology).
Extract information like disease variant mutations from UniProt to
interlink the ‘point mutations’ layer in the ‘Genomic Wheel’ with
other segments.
Improve the granularity of the ‘Genomic Wheel’ and extensively test
the user experience and the usability of this platform by conducting
a user-driven evaluation.
Most of the popular genome browsers provide navigation across the human genome in a linear fashion. Whereas automated mining tools are more adept towards the linear genomic analysis, the perceptive faculties of humans are more developed towards interpreting patterns in depth compared to length. Moreover, linear visualizations fail to account for inter- and intra-chromosomal relations, which could be easily interpreted and used by humans but difficult for machines.
‘Snip’ means ‘Clip’ – IN genomics Snipping is a common verb used in conjunction with the chromosome.
We have combined the salient features of the linear genome browsers and circular plots, and present an interactive alternative displaying only those genomic regions and its inherent relations which are of actual interest to the cancer expert.
Technologies Why??:
No dependence on proprietary frameworks like Adobe Flash and Silverlight for interactivity, whereas the support across traditional browsers improves the interoperability.
D3JS primarily relies on SVG, are suitable for developing interactive visualizations for smaller datasets - the functionality is deeply impacted when rendering larger datasets as SVG stores the rendered objects directly in the browser DOM.
HTML5 Canvas, which creates a raster graphic of the entire visualization prior to rendering in the browser window.
Approach:
Knowledge retrieval from various data sources
Determine the co-occurrence matrices between genomic features
Assemble the generated insights into an aggregated, interactive visualization ‘Genomic Wheel’.
SNP – Single Nucleotide Polymorphism
CNV – Copy Number Variation
LinkedTCGA – 20.4 billion triples
Gene identifier conversions were done using HGNC table – this step is required because different data sources use different nomenclatures and ids (EntrezGene, Ensembl or KEGG) to represent the gene.
HGTotalDis = 1435, HGTotalPath = 83970, HGTotalPub = 129035
Total number of co-occurrence pairs extracted from the entire human genome in current version
= 1, = 0.4, = 0.1 (In current version)
We hope to allow users to decide the weights (, , ) using input controls later.
The relative similarity impact is a visualization parameter used in the Genomic Wheel to represent the thickness of the chord connecting two chromosomes on the side of one chromosomal arc.
We hope to allow users to decide the weights (, , ) using input controls later.
Jbrowse and GenomeMaps provide a very interactive web-based Genomic Tracks Viewer with the ability to load and remove tracks related to SNPs, exons, etc. in a better way.
We hope to provide a Data Upload Utility for allowing cancer researchers to upload their own high-throughput data to visually compare against Linked TCGA Datasets.
Jbrowse and GenomeMaps provide a very interactive web-based Genomic Tracks Viewer with the ability to load and remove tracks related to SNPs, exons, etc. in a better way.
We hope to provide a Data Upload Utility for allowing cancer researchers to upload their own high-throughput data to visually compare against Linked TCGA Datasets.
Jbrowse and GenomeMaps provide a very interactive web-based Genomic Tracks Viewer with the ability to load and remove tracks related to SNPs, exons, etc. in a better way.
We hope to provide a Data Upload Utility for allowing cancer researchers to upload their own high-throughput data to visually compare against Linked TCGA Datasets.
Jbrowse and GenomeMaps provide a very interactive web-based Genomic Tracks Viewer with the ability to load and remove tracks related to SNPs, exons, etc. in a better way.
We hope to provide a Data Upload Utility for allowing cancer researchers to upload their own high-throughput data to visually compare against Linked TCGA Datasets.
Jbrowse and GenomeMaps provide a very interactive web-based Genomic Tracks Viewer with the ability to load and remove tracks related to SNPs, exons, etc. in a better way.
We hope to provide a Data Upload Utility for allowing cancer researchers to upload their own high-throughput data to visually compare against Linked TCGA Datasets.
Prostate adenocarcinoma, which is one of the most common malignancy to affect men, could be diagnosed through a combined evaluation of an individual's genomic and clinical data on a personalized scale [Boyd, 2012].