Student Profile Sample - We help schools to connect the data they have, with ...
Use of Open Linked Data in Bioinformatics Space: A Case Study on Pathway-Gene-Disease Relationships
1. Use of Open Linked Data in
Bioinformatics Space:
A Case Study
Remzi Çelebi
Department of Computer Engineering,
Ege University
İzmir, Turkey
remzi.celebi@ege.edu.tr
Özgür Gümüş
Department of Computer Engineering,
Ege University
İzmir, Turkey
ozgur.gumus@ege.edu.tr
Yeşim Aydın Son
Department of Health Informatics,
Middle East Technical University
Ankara, Turkey
yesim@metu.edu.tr
2. Outline
●
Semantic Web – very brief intro
–
●
RDF, SPARQL, Linked Data
Use Case Senarios
●
Conclusion
●
Future work
3. Semantic Web
●
●
Semantic Web , the next generation of web,
is considered as an extension of the current
web and provides a framework for
integration of the data from heterogeneous
resources.
The semantic web enables machines to
perform more of the tedious work involved in
finding, combining and extracting information
on the web.
4. Semantic Web – Open Linked Data
●
●
●
Open linked data is a new approach, which utilizes
the semantic web technology to publish, integrate
and analyze open data on web.
Open linked data suggests that the data on web
should be linked and be open for use of practical
applications.
It provides two kinds of advantages: ability to search
multiple datasets through a single framework and
ability to search relationships and paths of
relationships that go across different datasets.
5. Semantic Web Technologies - RDF
●
●
●
Resource Description Framework (RDF) is the
most fundamental way of describing resources
and relationships between them in the Semantic
Web.
An RDF triple is a statement about a resource in
the form of subject-predicate-object expression.
RDF can be represented in variety of formats,
including XML and JSON.
6. Uniform Resource Identifier - URI
●
"The generic set of all names/addresses that
are short strings that refer to resources"
–
●
URLs (Uniform Resource Locators) are a particular
type of URI, used for resources that can be
accessed on the WWW (e.g., web pages)
In RDF, URIs typically look like “normal” URLs,
often with fragment identifiers to point at
specific parts of a document:
Example:
Shorthand notation gene:BRCA1
●
The PREFIX keyword is used to describe short form of resources
PREFIX gene: http://www.bio2rdf.org/gene:
7
7. SPARQL
●
●
●
SPARQL is a query language to retrieve and
manipulate data in RDF format.
A SPARQL endpoint is a service which
provides a SPARQL-queryable interface to a
set of RDF statements stored in a triple-store.
SPARQL searches for all subgraphs that
match the graph described by the triples in the
query.
SELECT * WHERE { ?subject ?predicate ?object . }
8
8. Semantic Web for Health Care
and Bioinformatics
●
●
There is a big data cloud including the
information about genes, proteins, gene
networks, protein-protein interactions,
genetic variations, chemical compounds,
diseases and drugs in diverse formats.
The complexity of life sciences comes
from the integration and the analysis of
enormous amount of data obtained by
research from these variety of domains.
9
9. Bio2RDF Project
●
●
●
Creating a knowledge space of RDF documents
linked together with normalized URIs and sharing
a common ontology.
Documents from public bioinformatics databases
such as KEGG, PDB, MGI, HGNC and several of
NCBI’s databases are available in RDF format
through a unique URL in the form of
http://bio2rdf.org/namespace:id.
Bio2RDF has created a RDF warehouse that
serves over 70 million triples describing the human 10
and mouse genomes.
10. Bio2RDF
●
Bio2RDF is unique in several ways from previous
efforts that has been provisioning life sciences
with linked data such as Neurocommons,
LinkedLifeData, W3C HCLS, Chem2Bio2RDF and
BioLOD,
–
First, Bio2RDF gives unique linked data
vocabulary and topology.
–
Second, Bio2RDF produces syntactically
interoperable linked data across all datasets by
defining a set of basic guidelines.
–
Third, the community can benefit from Bio2RDF
infrastructure with an expandable global network of
mirrors that host Bio2RDF datasets and a
federated network of SPARQL end-points.
–
Finally, Bio2RDF is open source and freely
available for use, modify or redistribute.
11
11. Use Case Scenario
●
As a case study, to reveal the capabilities and
benefits of Bio2RDF project, we defined the
following question:
For a given pathway, what are the
diseases associated to the individual
genes in the pathway?
●
To get the answer of this question, a set of
data sources are required, CDT, OMIM, NCBI
Gene. These datasets can be queried on the
web as part of Bio2RDF project.
12
12. Use Case Scenario
a) Query-1
CTD for gene-pathway
information
PREFIX ctd_vocabulary: <http://bio2rdf.org/ctd_vocabulary:>
SELECT ?geneID
WHERE {
?geneID
}
ctd_vocabulary:pathway
<http://bio2rdf.org/kegg:04520> .
b) Query-2
OMIM for gene-disease
association
PREFIX omim_vocabulary: http://bio2rdf.org/omim_vocabulary:>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT ?gene ?pheno
WHERE {
}
?gene omim_vocabulary:phenotype ?pheno .
?pheno rdf:type omim_vocabulary:Phenotype .
?gene rdf:type omim_vocabulary:Gene .
c) Query-3
PREFIX geneid_vocabulary: <http://bio2rdf.org/geneid_vocabulary:>
SELECT ?geneID ?ensemblID
WHERE {
}
NCBI Gene for conversion of
geneid to ENSEMBL id
?geneID geneid_vocabulary:has_ensembl_gene_identifier ?ensemblID .
13
14. Results
●
When the results from both
BioMart (after providing KEGG Gene IDs) and
Bio2RDF (all-in-one-step) searches are
compared for the gene ID-OMIM ID matches
●
●
Bio2RDF matched 27 unique ENSEMBL gene
IDs from KEGG04520 pathway with 59 OMIM
IDs, whereas BioMart results only included 50 of
OMIM IDs for the same query, without any
additional matches.
The difference between the result set is likely to
be due to the version of the OMIM searched by
both services. Validity of all results is confirmed
through current build of the OMIM database.
15
15. More Use Cases
Finding important genes (hub genes) through
pathway related disease
PREFIX ctd_vocabulary: <http://bio2rdf.org/ctd_vocabulary:>
PREFIX omim_vocabulary: <http://bio2rdf.org/omim_vocabulary:>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT ?symbol count(distinct ?pathway) as ?indirect_num
WHERE
<http://bio2rdf.org/mesh:D006349> ctd_vocabulary:pathway ?pathway .
?geneid ctd_vocabulary:pathway
?pathway .
?geneid rdf:type
ctd_vocabulary:Gene .
?geneid ctd_vocabulary:gene-symbol ?symbol .
}
GROUP BY ?symbol
ORDER BY DESC( ?indirect_num )
16
16. More Use Cases
Finding diseases related given SNP by rsid through gene association
PREFIX ctd_vocabulary: <http://bio2rdf.org/ctd_vocabulary:>
PREFIX omim_vocabulary: <http://bio2rdf.org/omim_vocabulary:>
PREFIX pharmgkb_vocabulary: <http://bio2rdf.org/pharmgkb_vocabulary:>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT distinct ?disease_label
WHERE {
?assoc rdf:type pharmgkb_vocabulary:Disease-Gene-Association .
?assoc pharmgkb_vocabulary:disease ?disease .
?disease rdfs:label ?disease_label .
?assoc pharmgkb_vocabulary:gene ?gene .
?rsid pharmgkb_vocabulary:gene ?gene .
FILTER regex( str(?rsid), "rs1801253" ) .
}
17
17. Conclusion
●
●
Through the use case (pathway-gene-disease)
build here, we have showed that with Bio2RDF
datasets, different queries can be flexibly build,
merged and run in a federated fashion in order
to correctly retrieve data in a single run, which
is not possible to get from another single
database or service.
In this paper, a use case involving to query
multiple distant data sources which are
semantically available through Bio2RDF is
defined. Also, the results are compared and
validated by traditional search techniques.
18
18. Future works
●
This work will continue in two directions:
–
first direction will be developing a web
interface that helps the researchers to
query multiple data sources by using
some visual query templates without
RDF and/or SPARQL knowledge
–
second direction will be developing a
monitoring system that helps the
researchers to be aware of updates
about data related to their research
from multiple data sources.
19