Integration of Bioinformatics Web Services through the Search Computing Technology
1. Doctoral Minor Research
Project Defense
19th November 2012
Dipartimento di
Elettronica e Informazione
Integration of Bioinformatics Web Services
through the Search Computing Technology
Davide Chicco davide.chicco@elet.polimi.it
2. SUMMARY
1. The problem
• Multidomain questions
2. The proposed solution
• GPDW Data Warehouse
• Search Computing
• Bio-SeCo
3. Developed and added services
• Exploiting GPDW Data Warehouse
• Semantic Similarity
4. Conclusions
3. Data and search service scenario
in the Life Sciences
• In the Life Sciences: numerous data, sparsely distributed in
many heterogeneous sources
• Many are ranked data (or partially ranked) of various
types, representing different phenomena, e.g.:
– physical ordering, e.g. within a genome
– analytical order through algorithmically assigned
scores, e.g. representing levels of sequence similarity
– experimentally measured values, such as gene
expression levels
• The ordering may represent a range of different notions,
such as quantity, confidence, or location
3
4. Life Science questions and their answering
– Several Life Science questions:
- are complex
- to be answered require integration and comprehensive
evaluation of different data
– often distributed, many of which ranked
• Answering complex questions requires integration of vertical
search services to create multi-topic searches
• where the different topic searches either refine or augment previous
search results
• Bioinformatics data integration platforms exist
– Ordered data are poorly served or no supported at all by
current data integration platforms
4
5. Life Science multidomain question
Example: “Which genes encode proteins in different
organisms with high sequence similarity to a protein X and
have some biomedical features in common e.g. up/down
significantly co-expressed in the same biological tissue or
condition Y and involved in the biological function Z?”
Information to answer such queries is available on the Internet,
but no available software system is capable of computing the
answer
The user should search in
different resources, often
indipendent.
5
6. GPDW Data Warehouse
• Several integrated databanks, including: On-line databanks
• Entrez Gene, Ensembl
• Homologene
• IPI, UniProt/Swiss-Prot Entrez IPI
eVOC
BioCyc
KEGG
Reactome
GOA
Gene
• Gene Ontology, GOA Gene
Homologene
Ontology
• BioCyc, KEGG, Reactome
• InterPro, Pfam Automatic
• OMIM, eVOC, … Database
updating
procedures
server
• Numerous integrated data, including: Genomic and Proteomic
• 8,085,152 genes of 8,410 organisms Data Warehouse
• 31,347,655 proteins of 367,853 specie
• 33,252 Gene Ontology terms and 61,899 relations (is a, part of)
• 27,667 biochemical pathways
• 14,163 protein domains; 7,215 OMIM genetic disorders; …
6
7. Search Computing project at PoliMi
Search Computing (SeCo) aims at:
1. Developing the informatics framework required for
computing multi-topic searches by combing single topic
search results from search engines, which are often ranked,
with other data and computational resources
• directly supporting multi-topic ordered data
• taking into account order when the results of several
requests are combined
• enabling exploration and expansion of search results
2. Applying SeCo technology in different fields, including Life
Sciences => Bio-SeCo: Support answering complex
bioinformatics queries
7
8. Bio-SeCo: SeCo technologies to answer
Life Science questions
Life Science example query:
“Which genes encode proteins in different organisms with
high sequence similarity to a protein X and have some
biomedical features in common, e.g. up/down significantly
co-expressed in the same biological tissue or condition Y
and involved in a biological function Z?”
This multi-topic case study question can be decomposed into
the following four single topic sub-queries, each of these sub-
queries can be mapped to an available search service.
8
9. Bio-SeCo: SeCo technologies to answer
Life Science questions
• “Which proteins in different organisms have high sequence
similarity to a protein X ?”
BLAST, a sequence similarity search program, in one
of its many implementations, e.g. WU-BLAST or
NCBI-Blast
• “Which genes encode which proteins ?”
GPDW (Genomic and Proteomic Data Warehouse), a
query service to a database of genomic and proteomic
data (GPDW_protein2gene)
9
10. Bio-SeCo: SeCo technologies to answer
Life Science questions
• “Which genes are up/down significantly co-expressed in the
same biological condition / tissue Y ?”
Array Express Gene Expression Atlas, a search
engine of gene expression data
• “Which genes are involved in a biological function Z ?
GPDW (Genomic and Proteomic Data Warehouse), a
query service to a database of genomic and proteomic
data (GPDW_gene2biologicalFunctionFeature)
10
11. Bio-SeCo: SeCo technologies to answer
Life Science questions
Each quesiton part answer is integrated with others, with all
the ranked results found
GPDW_protein2gene
BLAST
ArrayExpress
GPDW_gene2biologicalFunctionFeature
11
12. What I have done for my Minor Research project
12
13. Semantic network: before
Has Gene Protein
Is_similar_to
Is_encoded_by
Gene Expression
Is_involved_in
Biological Function
Is_involved_in
13
Feature
14. Semantic network: now
Genetic
Disorder Pathway
Is_involved_in
Is_involved_in
Is_involved_in
Is_involved_in
Is_functional_similar_to
Codes
Has Gene Protein
Is_similar_to
Is_encoded_by
Gene Expression
Is_involved_in
Biological Function
Is_involved_in
14 Feature
15. Services I added: GPDW exploitation
Genetic
Disorder Pathway
Is_involved_in
Is_involved_in
Is_involved_in
Is_involved_in
Is_functional_similar_to
Codes
Has Gene Protein
A Genetic Disorder is an illness caused by abnormalities in genes
Is_encoded_by
Is_similar_to
or Gene Expression
chromosomes, especially a condition that is present from
before birth.
In biochemistry, Metabolic Pathways are series of chemical
Is_involved_in
reactions occurring within Biological Function pathway, a principal
a cell. In each Is_involved_in
Feature
chemical is modified by a series of chemical reactions.
15
16. Services I added: GPDW exploitation
Genetic
Disorder Pathway
Is_involved_in
Is_involved_in
Is_involved_in
Is_involved_in
Is_functional_similar_to
Codes
Has Gene Protein
Which Genetic Disorders isIs_encoded_byX involved in ?
the Gene Is_similar_to
Gene Expression
GPDW (Genomic and Proteomic Data Warehouse), a query
service to a database of genomic and proteomic data
(GPDW_gene2geneticDisorder)
Is_involved_in
Biological Function
Is_involved_in
16 Feature
17. Services I added: GPDW exploitation
Genetic
Disorder Pathway
Is_involved_in
Is_involved_in
Is_involved_in
Is_involved_in
Is_functional_similar_to
Codes
Has Gene Protein
Which Genetic Disorders isIs_encoded_by Y involved in ?
the Protein Is_similar_to
Gene Expression
GPDW (Genomic and Proteomic Data Warehouse), a query
service to a database of genomic and proteomic data
(GPDW_protein2geneticDisorder)
Is_involved_in
Biological Function
Is_involved_in
17 Feature
18. Services I added: GPDW exploitation
Genetic
Disorder Pathway
Is_involved_in
Is_involved_in
Is_involved_in
Is_involved_in
Is_functional_similar_to
Codes
Has Gene Protein
Which Genes does the Genetic Disorder X involve?
Is_encoded_by
Is_similar_to
Gene Expression
GPDW (Genomic and Proteomic Data Warehouse), a query
service to a database of genomic and proteomic data
(GPDW_geneticDisorder2gene)
Is_involved_in
Biological Function
Is_involved_in
18 Feature
19. Services I added: GPDW exploitation
Genetic
Disorder Pathway
Is_involved_in
Is_involved_in
Is_involved_in
Is_involved_in
Is_functional_similar_to
Codes
Has Gene Protein
Which Proteins does the Genetic Disorder X involve?
Is_encoded_by
Is_similar_to
Gene Expression
GPDW (Genomic and Proteomic Data Warehouse), a query
service to a database of genomic and proteomic data
(GPDW_geneticDisorder2gene)
Is_involved_in
Biological Function
Is_involved_in
19 Feature
20. Services I added: GPDW exploitation
Genetic
Disorder Pathway
Is_involved_in
Is_involved_in
Is_involved_in
Is_involved_in
Is_functional_similar_to
Codes
Has Gene Protein
Same questions and GPDWIs_encoded_by Metabolic Pathways:
services for Is_similar_to
• GPDW_gene2pathway
Gene Expression
• GPDW_protein2pathway
• GPDW_pathway2gene
• GPDW_pathway2protein
Is_involved_in
Biological Function
Is_involved_in
20 Feature
21. Services I added: GPDW exploitation
A Biological Function Feature is an item of information about a
Genetic
gene or a protein. It defines a certain peculiarity of a biomolecular
Disorder Pathway
entity. E.g.: “is involved in lung cancer”
Is_involved_in
Is_involved_in
Is_involved_in
GPDW_protein2biological_function_feature
Is_involved_in
Is_functional_similar_to
Codes
Has Gene Protein
Is_similar_to
Is_encoded_by
Gene Expression
Is_involved_in
Biological Function
Is_involved_in
21 Feature
22. Services I added
• These new services (Genetic Disorder and Pathway) are
very useful and important, but they don’t take advantage of
the main novelty provided by the Search Computing
technology: the Integration of ranked results
• There’s no ranking on “being involved” in a Genetic
22Disorder or a Pathway…
23. Services I added: Gene Semantic Similarity
• The other service (SemSim) I integrated on Bio-SeCo is
related to the computation of the semantic similarity of a
gene into a list of genes:
Is_functional_similar_to
Gene
• This service provides ranked results (given a gene X, it
returns a list of gene ranked from the most semantic similar
to X to the less semantic similar one)
• SemSim takes advantage of the Search Computing
potentiality of integrating ranked results
23
24. Semantic Similarity?!? What does it mean?
• Keypoint: given the gene X and gene Y, how much similar
are they?
• Semantically similar genes can be involved in similar
activities, can be involved in similar pathways, and can have
many annotations in common
• To measure this similarity, I chose Latent Semantic
Indexing method, based on a matrix build with gene-
related annotations
24
25. Biomolecular annotation
• The concept of annotation: association of nucleotide or amino
acid sequences with useful information describing their features
• This information is expressed through controlled
vocabularies, sometimes structured as ontologies, where
every controlled term of the vocabulary is associated with a
unique alphanumeric code
• The association of such a code with a gene or protein ID
constitutes an annotation
Gene / Biological function feature
Protein
Annotation
25
gene2bff
26. Biomolecular annotation
• The association of an information/feature with a gene or
protein ID constitutes an annotation
• Annotation example:
• gene: GD4
• feature: “is present in the mitochondrial membrane”
Gene / Biological function feature
Protein
Annotation
26
gene2bff
27. Latente Semantic Indexing:
Singular Value Decomposition – SVD
– Annotation matrix A {0, 1} m x n
− m rows: genes / proteins
− n columns: annotation terms
A(i,j) = 1 if gene / protein i is annotated to term j or to any
descendant of j in the considered ontology structure (true
path rule)
A(i,j) = 0 otherwise (it is unknown)
term01 term02 term03 term04 … termN
gene01 0 0 0 0 … 0
gene02 0 1 1 0 … 1
… … … … … … …
geneM 0 0 0 0 … 0
27
28. Latente Semantic Indexing:
Singular Value Decomposition – SVD
– Annotation matrix A {0, 1} m x n
− m rows: genes / proteins
− n columns: annotation terms
A(i,j) = 1 if gene / protein i is annotated to term j or to any
descendant of j in the considered ontology structure (true
path rule)
A(i,j) = 0 otherwise (it is unknown)
term01 term02 term03 term04 … termN
gene01 0 0 0 0 … 0
gene02 0 1 1 0 … 1
… … … … … … …
geneM 0 0 0 0 … 0
28
29. Latente Semantic Indexing:
Singular Value Decomposition – SVD
Compute SVD:
A U V T U V T V TA U V T
A A U
A U V T
Compute reduced rank approximation:
Ak U k kkVk U kU kVkkTVkTU k kVkT
A AT
k A
Ak U k kVkT k
k
• An annotation prediction is performed by computing a reduced
rank approximation Ak of the annotation matrix A
(where 0 < k < r, with r the number of non zero singular values
of A, i.e. the rank of A)
29
30. Latente Semantic Indexing:
Singular Value Decomposition – SVD
Compute reduced rank approximation:
Ak U k kkVk U kU kVkkTVkTU k kVkT
A AT
k A
Ak U k kVkT k
k
• A : genes – features matrix
• Uk : gene vectors matrix
• Σk : singular value matrix
• VTk : feature vectors matrix
30
31. Latente Semantic Indexing:
Singular Value Decomposition – SVD
• Uk : gene vectors matrix
• Σk : singular value matrix
• VTk : feature vectors matrix
• These matrices can be considered for measuring the distances
between objects (genes or feature) in the k-dimensional space.
• For example, is possibile to compute the distance between
two gene vector to understand their similarity level. The same
thing could be done for features.
31
32. Latente Semantic Indexing:
Singular Value Decomposition – SVD
• Uk : gene vectors matrix
• Σk : singular value matrix
• VTk : feature vectors matrix
• For our implementation of the LSI, we chose to compute the
cosine similarity as measure of the semantic similarity
between genes.
32
33. Minor Research Project
• A preprocessing software computes the Singular Value
Decomposition (SVD) algorithm
• It prints the matrices (Uk, Σk, VTk) in three different files
• These files are inserted into the data directory of the SemSim
REST web application
• SemSim (JSP + Java) computes the Latent Semantic Indexing
(LSI) measures and returns the ranked list of genes
33
34. Minor Research Project
• Developed with REST technology
• Integrated on Bio-SeCo as an external service, with a wrapper
• Input: gene (ID, name, taxonomy)
34
35. Minor Research Project
• Input: list of genes ranked on their semantic similarity with the
input gene
35
36. Minor Research Project
• Now is possible to answer to many other biological questions.
For example:
Among the proteins that are encoded by genes, in Chicken
organism, with higher functional semantic similarity to gene X,
which are those with higher sequence similarity to protein Y ?
Input
Sequence
SemSim ProteinByGene
Alignment
Output
36
37. Minor Research Project
• Now is possible to answer to many other biological questions,
that involve Gene Semantic Similarity computation, Genetic
Disorders or Metabolic Pathways. For example:
Among the proteins that are encoded by genes, in Chicken
organism, with higher functional semantic similarity to gene X,
which are those with higher sequence similarity to protein Y ?
Input
Sequence
SemSim ProteinByGene
Alignment
Output
37 DEMO