SlideShare una empresa de Scribd logo
1 de 38
Descargar para leer sin conexión
Doctoral Minor Research
                                                Project Defense
                                             19th November 2012




           Dipartimento di
Elettronica e Informazione




           Integration of Bioinformatics Web Services
           through the Search Computing Technology

         Davide Chicco            davide.chicco@elet.polimi.it
SUMMARY

1. The problem
    • Multidomain questions

2. The proposed solution
    • GPDW Data Warehouse
    • Search Computing
        • Bio-SeCo

3. Developed and added services
    • Exploiting GPDW Data Warehouse
    • Semantic Similarity

4. Conclusions
Data and search service scenario
                     in the Life Sciences
• In the Life Sciences: numerous data, sparsely distributed in
  many heterogeneous sources

     • Many are ranked data (or partially ranked) of various
       types, representing different phenomena, e.g.:
         – physical ordering, e.g. within a genome

         – analytical order through algorithmically assigned
           scores, e.g. representing levels of sequence similarity
         – experimentally measured values, such as gene
           expression levels


     • The ordering may represent a range of different notions,
       such as quantity, confidence, or location
 3
Life Science questions and their answering

 – Several Life Science questions:
    - are complex
    - to be answered require integration and comprehensive
      evaluation of different data
        – often distributed, many of which ranked


• Answering complex questions requires integration of vertical
  search services to create multi-topic searches
     • where the different topic searches either refine or augment previous
       search results


• Bioinformatics data integration platforms exist
 – Ordered data are poorly served or no supported at all by
   current data integration platforms
 4
Life Science multidomain question

 Example: “Which genes encode proteins in different
 organisms with high sequence similarity to a protein X and
 have some biomedical features in common e.g. up/down
 significantly co-expressed in the same biological tissue or
 condition Y and involved in the biological function Z?”
Information to answer such queries is available on the Internet,
but no available software system is capable of computing the
answer

The user should search in
different resources, often
indipendent.

 5
GPDW Data Warehouse

•       Several integrated databanks, including:         On-line databanks
•       Entrez Gene, Ensembl
•       Homologene
•       IPI, UniProt/Swiss-Prot                 Entrez   IPI
                                                               eVOC
                                                                      BioCyc
                                                                               KEGG
                                                                                   Reactome
                                                                                            GOA
                                                                                              Gene
•       Gene Ontology, GOA                      Gene
                                          Homologene
                                                                                             Ontology


•       BioCyc, KEGG, Reactome
•       InterPro, Pfam                                                                 Automatic
•       OMIM, eVOC, …                    Database
                                                                                        updating
                                                                                       procedures
                                            server

•       Numerous integrated data, including:             Genomic and Proteomic
•       8,085,152 genes of 8,410 organisms                  Data Warehouse

•       31,347,655 proteins of 367,853 specie
•       33,252 Gene Ontology terms and 61,899 relations (is a, part of)
•       27,667 biochemical pathways
•       14,163 protein domains; 7,215 OMIM genetic disorders; …
    6
Search Computing project at PoliMi

Search Computing (SeCo) aims at:
 1. Developing the informatics framework required for
    computing multi-topic searches by combing single topic
    search results from search engines, which are often ranked,
    with other data and computational resources
      • directly supporting multi-topic ordered data
      •    taking into account order when the results of several
          requests are combined
      • enabling exploration and expansion of search results
 2. Applying SeCo technology in different fields, including Life
    Sciences => Bio-SeCo: Support answering complex
    bioinformatics queries
  7
Bio-SeCo: SeCo technologies to answer
                      Life Science questions

Life Science example query:

     “Which genes encode proteins in different organisms with
     high sequence similarity to a protein X and have some
     biomedical features in common, e.g. up/down significantly
     co-expressed in the same biological tissue or condition Y
     and involved in a biological function Z?”


This multi-topic case study question can be decomposed into
the following four single topic sub-queries, each of these sub-
queries can be mapped to an available search service.


 8
Bio-SeCo: SeCo technologies to answer
                    Life Science questions

 • “Which proteins in different organisms have high sequence
   similarity to a protein X ?”
      BLAST, a sequence similarity search program, in one
        of its many implementations, e.g. WU-BLAST or
        NCBI-Blast


• “Which genes encode which proteins ?”
  GPDW (Genomic and Proteomic Data Warehouse), a
   query service to a database of genomic and proteomic
   data (GPDW_protein2gene)




 9
Bio-SeCo: SeCo technologies to answer
                    Life Science questions

• “Which genes are up/down significantly co-expressed in the
  same biological condition / tissue Y ?”
       Array Express Gene Expression Atlas, a search
      engine of gene expression data


• “Which genes are involved in a biological function Z ?
     GPDW (Genomic and Proteomic Data Warehouse), a
      query service to a database of genomic and proteomic
      data (GPDW_gene2biologicalFunctionFeature)




 10
Bio-SeCo: SeCo technologies to answer
                     Life Science questions

Each quesiton part answer is integrated with others, with all
  the ranked results found


                        GPDW_protein2gene



               BLAST
                                              ArrayExpress




                 GPDW_gene2biologicalFunctionFeature



 11
What I have done for my Minor Research project




12
Semantic network: before




      Has             Gene               Protein
                                                    Is_similar_to
                             Is_encoded_by
Gene Expression



            Is_involved_in
                              Biological Function
                                                    Is_involved_in
13
                                   Feature
Semantic network: now

          Genetic
          Disorder                                                        Pathway

                           Is_involved_in
                                                  Is_involved_in
                                                                       Is_involved_in

    Is_involved_in

Is_functional_similar_to
                                      Codes


            Has              Gene                      Protein
                                                                   Is_similar_to
                                        Is_encoded_by
    Gene Expression



                  Is_involved_in
                                            Biological Function
                                                                   Is_involved_in
     14                                          Feature
Services I added: GPDW exploitation
         Genetic
         Disorder                                                 Pathway

                           Is_involved_in
                                              Is_involved_in
                                                               Is_involved_in

    Is_involved_in

Is_functional_similar_to
                                      Codes


            Has              Gene                  Protein

A Genetic Disorder is an illness caused by abnormalities in genes
                           Is_encoded_by
                                             Is_similar_to

or Gene Expression
   chromosomes, especially a condition that is present from
before birth.

In biochemistry, Metabolic Pathways are series of chemical
            Is_involved_in
reactions occurring within Biological Function pathway, a principal
                             a cell. In each Is_involved_in
                                 Feature
chemical is modified by a series of chemical reactions.
   15
Services I added: GPDW exploitation
          Genetic
          Disorder                                                       Pathway

                           Is_involved_in
                                                  Is_involved_in
                                                                      Is_involved_in

    Is_involved_in

Is_functional_similar_to
                                      Codes


            Has              Gene                      Protein

Which Genetic Disorders isIs_encoded_byX involved in ?
                            the Gene            Is_similar_to

    Gene Expression
 GPDW (Genomic and Proteomic Data Warehouse), a query
  service to a database of genomic and proteomic data
  (GPDW_gene2geneticDisorder)
          Is_involved_in
                                            Biological Function
                                                                   Is_involved_in
     16                                          Feature
Services I added: GPDW exploitation
          Genetic
          Disorder                                                       Pathway

                           Is_involved_in
                                                  Is_involved_in
                                                                      Is_involved_in

    Is_involved_in

Is_functional_similar_to
                                      Codes


            Has              Gene                      Protein

Which Genetic Disorders isIs_encoded_by Y involved in ?
                            the Protein        Is_similar_to

    Gene Expression
 GPDW (Genomic and Proteomic Data Warehouse), a query
  service to a database of genomic and proteomic data
  (GPDW_protein2geneticDisorder)
          Is_involved_in
                                            Biological Function
                                                                   Is_involved_in
     17                                          Feature
Services I added: GPDW exploitation
          Genetic
          Disorder                                                       Pathway

                           Is_involved_in
                                                  Is_involved_in
                                                                      Is_involved_in

    Is_involved_in

Is_functional_similar_to
                                      Codes


            Has              Gene                      Protein

Which Genes does the Genetic Disorder X involve?
                        Is_encoded_by
                                            Is_similar_to

    Gene Expression
 GPDW (Genomic and Proteomic Data Warehouse), a query
  service to a database of genomic and proteomic data
  (GPDW_geneticDisorder2gene)
          Is_involved_in
                                            Biological Function
                                                                   Is_involved_in
     18                                          Feature
Services I added: GPDW exploitation
          Genetic
          Disorder                                                       Pathway

                           Is_involved_in
                                                  Is_involved_in
                                                                      Is_involved_in

    Is_involved_in

Is_functional_similar_to
                                      Codes


            Has              Gene                      Protein

Which Proteins does the Genetic Disorder X involve?
                         Is_encoded_by
                                             Is_similar_to

    Gene Expression
 GPDW (Genomic and Proteomic Data Warehouse), a query
  service to a database of genomic and proteomic data
  (GPDW_geneticDisorder2gene)
          Is_involved_in
                                            Biological Function
                                                                   Is_involved_in
     19                                          Feature
Services I added: GPDW exploitation
          Genetic
          Disorder                                                       Pathway

                           Is_involved_in
                                                  Is_involved_in
                                                                      Is_involved_in

    Is_involved_in

Is_functional_similar_to
                                      Codes


            Has              Gene                      Protein

Same questions and GPDWIs_encoded_by Metabolic Pathways:
                        services for      Is_similar_to

•   GPDW_gene2pathway
    Gene Expression
•   GPDW_protein2pathway
•   GPDW_pathway2gene
•   GPDW_pathway2protein
                  Is_involved_in
                                            Biological Function
                                                                   Is_involved_in
     20                                          Feature
Services I added: GPDW exploitation
A Biological Function Feature is an item of information about a
        Genetic
gene or a protein. It defines a certain peculiarity of a biomolecular
        Disorder                                       Pathway
entity. E.g.: “is involved in lung cancer”
                      Is_involved_in
                                           Is_involved_in
                                                                Is_involved_in

GPDW_protein2biological_function_feature
  Is_involved_in

Is_functional_similar_to
                                   Codes


            Has             Gene                Protein
                                                            Is_similar_to
                                    Is_encoded_by
    Gene Expression



                  Is_involved_in
                                     Biological Function
                                                            Is_involved_in
     21                                   Feature
Services I added

• These new services (Genetic Disorder and Pathway) are
  very useful and important, but they don’t take advantage of
  the main novelty provided by the Search Computing
  technology: the Integration of ranked results




• There’s no ranking on “being involved” in a Genetic
 22Disorder or a Pathway…
Services I added: Gene Semantic Similarity

• The other service (SemSim) I integrated on Bio-SeCo is
  related to the computation of the semantic similarity of a
  gene into a list of genes:

                                Is_functional_similar_to

                             Gene
• This service provides ranked results (given a gene X, it
  returns a list of gene ranked from the most semantic similar
  to X to the less semantic similar one)


• SemSim takes advantage of the Search Computing
  potentiality of integrating ranked results
 23
Semantic Similarity?!? What does it mean?

• Keypoint: given the gene X and gene Y, how much similar
  are they?


• Semantically similar genes can be involved in similar
  activities, can be involved in similar pathways, and can have
  many annotations in common


• To measure this similarity, I chose Latent Semantic
  Indexing method, based on a matrix build with gene-
  related annotations



 24
Biomolecular annotation

• The concept of annotation: association of nucleotide or amino
  acid sequences with useful information describing their features

• This information is expressed through controlled
  vocabularies, sometimes structured as ontologies, where
  every controlled term of the vocabulary is associated with a
  unique alphanumeric code



• The association of such a code with a gene or protein ID
  constitutes an annotation
       Gene /                          Biological function feature
       Protein
                       Annotation
25
                        gene2bff
Biomolecular annotation

• The association of an information/feature with a gene or
  protein ID constitutes an annotation



• Annotation example:

     • gene: GD4

     • feature: “is present in the mitochondrial membrane”



        Gene /                       Biological function feature
        Protein
                        Annotation
26
                         gene2bff
Latente Semantic Indexing:
               Singular Value Decomposition – SVD


     – Annotation matrix A  {0, 1} m x n
      − m rows: genes / proteins
      − n columns: annotation terms
      A(i,j) = 1 if gene / protein i is annotated to term j or to any
         descendant of j in the considered ontology structure (true
         path rule)
      A(i,j) = 0 otherwise (it is unknown)
                 term01   term02   term03   term04   …    termN
       gene01      0        0        0        0      …      0
       gene02      0        1        1        0      …      1
         …         …        …        …        …      …      …
       geneM       0        0        0        0      …      0
27
Latente Semantic Indexing:
               Singular Value Decomposition – SVD


     – Annotation matrix A  {0, 1} m x n
      − m rows: genes / proteins
      − n columns: annotation terms
      A(i,j) = 1 if gene / protein i is annotated to term j or to any
         descendant of j in the considered ontology structure (true
         path rule)
      A(i,j) = 0 otherwise (it is unknown)
                 term01   term02   term03   term04   …    termN
       gene01      0        0        0        0      …      0
       gene02      0        1        1        0      …      1
         …         …        …        …        …      …      …
       geneM       0        0        0        0      …      0
28
Latente Semantic Indexing:
          Singular Value Decomposition – SVD

Compute SVD:
                            A  U V T  U V T V TA  U V T
                                     A      A U

            A  U V T                  

Compute reduced rank approximation:
                            Ak  U k kkVk U kU kVkkTVkTU k  kVkT
                                     A AT    
                                         k        A

           Ak  U k kVkT                                            k


                                             k

  • An annotation prediction is performed by computing a reduced
    rank approximation Ak of the annotation matrix A
    (where 0 < k < r, with r the number of non zero singular values
    of A, i.e. the rank of A)

 29
Latente Semantic Indexing:
          Singular Value Decomposition – SVD

Compute reduced rank approximation:

                            Ak  U k kkVk U kU kVkkTVkTU k  kVkT
                                     A AT    
                                         k        A

          Ak  U k kVkT                                             k


                                             k
 • A : genes – features matrix

 • Uk : gene vectors matrix

 • Σk : singular value matrix

 • VTk : feature vectors matrix



30
Latente Semantic Indexing:
          Singular Value Decomposition – SVD

 • Uk : gene vectors matrix

 • Σk : singular value matrix

 • VTk : feature vectors matrix


 • These matrices can be considered for measuring the distances
   between objects (genes or feature) in the k-dimensional space.

 • For example, is possibile to compute the distance between
   two gene vector to understand their similarity level. The same
   thing could be done for features.




31
Latente Semantic Indexing:
          Singular Value Decomposition – SVD

 • Uk : gene vectors matrix

 • Σk : singular value matrix

 • VTk : feature vectors matrix


 • For our implementation of the LSI, we chose to compute the
   cosine similarity as measure of the semantic similarity
   between genes.




32
Minor Research Project

• A preprocessing software computes the Singular Value
  Decomposition (SVD) algorithm

• It prints the matrices (Uk, Σk, VTk) in three different files

• These files are inserted into the data directory of the SemSim
  REST web application

• SemSim (JSP + Java) computes the Latent Semantic Indexing
  (LSI) measures and returns the ranked list of genes




33
Minor Research Project

• Developed with REST technology

• Integrated on Bio-SeCo as an external service, with a wrapper

• Input: gene (ID, name, taxonomy)




34
Minor Research Project

• Input: list of genes ranked on their semantic similarity with the
  input gene




35
Minor Research Project

 • Now is possible to answer to many other biological questions.
   For example:


     Among the proteins that are encoded by genes, in Chicken
     organism, with higher functional semantic similarity to gene X,
     which are those with higher sequence similarity to protein Y ?


     Input


                                                      Sequence
SemSim                 ProteinByGene
                                                      Alignment

                                                         Output
36
Minor Research Project

 • Now is possible to answer to many other biological questions,
   that involve Gene Semantic Similarity computation, Genetic
   Disorders or Metabolic Pathways. For example:


     Among the proteins that are encoded by genes, in Chicken
     organism, with higher functional semantic similarity to gene X,
     which are those with higher sequence similarity to protein Y ?

     Input


                                                      Sequence
SemSim                 ProteinByGene
                                                      Alignment

                                                         Output
37           DEMO
Thanks for your attention




38

Más contenido relacionado

Similar a Integration of Bioinformatics Web Services through the Search Computing Technology

Reference Data Integration: A Strategy for the Future
Reference Data Integration: A Strategy for the FutureReference Data Integration: A Strategy for the Future
Reference Data Integration: A Strategy for the FutureBarry Smith
 
BioDiscovery Solutions for Future
BioDiscovery Solutions for FutureBioDiscovery Solutions for Future
BioDiscovery Solutions for Futurecontactmeasif
 
UniProt-GOA
UniProt-GOAUniProt-GOA
UniProt-GOAEBI
 
Gene Ontology Project
Gene Ontology ProjectGene Ontology Project
Gene Ontology Projectvaibhavdeoda
 
NetBioSIG2013-Talk Robin Haw
NetBioSIG2013-Talk Robin Haw NetBioSIG2013-Talk Robin Haw
NetBioSIG2013-Talk Robin Haw Alexander Pico
 
Bioinformatics: Building the cornerstones of Sequence Homology and its use fo...
Bioinformatics: Building the cornerstones of Sequence Homology and its use fo...Bioinformatics: Building the cornerstones of Sequence Homology and its use fo...
Bioinformatics: Building the cornerstones of Sequence Homology and its use fo...OECD Environment
 
Genomics and Bioinformatics
Genomics and BioinformaticsGenomics and Bioinformatics
Genomics and BioinformaticsAmit Garg
 
What can you learn from molecular modeling?
What can you learn from molecular modeling?What can you learn from molecular modeling?
What can you learn from molecular modeling?digitalbio
 
Towards a Simple, Standards-Compliant, and Generic Phylogenetic Database
Towards a Simple, Standards-Compliant, and Generic Phylogenetic DatabaseTowards a Simple, Standards-Compliant, and Generic Phylogenetic Database
Towards a Simple, Standards-Compliant, and Generic Phylogenetic DatabaseHilmar Lapp
 
Introduction to Biological database ppt(1).pptx
Introduction to Biological database ppt(1).pptxIntroduction to Biological database ppt(1).pptx
Introduction to Biological database ppt(1).pptxRAJESHKUMAR428748
 
1 introduction to_the_ebi_(katrina_pavelin)
1 introduction to_the_ebi_(katrina_pavelin)1 introduction to_the_ebi_(katrina_pavelin)
1 introduction to_the_ebi_(katrina_pavelin)phdcareers
 
Functional annotation
Functional annotationFunctional annotation
Functional annotationRavi Gandham
 
Bioinformatics Introduction and Use of BLAST Tool
Bioinformatics Introduction and Use of BLAST ToolBioinformatics Introduction and Use of BLAST Tool
Bioinformatics Introduction and Use of BLAST ToolJesminBinti
 
Role of Bioinformatics in Plant Pathology.pptx
Role of Bioinformatics in Plant Pathology.pptxRole of Bioinformatics in Plant Pathology.pptx
Role of Bioinformatics in Plant Pathology.pptxHasanRiaz18
 
BITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequencesBITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequencesBITS
 
Bioinformatica t9-t10-biocheminformatics
Bioinformatica t9-t10-biocheminformaticsBioinformatica t9-t10-biocheminformatics
Bioinformatica t9-t10-biocheminformaticsProf. Wim Van Criekinge
 
Introduction to bioinformatics
Introduction to bioinformaticsIntroduction to bioinformatics
Introduction to bioinformaticsmaulikchaudhary8
 
Using ontologies to do integrative systems biology
Using ontologies to do integrative systems biologyUsing ontologies to do integrative systems biology
Using ontologies to do integrative systems biologyChris Evelo
 
bioinformatics simple
bioinformatics simple bioinformatics simple
bioinformatics simple nadeem akhter
 

Similar a Integration of Bioinformatics Web Services through the Search Computing Technology (20)

Reference Data Integration: A Strategy for the Future
Reference Data Integration: A Strategy for the FutureReference Data Integration: A Strategy for the Future
Reference Data Integration: A Strategy for the Future
 
BioDiscovery Solutions for Future
BioDiscovery Solutions for FutureBioDiscovery Solutions for Future
BioDiscovery Solutions for Future
 
UniProt-GOA
UniProt-GOAUniProt-GOA
UniProt-GOA
 
Gene Ontology Project
Gene Ontology ProjectGene Ontology Project
Gene Ontology Project
 
NetBioSIG2013-Talk Robin Haw
NetBioSIG2013-Talk Robin Haw NetBioSIG2013-Talk Robin Haw
NetBioSIG2013-Talk Robin Haw
 
Bioinformatics: Building the cornerstones of Sequence Homology and its use fo...
Bioinformatics: Building the cornerstones of Sequence Homology and its use fo...Bioinformatics: Building the cornerstones of Sequence Homology and its use fo...
Bioinformatics: Building the cornerstones of Sequence Homology and its use fo...
 
Genomics and Bioinformatics
Genomics and BioinformaticsGenomics and Bioinformatics
Genomics and Bioinformatics
 
What can you learn from molecular modeling?
What can you learn from molecular modeling?What can you learn from molecular modeling?
What can you learn from molecular modeling?
 
Towards a Simple, Standards-Compliant, and Generic Phylogenetic Database
Towards a Simple, Standards-Compliant, and Generic Phylogenetic DatabaseTowards a Simple, Standards-Compliant, and Generic Phylogenetic Database
Towards a Simple, Standards-Compliant, and Generic Phylogenetic Database
 
Introduction to Biological database ppt(1).pptx
Introduction to Biological database ppt(1).pptxIntroduction to Biological database ppt(1).pptx
Introduction to Biological database ppt(1).pptx
 
1 introduction to_the_ebi_(katrina_pavelin)
1 introduction to_the_ebi_(katrina_pavelin)1 introduction to_the_ebi_(katrina_pavelin)
1 introduction to_the_ebi_(katrina_pavelin)
 
Functional annotation
Functional annotationFunctional annotation
Functional annotation
 
Bioinformatics Introduction and Use of BLAST Tool
Bioinformatics Introduction and Use of BLAST ToolBioinformatics Introduction and Use of BLAST Tool
Bioinformatics Introduction and Use of BLAST Tool
 
bioinformatics enabling knowledge generation from agricultural omics data
bioinformatics enabling knowledge generation from agricultural omics databioinformatics enabling knowledge generation from agricultural omics data
bioinformatics enabling knowledge generation from agricultural omics data
 
Role of Bioinformatics in Plant Pathology.pptx
Role of Bioinformatics in Plant Pathology.pptxRole of Bioinformatics in Plant Pathology.pptx
Role of Bioinformatics in Plant Pathology.pptx
 
BITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequencesBITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequences
 
Bioinformatica t9-t10-biocheminformatics
Bioinformatica t9-t10-biocheminformaticsBioinformatica t9-t10-biocheminformatics
Bioinformatica t9-t10-biocheminformatics
 
Introduction to bioinformatics
Introduction to bioinformaticsIntroduction to bioinformatics
Introduction to bioinformatics
 
Using ontologies to do integrative systems biology
Using ontologies to do integrative systems biologyUsing ontologies to do integrative systems biology
Using ontologies to do integrative systems biology
 
bioinformatics simple
bioinformatics simple bioinformatics simple
bioinformatics simple
 

Integration of Bioinformatics Web Services through the Search Computing Technology

  • 1. Doctoral Minor Research Project Defense 19th November 2012 Dipartimento di Elettronica e Informazione Integration of Bioinformatics Web Services through the Search Computing Technology Davide Chicco davide.chicco@elet.polimi.it
  • 2. SUMMARY 1. The problem • Multidomain questions 2. The proposed solution • GPDW Data Warehouse • Search Computing • Bio-SeCo 3. Developed and added services • Exploiting GPDW Data Warehouse • Semantic Similarity 4. Conclusions
  • 3. Data and search service scenario in the Life Sciences • In the Life Sciences: numerous data, sparsely distributed in many heterogeneous sources • Many are ranked data (or partially ranked) of various types, representing different phenomena, e.g.: – physical ordering, e.g. within a genome – analytical order through algorithmically assigned scores, e.g. representing levels of sequence similarity – experimentally measured values, such as gene expression levels • The ordering may represent a range of different notions, such as quantity, confidence, or location 3
  • 4. Life Science questions and their answering – Several Life Science questions: - are complex - to be answered require integration and comprehensive evaluation of different data – often distributed, many of which ranked • Answering complex questions requires integration of vertical search services to create multi-topic searches • where the different topic searches either refine or augment previous search results • Bioinformatics data integration platforms exist – Ordered data are poorly served or no supported at all by current data integration platforms 4
  • 5. Life Science multidomain question Example: “Which genes encode proteins in different organisms with high sequence similarity to a protein X and have some biomedical features in common e.g. up/down significantly co-expressed in the same biological tissue or condition Y and involved in the biological function Z?” Information to answer such queries is available on the Internet, but no available software system is capable of computing the answer The user should search in different resources, often indipendent. 5
  • 6. GPDW Data Warehouse • Several integrated databanks, including: On-line databanks • Entrez Gene, Ensembl • Homologene • IPI, UniProt/Swiss-Prot Entrez IPI eVOC BioCyc KEGG Reactome GOA Gene • Gene Ontology, GOA Gene Homologene Ontology • BioCyc, KEGG, Reactome • InterPro, Pfam Automatic • OMIM, eVOC, … Database updating procedures server • Numerous integrated data, including: Genomic and Proteomic • 8,085,152 genes of 8,410 organisms Data Warehouse • 31,347,655 proteins of 367,853 specie • 33,252 Gene Ontology terms and 61,899 relations (is a, part of) • 27,667 biochemical pathways • 14,163 protein domains; 7,215 OMIM genetic disorders; … 6
  • 7. Search Computing project at PoliMi Search Computing (SeCo) aims at: 1. Developing the informatics framework required for computing multi-topic searches by combing single topic search results from search engines, which are often ranked, with other data and computational resources • directly supporting multi-topic ordered data • taking into account order when the results of several requests are combined • enabling exploration and expansion of search results 2. Applying SeCo technology in different fields, including Life Sciences => Bio-SeCo: Support answering complex bioinformatics queries 7
  • 8. Bio-SeCo: SeCo technologies to answer Life Science questions Life Science example query: “Which genes encode proteins in different organisms with high sequence similarity to a protein X and have some biomedical features in common, e.g. up/down significantly co-expressed in the same biological tissue or condition Y and involved in a biological function Z?” This multi-topic case study question can be decomposed into the following four single topic sub-queries, each of these sub- queries can be mapped to an available search service. 8
  • 9. Bio-SeCo: SeCo technologies to answer Life Science questions • “Which proteins in different organisms have high sequence similarity to a protein X ?”  BLAST, a sequence similarity search program, in one of its many implementations, e.g. WU-BLAST or NCBI-Blast • “Which genes encode which proteins ?”  GPDW (Genomic and Proteomic Data Warehouse), a query service to a database of genomic and proteomic data (GPDW_protein2gene) 9
  • 10. Bio-SeCo: SeCo technologies to answer Life Science questions • “Which genes are up/down significantly co-expressed in the same biological condition / tissue Y ?”  Array Express Gene Expression Atlas, a search engine of gene expression data • “Which genes are involved in a biological function Z ?  GPDW (Genomic and Proteomic Data Warehouse), a query service to a database of genomic and proteomic data (GPDW_gene2biologicalFunctionFeature) 10
  • 11. Bio-SeCo: SeCo technologies to answer Life Science questions Each quesiton part answer is integrated with others, with all the ranked results found GPDW_protein2gene BLAST ArrayExpress GPDW_gene2biologicalFunctionFeature 11
  • 12. What I have done for my Minor Research project 12
  • 13. Semantic network: before Has Gene Protein Is_similar_to Is_encoded_by Gene Expression Is_involved_in Biological Function Is_involved_in 13 Feature
  • 14. Semantic network: now Genetic Disorder Pathway Is_involved_in Is_involved_in Is_involved_in Is_involved_in Is_functional_similar_to Codes Has Gene Protein Is_similar_to Is_encoded_by Gene Expression Is_involved_in Biological Function Is_involved_in 14 Feature
  • 15. Services I added: GPDW exploitation Genetic Disorder Pathway Is_involved_in Is_involved_in Is_involved_in Is_involved_in Is_functional_similar_to Codes Has Gene Protein A Genetic Disorder is an illness caused by abnormalities in genes Is_encoded_by Is_similar_to or Gene Expression chromosomes, especially a condition that is present from before birth. In biochemistry, Metabolic Pathways are series of chemical Is_involved_in reactions occurring within Biological Function pathway, a principal a cell. In each Is_involved_in Feature chemical is modified by a series of chemical reactions. 15
  • 16. Services I added: GPDW exploitation Genetic Disorder Pathway Is_involved_in Is_involved_in Is_involved_in Is_involved_in Is_functional_similar_to Codes Has Gene Protein Which Genetic Disorders isIs_encoded_byX involved in ? the Gene Is_similar_to Gene Expression  GPDW (Genomic and Proteomic Data Warehouse), a query service to a database of genomic and proteomic data (GPDW_gene2geneticDisorder) Is_involved_in Biological Function Is_involved_in 16 Feature
  • 17. Services I added: GPDW exploitation Genetic Disorder Pathway Is_involved_in Is_involved_in Is_involved_in Is_involved_in Is_functional_similar_to Codes Has Gene Protein Which Genetic Disorders isIs_encoded_by Y involved in ? the Protein Is_similar_to Gene Expression  GPDW (Genomic and Proteomic Data Warehouse), a query service to a database of genomic and proteomic data (GPDW_protein2geneticDisorder) Is_involved_in Biological Function Is_involved_in 17 Feature
  • 18. Services I added: GPDW exploitation Genetic Disorder Pathway Is_involved_in Is_involved_in Is_involved_in Is_involved_in Is_functional_similar_to Codes Has Gene Protein Which Genes does the Genetic Disorder X involve? Is_encoded_by Is_similar_to Gene Expression  GPDW (Genomic and Proteomic Data Warehouse), a query service to a database of genomic and proteomic data (GPDW_geneticDisorder2gene) Is_involved_in Biological Function Is_involved_in 18 Feature
  • 19. Services I added: GPDW exploitation Genetic Disorder Pathway Is_involved_in Is_involved_in Is_involved_in Is_involved_in Is_functional_similar_to Codes Has Gene Protein Which Proteins does the Genetic Disorder X involve? Is_encoded_by Is_similar_to Gene Expression  GPDW (Genomic and Proteomic Data Warehouse), a query service to a database of genomic and proteomic data (GPDW_geneticDisorder2gene) Is_involved_in Biological Function Is_involved_in 19 Feature
  • 20. Services I added: GPDW exploitation Genetic Disorder Pathway Is_involved_in Is_involved_in Is_involved_in Is_involved_in Is_functional_similar_to Codes Has Gene Protein Same questions and GPDWIs_encoded_by Metabolic Pathways: services for Is_similar_to • GPDW_gene2pathway Gene Expression • GPDW_protein2pathway • GPDW_pathway2gene • GPDW_pathway2protein Is_involved_in Biological Function Is_involved_in 20 Feature
  • 21. Services I added: GPDW exploitation A Biological Function Feature is an item of information about a Genetic gene or a protein. It defines a certain peculiarity of a biomolecular Disorder Pathway entity. E.g.: “is involved in lung cancer” Is_involved_in Is_involved_in Is_involved_in GPDW_protein2biological_function_feature Is_involved_in Is_functional_similar_to Codes Has Gene Protein Is_similar_to Is_encoded_by Gene Expression Is_involved_in Biological Function Is_involved_in 21 Feature
  • 22. Services I added • These new services (Genetic Disorder and Pathway) are very useful and important, but they don’t take advantage of the main novelty provided by the Search Computing technology: the Integration of ranked results • There’s no ranking on “being involved” in a Genetic 22Disorder or a Pathway…
  • 23. Services I added: Gene Semantic Similarity • The other service (SemSim) I integrated on Bio-SeCo is related to the computation of the semantic similarity of a gene into a list of genes: Is_functional_similar_to Gene • This service provides ranked results (given a gene X, it returns a list of gene ranked from the most semantic similar to X to the less semantic similar one) • SemSim takes advantage of the Search Computing potentiality of integrating ranked results 23
  • 24. Semantic Similarity?!? What does it mean? • Keypoint: given the gene X and gene Y, how much similar are they? • Semantically similar genes can be involved in similar activities, can be involved in similar pathways, and can have many annotations in common • To measure this similarity, I chose Latent Semantic Indexing method, based on a matrix build with gene- related annotations 24
  • 25. Biomolecular annotation • The concept of annotation: association of nucleotide or amino acid sequences with useful information describing their features • This information is expressed through controlled vocabularies, sometimes structured as ontologies, where every controlled term of the vocabulary is associated with a unique alphanumeric code • The association of such a code with a gene or protein ID constitutes an annotation Gene / Biological function feature Protein Annotation 25 gene2bff
  • 26. Biomolecular annotation • The association of an information/feature with a gene or protein ID constitutes an annotation • Annotation example: • gene: GD4 • feature: “is present in the mitochondrial membrane” Gene / Biological function feature Protein Annotation 26 gene2bff
  • 27. Latente Semantic Indexing: Singular Value Decomposition – SVD – Annotation matrix A  {0, 1} m x n − m rows: genes / proteins − n columns: annotation terms A(i,j) = 1 if gene / protein i is annotated to term j or to any descendant of j in the considered ontology structure (true path rule) A(i,j) = 0 otherwise (it is unknown) term01 term02 term03 term04 … termN gene01 0 0 0 0 … 0 gene02 0 1 1 0 … 1 … … … … … … … geneM 0 0 0 0 … 0 27
  • 28. Latente Semantic Indexing: Singular Value Decomposition – SVD – Annotation matrix A  {0, 1} m x n − m rows: genes / proteins − n columns: annotation terms A(i,j) = 1 if gene / protein i is annotated to term j or to any descendant of j in the considered ontology structure (true path rule) A(i,j) = 0 otherwise (it is unknown) term01 term02 term03 term04 … termN gene01 0 0 0 0 … 0 gene02 0 1 1 0 … 1 … … … … … … … geneM 0 0 0 0 … 0 28
  • 29. Latente Semantic Indexing: Singular Value Decomposition – SVD Compute SVD: A  U V T  U V T V TA  U V T A A U A  U V T  Compute reduced rank approximation: Ak  U k kkVk U kU kVkkTVkTU k  kVkT A AT     k A Ak  U k kVkT  k k • An annotation prediction is performed by computing a reduced rank approximation Ak of the annotation matrix A (where 0 < k < r, with r the number of non zero singular values of A, i.e. the rank of A) 29
  • 30. Latente Semantic Indexing: Singular Value Decomposition – SVD Compute reduced rank approximation: Ak  U k kkVk U kU kVkkTVkTU k  kVkT A AT     k A Ak  U k kVkT  k k • A : genes – features matrix • Uk : gene vectors matrix • Σk : singular value matrix • VTk : feature vectors matrix 30
  • 31. Latente Semantic Indexing: Singular Value Decomposition – SVD • Uk : gene vectors matrix • Σk : singular value matrix • VTk : feature vectors matrix • These matrices can be considered for measuring the distances between objects (genes or feature) in the k-dimensional space. • For example, is possibile to compute the distance between two gene vector to understand their similarity level. The same thing could be done for features. 31
  • 32. Latente Semantic Indexing: Singular Value Decomposition – SVD • Uk : gene vectors matrix • Σk : singular value matrix • VTk : feature vectors matrix • For our implementation of the LSI, we chose to compute the cosine similarity as measure of the semantic similarity between genes. 32
  • 33. Minor Research Project • A preprocessing software computes the Singular Value Decomposition (SVD) algorithm • It prints the matrices (Uk, Σk, VTk) in three different files • These files are inserted into the data directory of the SemSim REST web application • SemSim (JSP + Java) computes the Latent Semantic Indexing (LSI) measures and returns the ranked list of genes 33
  • 34. Minor Research Project • Developed with REST technology • Integrated on Bio-SeCo as an external service, with a wrapper • Input: gene (ID, name, taxonomy) 34
  • 35. Minor Research Project • Input: list of genes ranked on their semantic similarity with the input gene 35
  • 36. Minor Research Project • Now is possible to answer to many other biological questions. For example: Among the proteins that are encoded by genes, in Chicken organism, with higher functional semantic similarity to gene X, which are those with higher sequence similarity to protein Y ? Input Sequence SemSim ProteinByGene Alignment Output 36
  • 37. Minor Research Project • Now is possible to answer to many other biological questions, that involve Gene Semantic Similarity computation, Genetic Disorders or Metabolic Pathways. For example: Among the proteins that are encoded by genes, in Chicken organism, with higher functional semantic similarity to gene X, which are those with higher sequence similarity to protein Y ? Input Sequence SemSim ProteinByGene Alignment Output 37 DEMO
  • 38. Thanks for your attention 38