1. Data Manipulation: Molecular Online
and Server Tools & BioExtract Server
Theme: FXN Gene and Pancreatic Cancer.
Etienne Z. Gnimpieba
BRIN WS 2013
Mount Marty College – June 24th 2013
Etienne.gnimpieba@usd.edu
2. Data Manipulation Molecular Online Tools: BioExtract Server
Review: Databases
Etienne Z. Gnimpieba
BRIN WS 2013
Mount Marty College – June 24th 2013
Metabolic:
• Sabio-RK (check with Brent)
• KEGG (check with Brent)
• HMDB (hmdb.ca, contact for API)
• SMPDB (http://www.smpdb.ca)
• BioModels
• drugDB
• Brenda (check with Brent)
• [Mathi's project]
Protein
• Expazy DB collection (uniprot, )
• PDB
• SBKB
• STRING
Genomic:
• G.E.O.
• GenBank
• GO
• EBI Array Express & Gene Atlas
Phenomic:
• PhenomicDB
• Phenoscape
3. Data Manipulation Molecular Online Tools: BioExtract Server
Review: Databases
Etienne Z. Gnimpieba
BRIN WS 2013
Mount Marty College – June 24th 2013
Active Network Extraction & Analysis
Reactome Functional Interaction network
Disease subnetwork
Extract mutated, overexpressed,
undexpressed, expanded/deleted
genesAdd Linker
genes
Disease “modules”
Disease gene prediction
Sample classification
Hypothesis generationApply community
clustering algorithms
4. Data Manipulation Molecular Online Tools: BioExtract Server
Review: Databases
Etienne Z. Gnimpieba
BRIN WS 2013
Mount Marty College – June 24th 2013
p53, SMAD, TGFβ,
TNF signaling
KRAS, MAPK signaling
Heterotrimeric
G-protein signaling
Rho GTPase
signaling
Transcription & translation
Cell cycle
Wnt & Cadherin
signaling
Hedgehog
signaling
Transcription
Zinc fingers
Ca2+ Signaling
Non-silent mutations
• blue – in primary tumour only
• green – in xenograft only
• red – in primary & xenograft
Pancreatic Cancer Module Map (43 Cases)
Christina Yung / Bioinformatics.ca
5. Data Manipulation Molecular Online Tools: BioExtract Server
Bibliographic Taxonomic Nucleotide GenomicProteinMetabolic pathway
Molecular Biology
Databases
MEDLINE
PubMed
EMBASE
BIOSIS
CAB
International
AGRICOLA
NEWT
The Tree of Life
Species 2000
IOPI
ITIS
KEGG
EcoCyc
BRENDA
ENZYME
BIOMODEL
REACTOME
INSDC
EMBL
DDBJ
NCBI
GENBANK
SPGP
AceDB
HIV-SD
Ensembl
Wormbase
FlyBase
MGD
SGD
EBI ( Genome
server,
Karyn’s genome)
RGD
SPGP
•GOA
•ENZYME
•INterPro
•PDB
•Integr8
•MEROPS
LIGAN
•EMP
•DCHGR
•PROSITE
•PRINT
•Pfam
•BLOCKS
•SBASE
•UniProt/
Swiss-
Prot
•PIR
Review: Databases
Etienne Z. Gnimpieba
BRIN WS 2013
Mount Marty College – June 24th 2013
6. Sequence Type Accession Number
DNA sequence from GENBANk , EMBL or DDBJ
1 letter + 5 digits : U43752
2 letter + 6 digits : AF462052
GenePept sequence GENBANk , EMBL or DDBJ 3 letter + 5 digits : AAF46449
Protein sequence from SwissProt 1 letter + 5 digits : Q16595
Protein sequence from the Protein Research Foundation 6/7 digits + 1 letter : 2808353A
RefSeq sequence
2 letters + _ + >6 digits
mRNA : NM_******
Protein : NP_******
Protein sequence from Protein Data Bank PDB 1 digit + 3 letters : 2EFF
Protein sequence from Molecular Modeling DataBase MMDB ID + >4 digits : MMDB ID 767744
Review: data format
Data Manipulation Molecular Online Tools and BioExtract Server
Etienne Z. Gnimpieba
BRIN WS 2013
Mount Marty College – June 24th 2013
>gi|XXXX |XXX >sp|XXXX |XXX
Gene Info number Specie referenceAccession number Gene Info number Specie referenceAccession number
7. Data Manipulation Molecular Online Tools: BioExtract Server
Biological sequences and data can be analyzed in many ways with bioinformatics tools.
They can be read, assembled, compared, mapped, predicted, designed, modeled…
1. Nucleotide and protein sequence searching (blastall, SSEARCH for fasta
local, GLSEEARCH for global)
2. Multiple sequence alignment (clustalW2, Mview, …)
3. Pairwise sequence alignment (Needle for global, LALIGN for local)
4. Protein functional analysis (SMART, Phobius, interproscan)
5. Functional genomic tools (R-tools, SAIL, EFOtools,)
6. Molecular structure analysis (PDBeFold, QuaternaryStructure,…)
7. Scientific literature text mining (EBIMed, Whatizit)
8. Sequence translation (Transeq, readseq, Backtranseq,…)
9. Data retrieval and ID mapping (dbfetchm, ENA/SRA, SRS, PICR)
10.Protein structure prediction tools
11.…
Review: Online Programs & Algorithms
Etienne Z. Gnimpieba
BRIN WS 2013
Mount Marty College – June 24th 2013
8. Data Manipulation Molecular Online Tools: BioExtract Server
Review: Databases
Etienne Z. Gnimpieba
BRIN WS 2013
Mount Marty College – June 24th 2013
AND = term1 AND term2 must exist in the searched documents
OR = term1 OR term2 must exist
NOT = term1 must not be present in any of the displayed documents
ALL = term1 must not be present in all of the displayed documents
+ term1 = document must contain the term1
- term1 = document must not contain term1
XXX* = all characters are accepted after the XXX
XX?YX = all characters are accepted instead of Y
FXN [AND] gene [NOT] Frataxin all data related with FXN gene except
those concerning Frataxin protein
ataxia + apraxia + gene all genes related with ataxia and apraxia
Ada* [AUTH] all authors whose names begin with Ada
Boolean operators and symbols
9. Data Manipulation Molecular Online Tools: BioExtract Server
Review: Databases
Etienne Z. Gnimpieba
BRIN WS 2013
Mount Marty College – June 24th 2013
BLAST (Basic Local Alignment search Tool) : comparing a
protein or a DNA sequence to other sequences
FASTA (FAST-ALL): fast protein or nucleotide comparison
Similarity search tools
10. Global match : align all
residues of a sequence with
all of the other sequence
Local match : find a region
in one sequence that
matches with the other
Motif match : find matches of a short
sequence in one or more region internal
to another long sequence, it could be a :
Multiple alignment : a
mutual alignment of many
sequences
Perfect match deletions insertionsmismatches
Review: Sequence Analysis
Data Manipulation Molecular Online Tools and BioExtract Server
Etienne Z. Gnimpieba
BRIN WS 2013
Mount Marty College – June 24th 2013
11. Review: Sequence Analysis
Data Manipulation Molecular Online Tools and BioExtract Server
Etienne Z. Gnimpieba
BRIN WS 2013
Mount Marty College – June 24th 2013
Sequence alignment : assignment of residue-residue correspondence
Determine phylogenic relationship by analyzing similarity and homology-
Similarity: Observation or measurement of resemblance and difference
Homology: The sequences and the organisms in which they occur are
descended from a common ancestor Homology must be an inference from
observation of similarity
Determine if a protein (or a gene) is related to a larger group of proteins
Verify if a mutated residue is conserved within species
12. Context
0. Specification & Aims
.
Statement of problem / Case study: The FXN gene provides instructions for making a protein called frataxin. This protein is found in cells throughout the body, with the highest levels in the heart,
spinal cord, liver, pancreas, and muscles. The protein is used for voluntary movement (skeletal muscles). Within cells, frataxin is found in energy-producing structures called mitochondria. Although
its function is not fully understood, frataxin appears to help assemble clusters of iron and sulfur molecules that are critical for the function of many proteins, including those needed for energy
production. Mutations in the FXN gene cause Friedreich ataxia. Friedreich ataxia is a genetic condition that affects the nervous system and causes movement problems. Most people with Friedreich
ataxia begin to experience the signs and symptoms of the disorder around puberty.
Molecular Online Tools and Server
Keywords:
Bio: FXN, Frataxin, pancreatic cancer, CDKN4
Math: HMM,
Informatics: programing, bioinformatics tools, getting
and exporting data
Reduced expression of frataxin is
the cause of Friedrich's ataxia
(FRDA), a lethal neurodegenerative
disease, how about liver cancer?
Aim: The purpose of this lab is to initiate online
biological exploration tools of the human model large
scale data study (metabolic, proteic, genomic, …). We
simulated the application on FXN gene and pancreatic
cancer disease. Now we can understand how a
researcher can come to identify cross biological
knowledge available in data banks.
Acquired skills
Online and server tools:
- Query biological DB (fasta, Html, txt, figure formats)
- Sequence tools (protein and gene)
Alignment (showalign, clustalw2), similarity, …
- Manage data result (select, keep, map, export)
- Build and reuse workflow
Biological Hypothesis
FXN on chromosome 9
Frataxin molecule structure (pymol)
Pancreatic cancerPancreasanatomy
?
BiologicalDB
Tools
Resolution Process
T2. Genome exploration:
Objective: Use of Ensembl to localize the FXN on the human
genome and identify the genes implicate in pancreatic cancer
disease.
T3. Sequences manipulation
Objective: Find similar sequence using BLAST tools
and make an alignment on given sequences.
T2.1. Locate a given gene on human genome
T2.2. Get a genomic sequence from NCBI
T2.3. Get the protein data and sequence from EBI
T2.4. Save the export sequences data in data folder
T3.1. Find similar sequences using BLAST tool
T3.2. Align generated sequences with ClustalW tool
T3.3. Visualized result using phylogenic tree on
Jalview
T5. BioExtract server
Objective: used server tool to optimized data
manipulation process, apply on BioExtract server.
T5.1. Server Initialization
T5.2. Pancreatic cancer & Frataxin (FXN)
T5.3. Mapping, Alignment
T5.4. Workflow save & reused
T4. Protein Data and Structural
Biology Knowledge
Objective: To provide protein levels of frataxin study
and its connection with pancreatic cancer (functional ad
structural data)
T1. Metabolomics
Objective: Use metabolic data repository to
understand the frataxin protein mechanism
T1.1. Finding the Enzyme and Pathway related to
Frataxin using KEGG
T1.2. Finding the Reaction involved with Frataxin
using Reactome
T1.3. Using BRENDA for enzyme data on Frataxin
T1.4. Using Collected data for Analysis
T1.5. Redu the process with Pancreatic Cancer
Results
T4.1. Structural Knowledge on Frataxin using
SBKB
T4.2. Using Uniprot for Frataxin Protein Study
T4.3. Protein-Protein Interaction using STRING
T4.4. Using same method for Pancreatic Cancer
and compare
Editor's Notes
Welcome to this bioinformatics lab on data manipulation using online and server tools.As the theme, we have chosen to study of the interaction between Frataxin and pancreatic cancer.
Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Software generally takes the name of the coded algorithm (next slide)Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Bibliographic DBMEDLINE is accessible through EBI's SRS. PUBMED is accessible through NCBI's ENTREZ.EMBASE is a commercial product formedical literature. BIOSIS, the inheritor of the old Biological Abstracts, covers a broad biological field; Zoological Record indexes and zoological literature. CAB International maintains abstract databases in the fields of agriculture and parasitic diseases. AGRICOLA is for the agricultural field what MEDLINE is for the medical field . The bibliographical databases, with the exception of MEDLINE/PUBMED, are only available through commercial database vendors. Taxonomy DB NEWT The Tree of Life project Species 2000 IOPI: International Organization for Plant Information ITIS: Integrated Taxonomic Information SystemNucleotide DBIn Europe, the vast majority of the nucleotide sequence data produced is collected, organized, and distributed by the EMBL Nucleotide Sequence Database located at the EBI in Cambridge UK. An Outstation of the European Molecular Biology Laboratory (EMBL) is located in Heidelberg, Germany. The nucleotide sequence databases are data repositories, accepting nucleic acid sequence data from the scientific community and making it freely available. The databases strive for completeness, with the aim of recording every publicly known nucleic acid sequence. These databases are heterogenous, they vary with respect to the source of the material (e.g. genomic versus cDNA), the intended quality (e.g. finished versus single pass sequences), the extent of sequence annotation, and the intended completeness of the sequence relative to its biological target (e.g. complete versus partial coverage of a gene or a genome). The nucleotide databases are distributed free of charge over the internet.DDBJ, GenBank and EMBL-Bank exchange new and updated data on a daily basis to achieve optimal synchronization. The result is that they contain exactly the same information, except for sequences that have been added in the last 24 hours. - Genomic DBGenomic databases vary greatly in form and contentFor organisms of major interest to geneticists, there is a long history of conventionally published catalogues of genes or mutations. In the past few years, most of these have been made available in an electronic form and a variety of new databases have been developed. These databases vary greatly in the classes of data captured and how this data is stored. Genomes Server - this server gives access to a hundreds of complete genome sequences, including those from archaea, bacteria, eukaryota, organelles, phages, plasmids, viroids, and viruses. Proteome Analysis - the Proteome Analysis database has been set up to provide comprehensive statistical and comparative analyses of the predicted proteomes of fully sequenced organisms. Ensembl - this is a joint project between the EBI and the Wellcome Trust Sanger Institute that aims at developing a system that maintains automatic annotation of large eukaryotic genomes. Ensembl presents up-to-date sequence data and the best possible automatic annotation for metazoan genomes. Available now are human, mouse, rat, fugu, zebrafish, mosquito, Drosophila, C. elegans, and C. briggsae. Karyn's Genomes - contains general information about organisms whose genomes are completely sequenced. The main aim of the database is to provide a short and concise explanation as to why it is important to obtain these organisms genomic sequences. WormBase - this is a repository of mapping, sequencing, and phenotypic information for C. elegans (and some other nematodes). FlyBase - the database for Drosophila melanogaster is one of the best-curated genetic databases.MGD - the 'Mouse Genome Database' is one of the most comprehensively curated genetic databases.RGD - the 'Rat Genome Database' curates and integrates rat genetic and genomic data and provides access to this data to support research using the rat as a genetic model for the study of human disease.The MIPS yeast database is an important resource for information on the yeast genome and its products. SGD - the 'Saccharomyces Genome Database' is another major yeast database.SPGP - the 'S. Pombe Genome Project' based at the Sanger Institute is the database for genetic data on the fungus Schizosaccharomycespombe.AceDB - this is the database for genetic and molecular data concerning Caenorhabditiselegans. The database management system written for AceDB by R. Durbin and J. Thierry-Mieg has proved very popular and has been used in many other species-specific databases. AceDB is now the name of this database management system, resulting in some confusion relative to the C. elegans database. The entire database can be downloaded from the Sanger Institute.HIV-SD - the 'HIV Sequence Database' collects, curates and annotates HIV and SIV sequence data and provides various tools for analyzing this data.- Protein DBThe protein databases are the most comprehensive source of information on proteins. It is necessary to distinguish between universal databases covering proteins from all species and specialized data collections storing information about specific families or groups of proteins, or about the proteins of a specific organism. Two categories of universal protein databases can be discerned: simple archives of sequence data; and annotated databases where additional information has been added to the sequence record. In the following you will find a short description of the: Primary protein sequence databases such as UniProtKB/Swiss-ProtSpecialised protein sequence databases such as GOA Gene Ontology AnnotationSpecialised protein databases such as ENZYMESecondary protein databases such as InterProStructure databases such as PDBIntegr8 - The Integr8 web portal provides easy access to integrated information about deciphered genomes and their corresponding proteomes. ENZYME - this database is an annotated extension of the Enzyme Commission's publication, linked to UniProtKB/Swiss-Prot. There are also databases of enzyme properties such as BRENDA, Ligand Chemical Database for Enzyme Reactions such as LIGAND, and the database of 'Enzymes and Metabolic Pathways' (EMP). LIGAND and EMP are searchable via SRS at the EBI. LIGAND is linked to the metabolic pathways in KEGG.2 – dimensional gel electrophoresis data - a database is available from Expasy and the Danish Centre for Human Genome Research (DCHGR). Mass spectrometry protein data - a useful resource which includes protein cleavage products, is maintained at Rockefeller University. - Examples of secondary protein databases include: PROSITE - The special value of this database is the extensive documentation on many protein families, as defined by sequence domains or motifs. PROSITE contains biologically significant sites and patterns formulated in such a way that with appropriate computational tools it can rapidly and reliably identify to which family of proteins the new sequence belongs. The profile structure used in PROSITE is similar to but slightly more general than the one introduced by Gribskov and co-workers (Gribskov et al.,1987). Generalised profiles are remarkably similar to the specific type of Hidden Markov Models (HMMs) used in Pfam. PRINTS - A different approach to pattern recognition, termed "fingerprinting" is used by this database. Within a sequence alignment, it is usual to find not one, but several motifs that characterize the aligned family. Diagnostically, it makes sense to use many, or all, of the conserved regions to build a family signature. In a database search, there is then a greater chance of identifying a distant relative, whether or not all parts of the signature are matched. The ability to tolerate mismatches, both at the level of residues within individual motifs, and at the level of motifs within the fingerprint as a whole, renders fingerprinting a powerful diagnostic technique.Pfam - Another important secondary protein database is Pfam. The methodology used by Pfam to create protein family or domain signatures is Hidden Markov Models (HMMs). HMMs are closely related to profiles, but are based on probability theory methods. These allow a direct statistical approach to identifying and scoring matches, and also to combining information from a multiple alignment with prior knowledge. One feature that distinguishes HMMs and profiles from regular expressions and fingerprints is that the formers allow the full extent of a domain to be identified in a sequence. They are thus particularly useful when analyzing multi-domain proteins. The biggest drawback of Pfam is its lack of biological information (annotation) of the protein families.BLOCKS - Blocks are multiply aligned un-gapped segments corresponding to the most highly conserved regions of proteins. The blocks for the Blocks Database are made automatically by looking for the most highly conserved regions in groups of proteins documented in InterPro.SBASE - This is a protein domain library sequences database that contains annotated structural, functional, ligand-binding and topogenic segments of proteins, cross-referenced to all major sequence databases and sequence pattern collections.Software generally takes the name of the coded algorithm (next slide)
Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Software generally takes the name of the coded algorithm (next slide)Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Bibliographic DBMEDLINE is accessible through EBI's SRS. PUBMED is accessible through NCBI's ENTREZ.EMBASE is a commercial product formedical literature. BIOSIS, the inheritor of the old Biological Abstracts, covers a broad biological field; Zoological Record indexes and zoological literature. CAB International maintains abstract databases in the fields of agriculture and parasitic diseases. AGRICOLA is for the agricultural field what MEDLINE is for the medical field . The bibliographical databases, with the exception of MEDLINE/PUBMED, are only available through commercial database vendors. Taxonomy DB NEWT The Tree of Life project Species 2000 IOPI: International Organization for Plant Information ITIS: Integrated Taxonomic Information SystemNucleotide DBIn Europe, the vast majority of the nucleotide sequence data produced is collected, organized, and distributed by the EMBL Nucleotide Sequence Database located at the EBI in Cambridge UK. An Outstation of the European Molecular Biology Laboratory (EMBL) is located in Heidelberg, Germany. The nucleotide sequence databases are data repositories, accepting nucleic acid sequence data from the scientific community and making it freely available. The databases strive for completeness, with the aim of recording every publicly known nucleic acid sequence. These databases are heterogenous, they vary with respect to the source of the material (e.g. genomic versus cDNA), the intended quality (e.g. finished versus single pass sequences), the extent of sequence annotation, and the intended completeness of the sequence relative to its biological target (e.g. complete versus partial coverage of a gene or a genome). The nucleotide databases are distributed free of charge over the internet.DDBJ, GenBank and EMBL-Bank exchange new and updated data on a daily basis to achieve optimal synchronization. The result is that they contain exactly the same information, except for sequences that have been added in the last 24 hours. - Genomic DBGenomic databases vary greatly in form and contentFor organisms of major interest to geneticists, there is a long history of conventionally published catalogues of genes or mutations. In the past few years, most of these have been made available in an electronic form and a variety of new databases have been developed. These databases vary greatly in the classes of data captured and how this data is stored. Genomes Server - this server gives access to a hundreds of complete genome sequences, including those from archaea, bacteria, eukaryota, organelles, phages, plasmids, viroids, and viruses. Proteome Analysis - the Proteome Analysis database has been set up to provide comprehensive statistical and comparative analyses of the predicted proteomes of fully sequenced organisms. Ensembl - this is a joint project between the EBI and the Wellcome Trust Sanger Institute that aims at developing a system that maintains automatic annotation of large eukaryotic genomes. Ensembl presents up-to-date sequence data and the best possible automatic annotation for metazoan genomes. Available now are human, mouse, rat, fugu, zebrafish, mosquito, Drosophila, C. elegans, and C. briggsae. Karyn's Genomes - contains general information about organisms whose genomes are completely sequenced. The main aim of the database is to provide a short and concise explanation as to why it is important to obtain these organisms genomic sequences. WormBase - this is a repository of mapping, sequencing, and phenotypic information for C. elegans (and some other nematodes). FlyBase - the database for Drosophila melanogaster is one of the best-curated genetic databases.MGD - the 'Mouse Genome Database' is one of the most comprehensively curated genetic databases.RGD - the 'Rat Genome Database' curates and integrates rat genetic and genomic data and provides access to this data to support research using the rat as a genetic model for the study of human disease.The MIPS yeast database is an important resource for information on the yeast genome and its products. SGD - the 'Saccharomyces Genome Database' is another major yeast database.SPGP - the 'S. Pombe Genome Project' based at the Sanger Institute is the database for genetic data on the fungus Schizosaccharomycespombe.AceDB - this is the database for genetic and molecular data concerning Caenorhabditiselegans. The database management system written for AceDB by R. Durbin and J. Thierry-Mieg has proved very popular and has been used in many other species-specific databases. AceDB is now the name of this database management system, resulting in some confusion relative to the C. elegans database. The entire database can be downloaded from the Sanger Institute.HIV-SD - the 'HIV Sequence Database' collects, curates and annotates HIV and SIV sequence data and provides various tools for analyzing this data.- Protein DBThe protein databases are the most comprehensive source of information on proteins. It is necessary to distinguish between universal databases covering proteins from all species and specialized data collections storing information about specific families or groups of proteins, or about the proteins of a specific organism. Two categories of universal protein databases can be discerned: simple archives of sequence data; and annotated databases where additional information has been added to the sequence record. In the following you will find a short description of the: Primary protein sequence databases such as UniProtKB/Swiss-ProtSpecialised protein sequence databases such as GOA Gene Ontology AnnotationSpecialised protein databases such as ENZYMESecondary protein databases such as InterProStructure databases such as PDBIntegr8 - The Integr8 web portal provides easy access to integrated information about deciphered genomes and their corresponding proteomes. ENZYME - this database is an annotated extension of the Enzyme Commission's publication, linked to UniProtKB/Swiss-Prot. There are also databases of enzyme properties such as BRENDA, Ligand Chemical Database for Enzyme Reactions such as LIGAND, and the database of 'Enzymes and Metabolic Pathways' (EMP). LIGAND and EMP are searchable via SRS at the EBI. LIGAND is linked to the metabolic pathways in KEGG.2 – dimensional gel electrophoresis data - a database is available from Expasy and the Danish Centre for Human Genome Research (DCHGR). Mass spectrometry protein data - a useful resource which includes protein cleavage products, is maintained at Rockefeller University. - Examples of secondary protein databases include: PROSITE - The special value of this database is the extensive documentation on many protein families, as defined by sequence domains or motifs. PROSITE contains biologically significant sites and patterns formulated in such a way that with appropriate computational tools it can rapidly and reliably identify to which family of proteins the new sequence belongs. The profile structure used in PROSITE is similar to but slightly more general than the one introduced by Gribskov and co-workers (Gribskov et al.,1987). Generalised profiles are remarkably similar to the specific type of Hidden Markov Models (HMMs) used in Pfam. PRINTS - A different approach to pattern recognition, termed "fingerprinting" is used by this database. Within a sequence alignment, it is usual to find not one, but several motifs that characterize the aligned family. Diagnostically, it makes sense to use many, or all, of the conserved regions to build a family signature. In a database search, there is then a greater chance of identifying a distant relative, whether or not all parts of the signature are matched. The ability to tolerate mismatches, both at the level of residues within individual motifs, and at the level of motifs within the fingerprint as a whole, renders fingerprinting a powerful diagnostic technique.Pfam - Another important secondary protein database is Pfam. The methodology used by Pfam to create protein family or domain signatures is Hidden Markov Models (HMMs). HMMs are closely related to profiles, but are based on probability theory methods. These allow a direct statistical approach to identifying and scoring matches, and also to combining information from a multiple alignment with prior knowledge. One feature that distinguishes HMMs and profiles from regular expressions and fingerprints is that the formers allow the full extent of a domain to be identified in a sequence. They are thus particularly useful when analyzing multi-domain proteins. The biggest drawback of Pfam is its lack of biological information (annotation) of the protein families.BLOCKS - Blocks are multiply aligned un-gapped segments corresponding to the most highly conserved regions of proteins. The blocks for the Blocks Database are made automatically by looking for the most highly conserved regions in groups of proteins documented in InterPro.SBASE - This is a protein domain library sequences database that contains annotated structural, functional, ligand-binding and topogenic segments of proteins, cross-referenced to all major sequence databases and sequence pattern collections.Software generally takes the name of the coded algorithm (next slide)
Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Software generally takes the name of the coded algorithm (next slide)Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Bibliographic DBMEDLINE is accessible through EBI's SRS. PUBMED is accessible through NCBI's ENTREZ.EMBASE is a commercial product formedical literature. BIOSIS, the inheritor of the old Biological Abstracts, covers a broad biological field; Zoological Record indexes and zoological literature. CAB International maintains abstract databases in the fields of agriculture and parasitic diseases. AGRICOLA is for the agricultural field what MEDLINE is for the medical field . The bibliographical databases, with the exception of MEDLINE/PUBMED, are only available through commercial database vendors. Taxonomy DB NEWT The Tree of Life project Species 2000 IOPI: International Organization for Plant Information ITIS: Integrated Taxonomic Information SystemNucleotide DBIn Europe, the vast majority of the nucleotide sequence data produced is collected, organized, and distributed by the EMBL Nucleotide Sequence Database located at the EBI in Cambridge UK. An Outstation of the European Molecular Biology Laboratory (EMBL) is located in Heidelberg, Germany. The nucleotide sequence databases are data repositories, accepting nucleic acid sequence data from the scientific community and making it freely available. The databases strive for completeness, with the aim of recording every publicly known nucleic acid sequence. These databases are heterogenous, they vary with respect to the source of the material (e.g. genomic versus cDNA), the intended quality (e.g. finished versus single pass sequences), the extent of sequence annotation, and the intended completeness of the sequence relative to its biological target (e.g. complete versus partial coverage of a gene or a genome). The nucleotide databases are distributed free of charge over the internet.DDBJ, GenBank and EMBL-Bank exchange new and updated data on a daily basis to achieve optimal synchronization. The result is that they contain exactly the same information, except for sequences that have been added in the last 24 hours. - Genomic DBGenomic databases vary greatly in form and contentFor organisms of major interest to geneticists, there is a long history of conventionally published catalogues of genes or mutations. In the past few years, most of these have been made available in an electronic form and a variety of new databases have been developed. These databases vary greatly in the classes of data captured and how this data is stored. Genomes Server - this server gives access to a hundreds of complete genome sequences, including those from archaea, bacteria, eukaryota, organelles, phages, plasmids, viroids, and viruses. Proteome Analysis - the Proteome Analysis database has been set up to provide comprehensive statistical and comparative analyses of the predicted proteomes of fully sequenced organisms. Ensembl - this is a joint project between the EBI and the Wellcome Trust Sanger Institute that aims at developing a system that maintains automatic annotation of large eukaryotic genomes. Ensembl presents up-to-date sequence data and the best possible automatic annotation for metazoan genomes. Available now are human, mouse, rat, fugu, zebrafish, mosquito, Drosophila, C. elegans, and C. briggsae. Karyn's Genomes - contains general information about organisms whose genomes are completely sequenced. The main aim of the database is to provide a short and concise explanation as to why it is important to obtain these organisms genomic sequences. WormBase - this is a repository of mapping, sequencing, and phenotypic information for C. elegans (and some other nematodes). FlyBase - the database for Drosophila melanogaster is one of the best-curated genetic databases.MGD - the 'Mouse Genome Database' is one of the most comprehensively curated genetic databases.RGD - the 'Rat Genome Database' curates and integrates rat genetic and genomic data and provides access to this data to support research using the rat as a genetic model for the study of human disease.The MIPS yeast database is an important resource for information on the yeast genome and its products. SGD - the 'Saccharomyces Genome Database' is another major yeast database.SPGP - the 'S. Pombe Genome Project' based at the Sanger Institute is the database for genetic data on the fungus Schizosaccharomycespombe.AceDB - this is the database for genetic and molecular data concerning Caenorhabditiselegans. The database management system written for AceDB by R. Durbin and J. Thierry-Mieg has proved very popular and has been used in many other species-specific databases. AceDB is now the name of this database management system, resulting in some confusion relative to the C. elegans database. The entire database can be downloaded from the Sanger Institute.HIV-SD - the 'HIV Sequence Database' collects, curates and annotates HIV and SIV sequence data and provides various tools for analyzing this data.- Protein DBThe protein databases are the most comprehensive source of information on proteins. It is necessary to distinguish between universal databases covering proteins from all species and specialized data collections storing information about specific families or groups of proteins, or about the proteins of a specific organism. Two categories of universal protein databases can be discerned: simple archives of sequence data; and annotated databases where additional information has been added to the sequence record. In the following you will find a short description of the: Primary protein sequence databases such as UniProtKB/Swiss-ProtSpecialised protein sequence databases such as GOA Gene Ontology AnnotationSpecialised protein databases such as ENZYMESecondary protein databases such as InterProStructure databases such as PDBIntegr8 - The Integr8 web portal provides easy access to integrated information about deciphered genomes and their corresponding proteomes. ENZYME - this database is an annotated extension of the Enzyme Commission's publication, linked to UniProtKB/Swiss-Prot. There are also databases of enzyme properties such as BRENDA, Ligand Chemical Database for Enzyme Reactions such as LIGAND, and the database of 'Enzymes and Metabolic Pathways' (EMP). LIGAND and EMP are searchable via SRS at the EBI. LIGAND is linked to the metabolic pathways in KEGG.2 – dimensional gel electrophoresis data - a database is available from Expasy and the Danish Centre for Human Genome Research (DCHGR). Mass spectrometry protein data - a useful resource which includes protein cleavage products, is maintained at Rockefeller University. - Examples of secondary protein databases include: PROSITE - The special value of this database is the extensive documentation on many protein families, as defined by sequence domains or motifs. PROSITE contains biologically significant sites and patterns formulated in such a way that with appropriate computational tools it can rapidly and reliably identify to which family of proteins the new sequence belongs. The profile structure used in PROSITE is similar to but slightly more general than the one introduced by Gribskov and co-workers (Gribskov et al.,1987). Generalised profiles are remarkably similar to the specific type of Hidden Markov Models (HMMs) used in Pfam. PRINTS - A different approach to pattern recognition, termed "fingerprinting" is used by this database. Within a sequence alignment, it is usual to find not one, but several motifs that characterize the aligned family. Diagnostically, it makes sense to use many, or all, of the conserved regions to build a family signature. In a database search, there is then a greater chance of identifying a distant relative, whether or not all parts of the signature are matched. The ability to tolerate mismatches, both at the level of residues within individual motifs, and at the level of motifs within the fingerprint as a whole, renders fingerprinting a powerful diagnostic technique.Pfam - Another important secondary protein database is Pfam. The methodology used by Pfam to create protein family or domain signatures is Hidden Markov Models (HMMs). HMMs are closely related to profiles, but are based on probability theory methods. These allow a direct statistical approach to identifying and scoring matches, and also to combining information from a multiple alignment with prior knowledge. One feature that distinguishes HMMs and profiles from regular expressions and fingerprints is that the formers allow the full extent of a domain to be identified in a sequence. They are thus particularly useful when analyzing multi-domain proteins. The biggest drawback of Pfam is its lack of biological information (annotation) of the protein families.BLOCKS - Blocks are multiply aligned un-gapped segments corresponding to the most highly conserved regions of proteins. The blocks for the Blocks Database are made automatically by looking for the most highly conserved regions in groups of proteins documented in InterPro.SBASE - This is a protein domain library sequences database that contains annotated structural, functional, ligand-binding and topogenic segments of proteins, cross-referenced to all major sequence databases and sequence pattern collections.Software generally takes the name of the coded algorithm (next slide)
Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Software generally takes the name of the coded algorithm (next slide)Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Bibliographic DBMEDLINE is accessible through EBI's SRS. PUBMED is accessible through NCBI's ENTREZ.EMBASE is a commercial product formedical literature. BIOSIS, the inheritor of the old Biological Abstracts, covers a broad biological field; Zoological Record indexes and zoological literature. CAB International maintains abstract databases in the fields of agriculture and parasitic diseases. AGRICOLA is for the agricultural field what MEDLINE is for the medical field . The bibliographical databases, with the exception of MEDLINE/PUBMED, are only available through commercial database vendors. Taxonomy DB NEWT The Tree of Life project Species 2000 IOPI: International Organization for Plant Information ITIS: Integrated Taxonomic Information SystemNucleotide DBIn Europe, the vast majority of the nucleotide sequence data produced is collected, organized, and distributed by the EMBL Nucleotide Sequence Database located at the EBI in Cambridge UK. An Outstation of the European Molecular Biology Laboratory (EMBL) is located in Heidelberg, Germany. The nucleotide sequence databases are data repositories, accepting nucleic acid sequence data from the scientific community and making it freely available. The databases strive for completeness, with the aim of recording every publicly known nucleic acid sequence. These databases are heterogenous, they vary with respect to the source of the material (e.g. genomic versus cDNA), the intended quality (e.g. finished versus single pass sequences), the extent of sequence annotation, and the intended completeness of the sequence relative to its biological target (e.g. complete versus partial coverage of a gene or a genome). The nucleotide databases are distributed free of charge over the internet.DDBJ, GenBank and EMBL-Bank exchange new and updated data on a daily basis to achieve optimal synchronization. The result is that they contain exactly the same information, except for sequences that have been added in the last 24 hours. - Genomic DBGenomic databases vary greatly in form and contentFor organisms of major interest to geneticists, there is a long history of conventionally published catalogues of genes or mutations. In the past few years, most of these have been made available in an electronic form and a variety of new databases have been developed. These databases vary greatly in the classes of data captured and how this data is stored. Genomes Server - this server gives access to a hundreds of complete genome sequences, including those from archaea, bacteria, eukaryota, organelles, phages, plasmids, viroids, and viruses. Proteome Analysis - the Proteome Analysis database has been set up to provide comprehensive statistical and comparative analyses of the predicted proteomes of fully sequenced organisms. Ensembl - this is a joint project between the EBI and the Wellcome Trust Sanger Institute that aims at developing a system that maintains automatic annotation of large eukaryotic genomes. Ensembl presents up-to-date sequence data and the best possible automatic annotation for metazoan genomes. Available now are human, mouse, rat, fugu, zebrafish, mosquito, Drosophila, C. elegans, and C. briggsae. Karyn's Genomes - contains general information about organisms whose genomes are completely sequenced. The main aim of the database is to provide a short and concise explanation as to why it is important to obtain these organisms genomic sequences. WormBase - this is a repository of mapping, sequencing, and phenotypic information for C. elegans (and some other nematodes). FlyBase - the database for Drosophila melanogaster is one of the best-curated genetic databases.MGD - the 'Mouse Genome Database' is one of the most comprehensively curated genetic databases.RGD - the 'Rat Genome Database' curates and integrates rat genetic and genomic data and provides access to this data to support research using the rat as a genetic model for the study of human disease.The MIPS yeast database is an important resource for information on the yeast genome and its products. SGD - the 'Saccharomyces Genome Database' is another major yeast database.SPGP - the 'S. Pombe Genome Project' based at the Sanger Institute is the database for genetic data on the fungus Schizosaccharomycespombe.AceDB - this is the database for genetic and molecular data concerning Caenorhabditiselegans. The database management system written for AceDB by R. Durbin and J. Thierry-Mieg has proved very popular and has been used in many other species-specific databases. AceDB is now the name of this database management system, resulting in some confusion relative to the C. elegans database. The entire database can be downloaded from the Sanger Institute.HIV-SD - the 'HIV Sequence Database' collects, curates and annotates HIV and SIV sequence data and provides various tools for analyzing this data.- Protein DBThe protein databases are the most comprehensive source of information on proteins. It is necessary to distinguish between universal databases covering proteins from all species and specialized data collections storing information about specific families or groups of proteins, or about the proteins of a specific organism. Two categories of universal protein databases can be discerned: simple archives of sequence data; and annotated databases where additional information has been added to the sequence record. In the following you will find a short description of the: Primary protein sequence databases such as UniProtKB/Swiss-ProtSpecialised protein sequence databases such as GOA Gene Ontology AnnotationSpecialised protein databases such as ENZYMESecondary protein databases such as InterProStructure databases such as PDBIntegr8 - The Integr8 web portal provides easy access to integrated information about deciphered genomes and their corresponding proteomes. ENZYME - this database is an annotated extension of the Enzyme Commission's publication, linked to UniProtKB/Swiss-Prot. There are also databases of enzyme properties such as BRENDA, Ligand Chemical Database for Enzyme Reactions such as LIGAND, and the database of 'Enzymes and Metabolic Pathways' (EMP). LIGAND and EMP are searchable via SRS at the EBI. LIGAND is linked to the metabolic pathways in KEGG.2 – dimensional gel electrophoresis data - a database is available from Expasy and the Danish Centre for Human Genome Research (DCHGR). Mass spectrometry protein data - a useful resource which includes protein cleavage products, is maintained at Rockefeller University. - Examples of secondary protein databases include: PROSITE - The special value of this database is the extensive documentation on many protein families, as defined by sequence domains or motifs. PROSITE contains biologically significant sites and patterns formulated in such a way that with appropriate computational tools it can rapidly and reliably identify to which family of proteins the new sequence belongs. The profile structure used in PROSITE is similar to but slightly more general than the one introduced by Gribskov and co-workers (Gribskov et al.,1987). Generalised profiles are remarkably similar to the specific type of Hidden Markov Models (HMMs) used in Pfam. PRINTS - A different approach to pattern recognition, termed "fingerprinting" is used by this database. Within a sequence alignment, it is usual to find not one, but several motifs that characterize the aligned family. Diagnostically, it makes sense to use many, or all, of the conserved regions to build a family signature. In a database search, there is then a greater chance of identifying a distant relative, whether or not all parts of the signature are matched. The ability to tolerate mismatches, both at the level of residues within individual motifs, and at the level of motifs within the fingerprint as a whole, renders fingerprinting a powerful diagnostic technique.Pfam - Another important secondary protein database is Pfam. The methodology used by Pfam to create protein family or domain signatures is Hidden Markov Models (HMMs). HMMs are closely related to profiles, but are based on probability theory methods. These allow a direct statistical approach to identifying and scoring matches, and also to combining information from a multiple alignment with prior knowledge. One feature that distinguishes HMMs and profiles from regular expressions and fingerprints is that the formers allow the full extent of a domain to be identified in a sequence. They are thus particularly useful when analyzing multi-domain proteins. The biggest drawback of Pfam is its lack of biological information (annotation) of the protein families.BLOCKS - Blocks are multiply aligned un-gapped segments corresponding to the most highly conserved regions of proteins. The blocks for the Blocks Database are made automatically by looking for the most highly conserved regions in groups of proteins documented in InterPro.SBASE - This is a protein domain library sequences database that contains annotated structural, functional, ligand-binding and topogenic segments of proteins, cross-referenced to all major sequence databases and sequence pattern collections.Software generally takes the name of the coded algorithm (next slide)
To be store, data need to have a normal representation.For the sequence we have:……………………And about alignment tools building, we have local match………..
Software generally takes the name of the implemented algorithmWe have hundreds of available algorithmsFor details on the algorithms for each tool, links are usually available on the website list of tools. If this fails, it suffices to type on a search browser (google) the name of the tool and you will have the referred algorithm. For example, Biological sequences and data can be analyzed in many ways with bioinformatics tools. They can be read, assembled, compared, mapped, predicted, designed, modeled
Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Software generally takes the name of the coded algorithm (next slide)Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Bibliographic DBMEDLINE is accessible through EBI's SRS. PUBMED is accessible through NCBI's ENTREZ.EMBASE is a commercial product formedical literature. BIOSIS, the inheritor of the old Biological Abstracts, covers a broad biological field; Zoological Record indexes and zoological literature. CAB International maintains abstract databases in the fields of agriculture and parasitic diseases. AGRICOLA is for the agricultural field what MEDLINE is for the medical field . The bibliographical databases, with the exception of MEDLINE/PUBMED, are only available through commercial database vendors. Taxonomy DB NEWT The Tree of Life project Species 2000 IOPI: International Organization for Plant Information ITIS: Integrated Taxonomic Information SystemNucleotide DBIn Europe, the vast majority of the nucleotide sequence data produced is collected, organized, and distributed by the EMBL Nucleotide Sequence Database located at the EBI in Cambridge UK. An Outstation of the European Molecular Biology Laboratory (EMBL) is located in Heidelberg, Germany. The nucleotide sequence databases are data repositories, accepting nucleic acid sequence data from the scientific community and making it freely available. The databases strive for completeness, with the aim of recording every publicly known nucleic acid sequence. These databases are heterogenous, they vary with respect to the source of the material (e.g. genomic versus cDNA), the intended quality (e.g. finished versus single pass sequences), the extent of sequence annotation, and the intended completeness of the sequence relative to its biological target (e.g. complete versus partial coverage of a gene or a genome). The nucleotide databases are distributed free of charge over the internet.DDBJ, GenBank and EMBL-Bank exchange new and updated data on a daily basis to achieve optimal synchronization. The result is that they contain exactly the same information, except for sequences that have been added in the last 24 hours. - Genomic DBGenomic databases vary greatly in form and contentFor organisms of major interest to geneticists, there is a long history of conventionally published catalogues of genes or mutations. In the past few years, most of these have been made available in an electronic form and a variety of new databases have been developed. These databases vary greatly in the classes of data captured and how this data is stored. Genomes Server - this server gives access to a hundreds of complete genome sequences, including those from archaea, bacteria, eukaryota, organelles, phages, plasmids, viroids, and viruses. Proteome Analysis - the Proteome Analysis database has been set up to provide comprehensive statistical and comparative analyses of the predicted proteomes of fully sequenced organisms. Ensembl - this is a joint project between the EBI and the Wellcome Trust Sanger Institute that aims at developing a system that maintains automatic annotation of large eukaryotic genomes. Ensembl presents up-to-date sequence data and the best possible automatic annotation for metazoan genomes. Available now are human, mouse, rat, fugu, zebrafish, mosquito, Drosophila, C. elegans, and C. briggsae. Karyn's Genomes - contains general information about organisms whose genomes are completely sequenced. The main aim of the database is to provide a short and concise explanation as to why it is important to obtain these organisms genomic sequences. WormBase - this is a repository of mapping, sequencing, and phenotypic information for C. elegans (and some other nematodes). FlyBase - the database for Drosophila melanogaster is one of the best-curated genetic databases.MGD - the 'Mouse Genome Database' is one of the most comprehensively curated genetic databases.RGD - the 'Rat Genome Database' curates and integrates rat genetic and genomic data and provides access to this data to support research using the rat as a genetic model for the study of human disease.The MIPS yeast database is an important resource for information on the yeast genome and its products. SGD - the 'Saccharomyces Genome Database' is another major yeast database.SPGP - the 'S. Pombe Genome Project' based at the Sanger Institute is the database for genetic data on the fungus Schizosaccharomycespombe.AceDB - this is the database for genetic and molecular data concerning Caenorhabditiselegans. The database management system written for AceDB by R. Durbin and J. Thierry-Mieg has proved very popular and has been used in many other species-specific databases. AceDB is now the name of this database management system, resulting in some confusion relative to the C. elegans database. The entire database can be downloaded from the Sanger Institute.HIV-SD - the 'HIV Sequence Database' collects, curates and annotates HIV and SIV sequence data and provides various tools for analyzing this data.- Protein DBThe protein databases are the most comprehensive source of information on proteins. It is necessary to distinguish between universal databases covering proteins from all species and specialized data collections storing information about specific families or groups of proteins, or about the proteins of a specific organism. Two categories of universal protein databases can be discerned: simple archives of sequence data; and annotated databases where additional information has been added to the sequence record. In the following you will find a short description of the: Primary protein sequence databases such as UniProtKB/Swiss-ProtSpecialised protein sequence databases such as GOA Gene Ontology AnnotationSpecialised protein databases such as ENZYMESecondary protein databases such as InterProStructure databases such as PDBIntegr8 - The Integr8 web portal provides easy access to integrated information about deciphered genomes and their corresponding proteomes. ENZYME - this database is an annotated extension of the Enzyme Commission's publication, linked to UniProtKB/Swiss-Prot. There are also databases of enzyme properties such as BRENDA, Ligand Chemical Database for Enzyme Reactions such as LIGAND, and the database of 'Enzymes and Metabolic Pathways' (EMP). LIGAND and EMP are searchable via SRS at the EBI. LIGAND is linked to the metabolic pathways in KEGG.2 – dimensional gel electrophoresis data - a database is available from Expasy and the Danish Centre for Human Genome Research (DCHGR). Mass spectrometry protein data - a useful resource which includes protein cleavage products, is maintained at Rockefeller University. - Examples of secondary protein databases include: PROSITE - The special value of this database is the extensive documentation on many protein families, as defined by sequence domains or motifs. PROSITE contains biologically significant sites and patterns formulated in such a way that with appropriate computational tools it can rapidly and reliably identify to which family of proteins the new sequence belongs. The profile structure used in PROSITE is similar to but slightly more general than the one introduced by Gribskov and co-workers (Gribskov et al.,1987). Generalised profiles are remarkably similar to the specific type of Hidden Markov Models (HMMs) used in Pfam. PRINTS - A different approach to pattern recognition, termed "fingerprinting" is used by this database. Within a sequence alignment, it is usual to find not one, but several motifs that characterize the aligned family. Diagnostically, it makes sense to use many, or all, of the conserved regions to build a family signature. In a database search, there is then a greater chance of identifying a distant relative, whether or not all parts of the signature are matched. The ability to tolerate mismatches, both at the level of residues within individual motifs, and at the level of motifs within the fingerprint as a whole, renders fingerprinting a powerful diagnostic technique.Pfam - Another important secondary protein database is Pfam. The methodology used by Pfam to create protein family or domain signatures is Hidden Markov Models (HMMs). HMMs are closely related to profiles, but are based on probability theory methods. These allow a direct statistical approach to identifying and scoring matches, and also to combining information from a multiple alignment with prior knowledge. One feature that distinguishes HMMs and profiles from regular expressions and fingerprints is that the formers allow the full extent of a domain to be identified in a sequence. They are thus particularly useful when analyzing multi-domain proteins. The biggest drawback of Pfam is its lack of biological information (annotation) of the protein families.BLOCKS - Blocks are multiply aligned un-gapped segments corresponding to the most highly conserved regions of proteins. The blocks for the Blocks Database are made automatically by looking for the most highly conserved regions in groups of proteins documented in InterPro.SBASE - This is a protein domain library sequences database that contains annotated structural, functional, ligand-binding and topogenic segments of proteins, cross-referenced to all major sequence databases and sequence pattern collections.Software generally takes the name of the coded algorithm (next slide)
Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Software generally takes the name of the coded algorithm (next slide)Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Bibliographic DBMEDLINE is accessible through EBI's SRS. PUBMED is accessible through NCBI's ENTREZ.EMBASE is a commercial product formedical literature. BIOSIS, the inheritor of the old Biological Abstracts, covers a broad biological field; Zoological Record indexes and zoological literature. CAB International maintains abstract databases in the fields of agriculture and parasitic diseases. AGRICOLA is for the agricultural field what MEDLINE is for the medical field . The bibliographical databases, with the exception of MEDLINE/PUBMED, are only available through commercial database vendors. Taxonomy DB NEWT The Tree of Life project Species 2000 IOPI: International Organization for Plant Information ITIS: Integrated Taxonomic Information SystemNucleotide DBIn Europe, the vast majority of the nucleotide sequence data produced is collected, organized, and distributed by the EMBL Nucleotide Sequence Database located at the EBI in Cambridge UK. An Outstation of the European Molecular Biology Laboratory (EMBL) is located in Heidelberg, Germany. The nucleotide sequence databases are data repositories, accepting nucleic acid sequence data from the scientific community and making it freely available. The databases strive for completeness, with the aim of recording every publicly known nucleic acid sequence. These databases are heterogenous, they vary with respect to the source of the material (e.g. genomic versus cDNA), the intended quality (e.g. finished versus single pass sequences), the extent of sequence annotation, and the intended completeness of the sequence relative to its biological target (e.g. complete versus partial coverage of a gene or a genome). The nucleotide databases are distributed free of charge over the internet.DDBJ, GenBank and EMBL-Bank exchange new and updated data on a daily basis to achieve optimal synchronization. The result is that they contain exactly the same information, except for sequences that have been added in the last 24 hours. - Genomic DBGenomic databases vary greatly in form and contentFor organisms of major interest to geneticists, there is a long history of conventionally published catalogues of genes or mutations. In the past few years, most of these have been made available in an electronic form and a variety of new databases have been developed. These databases vary greatly in the classes of data captured and how this data is stored. Genomes Server - this server gives access to a hundreds of complete genome sequences, including those from archaea, bacteria, eukaryota, organelles, phages, plasmids, viroids, and viruses. Proteome Analysis - the Proteome Analysis database has been set up to provide comprehensive statistical and comparative analyses of the predicted proteomes of fully sequenced organisms. Ensembl - this is a joint project between the EBI and the Wellcome Trust Sanger Institute that aims at developing a system that maintains automatic annotation of large eukaryotic genomes. Ensembl presents up-to-date sequence data and the best possible automatic annotation for metazoan genomes. Available now are human, mouse, rat, fugu, zebrafish, mosquito, Drosophila, C. elegans, and C. briggsae. Karyn's Genomes - contains general information about organisms whose genomes are completely sequenced. The main aim of the database is to provide a short and concise explanation as to why it is important to obtain these organisms genomic sequences. WormBase - this is a repository of mapping, sequencing, and phenotypic information for C. elegans (and some other nematodes). FlyBase - the database for Drosophila melanogaster is one of the best-curated genetic databases.MGD - the 'Mouse Genome Database' is one of the most comprehensively curated genetic databases.RGD - the 'Rat Genome Database' curates and integrates rat genetic and genomic data and provides access to this data to support research using the rat as a genetic model for the study of human disease.The MIPS yeast database is an important resource for information on the yeast genome and its products. SGD - the 'Saccharomyces Genome Database' is another major yeast database.SPGP - the 'S. Pombe Genome Project' based at the Sanger Institute is the database for genetic data on the fungus Schizosaccharomycespombe.AceDB - this is the database for genetic and molecular data concerning Caenorhabditiselegans. The database management system written for AceDB by R. Durbin and J. Thierry-Mieg has proved very popular and has been used in many other species-specific databases. AceDB is now the name of this database management system, resulting in some confusion relative to the C. elegans database. The entire database can be downloaded from the Sanger Institute.HIV-SD - the 'HIV Sequence Database' collects, curates and annotates HIV and SIV sequence data and provides various tools for analyzing this data.- Protein DBThe protein databases are the most comprehensive source of information on proteins. It is necessary to distinguish between universal databases covering proteins from all species and specialized data collections storing information about specific families or groups of proteins, or about the proteins of a specific organism. Two categories of universal protein databases can be discerned: simple archives of sequence data; and annotated databases where additional information has been added to the sequence record. In the following you will find a short description of the: Primary protein sequence databases such as UniProtKB/Swiss-ProtSpecialised protein sequence databases such as GOA Gene Ontology AnnotationSpecialised protein databases such as ENZYMESecondary protein databases such as InterProStructure databases such as PDBIntegr8 - The Integr8 web portal provides easy access to integrated information about deciphered genomes and their corresponding proteomes. ENZYME - this database is an annotated extension of the Enzyme Commission's publication, linked to UniProtKB/Swiss-Prot. There are also databases of enzyme properties such as BRENDA, Ligand Chemical Database for Enzyme Reactions such as LIGAND, and the database of 'Enzymes and Metabolic Pathways' (EMP). LIGAND and EMP are searchable via SRS at the EBI. LIGAND is linked to the metabolic pathways in KEGG.2 – dimensional gel electrophoresis data - a database is available from Expasy and the Danish Centre for Human Genome Research (DCHGR). Mass spectrometry protein data - a useful resource which includes protein cleavage products, is maintained at Rockefeller University. - Examples of secondary protein databases include: PROSITE - The special value of this database is the extensive documentation on many protein families, as defined by sequence domains or motifs. PROSITE contains biologically significant sites and patterns formulated in such a way that with appropriate computational tools it can rapidly and reliably identify to which family of proteins the new sequence belongs. The profile structure used in PROSITE is similar to but slightly more general than the one introduced by Gribskov and co-workers (Gribskov et al.,1987). Generalised profiles are remarkably similar to the specific type of Hidden Markov Models (HMMs) used in Pfam. PRINTS - A different approach to pattern recognition, termed "fingerprinting" is used by this database. Within a sequence alignment, it is usual to find not one, but several motifs that characterize the aligned family. Diagnostically, it makes sense to use many, or all, of the conserved regions to build a family signature. In a database search, there is then a greater chance of identifying a distant relative, whether or not all parts of the signature are matched. The ability to tolerate mismatches, both at the level of residues within individual motifs, and at the level of motifs within the fingerprint as a whole, renders fingerprinting a powerful diagnostic technique.Pfam - Another important secondary protein database is Pfam. The methodology used by Pfam to create protein family or domain signatures is Hidden Markov Models (HMMs). HMMs are closely related to profiles, but are based on probability theory methods. These allow a direct statistical approach to identifying and scoring matches, and also to combining information from a multiple alignment with prior knowledge. One feature that distinguishes HMMs and profiles from regular expressions and fingerprints is that the formers allow the full extent of a domain to be identified in a sequence. They are thus particularly useful when analyzing multi-domain proteins. The biggest drawback of Pfam is its lack of biological information (annotation) of the protein families.BLOCKS - Blocks are multiply aligned un-gapped segments corresponding to the most highly conserved regions of proteins. The blocks for the Blocks Database are made automatically by looking for the most highly conserved regions in groups of proteins documented in InterPro.SBASE - This is a protein domain library sequences database that contains annotated structural, functional, ligand-binding and topogenic segments of proteins, cross-referenced to all major sequence databases and sequence pattern collections.Software generally takes the name of the coded algorithm (next slide)
To be store, data need to have a normal representation.For the sequence we have:……………………And about alignment tools building, we have local match………..
To be store, data need to have a normal representation.For the sequence we have:……………………And about alignment tools building, we have local match………..
This is the lab template: The context is a biological context based on a real biological problem. And a given hypothesisI don’t use computer science, strong word.When you read this template, you have a different view than an informatician.You want to understand the process to build the used tools.The architecture of the systemThe algorithm implementationThe quality of the resulting dataAnd so on