Session i overview bioinfo dm and app mmc

•Download as PPTX, PDF•

2 likes•483 views

USD Bioinformatics

Technology

Data Manipulation: Molecular Online
and Server Tools & BioExtract Server
Theme: FXN Gene and Pancreatic Cancer.
Etienne Z. Gnimpieba
BRIN WS 2013
Mount Marty College – June 24th 2013
Etienne.gnimpieba@usd.edu

Data Manipulation Molecular Online Tools: BioExtract Server
Review: Databases
Etienne Z. Gnimpieba
BRIN WS 2013
Mount Marty College – June 24th 2013
Metabolic:
• Sabio-RK (check with Brent)
• KEGG (check with Brent)
• HMDB (hmdb.ca, contact for API)
• SMPDB (http://www.smpdb.ca)
• BioModels
• drugDB
• Brenda (check with Brent)
• [Mathi's project]
Protein
• Expazy DB collection (uniprot, )
• PDB
• SBKB
• STRING
Genomic:
• G.E.O.
• GenBank
• GO
• EBI Array Express & Gene Atlas
Phenomic:
• PhenomicDB
• Phenoscape

Data Manipulation Molecular Online Tools: BioExtract Server
Bibliographic Taxonomic Nucleotide GenomicProteinMetabolic pathway
Molecular Biology
Databases
MEDLINE
PubMed
EMBASE
BIOSIS
CAB
International
AGRICOLA
NEWT
The Tree of Life
Species 2000
IOPI
ITIS
KEGG
EcoCyc
BRENDA
ENZYME
BIOMODEL
REACTOME
INSDC
EMBL
DDBJ
NCBI
GENBANK
SPGP
AceDB
HIV-SD
Ensembl
Wormbase
FlyBase
MGD
SGD
EBI ( Genome
server,
Karyn’s genome)
RGD
SPGP
•GOA
•ENZYME
•INterPro
•PDB
•Integr8
•MEROPS
LIGAN
•EMP
•DCHGR
•PROSITE
•PRINT
•Pfam
•BLOCKS
•SBASE
•UniProt/
Swiss-
Prot
•PIR
Review: Databases
Etienne Z. Gnimpieba
BRIN WS 2013
Mount Marty College – June 24th 2013

Sequence Type Accession Number
DNA sequence from GENBANk , EMBL or DDBJ
1 letter + 5 digits : U43752
2 letter + 6 digits : AF462052
GenePept sequence GENBANk , EMBL or DDBJ 3 letter + 5 digits : AAF46449
Protein sequence from SwissProt 1 letter + 5 digits : Q16595
Protein sequence from the Protein Research Foundation 6/7 digits + 1 letter : 2808353A
RefSeq sequence
2 letters + _ + >6 digits
mRNA : NM_******
Protein : NP_******
Protein sequence from Protein Data Bank PDB 1 digit + 3 letters : 2EFF
Protein sequence from Molecular Modeling DataBase MMDB ID + >4 digits : MMDB ID 767744
Review: data format
Data Manipulation Molecular Online Tools and BioExtract Server
Etienne Z. Gnimpieba
BRIN WS 2013
Mount Marty College – June 24th 2013
>gi|XXXX |XXX >sp|XXXX |XXX
Gene Info number Specie referenceAccession number Gene Info number Specie referenceAccession number

Data Manipulation Molecular Online Tools: BioExtract Server
Biological sequences and data can be analyzed in many ways with bioinformatics tools.
They can be read, assembled, compared, mapped, predicted, designed, modeled…
1. Nucleotide and protein sequence searching (blastall, SSEARCH for fasta
local, GLSEEARCH for global)
2. Multiple sequence alignment (clustalW2, Mview, …)
3. Pairwise sequence alignment (Needle for global, LALIGN for local)
4. Protein functional analysis (SMART, Phobius, interproscan)
5. Functional genomic tools (R-tools, SAIL, EFOtools,)
6. Molecular structure analysis (PDBeFold, QuaternaryStructure,…)
7. Scientific literature text mining (EBIMed, Whatizit)
8. Sequence translation (Transeq, readseq, Backtranseq,…)
9. Data retrieval and ID mapping (dbfetchm, ENA/SRA, SRS, PICR)
10.Protein structure prediction tools
11.…
Review: Online Programs & Algorithms
Etienne Z. Gnimpieba
BRIN WS 2013
Mount Marty College – June 24th 2013

Global match : align all
residues of a sequence with
all of the other sequence
Local match : find a region
in one sequence that
matches with the other
Motif match : find matches of a short
sequence in one or more region internal
to another long sequence, it could be a :
Multiple alignment : a
mutual alignment of many
sequences
Perfect match deletions insertionsmismatches
Review: Sequence Analysis
Data Manipulation Molecular Online Tools and BioExtract Server
Etienne Z. Gnimpieba
BRIN WS 2013
Mount Marty College – June 24th 2013

Review: Sequence Analysis
Data Manipulation Molecular Online Tools and BioExtract Server
Etienne Z. Gnimpieba
BRIN WS 2013
Mount Marty College – June 24th 2013
Sequence alignment : assignment of residue-residue correspondence
Determine phylogenic relationship by analyzing similarity and homology-
Similarity: Observation or measurement of resemblance and difference
Homology: The sequences and the organisms in which they occur are
descended from a common ancestor  Homology must be an inference from
observation of similarity
Determine if a protein (or a gene) is related to a larger group of proteins
Verify if a mutated residue is conserved within species

Context
0. Specification & Aims
.
Statement of problem / Case study: The FXN gene provides instructions for making a protein called frataxin. This protein is found in cells throughout the body, with the highest levels in the heart,
spinal cord, liver, pancreas, and muscles. The protein is used for voluntary movement (skeletal muscles). Within cells, frataxin is found in energy-producing structures called mitochondria. Although
its function is not fully understood, frataxin appears to help assemble clusters of iron and sulfur molecules that are critical for the function of many proteins, including those needed for energy
production. Mutations in the FXN gene cause Friedreich ataxia. Friedreich ataxia is a genetic condition that affects the nervous system and causes movement problems. Most people with Friedreich
ataxia begin to experience the signs and symptoms of the disorder around puberty.
Molecular Online Tools and Server
Keywords:
Bio: FXN, Frataxin, pancreatic cancer, CDKN4
Math: HMM,
Informatics: programing, bioinformatics tools, getting
and exporting data
Reduced expression of frataxin is
the cause of Friedrich's ataxia
(FRDA), a lethal neurodegenerative
disease, how about liver cancer?
Aim: The purpose of this lab is to initiate online
biological exploration tools of the human model large
scale data study (metabolic, proteic, genomic, …). We
simulated the application on FXN gene and pancreatic
cancer disease. Now we can understand how a
researcher can come to identify cross biological
knowledge available in data banks.
Acquired skills
Online and server tools:
- Query biological DB (fasta, Html, txt, figure formats)
- Sequence tools (protein and gene)
Alignment (showalign, clustalw2), similarity, …
- Manage data result (select, keep, map, export)
- Build and reuse workflow
Biological Hypothesis
FXN on chromosome 9
Frataxin molecule structure (pymol)
Pancreatic cancerPancreasanatomy
?
BiologicalDB
Tools
Resolution Process
T2. Genome exploration:
Objective: Use of Ensembl to localize the FXN on the human
genome and identify the genes implicate in pancreatic cancer
disease.
T3. Sequences manipulation
Objective: Find similar sequence using BLAST tools
and make an alignment on given sequences.
T2.1. Locate a given gene on human genome
T2.2. Get a genomic sequence from NCBI
T2.3. Get the protein data and sequence from EBI
T2.4. Save the export sequences data in data folder
T3.1. Find similar sequences using BLAST tool
T3.2. Align generated sequences with ClustalW tool
T3.3. Visualized result using phylogenic tree on
Jalview
T5. BioExtract server
Objective: used server tool to optimized data
manipulation process, apply on BioExtract server.
T5.1. Server Initialization
T5.2. Pancreatic cancer & Frataxin (FXN)
T5.3. Mapping, Alignment
T5.4. Workflow save & reused
T4. Protein Data and Structural
Biology Knowledge
Objective: To provide protein levels of frataxin study
and its connection with pancreatic cancer (functional ad
structural data)
T1. Metabolomics
Objective: Use metabolic data repository to
understand the frataxin protein mechanism
T1.1. Finding the Enzyme and Pathway related to
Frataxin using KEGG
T1.2. Finding the Reaction involved with Frataxin
using Reactome
T1.3. Using BRENDA for enzyme data on Frataxin
T1.4. Using Collected data for Analysis
T1.5. Redu the process with Pancreatic Cancer
Results
T4.1. Structural Knowledge on Frataxin using
SBKB
T4.2. Using Uniprot for Frataxin Protein Study
T4.3. Protein-Protein Interaction using STRING
T4.4. Using same method for Pancreatic Cancer
and compare

What's hot

Bioinformatics principles and applicationsSouth African National Bioinformatics Institute at the University of the Western Cape

RML NCBI ResourcesJackie Wirz, PhD

Pathways and genomes databases in bioinformaticssarwat bashir

Introduction to Bioinformatics.Elena Sügis

Introduction to the Proteomics Bioinformatics Course 2018Juan Antonio Vizcaino

TOOLS AND DATA BASES OF NCBISantosh Kumar Sahoo

Intro to databasesbhargvi sharma

FAIR as a Working Principle for Cancer Genomic DataIan Fore

Bioinformatics databases: Current Trends and Future PerspectivesUniversity of Malaya

Bioinformatics for beginners (exam point of view)Sijo A

Introduction to the Proteomics Bioinformatics Course 2017Juan Antonio Vizcaino

Features of biological databasesCharu Sharma

Mass spectrometry resources at the EBIJuan Antonio Vizcaino

PRIDE-ProteomeXchangeJuan Antonio Vizcaino

Michael Reich, GenomeSpace Workshop, fged_seattle_2013Functional Genomics Data Society

BioinformaticsAmna Jalil

Publicly available tools and open resources in BioinformaticsArindam Ghosh

Tools and database of NCBISantosh Kumar Sahoo

Reuse of public proteomics dataJuan Antonio Vizcaino

Data retreival systemShikha Thakur

What's hot (20)

Bioinformatics principles and applications

RML NCBI Resources

Pathways and genomes databases in bioinformatics

Introduction to Bioinformatics.

Introduction to the Proteomics Bioinformatics Course 2018

TOOLS AND DATA BASES OF NCBI

Intro to databases

FAIR as a Working Principle for Cancer Genomic Data

Bioinformatics databases: Current Trends and Future Perspectives

Bioinformatics for beginners (exam point of view)

Introduction to the Proteomics Bioinformatics Course 2017

Features of biological databases

Mass spectrometry resources at the EBI

PRIDE-ProteomeXchange

Michael Reich, GenomeSpace Workshop, fged_seattle_2013

Bioinformatics

Publicly available tools and open resources in Bioinformatics

Tools and database of NCBI

Reuse of public proteomics data

Data retreival system

Similar to Session i overview bioinfo dm and app mmc

Informal presentation on bioinformaticsAtai Rabby

BioInformatics Tools -Genomics , Proteomics and metablomicsAyeshaYousaf20

Introduction to databases.pptxsworna kumari chithiraivelu

Data retrievalMalla Reddy College of Pharmacy

Bioinformatics مي.pdfnedalalazzwy

ArticleMisbahAlwi

Role of bioinformatics in life sciences researchAnshika Bansal

Bioinformatics data miningSangeeta Das

Applications of bioinformaticsThapar Institute of Engineering & Technology, Patiala, Punjab, India

Bioinformatic, and tools by kk sahuKAUSHAL SAHU

bioinformatics simple nadeem akhter

Eccmid meet the expert 2015João André Carriço

BioinformaticsArockiyajainmary

Bioinformatics introductionDrGopaSarma

Research presentation-wdWagied Davids

Thesis defJay Vyas

Bioinformatics_1_ChenS.pptxxRowlet

2012 03 01_bioinformatics_ii_les1Prof. Wim Van Criekinge

Bioinformatics - Discovering the Bio Logic Of NatureRobert Cormia

Data analysis & integration challenges in genomicsmikaelhuss

Similar to Session i overview bioinfo dm and app mmc (20)

Informal presentation on bioinformatics

BioInformatics Tools -Genomics , Proteomics and metablomics

Introduction to databases.pptx

Data retrieval

Bioinformatics مي.pdf

Article

Role of bioinformatics in life sciences research

Bioinformatics data mining

Applications of bioinformatics

Bioinformatic, and tools by kk sahu

bioinformatics simple

Eccmid meet the expert 2015

Bioinformatics

Bioinformatics introduction

Research presentation-wd

Thesis def

Bioinformatics_1_ChenS.pptx

2012 03 01_bioinformatics_ii_les1

Bioinformatics - Discovering the Bio Logic Of Nature

Data analysis & integration challenges in genomics

Recently uploaded

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

Developing An App To Navigate The Roads of BrazilV3cube

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

A Call to Action for Generative AI in 2024Results

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Partners Life - Insurer Innovation Award 2024The Digital Insurer

Recently uploaded (20)

How to Troubleshoot Apps for the Modern Connected Worker

IAC 2024 - IA Fast Track to Search Focused AI Solutions

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...

Tata AIG General Insurance Company - Insurer Innovation Award 2024

A Domino Admins Adventures (Engage 2024)

Injustice - Developers Among Us (SciFiDevCon 2024)

Developing An App To Navigate The Roads of Brazil

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

08448380779 Call Girls In Civil Lines Women Seeking Men

2024: Domino Containers - The Next Step. News from the Domino Container commu...

08448380779 Call Girls In Friends Colony Women Seeking Men

A Call to Action for Generative AI in 2024

The 7 Things I Know About Cyber Security After 25 Years | April 2024

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

[2024]Digital Global Overview Report 2024 Meltwater.pdf

Presentation on how to chat with PDF using ChatGPT code interpreter

Breaking the Kubernetes Kill Chain: Host Path Mount

Handwritten Text Recognition for manuscripts and early printed texts

Partners Life - Insurer Innovation Award 2024

Session i overview bioinfo dm and app mmc

1. Data Manipulation: Molecular Online and Server Tools & BioExtract Server Theme: FXN Gene and Pancreatic Cancer. Etienne Z. Gnimpieba BRIN WS 2013 Mount Marty College – June 24th 2013 Etienne.gnimpieba@usd.edu

2. Data Manipulation Molecular Online Tools: BioExtract Server Review: Databases Etienne Z. Gnimpieba BRIN WS 2013 Mount Marty College – June 24th 2013 Metabolic: • Sabio-RK (check with Brent) • KEGG (check with Brent) • HMDB (hmdb.ca, contact for API) • SMPDB (http://www.smpdb.ca) • BioModels • drugDB • Brenda (check with Brent) • [Mathi's project] Protein • Expazy DB collection (uniprot, ) • PDB • SBKB • STRING Genomic: • G.E.O. • GenBank • GO • EBI Array Express & Gene Atlas Phenomic: • PhenomicDB • Phenoscape

3. Data Manipulation Molecular Online Tools: BioExtract Server Review: Databases Etienne Z. Gnimpieba BRIN WS 2013 Mount Marty College – June 24th 2013 Active Network Extraction & Analysis Reactome Functional Interaction network Disease subnetwork Extract mutated, overexpressed, undexpressed, expanded/deleted genesAdd Linker genes Disease “modules” Disease gene prediction Sample classification Hypothesis generationApply community clustering algorithms

4. Data Manipulation Molecular Online Tools: BioExtract Server Review: Databases Etienne Z. Gnimpieba BRIN WS 2013 Mount Marty College – June 24th 2013 p53, SMAD, TGFβ, TNF signaling KRAS, MAPK signaling Heterotrimeric G-protein signaling Rho GTPase signaling Transcription & translation Cell cycle Wnt & Cadherin signaling Hedgehog signaling Transcription Zinc fingers Ca2+ Signaling Non-silent mutations • blue – in primary tumour only • green – in xenograft only • red – in primary & xenograft Pancreatic Cancer Module Map (43 Cases) Christina Yung / Bioinformatics.ca

5. Data Manipulation Molecular Online Tools: BioExtract Server Bibliographic Taxonomic Nucleotide GenomicProteinMetabolic pathway Molecular Biology Databases MEDLINE PubMed EMBASE BIOSIS CAB International AGRICOLA NEWT The Tree of Life Species 2000 IOPI ITIS KEGG EcoCyc BRENDA ENZYME BIOMODEL REACTOME INSDC EMBL DDBJ NCBI GENBANK SPGP AceDB HIV-SD Ensembl Wormbase FlyBase MGD SGD EBI ( Genome server, Karyn’s genome) RGD SPGP •GOA •ENZYME •INterPro •PDB •Integr8 •MEROPS LIGAN •EMP •DCHGR •PROSITE •PRINT •Pfam •BLOCKS •SBASE •UniProt/ Swiss- Prot •PIR Review: Databases Etienne Z. Gnimpieba BRIN WS 2013 Mount Marty College – June 24th 2013

6. Sequence Type Accession Number DNA sequence from GENBANk , EMBL or DDBJ 1 letter + 5 digits : U43752 2 letter + 6 digits : AF462052 GenePept sequence GENBANk , EMBL or DDBJ 3 letter + 5 digits : AAF46449 Protein sequence from SwissProt 1 letter + 5 digits : Q16595 Protein sequence from the Protein Research Foundation 6/7 digits + 1 letter : 2808353A RefSeq sequence 2 letters + _ + >6 digits mRNA : NM_****** Protein : NP_****** Protein sequence from Protein Data Bank PDB 1 digit + 3 letters : 2EFF Protein sequence from Molecular Modeling DataBase MMDB ID + >4 digits : MMDB ID 767744 Review: data format Data Manipulation Molecular Online Tools and BioExtract Server Etienne Z. Gnimpieba BRIN WS 2013 Mount Marty College – June 24th 2013 >gi|XXXX |XXX >sp|XXXX |XXX Gene Info number Specie referenceAccession number Gene Info number Specie referenceAccession number

7. Data Manipulation Molecular Online Tools: BioExtract Server Biological sequences and data can be analyzed in many ways with bioinformatics tools. They can be read, assembled, compared, mapped, predicted, designed, modeled… 1. Nucleotide and protein sequence searching (blastall, SSEARCH for fasta local, GLSEEARCH for global) 2. Multiple sequence alignment (clustalW2, Mview, …) 3. Pairwise sequence alignment (Needle for global, LALIGN for local) 4. Protein functional analysis (SMART, Phobius, interproscan) 5. Functional genomic tools (R-tools, SAIL, EFOtools,) 6. Molecular structure analysis (PDBeFold, QuaternaryStructure,…) 7. Scientific literature text mining (EBIMed, Whatizit) 8. Sequence translation (Transeq, readseq, Backtranseq,…) 9. Data retrieval and ID mapping (dbfetchm, ENA/SRA, SRS, PICR) 10.Protein structure prediction tools 11.… Review: Online Programs & Algorithms Etienne Z. Gnimpieba BRIN WS 2013 Mount Marty College – June 24th 2013

8. Data Manipulation Molecular Online Tools: BioExtract Server Review: Databases Etienne Z. Gnimpieba BRIN WS 2013 Mount Marty College – June 24th 2013 AND = term1 AND term2 must exist in the searched documents OR = term1 OR term2 must exist NOT = term1 must not be present in any of the displayed documents ALL = term1 must not be present in all of the displayed documents + term1 = document must contain the term1 - term1 = document must not contain term1 XXX* = all characters are accepted after the XXX XX?YX = all characters are accepted instead of Y  FXN [AND] gene [NOT] Frataxin  all data related with FXN gene except those concerning Frataxin protein  ataxia + apraxia + gene  all genes related with ataxia and apraxia  Ada* [AUTH]  all authors whose names begin with Ada Boolean operators and symbols

9. Data Manipulation Molecular Online Tools: BioExtract Server Review: Databases Etienne Z. Gnimpieba BRIN WS 2013 Mount Marty College – June 24th 2013 BLAST (Basic Local Alignment search Tool) : comparing a protein or a DNA sequence to other sequences FASTA (FAST-ALL): fast protein or nucleotide comparison Similarity search tools

10. Global match : align all residues of a sequence with all of the other sequence Local match : find a region in one sequence that matches with the other Motif match : find matches of a short sequence in one or more region internal to another long sequence, it could be a : Multiple alignment : a mutual alignment of many sequences Perfect match deletions insertionsmismatches Review: Sequence Analysis Data Manipulation Molecular Online Tools and BioExtract Server Etienne Z. Gnimpieba BRIN WS 2013 Mount Marty College – June 24th 2013

11. Review: Sequence Analysis Data Manipulation Molecular Online Tools and BioExtract Server Etienne Z. Gnimpieba BRIN WS 2013 Mount Marty College – June 24th 2013 Sequence alignment : assignment of residue-residue correspondence Determine phylogenic relationship by analyzing similarity and homology- Similarity: Observation or measurement of resemblance and difference Homology: The sequences and the organisms in which they occur are descended from a common ancestor  Homology must be an inference from observation of similarity Determine if a protein (or a gene) is related to a larger group of proteins Verify if a mutated residue is conserved within species

12. Context 0. Specification & Aims . Statement of problem / Case study: The FXN gene provides instructions for making a protein called frataxin. This protein is found in cells throughout the body, with the highest levels in the heart, spinal cord, liver, pancreas, and muscles. The protein is used for voluntary movement (skeletal muscles). Within cells, frataxin is found in energy-producing structures called mitochondria. Although its function is not fully understood, frataxin appears to help assemble clusters of iron and sulfur molecules that are critical for the function of many proteins, including those needed for energy production. Mutations in the FXN gene cause Friedreich ataxia. Friedreich ataxia is a genetic condition that affects the nervous system and causes movement problems. Most people with Friedreich ataxia begin to experience the signs and symptoms of the disorder around puberty. Molecular Online Tools and Server Keywords: Bio: FXN, Frataxin, pancreatic cancer, CDKN4 Math: HMM, Informatics: programing, bioinformatics tools, getting and exporting data Reduced expression of frataxin is the cause of Friedrich's ataxia (FRDA), a lethal neurodegenerative disease, how about liver cancer? Aim: The purpose of this lab is to initiate online biological exploration tools of the human model large scale data study (metabolic, proteic, genomic, …). We simulated the application on FXN gene and pancreatic cancer disease. Now we can understand how a researcher can come to identify cross biological knowledge available in data banks. Acquired skills Online and server tools: - Query biological DB (fasta, Html, txt, figure formats) - Sequence tools (protein and gene) Alignment (showalign, clustalw2), similarity, … - Manage data result (select, keep, map, export) - Build and reuse workflow Biological Hypothesis FXN on chromosome 9 Frataxin molecule structure (pymol) Pancreatic cancerPancreasanatomy ? BiologicalDB Tools Resolution Process T2. Genome exploration: Objective: Use of Ensembl to localize the FXN on the human genome and identify the genes implicate in pancreatic cancer disease. T3. Sequences manipulation Objective: Find similar sequence using BLAST tools and make an alignment on given sequences. T2.1. Locate a given gene on human genome T2.2. Get a genomic sequence from NCBI T2.3. Get the protein data and sequence from EBI T2.4. Save the export sequences data in data folder T3.1. Find similar sequences using BLAST tool T3.2. Align generated sequences with ClustalW tool T3.3. Visualized result using phylogenic tree on Jalview T5. BioExtract server Objective: used server tool to optimized data manipulation process, apply on BioExtract server. T5.1. Server Initialization T5.2. Pancreatic cancer & Frataxin (FXN) T5.3. Mapping, Alignment T5.4. Workflow save & reused T4. Protein Data and Structural Biology Knowledge Objective: To provide protein levels of frataxin study and its connection with pancreatic cancer (functional ad structural data) T1. Metabolomics Objective: Use metabolic data repository to understand the frataxin protein mechanism T1.1. Finding the Enzyme and Pathway related to Frataxin using KEGG T1.2. Finding the Reaction involved with Frataxin using Reactome T1.3. Using BRENDA for enzyme data on Frataxin T1.4. Using Collected data for Analysis T1.5. Redu the process with Pancreatic Cancer Results T4.1. Structural Knowledge on Frataxin using SBKB T4.2. Using Uniprot for Frataxin Protein Study T4.3. Protein-Protein Interaction using STRING T4.4. Using same method for Pancreatic Cancer and compare

Editor's Notes

Welcome to this bioinformatics lab on data manipulation using online and server tools.As the theme, we have chosen to study of the interaction between Frataxin and pancreatic cancer.
Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Software generally takes the name of the coded algorithm (next slide)Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Bibliographic DBMEDLINE is accessible through EBI's SRS. PUBMED is accessible through NCBI's ENTREZ.EMBASE is a commercial product formedical literature. BIOSIS, the inheritor of the old Biological Abstracts, covers a broad biological field; Zoological Record indexes and zoological literature. CAB International maintains abstract databases in the fields of agriculture and parasitic diseases. AGRICOLA is for the agricultural field what MEDLINE is for the medical field . The bibliographical databases, with the exception of MEDLINE/PUBMED, are only available through commercial database vendors. Taxonomy DB NEWT The Tree of Life project Species 2000 IOPI: International Organization for Plant Information ITIS: Integrated Taxonomic Information SystemNucleotide DBIn Europe, the vast majority of the nucleotide sequence data produced is collected, organized, and distributed by the EMBL Nucleotide Sequence Database located at the EBI in Cambridge UK. An Outstation of the European Molecular Biology Laboratory (EMBL) is located in Heidelberg, Germany. The nucleotide sequence databases are data repositories, accepting nucleic acid sequence data from the scientific community and making it freely available. The databases strive for completeness, with the aim of recording every publicly known nucleic acid sequence. These databases are heterogenous, they vary with respect to the source of the material (e.g. genomic versus cDNA), the intended quality (e.g. finished versus single pass sequences), the extent of sequence annotation, and the intended completeness of the sequence relative to its biological target (e.g. complete versus partial coverage of a gene or a genome). The nucleotide databases are distributed free of charge over the internet.DDBJ, GenBank and EMBL-Bank exchange new and updated data on a daily basis to achieve optimal synchronization. The result is that they contain exactly the same information, except for sequences that have been added in the last 24 hours. - Genomic DBGenomic databases vary greatly in form and contentFor organisms of major interest to geneticists, there is a long history of conventionally published catalogues of genes or mutations. In the past few years, most of these have been made available in an electronic form and a variety of new databases have been developed. These databases vary greatly in the classes of data captured and how this data is stored. Genomes Server - this server gives access to a hundreds of complete genome sequences, including those from archaea, bacteria, eukaryota, organelles, phages, plasmids, viroids, and viruses. Proteome Analysis - the Proteome Analysis database has been set up to provide comprehensive statistical and comparative analyses of the predicted proteomes of fully sequenced organisms. Ensembl - this is a joint project between the EBI and the Wellcome Trust Sanger Institute that aims at developing a system that maintains automatic annotation of large eukaryotic genomes. Ensembl presents up-to-date sequence data and the best possible automatic annotation for metazoan genomes. Available now are human, mouse, rat, fugu, zebrafish, mosquito, Drosophila, C. elegans, and C. briggsae. Karyn's Genomes - contains general information about organisms whose genomes are completely sequenced. The main aim of the database is to provide a short and concise explanation as to why it is important to obtain these organisms genomic sequences. WormBase - this is a repository of mapping, sequencing, and phenotypic information for C. elegans (and some other nematodes). FlyBase - the database for Drosophila melanogaster is one of the best-curated genetic databases.MGD - the 'Mouse Genome Database' is one of the most comprehensively curated genetic databases.RGD - the 'Rat Genome Database' curates and integrates rat genetic and genomic data and provides access to this data to support research using the rat as a genetic model for the study of human disease.The MIPS yeast database is an important resource for information on the yeast genome and its products. SGD - the 'Saccharomyces Genome Database' is another major yeast database.SPGP - the 'S. Pombe Genome Project' based at the Sanger Institute is the database for genetic data on the fungus Schizosaccharomycespombe.AceDB - this is the database for genetic and molecular data concerning Caenorhabditiselegans. The database management system written for AceDB by R. Durbin and J. Thierry-Mieg has proved very popular and has been used in many other species-specific databases. AceDB is now the name of this database management system, resulting in some confusion relative to the C. elegans database. The entire database can be downloaded from the Sanger Institute.HIV-SD - the 'HIV Sequence Database' collects, curates and annotates HIV and SIV sequence data and provides various tools for analyzing this data.- Protein DBThe protein databases are the most comprehensive source of information on proteins. It is necessary to distinguish between universal databases covering proteins from all species and specialized data collections storing information about specific families or groups of proteins, or about the proteins of a specific organism. Two categories of universal protein databases can be discerned: simple archives of sequence data; and annotated databases where additional information has been added to the sequence record. In the following you will find a short description of the: Primary protein sequence databases such as UniProtKB/Swiss-ProtSpecialised protein sequence databases such as GOA Gene Ontology AnnotationSpecialised protein databases such as ENZYMESecondary protein databases such as InterProStructure databases such as PDBIntegr8 - The Integr8 web portal provides easy access to integrated information about deciphered genomes and their corresponding proteomes. ENZYME - this database is an annotated extension of the Enzyme Commission's publication, linked to UniProtKB/Swiss-Prot. There are also databases of enzyme properties such as BRENDA, Ligand Chemical Database for Enzyme Reactions such as LIGAND, and the database of 'Enzymes and Metabolic Pathways' (EMP). LIGAND and EMP are searchable via SRS at the EBI. LIGAND is linked to the metabolic pathways in KEGG.2 – dimensional gel electrophoresis data - a database is available from Expasy and the Danish Centre for Human Genome Research (DCHGR). Mass spectrometry protein data - a useful resource which includes protein cleavage products, is maintained at Rockefeller University. - Examples of secondary protein databases include: PROSITE - The special value of this database is the extensive documentation on many protein families, as defined by sequence domains or motifs. PROSITE contains biologically significant sites and patterns formulated in such a way that with appropriate computational tools it can rapidly and reliably identify to which family of proteins the new sequence belongs. The profile structure used in PROSITE is similar to but slightly more general than the one introduced by Gribskov and co-workers (Gribskov et al.,1987). Generalised profiles are remarkably similar to the specific type of Hidden Markov Models (HMMs) used in Pfam. PRINTS - A different approach to pattern recognition, termed "fingerprinting" is used by this database. Within a sequence alignment, it is usual to find not one, but several motifs that characterize the aligned family. Diagnostically, it makes sense to use many, or all, of the conserved regions to build a family signature. In a database search, there is then a greater chance of identifying a distant relative, whether or not all parts of the signature are matched. The ability to tolerate mismatches, both at the level of residues within individual motifs, and at the level of motifs within the fingerprint as a whole, renders fingerprinting a powerful diagnostic technique.Pfam - Another important secondary protein database is Pfam. The methodology used by Pfam to create protein family or domain signatures is Hidden Markov Models (HMMs). HMMs are closely related to profiles, but are based on probability theory methods. These allow a direct statistical approach to identifying and scoring matches, and also to combining information from a multiple alignment with prior knowledge. One feature that distinguishes HMMs and profiles from regular expressions and fingerprints is that the formers allow the full extent of a domain to be identified in a sequence. They are thus particularly useful when analyzing multi-domain proteins. The biggest drawback of Pfam is its lack of biological information (annotation) of the protein families.BLOCKS - Blocks are multiply aligned un-gapped segments corresponding to the most highly conserved regions of proteins. The blocks for the Blocks Database are made automatically by looking for the most highly conserved regions in groups of proteins documented in InterPro.SBASE - This is a protein domain library sequences database that contains annotated structural, functional, ligand-binding and topogenic segments of proteins, cross-referenced to all major sequence databases and sequence pattern collections.Software generally takes the name of the coded algorithm (next slide)
Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Software generally takes the name of the coded algorithm (next slide)Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Bibliographic DBMEDLINE is accessible through EBI's SRS. PUBMED is accessible through NCBI's ENTREZ.EMBASE is a commercial product formedical literature. BIOSIS, the inheritor of the old Biological Abstracts, covers a broad biological field; Zoological Record indexes and zoological literature. CAB International maintains abstract databases in the fields of agriculture and parasitic diseases. AGRICOLA is for the agricultural field what MEDLINE is for the medical field . The bibliographical databases, with the exception of MEDLINE/PUBMED, are only available through commercial database vendors. Taxonomy DB NEWT The Tree of Life project Species 2000 IOPI: International Organization for Plant Information ITIS: Integrated Taxonomic Information SystemNucleotide DBIn Europe, the vast majority of the nucleotide sequence data produced is collected, organized, and distributed by the EMBL Nucleotide Sequence Database located at the EBI in Cambridge UK. An Outstation of the European Molecular Biology Laboratory (EMBL) is located in Heidelberg, Germany. The nucleotide sequence databases are data repositories, accepting nucleic acid sequence data from the scientific community and making it freely available. The databases strive for completeness, with the aim of recording every publicly known nucleic acid sequence. These databases are heterogenous, they vary with respect to the source of the material (e.g. genomic versus cDNA), the intended quality (e.g. finished versus single pass sequences), the extent of sequence annotation, and the intended completeness of the sequence relative to its biological target (e.g. complete versus partial coverage of a gene or a genome). The nucleotide databases are distributed free of charge over the internet.DDBJ, GenBank and EMBL-Bank exchange new and updated data on a daily basis to achieve optimal synchronization. The result is that they contain exactly the same information, except for sequences that have been added in the last 24 hours. - Genomic DBGenomic databases vary greatly in form and contentFor organisms of major interest to geneticists, there is a long history of conventionally published catalogues of genes or mutations. In the past few years, most of these have been made available in an electronic form and a variety of new databases have been developed. These databases vary greatly in the classes of data captured and how this data is stored. Genomes Server - this server gives access to a hundreds of complete genome sequences, including those from archaea, bacteria, eukaryota, organelles, phages, plasmids, viroids, and viruses. Proteome Analysis - the Proteome Analysis database has been set up to provide comprehensive statistical and comparative analyses of the predicted proteomes of fully sequenced organisms. Ensembl - this is a joint project between the EBI and the Wellcome Trust Sanger Institute that aims at developing a system that maintains automatic annotation of large eukaryotic genomes. Ensembl presents up-to-date sequence data and the best possible automatic annotation for metazoan genomes. Available now are human, mouse, rat, fugu, zebrafish, mosquito, Drosophila, C. elegans, and C. briggsae. Karyn's Genomes - contains general information about organisms whose genomes are completely sequenced. The main aim of the database is to provide a short and concise explanation as to why it is important to obtain these organisms genomic sequences. WormBase - this is a repository of mapping, sequencing, and phenotypic information for C. elegans (and some other nematodes). FlyBase - the database for Drosophila melanogaster is one of the best-curated genetic databases.MGD - the 'Mouse Genome Database' is one of the most comprehensively curated genetic databases.RGD - the 'Rat Genome Database' curates and integrates rat genetic and genomic data and provides access to this data to support research using the rat as a genetic model for the study of human disease.The MIPS yeast database is an important resource for information on the yeast genome and its products. SGD - the 'Saccharomyces Genome Database' is another major yeast database.SPGP - the 'S. Pombe Genome Project' based at the Sanger Institute is the database for genetic data on the fungus Schizosaccharomycespombe.AceDB - this is the database for genetic and molecular data concerning Caenorhabditiselegans. The database management system written for AceDB by R. Durbin and J. Thierry-Mieg has proved very popular and has been used in many other species-specific databases. AceDB is now the name of this database management system, resulting in some confusion relative to the C. elegans database. The entire database can be downloaded from the Sanger Institute.HIV-SD - the 'HIV Sequence Database' collects, curates and annotates HIV and SIV sequence data and provides various tools for analyzing this data.- Protein DBThe protein databases are the most comprehensive source of information on proteins. It is necessary to distinguish between universal databases covering proteins from all species and specialized data collections storing information about specific families or groups of proteins, or about the proteins of a specific organism. Two categories of universal protein databases can be discerned: simple archives of sequence data; and annotated databases where additional information has been added to the sequence record. In the following you will find a short description of the: Primary protein sequence databases such as UniProtKB/Swiss-ProtSpecialised protein sequence databases such as GOA Gene Ontology AnnotationSpecialised protein databases such as ENZYMESecondary protein databases such as InterProStructure databases such as PDBIntegr8 - The Integr8 web portal provides easy access to integrated information about deciphered genomes and their corresponding proteomes. ENZYME - this database is an annotated extension of the Enzyme Commission's publication, linked to UniProtKB/Swiss-Prot. There are also databases of enzyme properties such as BRENDA, Ligand Chemical Database for Enzyme Reactions such as LIGAND, and the database of 'Enzymes and Metabolic Pathways' (EMP). LIGAND and EMP are searchable via SRS at the EBI. LIGAND is linked to the metabolic pathways in KEGG.2 – dimensional gel electrophoresis data - a database is available from Expasy and the Danish Centre for Human Genome Research (DCHGR). Mass spectrometry protein data - a useful resource which includes protein cleavage products, is maintained at Rockefeller University. - Examples of secondary protein databases include: PROSITE - The special value of this database is the extensive documentation on many protein families, as defined by sequence domains or motifs. PROSITE contains biologically significant sites and patterns formulated in such a way that with appropriate computational tools it can rapidly and reliably identify to which family of proteins the new sequence belongs. The profile structure used in PROSITE is similar to but slightly more general than the one introduced by Gribskov and co-workers (Gribskov et al.,1987). Generalised profiles are remarkably similar to the specific type of Hidden Markov Models (HMMs) used in Pfam. PRINTS - A different approach to pattern recognition, termed "fingerprinting" is used by this database. Within a sequence alignment, it is usual to find not one, but several motifs that characterize the aligned family. Diagnostically, it makes sense to use many, or all, of the conserved regions to build a family signature. In a database search, there is then a greater chance of identifying a distant relative, whether or not all parts of the signature are matched. The ability to tolerate mismatches, both at the level of residues within individual motifs, and at the level of motifs within the fingerprint as a whole, renders fingerprinting a powerful diagnostic technique.Pfam - Another important secondary protein database is Pfam. The methodology used by Pfam to create protein family or domain signatures is Hidden Markov Models (HMMs). HMMs are closely related to profiles, but are based on probability theory methods. These allow a direct statistical approach to identifying and scoring matches, and also to combining information from a multiple alignment with prior knowledge. One feature that distinguishes HMMs and profiles from regular expressions and fingerprints is that the formers allow the full extent of a domain to be identified in a sequence. They are thus particularly useful when analyzing multi-domain proteins. The biggest drawback of Pfam is its lack of biological information (annotation) of the protein families.BLOCKS - Blocks are multiply aligned un-gapped segments corresponding to the most highly conserved regions of proteins. The blocks for the Blocks Database are made automatically by looking for the most highly conserved regions in groups of proteins documented in InterPro.SBASE - This is a protein domain library sequences database that contains annotated structural, functional, ligand-binding and topogenic segments of proteins, cross-referenced to all major sequence databases and sequence pattern collections.Software generally takes the name of the coded algorithm (next slide)
Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Software generally takes the name of the coded algorithm (next slide)Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Bibliographic DBMEDLINE is accessible through EBI's SRS. PUBMED is accessible through NCBI's ENTREZ.EMBASE is a commercial product formedical literature. BIOSIS, the inheritor of the old Biological Abstracts, covers a broad biological field; Zoological Record indexes and zoological literature. CAB International maintains abstract databases in the fields of agriculture and parasitic diseases. AGRICOLA is for the agricultural field what MEDLINE is for the medical field . The bibliographical databases, with the exception of MEDLINE/PUBMED, are only available through commercial database vendors. Taxonomy DB NEWT The Tree of Life project Species 2000 IOPI: International Organization for Plant Information ITIS: Integrated Taxonomic Information SystemNucleotide DBIn Europe, the vast majority of the nucleotide sequence data produced is collected, organized, and distributed by the EMBL Nucleotide Sequence Database located at the EBI in Cambridge UK. An Outstation of the European Molecular Biology Laboratory (EMBL) is located in Heidelberg, Germany. The nucleotide sequence databases are data repositories, accepting nucleic acid sequence data from the scientific community and making it freely available. The databases strive for completeness, with the aim of recording every publicly known nucleic acid sequence. These databases are heterogenous, they vary with respect to the source of the material (e.g. genomic versus cDNA), the intended quality (e.g. finished versus single pass sequences), the extent of sequence annotation, and the intended completeness of the sequence relative to its biological target (e.g. complete versus partial coverage of a gene or a genome). The nucleotide databases are distributed free of charge over the internet.DDBJ, GenBank and EMBL-Bank exchange new and updated data on a daily basis to achieve optimal synchronization. The result is that they contain exactly the same information, except for sequences that have been added in the last 24 hours. - Genomic DBGenomic databases vary greatly in form and contentFor organisms of major interest to geneticists, there is a long history of conventionally published catalogues of genes or mutations. In the past few years, most of these have been made available in an electronic form and a variety of new databases have been developed. These databases vary greatly in the classes of data captured and how this data is stored. Genomes Server - this server gives access to a hundreds of complete genome sequences, including those from archaea, bacteria, eukaryota, organelles, phages, plasmids, viroids, and viruses. Proteome Analysis - the Proteome Analysis database has been set up to provide comprehensive statistical and comparative analyses of the predicted proteomes of fully sequenced organisms. Ensembl - this is a joint project between the EBI and the Wellcome Trust Sanger Institute that aims at developing a system that maintains automatic annotation of large eukaryotic genomes. Ensembl presents up-to-date sequence data and the best possible automatic annotation for metazoan genomes. Available now are human, mouse, rat, fugu, zebrafish, mosquito, Drosophila, C. elegans, and C. briggsae. Karyn's Genomes - contains general information about organisms whose genomes are completely sequenced. The main aim of the database is to provide a short and concise explanation as to why it is important to obtain these organisms genomic sequences. WormBase - this is a repository of mapping, sequencing, and phenotypic information for C. elegans (and some other nematodes). FlyBase - the database for Drosophila melanogaster is one of the best-curated genetic databases.MGD - the 'Mouse Genome Database' is one of the most comprehensively curated genetic databases.RGD - the 'Rat Genome Database' curates and integrates rat genetic and genomic data and provides access to this data to support research using the rat as a genetic model for the study of human disease.The MIPS yeast database is an important resource for information on the yeast genome and its products. SGD - the 'Saccharomyces Genome Database' is another major yeast database.SPGP - the 'S. Pombe Genome Project' based at the Sanger Institute is the database for genetic data on the fungus Schizosaccharomycespombe.AceDB - this is the database for genetic and molecular data concerning Caenorhabditiselegans. The database management system written for AceDB by R. Durbin and J. Thierry-Mieg has proved very popular and has been used in many other species-specific databases. AceDB is now the name of this database management system, resulting in some confusion relative to the C. elegans database. The entire database can be downloaded from the Sanger Institute.HIV-SD - the 'HIV Sequence Database' collects, curates and annotates HIV and SIV sequence data and provides various tools for analyzing this data.- Protein DBThe protein databases are the most comprehensive source of information on proteins. It is necessary to distinguish between universal databases covering proteins from all species and specialized data collections storing information about specific families or groups of proteins, or about the proteins of a specific organism. Two categories of universal protein databases can be discerned: simple archives of sequence data; and annotated databases where additional information has been added to the sequence record. In the following you will find a short description of the: Primary protein sequence databases such as UniProtKB/Swiss-ProtSpecialised protein sequence databases such as GOA Gene Ontology AnnotationSpecialised protein databases such as ENZYMESecondary protein databases such as InterProStructure databases such as PDBIntegr8 - The Integr8 web portal provides easy access to integrated information about deciphered genomes and their corresponding proteomes. ENZYME - this database is an annotated extension of the Enzyme Commission's publication, linked to UniProtKB/Swiss-Prot. There are also databases of enzyme properties such as BRENDA, Ligand Chemical Database for Enzyme Reactions such as LIGAND, and the database of 'Enzymes and Metabolic Pathways' (EMP). LIGAND and EMP are searchable via SRS at the EBI. LIGAND is linked to the metabolic pathways in KEGG.2 – dimensional gel electrophoresis data - a database is available from Expasy and the Danish Centre for Human Genome Research (DCHGR). Mass spectrometry protein data - a useful resource which includes protein cleavage products, is maintained at Rockefeller University. - Examples of secondary protein databases include: PROSITE - The special value of this database is the extensive documentation on many protein families, as defined by sequence domains or motifs. PROSITE contains biologically significant sites and patterns formulated in such a way that with appropriate computational tools it can rapidly and reliably identify to which family of proteins the new sequence belongs. The profile structure used in PROSITE is similar to but slightly more general than the one introduced by Gribskov and co-workers (Gribskov et al.,1987). Generalised profiles are remarkably similar to the specific type of Hidden Markov Models (HMMs) used in Pfam. PRINTS - A different approach to pattern recognition, termed "fingerprinting" is used by this database. Within a sequence alignment, it is usual to find not one, but several motifs that characterize the aligned family. Diagnostically, it makes sense to use many, or all, of the conserved regions to build a family signature. In a database search, there is then a greater chance of identifying a distant relative, whether or not all parts of the signature are matched. The ability to tolerate mismatches, both at the level of residues within individual motifs, and at the level of motifs within the fingerprint as a whole, renders fingerprinting a powerful diagnostic technique.Pfam - Another important secondary protein database is Pfam. The methodology used by Pfam to create protein family or domain signatures is Hidden Markov Models (HMMs). HMMs are closely related to profiles, but are based on probability theory methods. These allow a direct statistical approach to identifying and scoring matches, and also to combining information from a multiple alignment with prior knowledge. One feature that distinguishes HMMs and profiles from regular expressions and fingerprints is that the formers allow the full extent of a domain to be identified in a sequence. They are thus particularly useful when analyzing multi-domain proteins. The biggest drawback of Pfam is its lack of biological information (annotation) of the protein families.BLOCKS - Blocks are multiply aligned un-gapped segments corresponding to the most highly conserved regions of proteins. The blocks for the Blocks Database are made automatically by looking for the most highly conserved regions in groups of proteins documented in InterPro.SBASE - This is a protein domain library sequences database that contains annotated structural, functional, ligand-binding and topogenic segments of proteins, cross-referenced to all major sequence databases and sequence pattern collections.Software generally takes the name of the coded algorithm (next slide)
Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Software generally takes the name of the coded algorithm (next slide)Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Bibliographic DBMEDLINE is accessible through EBI's SRS. PUBMED is accessible through NCBI's ENTREZ.EMBASE is a commercial product formedical literature. BIOSIS, the inheritor of the old Biological Abstracts, covers a broad biological field; Zoological Record indexes and zoological literature. CAB International maintains abstract databases in the fields of agriculture and parasitic diseases. AGRICOLA is for the agricultural field what MEDLINE is for the medical field . The bibliographical databases, with the exception of MEDLINE/PUBMED, are only available through commercial database vendors. Taxonomy DB NEWT The Tree of Life project Species 2000 IOPI: International Organization for Plant Information ITIS: Integrated Taxonomic Information SystemNucleotide DBIn Europe, the vast majority of the nucleotide sequence data produced is collected, organized, and distributed by the EMBL Nucleotide Sequence Database located at the EBI in Cambridge UK. An Outstation of the European Molecular Biology Laboratory (EMBL) is located in Heidelberg, Germany. The nucleotide sequence databases are data repositories, accepting nucleic acid sequence data from the scientific community and making it freely available. The databases strive for completeness, with the aim of recording every publicly known nucleic acid sequence. These databases are heterogenous, they vary with respect to the source of the material (e.g. genomic versus cDNA), the intended quality (e.g. finished versus single pass sequences), the extent of sequence annotation, and the intended completeness of the sequence relative to its biological target (e.g. complete versus partial coverage of a gene or a genome). The nucleotide databases are distributed free of charge over the internet.DDBJ, GenBank and EMBL-Bank exchange new and updated data on a daily basis to achieve optimal synchronization. The result is that they contain exactly the same information, except for sequences that have been added in the last 24 hours. - Genomic DBGenomic databases vary greatly in form and contentFor organisms of major interest to geneticists, there is a long history of conventionally published catalogues of genes or mutations. In the past few years, most of these have been made available in an electronic form and a variety of new databases have been developed. These databases vary greatly in the classes of data captured and how this data is stored. Genomes Server - this server gives access to a hundreds of complete genome sequences, including those from archaea, bacteria, eukaryota, organelles, phages, plasmids, viroids, and viruses. Proteome Analysis - the Proteome Analysis database has been set up to provide comprehensive statistical and comparative analyses of the predicted proteomes of fully sequenced organisms. Ensembl - this is a joint project between the EBI and the Wellcome Trust Sanger Institute that aims at developing a system that maintains automatic annotation of large eukaryotic genomes. Ensembl presents up-to-date sequence data and the best possible automatic annotation for metazoan genomes. Available now are human, mouse, rat, fugu, zebrafish, mosquito, Drosophila, C. elegans, and C. briggsae. Karyn's Genomes - contains general information about organisms whose genomes are completely sequenced. The main aim of the database is to provide a short and concise explanation as to why it is important to obtain these organisms genomic sequences. WormBase - this is a repository of mapping, sequencing, and phenotypic information for C. elegans (and some other nematodes). FlyBase - the database for Drosophila melanogaster is one of the best-curated genetic databases.MGD - the 'Mouse Genome Database' is one of the most comprehensively curated genetic databases.RGD - the 'Rat Genome Database' curates and integrates rat genetic and genomic data and provides access to this data to support research using the rat as a genetic model for the study of human disease.The MIPS yeast database is an important resource for information on the yeast genome and its products. SGD - the 'Saccharomyces Genome Database' is another major yeast database.SPGP - the 'S. Pombe Genome Project' based at the Sanger Institute is the database for genetic data on the fungus Schizosaccharomycespombe.AceDB - this is the database for genetic and molecular data concerning Caenorhabditiselegans. The database management system written for AceDB by R. Durbin and J. Thierry-Mieg has proved very popular and has been used in many other species-specific databases. AceDB is now the name of this database management system, resulting in some confusion relative to the C. elegans database. The entire database can be downloaded from the Sanger Institute.HIV-SD - the 'HIV Sequence Database' collects, curates and annotates HIV and SIV sequence data and provides various tools for analyzing this data.- Protein DBThe protein databases are the most comprehensive source of information on proteins. It is necessary to distinguish between universal databases covering proteins from all species and specialized data collections storing information about specific families or groups of proteins, or about the proteins of a specific organism. Two categories of universal protein databases can be discerned: simple archives of sequence data; and annotated databases where additional information has been added to the sequence record. In the following you will find a short description of the: Primary protein sequence databases such as UniProtKB/Swiss-ProtSpecialised protein sequence databases such as GOA Gene Ontology AnnotationSpecialised protein databases such as ENZYMESecondary protein databases such as InterProStructure databases such as PDBIntegr8 - The Integr8 web portal provides easy access to integrated information about deciphered genomes and their corresponding proteomes. ENZYME - this database is an annotated extension of the Enzyme Commission's publication, linked to UniProtKB/Swiss-Prot. There are also databases of enzyme properties such as BRENDA, Ligand Chemical Database for Enzyme Reactions such as LIGAND, and the database of 'Enzymes and Metabolic Pathways' (EMP). LIGAND and EMP are searchable via SRS at the EBI. LIGAND is linked to the metabolic pathways in KEGG.2 – dimensional gel electrophoresis data - a database is available from Expasy and the Danish Centre for Human Genome Research (DCHGR). Mass spectrometry protein data - a useful resource which includes protein cleavage products, is maintained at Rockefeller University. - Examples of secondary protein databases include: PROSITE - The special value of this database is the extensive documentation on many protein families, as defined by sequence domains or motifs. PROSITE contains biologically significant sites and patterns formulated in such a way that with appropriate computational tools it can rapidly and reliably identify to which family of proteins the new sequence belongs. The profile structure used in PROSITE is similar to but slightly more general than the one introduced by Gribskov and co-workers (Gribskov et al.,1987). Generalised profiles are remarkably similar to the specific type of Hidden Markov Models (HMMs) used in Pfam. PRINTS - A different approach to pattern recognition, termed "fingerprinting" is used by this database. Within a sequence alignment, it is usual to find not one, but several motifs that characterize the aligned family. Diagnostically, it makes sense to use many, or all, of the conserved regions to build a family signature. In a database search, there is then a greater chance of identifying a distant relative, whether or not all parts of the signature are matched. The ability to tolerate mismatches, both at the level of residues within individual motifs, and at the level of motifs within the fingerprint as a whole, renders fingerprinting a powerful diagnostic technique.Pfam - Another important secondary protein database is Pfam. The methodology used by Pfam to create protein family or domain signatures is Hidden Markov Models (HMMs). HMMs are closely related to profiles, but are based on probability theory methods. These allow a direct statistical approach to identifying and scoring matches, and also to combining information from a multiple alignment with prior knowledge. One feature that distinguishes HMMs and profiles from regular expressions and fingerprints is that the formers allow the full extent of a domain to be identified in a sequence. They are thus particularly useful when analyzing multi-domain proteins. The biggest drawback of Pfam is its lack of biological information (annotation) of the protein families.BLOCKS - Blocks are multiply aligned un-gapped segments corresponding to the most highly conserved regions of proteins. The blocks for the Blocks Database are made automatically by looking for the most highly conserved regions in groups of proteins documented in InterPro.SBASE - This is a protein domain library sequences database that contains annotated structural, functional, ligand-binding and topogenic segments of proteins, cross-referenced to all major sequence databases and sequence pattern collections.Software generally takes the name of the coded algorithm (next slide)
To be store, data need to have a normal representation.For the sequence we have:……………………And about alignment tools building, we have local match………..
Software generally takes the name of the implemented algorithmWe have hundreds of available algorithmsFor details on the algorithms for each tool, links are usually available on the website list of tools. If this fails, it suffices to type on a search browser (google) the name of the tool and you will have the referred algorithm. For example, Biological sequences and data can be analyzed in many ways with bioinformatics tools. They can be read, assembled, compared, mapped, predicted, designed, modeled
Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Software generally takes the name of the coded algorithm (next slide)Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Bibliographic DBMEDLINE is accessible through EBI's SRS. PUBMED is accessible through NCBI's ENTREZ.EMBASE is a commercial product formedical literature. BIOSIS, the inheritor of the old Biological Abstracts, covers a broad biological field; Zoological Record indexes and zoological literature. CAB International maintains abstract databases in the fields of agriculture and parasitic diseases. AGRICOLA is for the agricultural field what MEDLINE is for the medical field . The bibliographical databases, with the exception of MEDLINE/PUBMED, are only available through commercial database vendors. Taxonomy DB NEWT The Tree of Life project Species 2000 IOPI: International Organization for Plant Information ITIS: Integrated Taxonomic Information SystemNucleotide DBIn Europe, the vast majority of the nucleotide sequence data produced is collected, organized, and distributed by the EMBL Nucleotide Sequence Database located at the EBI in Cambridge UK. An Outstation of the European Molecular Biology Laboratory (EMBL) is located in Heidelberg, Germany. The nucleotide sequence databases are data repositories, accepting nucleic acid sequence data from the scientific community and making it freely available. The databases strive for completeness, with the aim of recording every publicly known nucleic acid sequence. These databases are heterogenous, they vary with respect to the source of the material (e.g. genomic versus cDNA), the intended quality (e.g. finished versus single pass sequences), the extent of sequence annotation, and the intended completeness of the sequence relative to its biological target (e.g. complete versus partial coverage of a gene or a genome). The nucleotide databases are distributed free of charge over the internet.DDBJ, GenBank and EMBL-Bank exchange new and updated data on a daily basis to achieve optimal synchronization. The result is that they contain exactly the same information, except for sequences that have been added in the last 24 hours. - Genomic DBGenomic databases vary greatly in form and contentFor organisms of major interest to geneticists, there is a long history of conventionally published catalogues of genes or mutations. In the past few years, most of these have been made available in an electronic form and a variety of new databases have been developed. These databases vary greatly in the classes of data captured and how this data is stored. Genomes Server - this server gives access to a hundreds of complete genome sequences, including those from archaea, bacteria, eukaryota, organelles, phages, plasmids, viroids, and viruses. Proteome Analysis - the Proteome Analysis database has been set up to provide comprehensive statistical and comparative analyses of the predicted proteomes of fully sequenced organisms. Ensembl - this is a joint project between the EBI and the Wellcome Trust Sanger Institute that aims at developing a system that maintains automatic annotation of large eukaryotic genomes. Ensembl presents up-to-date sequence data and the best possible automatic annotation for metazoan genomes. Available now are human, mouse, rat, fugu, zebrafish, mosquito, Drosophila, C. elegans, and C. briggsae. Karyn's Genomes - contains general information about organisms whose genomes are completely sequenced. The main aim of the database is to provide a short and concise explanation as to why it is important to obtain these organisms genomic sequences. WormBase - this is a repository of mapping, sequencing, and phenotypic information for C. elegans (and some other nematodes). FlyBase - the database for Drosophila melanogaster is one of the best-curated genetic databases.MGD - the 'Mouse Genome Database' is one of the most comprehensively curated genetic databases.RGD - the 'Rat Genome Database' curates and integrates rat genetic and genomic data and provides access to this data to support research using the rat as a genetic model for the study of human disease.The MIPS yeast database is an important resource for information on the yeast genome and its products. SGD - the 'Saccharomyces Genome Database' is another major yeast database.SPGP - the 'S. Pombe Genome Project' based at the Sanger Institute is the database for genetic data on the fungus Schizosaccharomycespombe.AceDB - this is the database for genetic and molecular data concerning Caenorhabditiselegans. The database management system written for AceDB by R. Durbin and J. Thierry-Mieg has proved very popular and has been used in many other species-specific databases. AceDB is now the name of this database management system, resulting in some confusion relative to the C. elegans database. The entire database can be downloaded from the Sanger Institute.HIV-SD - the 'HIV Sequence Database' collects, curates and annotates HIV and SIV sequence data and provides various tools for analyzing this data.- Protein DBThe protein databases are the most comprehensive source of information on proteins. It is necessary to distinguish between universal databases covering proteins from all species and specialized data collections storing information about specific families or groups of proteins, or about the proteins of a specific organism. Two categories of universal protein databases can be discerned: simple archives of sequence data; and annotated databases where additional information has been added to the sequence record. In the following you will find a short description of the: Primary protein sequence databases such as UniProtKB/Swiss-ProtSpecialised protein sequence databases such as GOA Gene Ontology AnnotationSpecialised protein databases such as ENZYMESecondary protein databases such as InterProStructure databases such as PDBIntegr8 - The Integr8 web portal provides easy access to integrated information about deciphered genomes and their corresponding proteomes. ENZYME - this database is an annotated extension of the Enzyme Commission's publication, linked to UniProtKB/Swiss-Prot. There are also databases of enzyme properties such as BRENDA, Ligand Chemical Database for Enzyme Reactions such as LIGAND, and the database of 'Enzymes and Metabolic Pathways' (EMP). LIGAND and EMP are searchable via SRS at the EBI. LIGAND is linked to the metabolic pathways in KEGG.2 – dimensional gel electrophoresis data - a database is available from Expasy and the Danish Centre for Human Genome Research (DCHGR). Mass spectrometry protein data - a useful resource which includes protein cleavage products, is maintained at Rockefeller University. - Examples of secondary protein databases include: PROSITE - The special value of this database is the extensive documentation on many protein families, as defined by sequence domains or motifs. PROSITE contains biologically significant sites and patterns formulated in such a way that with appropriate computational tools it can rapidly and reliably identify to which family of proteins the new sequence belongs. The profile structure used in PROSITE is similar to but slightly more general than the one introduced by Gribskov and co-workers (Gribskov et al.,1987). Generalised profiles are remarkably similar to the specific type of Hidden Markov Models (HMMs) used in Pfam. PRINTS - A different approach to pattern recognition, termed "fingerprinting" is used by this database. Within a sequence alignment, it is usual to find not one, but several motifs that characterize the aligned family. Diagnostically, it makes sense to use many, or all, of the conserved regions to build a family signature. In a database search, there is then a greater chance of identifying a distant relative, whether or not all parts of the signature are matched. The ability to tolerate mismatches, both at the level of residues within individual motifs, and at the level of motifs within the fingerprint as a whole, renders fingerprinting a powerful diagnostic technique.Pfam - Another important secondary protein database is Pfam. The methodology used by Pfam to create protein family or domain signatures is Hidden Markov Models (HMMs). HMMs are closely related to profiles, but are based on probability theory methods. These allow a direct statistical approach to identifying and scoring matches, and also to combining information from a multiple alignment with prior knowledge. One feature that distinguishes HMMs and profiles from regular expressions and fingerprints is that the formers allow the full extent of a domain to be identified in a sequence. They are thus particularly useful when analyzing multi-domain proteins. The biggest drawback of Pfam is its lack of biological information (annotation) of the protein families.BLOCKS - Blocks are multiply aligned un-gapped segments corresponding to the most highly conserved regions of proteins. The blocks for the Blocks Database are made automatically by looking for the most highly conserved regions in groups of proteins documented in InterPro.SBASE - This is a protein domain library sequences database that contains annotated structural, functional, ligand-binding and topogenic segments of proteins, cross-referenced to all major sequence databases and sequence pattern collections.Software generally takes the name of the coded algorithm (next slide)
Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Software generally takes the name of the coded algorithm (next slide)Molecular online tools are reposed on biological databases and consist of many software programs (program implementing data manipulating algorithm or process)Databases contain bibliography (since 1960), taxonomy, nucleotide, genomic, protein, microarray, metabolic pathway concerning, sequence, RNA, organism,…Bibliographic DBMEDLINE is accessible through EBI's SRS. PUBMED is accessible through NCBI's ENTREZ.EMBASE is a commercial product formedical literature. BIOSIS, the inheritor of the old Biological Abstracts, covers a broad biological field; Zoological Record indexes and zoological literature. CAB International maintains abstract databases in the fields of agriculture and parasitic diseases. AGRICOLA is for the agricultural field what MEDLINE is for the medical field . The bibliographical databases, with the exception of MEDLINE/PUBMED, are only available through commercial database vendors. Taxonomy DB NEWT The Tree of Life project Species 2000 IOPI: International Organization for Plant Information ITIS: Integrated Taxonomic Information SystemNucleotide DBIn Europe, the vast majority of the nucleotide sequence data produced is collected, organized, and distributed by the EMBL Nucleotide Sequence Database located at the EBI in Cambridge UK. An Outstation of the European Molecular Biology Laboratory (EMBL) is located in Heidelberg, Germany. The nucleotide sequence databases are data repositories, accepting nucleic acid sequence data from the scientific community and making it freely available. The databases strive for completeness, with the aim of recording every publicly known nucleic acid sequence. These databases are heterogenous, they vary with respect to the source of the material (e.g. genomic versus cDNA), the intended quality (e.g. finished versus single pass sequences), the extent of sequence annotation, and the intended completeness of the sequence relative to its biological target (e.g. complete versus partial coverage of a gene or a genome). The nucleotide databases are distributed free of charge over the internet.DDBJ, GenBank and EMBL-Bank exchange new and updated data on a daily basis to achieve optimal synchronization. The result is that they contain exactly the same information, except for sequences that have been added in the last 24 hours. - Genomic DBGenomic databases vary greatly in form and contentFor organisms of major interest to geneticists, there is a long history of conventionally published catalogues of genes or mutations. In the past few years, most of these have been made available in an electronic form and a variety of new databases have been developed. These databases vary greatly in the classes of data captured and how this data is stored. Genomes Server - this server gives access to a hundreds of complete genome sequences, including those from archaea, bacteria, eukaryota, organelles, phages, plasmids, viroids, and viruses. Proteome Analysis - the Proteome Analysis database has been set up to provide comprehensive statistical and comparative analyses of the predicted proteomes of fully sequenced organisms. Ensembl - this is a joint project between the EBI and the Wellcome Trust Sanger Institute that aims at developing a system that maintains automatic annotation of large eukaryotic genomes. Ensembl presents up-to-date sequence data and the best possible automatic annotation for metazoan genomes. Available now are human, mouse, rat, fugu, zebrafish, mosquito, Drosophila, C. elegans, and C. briggsae. Karyn's Genomes - contains general information about organisms whose genomes are completely sequenced. The main aim of the database is to provide a short and concise explanation as to why it is important to obtain these organisms genomic sequences. WormBase - this is a repository of mapping, sequencing, and phenotypic information for C. elegans (and some other nematodes). FlyBase - the database for Drosophila melanogaster is one of the best-curated genetic databases.MGD - the 'Mouse Genome Database' is one of the most comprehensively curated genetic databases.RGD - the 'Rat Genome Database' curates and integrates rat genetic and genomic data and provides access to this data to support research using the rat as a genetic model for the study of human disease.The MIPS yeast database is an important resource for information on the yeast genome and its products. SGD - the 'Saccharomyces Genome Database' is another major yeast database.SPGP - the 'S. Pombe Genome Project' based at the Sanger Institute is the database for genetic data on the fungus Schizosaccharomycespombe.AceDB - this is the database for genetic and molecular data concerning Caenorhabditiselegans. The database management system written for AceDB by R. Durbin and J. Thierry-Mieg has proved very popular and has been used in many other species-specific databases. AceDB is now the name of this database management system, resulting in some confusion relative to the C. elegans database. The entire database can be downloaded from the Sanger Institute.HIV-SD - the 'HIV Sequence Database' collects, curates and annotates HIV and SIV sequence data and provides various tools for analyzing this data.- Protein DBThe protein databases are the most comprehensive source of information on proteins. It is necessary to distinguish between universal databases covering proteins from all species and specialized data collections storing information about specific families or groups of proteins, or about the proteins of a specific organism. Two categories of universal protein databases can be discerned: simple archives of sequence data; and annotated databases where additional information has been added to the sequence record. In the following you will find a short description of the: Primary protein sequence databases such as UniProtKB/Swiss-ProtSpecialised protein sequence databases such as GOA Gene Ontology AnnotationSpecialised protein databases such as ENZYMESecondary protein databases such as InterProStructure databases such as PDBIntegr8 - The Integr8 web portal provides easy access to integrated information about deciphered genomes and their corresponding proteomes. ENZYME - this database is an annotated extension of the Enzyme Commission's publication, linked to UniProtKB/Swiss-Prot. There are also databases of enzyme properties such as BRENDA, Ligand Chemical Database for Enzyme Reactions such as LIGAND, and the database of 'Enzymes and Metabolic Pathways' (EMP). LIGAND and EMP are searchable via SRS at the EBI. LIGAND is linked to the metabolic pathways in KEGG.2 – dimensional gel electrophoresis data - a database is available from Expasy and the Danish Centre for Human Genome Research (DCHGR). Mass spectrometry protein data - a useful resource which includes protein cleavage products, is maintained at Rockefeller University. - Examples of secondary protein databases include: PROSITE - The special value of this database is the extensive documentation on many protein families, as defined by sequence domains or motifs. PROSITE contains biologically significant sites and patterns formulated in such a way that with appropriate computational tools it can rapidly and reliably identify to which family of proteins the new sequence belongs. The profile structure used in PROSITE is similar to but slightly more general than the one introduced by Gribskov and co-workers (Gribskov et al.,1987). Generalised profiles are remarkably similar to the specific type of Hidden Markov Models (HMMs) used in Pfam. PRINTS - A different approach to pattern recognition, termed "fingerprinting" is used by this database. Within a sequence alignment, it is usual to find not one, but several motifs that characterize the aligned family. Diagnostically, it makes sense to use many, or all, of the conserved regions to build a family signature. In a database search, there is then a greater chance of identifying a distant relative, whether or not all parts of the signature are matched. The ability to tolerate mismatches, both at the level of residues within individual motifs, and at the level of motifs within the fingerprint as a whole, renders fingerprinting a powerful diagnostic technique.Pfam - Another important secondary protein database is Pfam. The methodology used by Pfam to create protein family or domain signatures is Hidden Markov Models (HMMs). HMMs are closely related to profiles, but are based on probability theory methods. These allow a direct statistical approach to identifying and scoring matches, and also to combining information from a multiple alignment with prior knowledge. One feature that distinguishes HMMs and profiles from regular expressions and fingerprints is that the formers allow the full extent of a domain to be identified in a sequence. They are thus particularly useful when analyzing multi-domain proteins. The biggest drawback of Pfam is its lack of biological information (annotation) of the protein families.BLOCKS - Blocks are multiply aligned un-gapped segments corresponding to the most highly conserved regions of proteins. The blocks for the Blocks Database are made automatically by looking for the most highly conserved regions in groups of proteins documented in InterPro.SBASE - This is a protein domain library sequences database that contains annotated structural, functional, ligand-binding and topogenic segments of proteins, cross-referenced to all major sequence databases and sequence pattern collections.Software generally takes the name of the coded algorithm (next slide)
To be store, data need to have a normal representation.For the sequence we have:……………………And about alignment tools building, we have local match………..
To be store, data need to have a normal representation.For the sequence we have:……………………And about alignment tools building, we have local match………..
This is the lab template: The context is a biological context based on a real biological problem. And a given hypothesisI don’t use computer science, strong word.When you read this template, you have a different view than an informatician.You want to understand the process to build the used tools.The architecture of the systemThe algorithm implementationThe quality of the resulting dataAnd so on

Session i overview bioinfo dm and app mmc

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Session i overview bioinfo dm and app mmc

Similar to Session i overview bioinfo dm and app mmc (20)

More from USD Bioinformatics

More from USD Bioinformatics (20)

Recently uploaded

Recently uploaded (20)

Session i overview bioinfo dm and app mmc

Editor's Notes