Databases_CSS2.pptx

Databases
• A data structure that stores organized
information. Most databases contain multiple
tables, which may each include several different
fields.
• A database-management system (DBMS) is a
computer-software application that interacts with
end-users, other applications, and the database
itself to capture and analyze data. A general-
purpose DBMS allows the definition, creation,
querying, update, and administration of
databases.

Biological databases
• Libraries of life sciences information, collected from
scientific experiments, published literature, high-
throughput experiment technology, and computational
analysis.They contain information from research areas
including genomics, proteomics, metabolomics,
microarray gene expression, and phylogentics.
Information contained in biological databases includes
gene function, structure, localization (both cellular and
chromosomal), clinical effects of mutations as well as
similarities of biological sequences and structures.
• Biological databases can be broadly classified into
sequence, structure and functional databases.

Biological databases
• Contains files or tables, each containing
numerous records and fields
• Simplest form, either a large single text file
or collection of text files
• Commonest type, stores the data within a
number of tables (with records and fields).
Each table will link each other by a shared
file called a key

Flat file
Relational database model
The operators are written in query-specific languages based on relational algebra
Structured Query Language (SQL) is commonly used

• XML (eXtensible Markup Language) is now a general tool for
storage of data and information. HTML and XHTML are subsets of XML.
• The key feature is to use identifiers called tabs
• <title> Understanding Bioinformatics </ title>
• <publisher> tag can be defined and used to identify book publishers
• Extraction from XML file is similar to database querying.

Databases
Information system
Query system
Storage System
Data
GenBank flat file
PDB file
Interaction Record
Title of a book
Book

Databases
Information system
Query system
Storage System
Data
Boxes
Oracle
MySQL
PC binary files
Unix text files
Bookshelves

Databases
Information system
Query system
Storage System
Data
A List you look at
A catalogue
indexed files
SQL
grep

The UBC library
Google
Entrez
SRS
Databases
Information system
Query system
Storage System
Data

Bioinformatics Information Space
July 17, 1999
• Nucleotide sequences: 4,456,822
• Protein sequences: 706,862
• 3D structures: 9,780
• Human Unigene Clusters: 75,832
• Maps and Complete Genomes: 10,870
• Different species node: 52,889
• dbSNP 6,377
• RefGenes 515
• human contigs > 250 kb 341
(4.9MB)
• PubMed records: 10,372,886
• OMIM records: 10,695

The challenge of the information space:
Nucleotide records 36,653,899
Protein sequences 4,436,362
3D structures 19,640
Interactions & complexes 52,385
Human Unigene Cluster 118,517
Maps and Complete Genomes 6,948
Different taxonomy Nodes 283,121
Human dbSNP 13,179,601
Human RefSeq records 22,079
bp in Human Contigs > 5,000 kb (116) 2,487,920,000
PubMed records 12,570,540
OMIM records 15,138
Feb 10 2004

Databases
• Primary (archival)
– GenBank/EMBL/DDBJ
– UniProt
– PDB
– Medline (PubMed)
– BIND
• Secondary (curated)
– RefSeq
– Taxon
– UniProt
– OMIM
– SGD

http://nar.oupjournals.org/content/vol31/issue1/

Tools of trade
for the “armchair scientist”
• Databases
– PubMed and other NCBI databases
– Biochemical databases
– Protein domain databases
– Structural databases
– Genome comparison databases
• Tools
– CDD / COGs
– VAST / FSSP

Distribution of the type of databases as classified at the NAR
database web site

Types of databases
• Archival or Primary Data
– Text: PubMed
– DNA Sequence: GenBank
– Protein Sequence: Entrez Proteins, TREMBL
– Protein Structures: PDB
• Curated or Processed Data
– DNA sequences : RefSeq, LocusLink, OMIM
– Protein Sequences: SWISS-PROT, PIR
– Protein Structures : SCOP, CATH, MMDB
– Genomes: Entrez Genomes, COGs
Nucleic Acids Research: Database Issue each January 1 Articles on
~100 different databases

The National Center for Biotechnology Information (NCBI)
• Created as a part of the National Library of Medicine,
National Institutes of Health in 1988
– Establish public databases
– Research in computational biology
– Develop software tools for sequence analysis
– Disseminate biomedical information
• Tools: BLAST(1990), Entrez (1992)
• GenBank (1992)
• Free MEDLINE (PubMed, 1997)
• Other databases: dbEST, dbGSS, dbSTS,
MMDB, OMIM, UniGene, Taxonomy,
GeneMap, SAGE, LocusLink, RefSeq

What is GenBank?
• Archival nucleotide sequence database
• Sample slogans:
“Easy deposits, unlimited withdrawals, high
interest”, “All bases covered”, “Billions and billions
served”
• Data are shared nightly among three
collaborating databases:
• GenBank at NCBI - Bethesda, Maryland, USA
• DNA Database of Japan (DDBJ) at NIG -
Mishima, Japan
• European Molecular Biology Laboratory
Database (EMBL) at EBI - Hinxton, UK

Some guiding principles of working
with GenBank
• GenBank is a nucleotide-centric
view of the information space
• GenBank is a repository of all
publically available sequences
• In GenBank, records are grouped
for various reasons
• Data in GenBank is only as good
as what you put in

NCBI databases and their links
Word Weight
VAST
BLAST
BLAST
Phylogeny
Genomes
Taxonomy
Nucleotide
Sequences
Protein
Sequences
Article
Abstracts
Medline
3-D
Structure
3 D
Structure
MMDB

PDB
• Protein DataBase
– Protein and NA
3D structures
– Sequence
present
– YAFFF

HEADER LEUCINE ZIPPER 15-JUL-
93 1DGC 1DGC 2
COMPND GCN4 LEUCINE ZIPPER COMPLEXED WITH SPECIFIC
1DGC 3
COMPND 2 ATF/CREB SITE DNA
1DGC 4
SOURCE GCN4: YEAST (SACCHAROMYCES CEREVISIAE); DNA:
SYNTHETIC 1DGC 5
AUTHOR T.J.RICHMOND
1DGC 6
REVDAT 1 22-JUN-94 1DGC 0
1DGC 7
JRNL AUTH P.KONIG,T.J.RICHMOND
1DGC 8
JRNL TITL THE X-RAY STRUCTURE OF THE GCN4-BZIP
BOUND TO 1DGC 9
JRNL TITL 2 ATF/CREB SITE DNA SHOWS THE COMPLEX
DEPENDS ON DNA 1DGC 10
JRNL TITL 3 FLEXIBILITY
1DGC 11
JRNL REF J.MOL.BIOL. V. 233
139 1993 1DGC 12
JRNL REFN ASTM JMOBAK UK ISSN 0022-2836
0070 1DGC 13
REMARK 1
1DGC 14
REMARK 2
1DGC 15
REMARK 2 RESOLUTION. 3.0 ANGSTROMS.
1DGC 16
REMARK 3
1DGC 17
REMARK 3 REFINEMENT.
1DGC 18
REMARK 3 PROGRAM X-PLOR
1DGC 19
REMARK 3 AUTHORS BRUNGER
1DGC 20
REMARK 3 R VALUE 0.216
1DGC 21
REMARK 3 RMSD BOND DISTANCES 0.020 ANGSTROMS
1DGC 22
REMARK 3 RMSD BOND ANGLES 3.86 DEGREES
1DGC 23
PDB
• HEADER
• COMPND
• SOURCE
• AUTHOR
• DATE
• JRNL
• REMARK
• SECRES
• ATOM COORDINATES

Accessing
information on
molecular sequences
Page 26

[rest of protein sequence deleted for brevity]
[rest of nucleotide sequence deleted for brevity]
GenBank Record
Accession Number
gi Number
Protein Sequence
Nucleotide Sequence
Locus Name
Medline ID
GenPept ID

LOCUS, Accession, NID and protein_id
LOCUS: Unique string of 10 letters and numbers in the database. Not
maintained amongst databases, and is therefore a poor sequence
identifier.
ACCESSION: A unique identifier to that record, citable entity; does not
change when record is updated. A good record identifier, ideal for citation
in publication.
VERSION: : New system where the accession and version play the same
function as the accession and gi number.
Nucleotide gi: Geninfo identifier (gi), a unique integer which will change
every time the sequence changes.
PID: Protein Identifier: g, e or d prefix to gi number. Can have one or two
on one CDS.
Protein gi: Geninfo identifier (gi), a unique integer which will change every
time the sequence changes.
Protein_id: Identifier which has the same structure and function as the
nucleotide Accession.version numbers, but slightlt different format.

Protein sequence motif
is a descriptor of a protein family
• Glutamine amidotransferase class I
[PAS]-[LIVMFYT]-[LIVMFY]-G-[LIVMFY]-C-
[LIVMFYN]-G-x-[QEH]- x-[LIVMFA]
[C is the active site residue]
• Glutamine amidotransferase class II
<x(0,11)-C-[GS]-[IV]-[LIVMFYW]-[AG]
[C is the active site residue]

Principles of structural alignment
• Dali: http://www.ebi.ac.uk/dali/
Looks for minimal RMSD between Ca atoms.
Calculate Ca - Ca distance matrices, then
identifies the longest alignable segments
• VAST (Vector Alignment Search Tool)
http://www.ncbi.nlm.nih.gov/Structure/
looks for pairs of secondary structure
elements (a-helices, b-strands) that have
similar orientation and connectivity

Dali alignment of Tyr phosphatase

Structure Summary
Cn3D viewer
VAST neighbors
BLAST neighbors

Cn3D : Displaying Structures
Chloroquine

Use of structural alignments
Chloroquine
NADH

UniProt
• New protein sequence database that is the result of a
merge from SWISS-PROT and PIR. It will be the
annotated curated protein sequence database.
• Data in UniProt is primarily derived from coding
sequence annotations in EMBL (GenBank/DDBJ) nucleic
acid sequence data.
• UniProt is a Flat-File database just like EMBL and
GenBank
• Flat-File format is SwissProt-like, or EMBL-like

• SWISS-PROT incorporates:
•Function of the protein
•Post-translational modification
•Domains and sites.
•Secondary structure.
•Quaternary structure.
•Similarities to other proteins;
•Diseases associated with deficiencies in the
protein
•Sequence conflicts, variants, etc.
Swiss-Prot

Databases_CSS2.pptx

Recomendados

Recomendados

Más contenido relacionado

Similar a Databases_CSS2.pptx

Similar a Databases_CSS2.pptx (20)

Más de Silpa87

Más de Silpa87 (14)

Último

Último (20)

Databases_CSS2.pptx