1. Databases
• A data structure that stores organized
information. Most databases contain multiple
tables, which may each include several different
fields.
• A database-management system (DBMS) is a
computer-software application that interacts with
end-users, other applications, and the database
itself to capture and analyze data. A general-
purpose DBMS allows the definition, creation,
querying, update, and administration of
databases.
2. Biological databases
• Libraries of life sciences information, collected from
scientific experiments, published literature, high-
throughput experiment technology, and computational
analysis.They contain information from research areas
including genomics, proteomics, metabolomics,
microarray gene expression, and phylogentics.
Information contained in biological databases includes
gene function, structure, localization (both cellular and
chromosomal), clinical effects of mutations as well as
similarities of biological sequences and structures.
• Biological databases can be broadly classified into
sequence, structure and functional databases.
3. Biological databases
• Contains files or tables, each containing
numerous records and fields
• Simplest form, either a large single text file
or collection of text files
• Commonest type, stores the data within a
number of tables (with records and fields).
Each table will link each other by a shared
file called a key
6. Flat file
Relational database model
The operators are written in query-specific languages based on relational algebra
Structured Query Language (SQL) is commonly used
7. • XML (eXtensible Markup Language) is now a general tool for
storage of data and information. HTML and XHTML are subsets of XML.
• The key feature is to use identifiers called tabs
• <title> Understanding Bioinformatics </ title>
• <publisher> tag can be defined and used to identify book publishers
• Extraction from XML file is similar to database querying.
12. Bioinformatics Information Space
July 17, 1999
• Nucleotide sequences: 4,456,822
• Protein sequences: 706,862
• 3D structures: 9,780
• Human Unigene Clusters: 75,832
• Maps and Complete Genomes: 10,870
• Different species node: 52,889
• dbSNP 6,377
• RefGenes 515
• human contigs > 250 kb 341
(4.9MB)
• PubMed records: 10,372,886
• OMIM records: 10,695
13. The challenge of the information space:
Nucleotide records 36,653,899
Protein sequences 4,436,362
3D structures 19,640
Interactions & complexes 52,385
Human Unigene Cluster 118,517
Maps and Complete Genomes 6,948
Different taxonomy Nodes 283,121
Human dbSNP 13,179,601
Human RefSeq records 22,079
bp in Human Contigs > 5,000 kb (116) 2,487,920,000
PubMed records 12,570,540
OMIM records 15,138
Feb 10 2004
16. Tools of trade
for the “armchair scientist”
• Databases
– PubMed and other NCBI databases
– Biochemical databases
– Protein domain databases
– Structural databases
– Genome comparison databases
• Tools
– CDD / COGs
– VAST / FSSP
17. Distribution of the type of databases as classified at the NAR
database web site
18.
19. Types of databases
• Archival or Primary Data
– Text: PubMed
– DNA Sequence: GenBank
– Protein Sequence: Entrez Proteins, TREMBL
– Protein Structures: PDB
• Curated or Processed Data
– DNA sequences : RefSeq, LocusLink, OMIM
– Protein Sequences: SWISS-PROT, PIR
– Protein Structures : SCOP, CATH, MMDB
– Genomes: Entrez Genomes, COGs
Nucleic Acids Research: Database Issue each January 1 Articles on
~100 different databases
20. The National Center for Biotechnology Information (NCBI)
• Created as a part of the National Library of Medicine,
National Institutes of Health in 1988
– Establish public databases
– Research in computational biology
– Develop software tools for sequence analysis
– Disseminate biomedical information
• Tools: BLAST(1990), Entrez (1992)
• GenBank (1992)
• Free MEDLINE (PubMed, 1997)
• Other databases: dbEST, dbGSS, dbSTS,
MMDB, OMIM, UniGene, Taxonomy,
GeneMap, SAGE, LocusLink, RefSeq
21. What is GenBank?
• Archival nucleotide sequence database
• Sample slogans:
“Easy deposits, unlimited withdrawals, high
interest”, “All bases covered”, “Billions and billions
served”
• Data are shared nightly among three
collaborating databases:
• GenBank at NCBI - Bethesda, Maryland, USA
• DNA Database of Japan (DDBJ) at NIG -
Mishima, Japan
• European Molecular Biology Laboratory
Database (EMBL) at EBI - Hinxton, UK
22. Some guiding principles of working
with GenBank
• GenBank is a nucleotide-centric
view of the information space
• GenBank is a repository of all
publically available sequences
• In GenBank, records are grouped
for various reasons
• Data in GenBank is only as good
as what you put in
23. NCBI databases and their links
Word Weight
VAST
BLAST
BLAST
Phylogeny
Genomes
Taxonomy
Nucleotide
Sequences
Protein
Sequences
Article
Abstracts
Medline
3-D
Structure
3 D
Structure
MMDB
27. [rest of protein sequence deleted for brevity]
[rest of nucleotide sequence deleted for brevity]
GenBank Record
Accession Number
gi Number
Protein Sequence
Nucleotide Sequence
Locus Name
Medline ID
GenPept ID
28. LOCUS, Accession, NID and protein_id
LOCUS: Unique string of 10 letters and numbers in the database. Not
maintained amongst databases, and is therefore a poor sequence
identifier.
ACCESSION: A unique identifier to that record, citable entity; does not
change when record is updated. A good record identifier, ideal for citation
in publication.
VERSION: : New system where the accession and version play the same
function as the accession and gi number.
Nucleotide gi: Geninfo identifier (gi), a unique integer which will change
every time the sequence changes.
PID: Protein Identifier: g, e or d prefix to gi number. Can have one or two
on one CDS.
Protein gi: Geninfo identifier (gi), a unique integer which will change every
time the sequence changes.
Protein_id: Identifier which has the same structure and function as the
nucleotide Accession.version numbers, but slightlt different format.
29.
30.
31.
32.
33. Protein sequence motif
is a descriptor of a protein family
• Glutamine amidotransferase class I
[PAS]-[LIVMFYT]-[LIVMFY]-G-[LIVMFY]-C-
[LIVMFYN]-G-x-[QEH]- x-[LIVMFA]
[C is the active site residue]
• Glutamine amidotransferase class II
<x(0,11)-C-[GS]-[IV]-[LIVMFYW]-[AG]
[C is the active site residue]
40. Principles of structural alignment
• Dali: http://www.ebi.ac.uk/dali/
Looks for minimal RMSD between Ca atoms.
Calculate Ca - Ca distance matrices, then
identifies the longest alignable segments
• VAST (Vector Alignment Search Tool)
http://www.ncbi.nlm.nih.gov/Structure/
looks for pairs of secondary structure
elements (a-helices, b-strands) that have
similar orientation and connectivity
48. UniProt
• New protein sequence database that is the result of a
merge from SWISS-PROT and PIR. It will be the
annotated curated protein sequence database.
• Data in UniProt is primarily derived from coding
sequence annotations in EMBL (GenBank/DDBJ) nucleic
acid sequence data.
• UniProt is a Flat-File database just like EMBL and
GenBank
• Flat-File format is SwissProt-like, or EMBL-like
50. • SWISS-PROT incorporates:
•Function of the protein
•Post-translational modification
•Domains and sites.
•Secondary structure.
•Quaternary structure.
•Similarities to other proteins;
•Diseases associated with deficiencies in the
protein
•Sequence conflicts, variants, etc.
Swiss-Prot