4. Basic concept about Database
1. What is a database?
A database is a collection of data which can be used:
• alone, or
• combined / related to other data
to provide answers to the user’s question.
5. Data types
primary data
secondary data
tertiary data
sequence
DNA
amino acid
DMPVERILEALAVE…
primary database
secondary protein
structure“motifs”: regular
expressions, blocks, profiles,
fingerprints
e. g., alpha-helices, beta-
strands
secondary db
domains, folding units
tertiary protein structure tertiary db
atomic co-ordinates
interaction data
binary protein-protein
interactions/ networks
pathways and
functional networks
interaction db
6. Primary biological databases
Nucleic acid databases
EMBL
GenBank
DDBJ (DNA Data Bank of
Japan)
Protein databases
PIR
MIPS
SWISS-PROT
TrEMBL
NRL-3D
7. Nucleotide Databases
•EMBL:Nucleotide sequence database
•Ensembl: Automatics annotation of eukaryotic genomes
•Genome Server: Overview of completed genomes at EBI
•Genome-MOT: Genome monitoring table
•EMBL-Align: Multiple sequence alignment database
8. Sequence data = strings of
letters
Nucleotides (bases)
Adenine (A)
Cytosine (C)
Guanine (G)
Thymine (T)
triplet codons
genetic code
20 amino acids
(A, L, V, S etc.)
11. EMBL/GenBank/DDJB
• These 3 db contain mainly the same information (few differences
in the format and syntax)
• Serve as archives containing all sequences (single genes, ESTs,
complete genomes, etc.) derived from:
– Genome projects and sequencing centers
– Individual scientists
– Patent offices (i.e. USPTO, EPO)
• Non-confidential data are exchanged daily.
12. Databases related to Genomics
• Contain information on genes, gene location (mapping),
gene nomenclature and links to sequence databases;
• Exist for most organisms important for life science research;
• Examples: MIM, GDB (human), MGD (mouse), FlyBase
(Drosophila), SGD (yeast), MaizeDB (maize), SubtiList
(B.subtilis), etc.
13. Swiss-Prot
• Annotated protein sequence database established in 1986 and
maintained collaboratively since 1987, by the Department of
Medical Biochemistry of the University of Geneva and EBI
• Complete, Curated, Non-redundant and cross-referenced with 34
other databases
• Highly cross-referenced
• Available from a variety of servers and through sequence analysis
software tools
• More than 8,000 different species
• First 20 species represent about 42% of all sequences in the
database
• More than 1,29,000 entries with 4.7 X 1010 amino acids
14. PDB: Protein Data Bank
• Holds 3D models of biological macromolecules (protein, RNA,
DNA).
• All data are available to the public.
• Obtained by X-Ray crystallography (84%) or NMR
spectroscopy (16%).
• Submitted by biologists and biochemists from around the
world.
15. EMBL Nucleotide Sequence
Database
• An annotated collection of all publicly available nucleotide
and protein sequences
• Created in 1980 at the European Molecular Biology
Laboratory in Heidelberg.
• Maintained since 1994 by EBI- Cambridge.
16. DDBJ–DNA Data Bank of
Japan
• An annotated collection of all publicly available
nucleotide and protein sequences
• Started, 1984 at the National Institute of Genetics (NIG)
in Mishima.
• Still maintained in this institute a team led by Takashi
Gojobori.
17. Why Proteins Structure ?
Proteins are fundamental components of all living
cells, performing a variety of biological tasks.
Each protein has a particular 3D structure that determines its
function.
Protein structure is more conserved than protein sequence, and
more closely related to function.
21. Major classes in scop
• Classes
– All alpha proteins
– Alpha and beta proteins (a/b)
– Alpha and beta proteins (a+b)
– Multi-domain proteins
– Membrane and cell surface proteins
– Small proteins
22. Folds*
• Each Class may be divided into one or more folds
• Proteins which have the same secondary structure elements
arranged the in the same order in the protein chain and in three
dimensions are classified as having the same fold
23. Superfamilies
• Superfamilies are a subdivisions of folds
• A superfamily contains proteins which are thought to be
evolutionarily related due to
– Sequence
– Function
– Special structural features
• Relationships between members of a superfamily may not be
readily recognizable from the sequence alone
24. Families
• Subdivision of super families
• Contains members whose relationship is readily recognizable
from the sequence
• Families are further subdivided in to Proteins
• Proteins are divided into Species
– The same protein may be found in several species
28. CATH
• Levels
• Class
• Architecture
– This level is unique to CATH
• Topology
– ~Fold(/super family) in SCOP
• Homologous Super family
– ~Super family(/family) in SCOP
29. Architecture
• Same overall arrangement of secondary structures
– Example: The architecture :Two layer beta sheet proteins
contains different folds each with a distinct number and
connectivity of strands
30.
31. Abdul Qahar Buneri abdulqahar045@gmail.com
www.slideshare.net/abdulqahar045