SlideShare una empresa de Scribd logo
1 de 108
FBW
             01-10-2012




Wim Van Criekinge
Inhoud Lessen: Bioinformatica




                                GEEN LES
Outline


          • Molecular Biology
          • Flat files “sequence” databases
            – DNA
            – Protein
            – Structure
          • Relational Databases
            – What ?
            – Why ?
          • Biological Relational Databases
            – Howto ?
Flat Files


             What is a “flat file” ?
             • Flat file is a term used to refer to when
               data is stored in a plain ordinary file
               on the hard disk
             • Example RefSEQ
               – See CD-ROM
               – FILE: hs.GBFF
                  • Hs: Homo Sapiens
                  • GBFF: Genbank File Format
                  • (associated with textpad, use monospaced
                    font eg. Courier)
Sequence entries
gene   10317..12529                     /gene="ZK822.4"
CDS    join(10317..10375,10714..10821,10874..10912,10960..11013,
       11061..11114,11169..11222,11346..11739,11859..11912,
       11962..12195,12242..12529)
       /gene="ZK822.4"                     /codon_start=1
       /protein_id="CAA98068.1"
       /db_xref="PID:g3881817"
       /db_xref="GI:3881817"
       /db_xref="SPTREMBL:Q23615"
       /translation="MHRHTYRKLYWNLGADGFSQGNADASVSAGSSGSNFLSGLQNSS
       FGQAVMGGINTYNQAKNSSGGNWQTAVANSSVGNFFQNGIDFFNGMKNGTQNFLDTDT
       IQETIGNSSFGEVVQTGVEFFNNIKNGNSPFQGDASSVMSQFVPFLANASAEAKAEFY
       TILPNFGNMTIAEFETAVNAWAAKYNLTDEVEAFNERSKNATVVAEEHANVVVMNLPN
       VLNNLKAISSDKNQTVVEMHTRMMAYVNSLDDDTRDIVFIFFRNLLPPQFKKSKCVDQ
       GNFLTNMYNKASDFFAGRNNRTDGEGSFWSGQGQNGNSGGSGFSSFFNNFNGQGNGNG
       NGAQNPMIGMFNNFMKKNNITADEANAAMADGGASIQILPAISAGWGDVAQVKIGGDF
       KIAVEEETKTTKKNKKQQQQANKNKNKNKKKTTIAPEAAIDANIAAEVHTQVL"
Nucleotide Databases

EMBL Nucleotide Sequence Database (European Molecular Biology
Laboratory) http://www.ebi.ac.uk/ebi_docs/embl_db/ebi/topembl.html
GenBank at NCBI (National Center for Biotechnology Information)
http://www.ncbi.nlm.nih.gov/Web/Genbank/index.html

DDBJ (DNA Database of Japan) http://www.ddbj.nig.ac.jp/
DDBJ,the Center for operating DDBJ, National Institute of Genetics (NIG),Japan,established in
April 1995.


http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html

Release Notes (ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt)
Genetic Sequence Data Bank - August 15 2003
NCBI-GenBank Flat File Release 137.0
Distribution Release Notes
33 865 022 251 bases, from 27 213 748 reported sequences
GenBank Format
LOCUS        LISOD         756 bp    DNA              BCT   30-JUN-1993
DEFINITION   L.ivanovii sod gene for superoxide dismutase.
ACCESSION    X64011.1 GI:37619753
NID          g44010
KEYWORDS     sod gene; superoxide dismutase.
SOURCE       Listeria ivanovii.
ORGANISM     Listeria ivanovii
             Eubacteria; Firmicutes; Low G+C gram-positive bacteria;
             Bacillaceae; Listeria.
REFERENCE    1 (bases 1 to 756)
  AUTHORS    Haas,A. and Goebel,W.
  TITLE      Cloning of a superoxide dismutase gene from Listeria ivanovii
             by functional complementation in Escherichia coli and
             characterization of the gene product
  JOURNAL    Mol. Gen. Genet. 231 (2), 313-322 (1992)
  MEDLINE    92140371
REFERENCE    2 (bases 1 to 756)
  AUTHORS    Kreft,J.
  TITLE      Direct Submission
  JOURNAL    Submitted (21-APR-1992) J. Kreft, Institut f. Mikrobiologie,
             Universitaet Wuerzburg, Biozentrum Am Hubland, 8700
             Wuerzburg, FRG
FEATURES           Location/Qualifiers
     source        1..756
                   /organism="Listeria ivanovii"
                   /strain="ATCC 19119"
                   /db_xref="taxon:1638"
     RBS           95..100
                   /gene="sod"
     gene          95..746
                   /gene="sod"
     CDS           109..717
                   /gene="sod"
                   /EC_number="1.15.1.1"
                   /codon_start=1
                   /product="superoxide dismutase"
                   /db_xref="PID:g44011"
                   /db_xref="SWISS-PROT:P28763"
                   /transl_table=11
                   /translation="MTYELPKLPYTYDALEPNFDKETMEIHYTKHHNIYVTKL
                   NEAVSGHAELASKPGEELVANLDSVPEEIRGAVRNHGGGHANHTLFWSSLSPN
                   GGGAPTGNLKAAIESEFGTFDEFKEKFNAAAAARFGSGWAWLVVNNGKLEIVS
                   TANQDSPLSEGKTPVLGLDVWEHAYYLKFQNRRPEYIDTFWNVINWDERNKRF
                   DAAK"
      terminator   723..746
                   /gene="sod"
Example of location descriptors
Location       Description

476            Points to a single base in the presented sequence

340..565       Points to a continuous range of bases bounded by and
               including the starting and ending bases

<345..500      The exact lower boundary point of a feature is unknown.

(102.110)      Indicates that the exact location is unknown but that it
               is one of the bases between bases 102 and 110.

(23.45)..600   Specifies that the starting point is one of the bases
               between bases 23 and 45, inclusive, and the end base 600

123^124        Points to a site between bases 123 and 124

145^177        Points to a site anywhere between bases 145 and 177

J00193:hladr   Points to a feature whose location is described in
               another entry: the feature labeled 'hladr' in the
               entry (in this database) with primary accession 'J00193'
BASE COUNT         247 a     136 c      151 g     222 t
ORIGIN
1    cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc gccttacaat
61   gtaatttctt ttcacataaa taataaacaa tccgaggagg aatttttaat gacttacgaa
121 ttaccaaaat taccttatac ttatgatgct ttggagccga attttgataa agaaacaatg
181 gaaattcact atacaaagca ccacaatatt tatgtaacaa aactaaatga agcagtctca
241 ggacacgcag aacttgcaag taaacctggg gaagaattag ttgctaatct agatagcgtt
301 cctgaagaaa ttcgtggcgc agtacgtaac cacggtggtg gacatgctaa ccatacttta
361 ttctggtcta gtcttagccc aaatggtggt ggtgctccaa ctggtaactt aaaagcagca
421 atcgaaagcg aattcggcac atttgatgaa ttcaaagaaa aattcaatgc ggcagctgcg
481 gctcgttttg gttcaggatg ggcatggcta gtagtgaaca atggtaaact agaaattgtt
541 tccactgcta accaagattc tccacttagc gaaggtaaaa ctccagttct tggcttagat
601 gtttgggaac atgcttatta tcttaaattc caaaaccgtc gtcctgaata cattgacaca
661 ttttggaatg taattaactg ggatgaacga aataaacgct ttgacgcagc aaaataatta
721 tcgaaaggct cacttaggtg ggtcttttta tttcta
//
EMBL format
ID LISOD      standard; DNA; PRO; 756 BP.             IDentification
XX
AC X64011; S78972;                                    Accession (Axxxxx, Afxxxxxx), GUID
XX
NI g44010                                              Nucleotide Identifier --> x.x
XX
DT 28-APR-1992 (Rel. 31, Created)                      DaTe
DT 30-JUN-1993 (Rel. 36, Last updated, Version 6)
XX
DE L.ivanovii sod gene for superoxide dismutase        DEscription
XX.
KW sod gene; superoxide dismutase.                     KeyWord
XX
OS Listeria ivanovii                                   Organism Species
OC Eubacteria; Firmicutes; Low G+C gram-positive bacteria; Bacillaceae;
OC Listeria.                                          Organism Classification
XX
RN [1]
RA Haas A., Goebel W.;                                Reference
RT "Cloning of a superoxide dismutase gene from Listeria ivanovii by
RT functional complementation in Escherichia coli and
RT characterization of the gene product.";
RL Mol. Gen. Genet. 231:313-322(1992).
XX
GenBank,EMBL & DDBJ: Comments

            • Collaboration Genbank/EMBL/DDBJ
               – Effort: Identical within 24 hours
            • Redundant information
            • Historical graveyard
               – BANKIT (responsability of the submitter)
               – Version conflicts
            • IDIOSYNCRATIC ( peculiar to the
              individual)
               – Heterogeneous annotation
               – No consistant quality check
                  • Vectors, sequence errors etc
Other Genbank Formats

               • ASN1
                   – Computer friendly, human unfriendly
               • FASTA
                   – Brief, loses information
                   – Easy to use
                   – Compatible with multiple sequences
Web Query tools & Programming Query tools

• NCBI website example:
    – http://www.ncbi.nlm.nih.gov/entrez/query/static/ad
      vancedentrez.html



• EBI UniProtKB website example:
    – http://www.ebi.ac.uk/uniprot/index.html
    – http://www.ebi.uniprot.org/search/SearchTools.sht
      ml
batch download (ftp server)


• Data available via website is most of
  the time also available via an ftp
  server to download a complete
  batch.
• Examples:
    –ftp://ftp.ncbi.nih.gov/
    –ftp://ftp.ebi.ac.uk/pub/
Sequence file format tips

                 • When saving a sequence for use in an email
                   message or pasting into a web page, use an
                   unannotated text format such as FASTA
                 • When retrieving from a database or
                   exchanging between programs, use an
                   annotated text format such as Genbank
                 • When using sequence again with the same
                   program, use that program’s annotated binary
                   format (or annotated text if binary not
                   available)
                     – Asn-1 (NCBI)
                     – Gbff (sanger)
                     – XML
Expressed Sequence Tags

            • Sequence that codes for protein is < 5% of the
              genome.
            • Coding sequence can be obtained from mRNA by
              reverse transcription.
            • Tags for that sequence can be obtained by end-
              sequencing.
            • Incyte and HGS gambled on this being the useful
              part:
               – Search for homologies to known proteins, motifs.
               – Search for changed levels of expression and tissue specificity
                 (“virtual/electronic northern” used in GeneCards)
            • ESTs have driven the huge expansion of GenBank:
               – Unigene now contains some sequence from most genes.
               – > 4,000,000 human est sequences
               – http://www.ncbi.nlm.nih.gov/dbEST/
dbEST release 100303 Summary by Organism - October 3, 2003

                 Number of public entries: 18,762,324

                 Homo sapiens (human)                     5,426,001
                 Mus musculus + domesticus (mouse)        3,881,878
                 Rattus sp. (rat)                           538,073
                 Triticum aestivum (wheat)                  500,898
                 Ciona intestinalis                         492,488
                 Gallus gallus (chicken)                    451,565
                 Zea mays (maize)                          383,416
                 Danio rerio (zebrafish)                   362,362
                 Hordeum vulgare + subsp. vulgare (barley) 348,233
                 Xenopus laevis (African clawed frog)       344,695
                 Glycine max (soybean)                      341,573
                 Bos taurus (cattle)                        322,074
                 Drosophila melanogaster (fruit fly)        261,404
Traces <-> strings

               • Traces contain much more information
                     – TraceDB: http://www.ncbi.nlm.nih.gov/Traces/




Example
Traces <-> strings

               • Phrep
                     – base calling, vector trimming, end of sequence
                       read trimming
               • Phrap
                     – Phrap uses Phred’s base calling scores to
                       determine the consensus sequences. Phrap
                       examines all individual sequences at a given
                       position, and uses the highest scoring sequence
                       (if it exists) to extend the consensus sequence
               • Consend
                     – graphical interface extension that controls both
                       Phred and Phrap
What is Phred?

• Phred is a program that observes the base trace, makes
base calls, and assigns quality values (qv) of bases in the
sequence.
• It then writes base calls and qv to output files that will be
used for Phrap assembly. The qv will be useful for
consensus sequence construction.
• For example,        ATGCATGC string1
                      ATTCATGC string2
                      AT-CATGC superstring
• Here we have a mismatch ‘G’ and ‘T’, the qv will
determine the dash in the superstring. The base with higher
qv will replaces the dash.
How Phred calculates qv?

• From the base trace Phred know number of peaks
    and actual peak locations.
•   Phred predicts peaks locations.
•   Phred reads the actual peak locations from base
    trace.
•   Phred match the actual locations with the
    predicted locations by using Dynamic
    Programming.
•   The qv is related to the base call error probability
    (ep) by the formula qv = -10*log_10(ep)
• Example 1:10000 = qv 40
Why Phred?


             • Output sequence might contain
                 errors.
             •   Vector contamination might occur.
             •   Dye-terminator reaction might not
                 occur.
             •   Segment migration abnormal in
                 gel electrophoresis.
             •   Weak or variable signal strength
                 of peak corresponding to a base.
Vector Trimming
End of Sequence Cropping




• It is common that the end of sequencing reads
    have poor data. This is due to the difficulties in
    resolving larger fragment ~1kb (it is easier to
    resolve 21bp from 20bp than it is to resolve
    1001bp from 1000bp).
•   Phred assigns a non-value of ‘x’ to this data by
    comparing peak separation and peak intensity to
    internal standards. If the standard threshold score
    is not reached, the data will not be used.
Traces <-> strings

• Handle traces
   – Abi-view EMBOSS
   – Bioedit
   – Acembly, …




• EXAMPLE
NCBI reference sequences

RefSeq database is a non-redundant set of
 reference standards that includes
 chromosomes, complete genomic molecules,
 intermediate assembled genomic contigs,
 curated genomic regions, mRNAs, RNAs, and
 proteins.
RefSeq nomenclature

NC_#### complete genomic
NG_#### incomplete genomic
NM_#### mRNA
NR_#### noncoding transcripts
NP_#### proteins
NT_#### intermediate genomic contigs
RefSeq nomenclature - models

XM_#### mRNA
XR_#### RNA
XP_#### protein

Automated Homo sapiens models provided by
 the Genome Annotation process; sequence
 corresponds to the genomic contig.
Open reading frame
• Definition:
  – A stretch of triplet codons with an initiator
    codon at one end and a stop codon sat the other,
    as identifiable by nucleotide sequences.
• Example
  – http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?
    cmd=Retrieve&db=nucleotide&list_uids=6688
    473&dopt=GenBank&term=Y18948.1&qty=1
Protein sequence database
SWISS-PROT & TREMBL
SwissProt - http://expasy.hcuge.ch/sprot/

 SWISS-PROT is an annotated protein sequence database

 The sequences are translated from the EMBL Nucleotide Sequence Database

 Sequence entries are composed of different lines.
 For standardization purposes the format of SWISS-PROT follows as
 closely as possible that of the EMBL Nucleotide Sequence Database.

 Continuously updated (daily).
Different Features of SWISS-PROT


                    •  Format follows as closely as
                       possible that of EMBL’s
                    • Curated protein sequence database
                    • Three differences:
                    1. Strives to provide a high level of
                       annotations
                    2. Minimal level of redundancy
                    3. High level of integration with
                       other databases
1. Annotation
 Three Distinct Criteria
                  The sequence data; the citation
                   information (bibliographical
                   references) and the taxonomic data
                   (description of the biological source of
                   the protein) such as protein
                   functions,post-translational
                   modifications ,domains and
                   sites,secondary structure,quaternary
                   structure,similarities to other
                   proteins,diseases associated with
                   deficiencies in the protein,sequence
                   conflicts, variants, etc.
2. Minimal Redundancy

                  any sequence databases contain, for a
                   given protein sequence, separate
                   entries which correspond to
                   different literature reports. SWISS-
                   PROT is as much as possible to
                   merge all these data so as to
                   minimize the redundancy. If
                   conflicts exist between various
                   sequencing reports, they are
                   indicated in the feature table of the
                   corresponding entry.
3. Integration With Other Databases

               • SWISS-PROT and TrEMBL - Protein
                 sequences
               • PROSITE - Protein families and domains
               • SWISS-2DPAGE - Two-dimensional
                 polyacrylamide gel electrophoresis
               • SWISS-3DIMAGE - 3D images of proteins
                 and other biological macromolecules
               • SWISS-MODEL Repository - Automatically
                 generated protein models
               • CD40Lbase - CD40 ligand defects
               • ENZYME - Enzyme nomenclature
               • SeqAnalRef - Sequence analysis bibliographic
                 references
TREMBL- http://expasy.hcuge.ch/sprot/
 Translated EMBL sequences not (yet) in
Swissprot.

  Updated faster than SWISS-PROT.

TREMBL - two parts
1. SP-TREMBL
      Will eventually be incorporated into Swissprot
      Divided into FUN, HUM, INV, MAM, MHC, ORG, PHG, PLN,
    PRO,
         ROD, UNC, VRL and VRT.
2. REM-TREMBL (remaining)
      Will NOT be incorporated into Swissprot
      Divided into:Immunoglobins and T-cell receptors,Synthetic
    sequences,Patent application sequences,Small fragments,CDS
    not coding for real proteins
SWISS-PROT/TrEMBL

              • TrEMBL is a computer-annotated
                supplement of SWISS-PROT that contains
                all the translations of EMBL nucleotide
                sequence entries not yet integrated in
                SWISS-PROT
              • SWISS-PROT Release 39.15 of 19-
                Mar-2001: 94,152 entries
                TrEMBL Release 16.2 of 23-Mar-
                2001: 436,924 entries
Example of a SwissProt entry
ID    TNFA_HUMAN STANDARD;           PRT; 233 AA.              IDentification
AC     P01375;                                                 ACcession
DT     21-JUL-1986 (REL. 01, CREATED)                           DaTe
DT     21-JUL-1986 (REL. 01, LAST SEQUENCE UPDATE)
DT     15-JUL-1998 (REL. 36, LAST ANNOTATION UPDATE)
DE     TUMOR NECROSIS FACTOR PRECURSOR (TNF-ALPHA) (CACHECTIN).
GN     TNFA.                                                   Gene name
OS     HOMO SAPIENS (HUMAN).                                    Organism Species
OC     EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA;
OC     EUTHERIA; PRIMATES.                                    Organism Classification
RN     [1]                                                    Reference
RP     SEQUENCE FROM N.A.
RX     MEDLINE; 87217060.
RA     NEDOSPASOV S.A., SHAKHOV A.N., TURETSKAYA R.L., METT V.A.,
RA    AZIZOV M.M., GEORGIEV G.P., KOROBKO V.G., DOBRYNIN V.N.,
RA     FILIPPOV S.A., BYSTROV N.S., BOLDYREVA E.F., CHUVPILO S.A.,
RA     CHUMAKOV A.M., SHINGAROVA L.N., OVCHINNIKOV Y.A.;
RL    COLD SPRING HARB. SYMP. QUANT. BIOL. 51:611-624(1986).
RN     [2]
RP     SEQUENCE FROM N.A.
RX     MEDLINE; 85086244.
RA     PENNICA D., NEDWIN G.E., HAYFLICK J.S., SEEBURG P.H., DERYNCK R.,
RA     PALLADINO M.A., KOHR W.J., AGGARWAL B.B., GOEDDEL D.V.;
RL    NATURE 312:724-729(1984).
...
CC   -!- FUNCTION: CYTOKINE WITH A WIDE VARIETY OF FUNCTIONS: IT CAN
CC       CAUSE CYTOLYSIS OF CERTAIN TUMOR CELL LINES, IT IS IMPLICATED
CC       IN THE INDUCTION OF CACHEXIA, IT IS A POTENT PYROGEN CAUSING
CC       FEVER BY DIRECT ACTION OR BY STIMULATION OF IL-1 SECRETION, IT
CC       CAN STIMULATE CELL PROLIFERATION & INDUCE CELL DIFFERENTIATION
CC       UNDER CERTAIN CONDITIONS.                   Comments
CC   -!- SUBUNIT: HOMOTRIMER.
CC   -!- SUBCELLULAR LOCATION: TYPE II MEMBRANE PROTEIN. ALSO EXISTS AS
CC      AN EXTRACELLULAR SOLUBLE FORM.
CC   -!- PTM: THE SOLUBLE FORM DERIVES FROM THE MEMBRANE FORM BY
CC       PROTEOLYTIC PROCESSING.
CC   -!- DISEASE: CACHEXIA ACCOMPANIES A VARIETY OF DISEASES, INCLUDING
CC       CANCER AND INFECTION, AND IS CHARACTERIZED BY GENERAL ILL
CC       HEALTH AND MALNUTRITION.
CC   -!- SIMILARITY: BELONGS TO THE TUMOR NECROSIS FACTOR FAMILY.
DR   EMBL; X02910; G37210; -.                         Database Cross-references
DR   EMBL; M16441; G339741; -.
DR   EMBL; X01394; G37220; -.
DR   EMBL; M10988; G339738; -.
DR   EMBL; M26331; G339764; -.
DR   EMBL; Z15026; G37212; -.
DR   PIR; B23784; QWHUN.
DR   PIR; A44189; A44189.
DR   PDB; 1TNF; 15-JAN-91.
DR   PDB; 2TUN; 31-JAN-94.
KW    CYTOKINE; CYTOTOXIN; TRANSMEMBRANE; GLYCOPROTEIN; SIGNAL-ANCHOR;
KW    MYRISTYLATION; 3D-STRUCTURE.                         KeyWord
FT   PROPEP      1 76                                      Feature Table
FT   CHAIN    77 233     TUMOR NECROSIS FACTOR.
FT   TRANSMEM 36 56        SIGNAL-ANCHOR (TYPE-II PROTEIN).
FT   LIPID   19 19     MYRISTATE.
FT   LIPID   20 20     MYRISTATE.
FT   DISULFID 145 177
FT   MUTAGEN 105 105        L->S: LOW ACTIVITY.
FT   MUTAGEN 108 108        R->W: BIOLOGICALLY INACTIVE.
FT   MUTAGEN 112 112       L->F: BIOLOGICALLY INACTIVE.
FT   MUTAGEN 162 162        S->F: BIOLOGICALLY INACTIVE.
FT   MUTAGEN 167 167        V->A,D: BIOLOGICALLY INACTIVE.
FT   MUTAGEN 222 222        E->K: BIOLOGICALLY INACTIVE.
FT   CONFLICT 63 63       F -> S (IN REF. 5).
FT   STRAND     89 93
FT   TURN     99 100
FT   TURN    109 110
FT   STRAND    112 113
FT   TURN    115 116
FT   STRAND    118 119
FT   STRAND    124 125
FT STRAND     130 143
FT STRAND     152 159
FT STRAND     166 170
FT STRAND     173 174
FT TURN      183 184
FT STRAND     189 202
FT TURN      204 205
FT STRAND     207 212
FT HELIX    215 217
FT STRAND     218 218
FT STRAND     227 232
SQ SEQUENCE 233 AA; 25644 MW; 666D7069 CRC32;
   MSTESMIRDV ELAEEALPKK TGGPQGSRRC LFLSLFSFLI   VAGATTLFCL   LHFGVIGPQR
   EEFPRDLSLI SPLAQAVRSS SRTPSDKPVA HVVANPQAEG   QLQWLNRRAN   ALLANGVELR
   DNQLVVPSEG LYLIYSQVLF KGQGCPSTHV LLTHTISRIA   VSYQTKVNLL   SAIKSPCQRE
   TPEGAEAKPW YEPIYLGGVF QLEKGDRLSA EINRPDYLDF   AESGQVYFGI   IAL
//
Protein searching
3-levels of Protein Searching


1. Swissprot                            Little Noise
                                        Annotated entries



2. Swissprot + TREMBL                   More Noisy
                                        All probable entries



3. Translated EMBL - tblast or tfasta   Most Noisy
                                        All possible entries
New initiatiaves


                   • IPI: International Protein Index
                     – http://www.ebi.ac.uk/IPI/IPIhelp.ht
                       ml
                   • UNIPROT: Universal Protein
                     Knowledgebase
                     – http://www.pir.uniprot.org/
                   • HPRD: Human Protein Reference
                     Database
                     – http://www.hprd.org/
UniProt
          UniProt Consortium
             • European Bioinformatics Institute (EBI)
             • Swiss Institute of Bioinformatics (SIB)
             • Protein Information Resource (PIR)

          Uniprot Databases
             •UniProt Knowledgebase (UniProtKB)
             •UniProt Reference Clusters (UniRef)
             •UniProt Archive (UniParc)

          UniprotKB
             •Swiss-Prot (annotated protein sequence db,
             golden standard)
             •trEMBL (translated EMBL + automated
             electronic annotations)
understanding molecular
 structure is critical to the
 understanding of biology
because because structure
    determines function
From Structure to Function

• the drug morphine has chemical groups that are functionally equivalent to the natural
endorphins found in the human body
From Structure to Function

• the drug morphine has chemical groups that are functionally equivalent to the natural
endorphins found in the human body




                                                                  • the receptor molecules
                                                                  located at the synapse
                                                                  (between two neurons)
                                                                  bind morphine much the
                                                                  same way as endorphins

                                                                  • therefore, morphine is
                                                                  able to attenuate the pain
                                                                  response
Structure databases
Protein Data Bank (PDB)
Protein Data Bank - http://www.rcsb.org/pdb

Diffraction         7373 structures determined by X-ray diffraction
NMR                 388 structures determined by NMR spectroscopy
Theoretical Model   201 structures proposed by modeling
• PDB is three-dimensional structure of
  proteins,some nuclei acids involved
• PDB is operated by RCSB(Research Collaboratory for
  Structural Bioinformatics),funded by NSF, DOE, and
  two units of NIH:NIGMS National Institute Of General
  Medical Sciences and NLM National Library Of Medicine.
• Established at BNL Brookhaven National Laboratories in
  1971,as an archive for biological
  macromolecular crystal structures
• In 1980s, the number of deposited structures
  began to increase dramatically.
• October 1998, the management of the PDB
  became the responsibility of RCSB.
• Website http://www.rcsb.org
PDB Holdings List: 27-Mar-2001




                                                Molecule Type

                          Proteins,     Protein/
                          Peptides,     Nucleic     Nuclei   Carbohydrate    total
                          and Viruses   Acid          c                 s
                                        Complexes   Acids


            X-ray               11045         526      552            14    12137
Exp.        Diffraction
            and other
Tech.       NMR                  1832          71      366             4     2273

            Theoretica            281          19       21             0      321
            l Modeling
          total                13158          616      939            18    14731

          5032 Structure Factor Files
          968 NMR Restraint Files
PDB Content Growth
PDB Growth in New Folds
Other structure databases
BioMagResBank http://www.bmrb.wisc.edu/
A Repository for Data from NMR Spectroscopy on Proteins, Peptides, and Nucleic
Acids
Biological Macromolecule Crystallization Database (BMCD)
          http://h178133.carb.nist.gov:4400/bmcd/bmcd.html
Contains crystal data and the crystallization conditions, which have been compiled
from literature
Nucleic Acid Database (NDB) http://ndbserver.rutgers.edu:80/
Assembles and distributes structural information about nucleic acids
Structural Classification of Proteins (SCOP) http://scop.mrc-lmb.cam.ac.uk/scop/
Structure similarity search. Hierarchic organization.
MOOSE http://db2.sdsc.edu/moose/
Macromolecular Structure Query
Cambridge Structural Database (CSD) http://www.ccdc.cam.ac.uk/
Small molecules.
Protein Splicing?

• Protein splicing is defined as the excision of
  an intervening protein sequence (the
  INTEIN) from a protein precursor and the
  concomitant ligation of the flanking protein
  fragments (the EXTEINS) to form a mature
  extein protein and the free intein
• http://www.neb.com/inteins/intein_intro.ht
  ml
Biological databases

                 • NAR Database Issue
                       – Every year: NAR DB Issue
                       – The 2006 update includes 858 databases
                       – Citation top 5 are:
                          •   Pfam
                          •   Gene Ontology
                          •   UniProt
                          •   SMART
                          •   KEGG
                       – Primary Nucleotide DB’s and PDB are
                         not cited anymore
Outline


          • Molecular Biology
          • Flat files “sequence” databases
            – DNA
            – Protein
            – Structure
          • Relational Databases
            – What ?
            – Why ?
          • Biological Relational Databases
            – Howto ?
Why biological databases ?

                 • Explosive growth in biological data

                 • Data (sequences, 3D structures, 2D
                   gel analysis, MS analysis….) are no
                   longer published in a conventional
                   manner, but directly submitted to
                   databases

                 • Essential tools for biological research,
                   as classical publications used to be !
Problems with Flat files …

                  •   Wasted storage space
                  •   Wasted processing time
                  •   Data control problems
                  •   Problems caused by changes to data
                      structures
                  •   Access to data difficult
                  •   Data out of date
                  •   Constraints are system based
                  •   Limited querying eg. all single exon
                      GPCRs (<1000 bp)
Relational

             • The Relational model is not only very mature, but it
               has developed a strong knowledge on how to make a
               relational back-end fast and reliable, and how to
               exploit different technologies such as massive
               SMP, Optical jukeboxes, clustering and etc. Object
               databases are nowhere near to this, and I do not
               expect then to get there in the short or medium term.
             • Relational Databases have a very well-known and
               proven underlying mathematical theory, a simple one
               (the set theory) that makes possible
                – automatic cost-based query optimization,
                – schema generation from high-level models and
                – many other features that are now vital for mission-critical
                  Information Systems development and operations.
• What is a relational database ?
   – Sets of tables and links (the data)
   – A language to query the datanase (Structured
     Query Language)
   – A program to manage the data (RDBMS)
• Flat files are not relational
   –   Data type (attribute) is part of the data
   –   Record order mateters
   –   Multiline records
   –   Massive duplication
        • Bv Organism: Homo sapeinsm Eukaryota, …
   – Some records are hierarchical
        • Xrefs
   – Records contain multiple “sub-records”
   – Implecit “Key”
The Benefits of Databases

                  • Redundancy can be reduced
                  • Inconsistency can be avoided
                  • Conflicting requirements can be
                    balanced
                  • Standards can be enforced
                  • Data can be shared
                  • Data independence
                  • Integrity can be maintained
                  • Security restrictions can be applied
Disadvantages

                •   size
                •   complexity
                •   cost
                •   Additional hardware costs
                •   Higher impact of failure
                •   Recovery more difficult
Relational Terminology



    CUSTOMER Table (Relation)


                         ID NAME                PHONE       EMP_ID
   Row (Tuple)       201   Unisports            55-2066101     12
                     202   Simms Atheletics     81-20101       14
                     203   Delhi Sports         91-10351       14
                     204   Womansport           1-206-104-0103 11

                           Column (Attribute)
Relational Database Terminology

• Each row of data in a table is uniquely identified by a primary key (PK)
• Information in multiple tables can be logically related by foreign keys (FK)




             Table Name: CUSTOMER                          Table Name: EMP

     ID    NAME               PHONE          EMP_ID   ID    LAST_NAME    FIRST_NAME
    201    Unisports          55-2066101     12       10    Havel        Marta
    202    Simms Atheletics   81-20101       14       11    Magee        Colin
    203    Delhi Sports       91-10351       14       12    Giljum       Henry
    204    Womansport         1-206-104-0103 11       14    Nguyen       Mai


          Primary Key            Foreign Key               Primary Key
• RDBM products
  – Free
    • MySQL, very fast, widely usedm easy to
      jump into but limited non standard SQL
    • PostrgreSQL – full SQLm limited OO,
      higher learning curve than MySQL
  – Commercial
    • MS Access – Great query builder, GUI
      interfaces
    • MS SQL Server – full SQL, NT only
    • Oracle, everything, including the kitchen
      sink
    • IBM DB2, Sybase
A simple datamodel (tables and relations)

                    Prot_id         name            seq           Species_id

                    1               GTM1_HUMA MGTDHG…             1
                                    N
                    2               GTM1_RAT        MGHJADSW..    2

                    3               GTM2_HUMA MVSDBSVD..          1
                                    N


                    Species_id              name          Full Lineage

                    1                       human         Homo Sapiens …


                    2                       rat           Rattus rattus
Relational Database Fundamentals

                  • Basic SQL
                      –   SELECT
                      –   FROM
                      –   WHERE
                      –   JOIN – NATURAL, INNER, OUTER
                  • Other SQL functions
                      –   COUNT()
                      –   MAX(),MIN(),AVE()
                      –   DISTINCT
                      –   ORDER BY
                      –   GROUP BY
                      –   LIMIT
BioSQL
• Query: een opdracht om gegevens uit
  een databaase op te vragen noemt men
  een query

• eg. MyGPCRdb
  – Bioentry
  – Taxid (include full lineage)
  – Linking table (bioentry_tax)
MyGPCR;

Geef me allE GPCR die korter zijn dan 1000bp

select * from bioentry;
select count(*) from bioentry;
select * from bioentry inner join biosequence on
   bioentry.bioentry_id=biosequence.bioentry_id ;
select * from bioentry inner join biosequence on
   bioentry.bioentry_id=biosequence.bioentry_id
   where length(biosequence_str)<1000;
Example 3-tier model in biological database




Example of different interface to the same back-end database (MySQL)
           http://www.bioinformatics.be
Overview

              • DataBases
                 – FF
                    • *.txt
                    • Indexed version
                 – Relational (RDBMS)
                    • Access, MySQL, PostGRES,
                      Oracle
                 – OO (OODBMS)
                    • AceDB, ObjectStore
                 – Hierarchical
                    • XML
                 – Frame based system
Overview




                    • Eg. DAML+OIL
                 – Hybrid systems
Object

         • The Object paradigm is already proven for application design and
           development, but it may simply not be an adequate paradigm for
           the data store.
         • Object Database are modelled by graphs. The graph theory plays a
           great role on computer science, but is also a great source of
           unbeatable problems, the NP-complex class: problems for which
           there are no computationally efficient solution, as there's no way to
           escape from exponential complexity. This is not a current
           technological limit. It's a limit inherent to the problem domain.

         • Hybrid Object-Relational databases will probably be the long term
           solution for the industry. They put a thin object layer above the
           relational structure, thus providing a syntax and semantics closer to
           the object oriented design and programming tools. They simply
           make it easier to build the data layer classes
Conclusions

              •   A database is a central component of any
                  contemporary information system
              •   The operations on the database and the mainenance
                  of database consistency is handled by a DBMS
              •   There exist stand alone query languages or
                  embedded languages but both deal with definition
                  (DDL) and manipulation (DML) aspects
              •   The structural properties, constraints and operations
                  permitted within a DBMS are defined by a data
                  model - hierarchical, network, relational
              •   Recovery and concurrency control are essential
              •   Linking of heterogebous datasources is central theme
                  in modern bioinformatics
• How do you know which database
  exists ?

• NAR list

• Weblinks op Nexus
  – Searchable
  – Maintainable
• Tools available in public domain for
  simultaneous access
  – entrez
  – srs
• Batch queries for offload in local
  databases for subsequent analysis
  (see further)
• What if you want to search the
  complete human genome (golden path
  coordinates) instead of separate NCBI
  entries ?

• ENSEMBL
BioMart




          • Joined project between EBI and CSHL,
            http://www.biomart.org/
          • Aim is to develop a generic, query-oriented data
            management system capable of integrating
            distributed data sources
          • 3 step system:
             – Start by selecting a dataset to query
             – Filter this dataset by applying the appropriate filters
             – Generate the output by selecting the attributes and output
               format
          • Available public biomart websites:
            http://www.biomart.org/biomart/martview
BioMart - Single access point - Generic interface
BioMart - ‘Out of the box’ website
BioMart – 3 step system




                 Dataset
                 Attribute
                 Filter
BioMart - 3 step system




                 Name, chromosome
Dataset          position, description
Attribute        for all Ensembl genes
Filter           located on chromosome 1, expressed in
                 lung, associated with human
                 homologues
BioMart - EnsMart

                • The first in line was EnsMart, a powerful data
                  mining toolset for retrieving customized data sets
                  from annotated genomes. EnsMart integrates data
                  from Ensembl and various worldwide data sources.
                • EnsMart provides ....
                    –   Gene and protein annotation
                    –   Disease information
                    –   Cross-species analyses
                    –   SNPs affecting proteins
                    –   Allele frequency data
                    –   Retrieval by external identifiers
                    –   Retrieval by Gene Ontology
                    –   Customized sequence datasets
                    –   Microarray annotation tools
Other BioMart implementations

                • Other data resources also implemented
                  a BioMart interface:
                    –   Wormbase
                    –   Gramene
                    –   HapMap
                    –   DictyBase
                    –   euGenes
Single interface
BioBar


         • A toolbar for browsing biological data
           and databases
           http://biobar.mozdev.org/
         • The following databases are included
           http://biobar.mozdev.org/Databases.ht
           ml
         • a toolbar for Mozilla-based browsers
           including Firefox and Netscape 7+
Weblems
          Weblems Online (example posting)

          W2.1. Which isolate of Tabac was used in record accession
            Z71230, and human sample in the genbank entry with
            accession AJ311677 ?
          W2.2: Find all structures of GFP in the Protein Data Bank and
            draw a histogram of their dates of deposition ?
          W2.3: What is the chromosomal location of the human gene for
            insulin ?
          W2.4: How many different human NHR (nuclear hormone
            receptors) s exist ? How many of these are single exon genes
            ? Are there any drugs working on this class of receptors ?
          W2.5: The gene for Berardinelli-Seip syndrome was initially
            localized between two markers on chromosome band 11q13-
            D11S4191 and D11S987.
            a. How many base pairs are there in the interval between
            these two markers ?
            b. How many known genes are there ?
            c. List the gene ontology terms for that region ?

Más contenido relacionado

La actualidad más candente

0826 Drosophila lab meeting
0826 Drosophila lab meeting0826 Drosophila lab meeting
0826 Drosophila lab meetingHoward Liu
 
Linked Data for integrating life-science databases
Linked Data for integrating life-science databasesLinked Data for integrating life-science databases
Linked Data for integrating life-science databasesShuichi Kawashima
 
140127 platinum genomes pedigree analyses
140127 platinum genomes pedigree analyses140127 platinum genomes pedigree analyses
140127 platinum genomes pedigree analysesGenomeInABottle
 
2016 bioinformatics i_databases_wim_vancriekinge
2016 bioinformatics i_databases_wim_vancriekinge2016 bioinformatics i_databases_wim_vancriekinge
2016 bioinformatics i_databases_wim_vancriekingeProf. Wim Van Criekinge
 
New data from giab genomes promethion
New data from giab genomes   promethionNew data from giab genomes   promethion
New data from giab genomes promethionGenomeInABottle
 
Theory and practice of graphical population analysis
Theory and practice of graphical population analysisTheory and practice of graphical population analysis
Theory and practice of graphical population analysisGenome Reference Consortium
 
Aug2015 analysis team spiral genetics
Aug2015 analysis team spiral geneticsAug2015 analysis team spiral genetics
Aug2015 analysis team spiral geneticsGenomeInABottle
 
Recent advances in CRISPR-CAS9 technology: an alternative to transgenic breeding
Recent advances in CRISPR-CAS9 technology: an alternative to transgenic breedingRecent advances in CRISPR-CAS9 technology: an alternative to transgenic breeding
Recent advances in CRISPR-CAS9 technology: an alternative to transgenic breedingJyoti Prakash Sahoo
 
Genome editing comes of age
Genome editing comes of ageGenome editing comes of age
Genome editing comes of ageJan Hryca
 

La actualidad más candente (20)

TAGC2016 schneider
TAGC2016 schneiderTAGC2016 schneider
TAGC2016 schneider
 
agbt 2016 workshop church
agbt 2016 workshop churchagbt 2016 workshop church
agbt 2016 workshop church
 
Grc ashg2015 workshop_mudge
Grc ashg2015 workshop_mudgeGrc ashg2015 workshop_mudge
Grc ashg2015 workshop_mudge
 
0826 Drosophila lab meeting
0826 Drosophila lab meeting0826 Drosophila lab meeting
0826 Drosophila lab meeting
 
Alignment Approaches II: Long Reads
Alignment Approaches II: Long ReadsAlignment Approaches II: Long Reads
Alignment Approaches II: Long Reads
 
Linked Data for integrating life-science databases
Linked Data for integrating life-science databasesLinked Data for integrating life-science databases
Linked Data for integrating life-science databases
 
Variant Calling II
Variant Calling IIVariant Calling II
Variant Calling II
 
140127 platinum genomes pedigree analyses
140127 platinum genomes pedigree analyses140127 platinum genomes pedigree analyses
140127 platinum genomes pedigree analyses
 
2016 bioinformatics i_databases_wim_vancriekinge
2016 bioinformatics i_databases_wim_vancriekinge2016 bioinformatics i_databases_wim_vancriekinge
2016 bioinformatics i_databases_wim_vancriekinge
 
Getting the most from the reference assembly
Getting the most from the reference assemblyGetting the most from the reference assembly
Getting the most from the reference assembly
 
Explaining the assembly model
Explaining the assembly modelExplaining the assembly model
Explaining the assembly model
 
ChIP-seq Theory
ChIP-seq TheoryChIP-seq Theory
ChIP-seq Theory
 
Jan2016 pac bio giab
Jan2016 pac bio giabJan2016 pac bio giab
Jan2016 pac bio giab
 
New data from giab genomes promethion
New data from giab genomes   promethionNew data from giab genomes   promethion
New data from giab genomes promethion
 
Ashg2014 grc workshop_schneider
Ashg2014 grc workshop_schneiderAshg2014 grc workshop_schneider
Ashg2014 grc workshop_schneider
 
Theory and practice of graphical population analysis
Theory and practice of graphical population analysisTheory and practice of graphical population analysis
Theory and practice of graphical population analysis
 
Aug2015 analysis team spiral genetics
Aug2015 analysis team spiral geneticsAug2015 analysis team spiral genetics
Aug2015 analysis team spiral genetics
 
Recent advances in CRISPR-CAS9 technology: an alternative to transgenic breeding
Recent advances in CRISPR-CAS9 technology: an alternative to transgenic breedingRecent advances in CRISPR-CAS9 technology: an alternative to transgenic breeding
Recent advances in CRISPR-CAS9 technology: an alternative to transgenic breeding
 
Crispr
CrisprCrispr
Crispr
 
Genome editing comes of age
Genome editing comes of ageGenome editing comes of age
Genome editing comes of age
 

Similar a FBW Lessons: Introduction to Bioinformatics Databases

2015 bioinformatics databases_wim_vancriekinge
2015 bioinformatics databases_wim_vancriekinge2015 bioinformatics databases_wim_vancriekinge
2015 bioinformatics databases_wim_vancriekingeProf. Wim Van Criekinge
 
Bioinformatics t2-databases wim-vancriekinge_v2013
Bioinformatics t2-databases wim-vancriekinge_v2013Bioinformatics t2-databases wim-vancriekinge_v2013
Bioinformatics t2-databases wim-vancriekinge_v2013Prof. Wim Van Criekinge
 
Databases_CSS2.pptx
Databases_CSS2.pptxDatabases_CSS2.pptx
Databases_CSS2.pptxSilpa87
 
Exploring DNA/RNA-Seq Analysis Results with Golden Helix GenomeBrowse and SVS
Exploring DNA/RNA-Seq Analysis Results with Golden Helix GenomeBrowse and SVSExploring DNA/RNA-Seq Analysis Results with Golden Helix GenomeBrowse and SVS
Exploring DNA/RNA-Seq Analysis Results with Golden Helix GenomeBrowse and SVSGolden Helix Inc
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchAnshika Bansal
 
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle.
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle.RNA-Seq transcriptome analysis of Gonium pectorale cell cycle.
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle.Jennifer Shelton
 
A Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with HypertableA Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with HypertableDATAVERSITY
 
2013 pag-equine-workshop
2013 pag-equine-workshop2013 pag-equine-workshop
2013 pag-equine-workshopc.titus.brown
 
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle
RNA-Seq transcriptome analysis of Gonium pectorale cell cycleRNA-Seq transcriptome analysis of Gonium pectorale cell cycle
RNA-Seq transcriptome analysis of Gonium pectorale cell cycleJennifer Shelton
 
Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchDavid Ruau
 

Similar a FBW Lessons: Introduction to Bioinformatics Databases (20)

Bioinformatics t2-databases v2014
Bioinformatics t2-databases v2014Bioinformatics t2-databases v2014
Bioinformatics t2-databases v2014
 
2015 bioinformatics databases_wim_vancriekinge
2015 bioinformatics databases_wim_vancriekinge2015 bioinformatics databases_wim_vancriekinge
2015 bioinformatics databases_wim_vancriekinge
 
Bioinformatics t2-databases wim-vancriekinge_v2013
Bioinformatics t2-databases wim-vancriekinge_v2013Bioinformatics t2-databases wim-vancriekinge_v2013
Bioinformatics t2-databases wim-vancriekinge_v2013
 
Bioinformatica 06-10-2011-t2-databases
Bioinformatica 06-10-2011-t2-databasesBioinformatica 06-10-2011-t2-databases
Bioinformatica 06-10-2011-t2-databases
 
2016 02 23_biological_databases_part1
2016 02 23_biological_databases_part12016 02 23_biological_databases_part1
2016 02 23_biological_databases_part1
 
Databases_CSS2.pptx
Databases_CSS2.pptxDatabases_CSS2.pptx
Databases_CSS2.pptx
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Exploring DNA/RNA-Seq Analysis Results with Golden Helix GenomeBrowse and SVS
Exploring DNA/RNA-Seq Analysis Results with Golden Helix GenomeBrowse and SVSExploring DNA/RNA-Seq Analysis Results with Golden Helix GenomeBrowse and SVS
Exploring DNA/RNA-Seq Analysis Results with Golden Helix GenomeBrowse and SVS
 
Intro to databases
Intro to databasesIntro to databases
Intro to databases
 
RML NCBI Resources
RML NCBI ResourcesRML NCBI Resources
RML NCBI Resources
 
NCBI
NCBINCBI
NCBI
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences research
 
Gen bank
Gen bankGen bank
Gen bank
 
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle.
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle.RNA-Seq transcriptome analysis of Gonium pectorale cell cycle.
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle.
 
A Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with HypertableA Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with Hypertable
 
2013 pag-equine-workshop
2013 pag-equine-workshop2013 pag-equine-workshop
2013 pag-equine-workshop
 
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle
RNA-Seq transcriptome analysis of Gonium pectorale cell cycleRNA-Seq transcriptome analysis of Gonium pectorale cell cycle
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle
 
Introduction to Apollo for i5k
Introduction to Apollo for i5kIntroduction to Apollo for i5k
Introduction to Apollo for i5k
 
Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical Research
 
Major databases in bioinformatics
Major databases in bioinformaticsMajor databases in bioinformatics
Major databases in bioinformatics
 

Más de Prof. Wim Van Criekinge

2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_upload2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_uploadProf. Wim Van Criekinge
 
2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_uploadProf. Wim Van Criekinge
 
2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_upload2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_uploadProf. Wim Van Criekinge
 
2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_upload2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_uploadProf. Wim Van Criekinge
 
2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_upload2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_uploadProf. Wim Van Criekinge
 
Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]Prof. Wim Van Criekinge
 
2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_upload2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_uploadProf. Wim Van Criekinge
 
2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_uploadProf. Wim Van Criekinge
 
2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_upload2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_uploadProf. Wim Van Criekinge
 

Más de Prof. Wim Van Criekinge (20)

2020 02 11_biological_databases_part1
2020 02 11_biological_databases_part12020 02 11_biological_databases_part1
2020 02 11_biological_databases_part1
 
2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_upload2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_upload
 
2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload
 
2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_upload2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_upload
 
2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_upload2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_upload
 
2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_upload2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_upload
 
P7 2018 biopython3
P7 2018 biopython3P7 2018 biopython3
P7 2018 biopython3
 
P6 2018 biopython2b
P6 2018 biopython2bP6 2018 biopython2b
P6 2018 biopython2b
 
P4 2018 io_functions
P4 2018 io_functionsP4 2018 io_functions
P4 2018 io_functions
 
P3 2018 python_regexes
P3 2018 python_regexesP3 2018 python_regexes
P3 2018 python_regexes
 
T1 2018 bioinformatics
T1 2018 bioinformaticsT1 2018 bioinformatics
T1 2018 bioinformatics
 
P1 2018 python
P1 2018 pythonP1 2018 python
P1 2018 python
 
Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]
 
2018 05 08_biological_databases_no_sql
2018 05 08_biological_databases_no_sql2018 05 08_biological_databases_no_sql
2018 05 08_biological_databases_no_sql
 
2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_upload2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_upload
 
2018 03 20_biological_databases_part3
2018 03 20_biological_databases_part32018 03 20_biological_databases_part3
2018 03 20_biological_databases_part3
 
2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload
 
2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_upload2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_upload
 
P7 2017 biopython3
P7 2017 biopython3P7 2017 biopython3
P7 2017 biopython3
 
P6 2017 biopython2
P6 2017 biopython2P6 2017 biopython2
P6 2017 biopython2
 

Último

MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxMusic 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxleah joy valeriano
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptshraddhaparab530
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)cama23
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxVanesaIglesias10
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 

Último (20)

MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxMusic 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.ppt
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptx
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 

FBW Lessons: Introduction to Bioinformatics Databases

  • 1.
  • 2. FBW 01-10-2012 Wim Van Criekinge
  • 4.
  • 5. Outline • Molecular Biology • Flat files “sequence” databases – DNA – Protein – Structure • Relational Databases – What ? – Why ? • Biological Relational Databases – Howto ?
  • 6.
  • 7.
  • 8.
  • 9.
  • 10. Flat Files What is a “flat file” ? • Flat file is a term used to refer to when data is stored in a plain ordinary file on the hard disk • Example RefSEQ – See CD-ROM – FILE: hs.GBFF • Hs: Homo Sapiens • GBFF: Genbank File Format • (associated with textpad, use monospaced font eg. Courier)
  • 11. Sequence entries gene 10317..12529 /gene="ZK822.4" CDS join(10317..10375,10714..10821,10874..10912,10960..11013, 11061..11114,11169..11222,11346..11739,11859..11912, 11962..12195,12242..12529) /gene="ZK822.4" /codon_start=1 /protein_id="CAA98068.1" /db_xref="PID:g3881817" /db_xref="GI:3881817" /db_xref="SPTREMBL:Q23615" /translation="MHRHTYRKLYWNLGADGFSQGNADASVSAGSSGSNFLSGLQNSS FGQAVMGGINTYNQAKNSSGGNWQTAVANSSVGNFFQNGIDFFNGMKNGTQNFLDTDT IQETIGNSSFGEVVQTGVEFFNNIKNGNSPFQGDASSVMSQFVPFLANASAEAKAEFY TILPNFGNMTIAEFETAVNAWAAKYNLTDEVEAFNERSKNATVVAEEHANVVVMNLPN VLNNLKAISSDKNQTVVEMHTRMMAYVNSLDDDTRDIVFIFFRNLLPPQFKKSKCVDQ GNFLTNMYNKASDFFAGRNNRTDGEGSFWSGQGQNGNSGGSGFSSFFNNFNGQGNGNG NGAQNPMIGMFNNFMKKNNITADEANAAMADGGASIQILPAISAGWGDVAQVKIGGDF KIAVEEETKTTKKNKKQQQQANKNKNKNKKKTTIAPEAAIDANIAAEVHTQVL"
  • 12. Nucleotide Databases EMBL Nucleotide Sequence Database (European Molecular Biology Laboratory) http://www.ebi.ac.uk/ebi_docs/embl_db/ebi/topembl.html GenBank at NCBI (National Center for Biotechnology Information) http://www.ncbi.nlm.nih.gov/Web/Genbank/index.html DDBJ (DNA Database of Japan) http://www.ddbj.nig.ac.jp/ DDBJ,the Center for operating DDBJ, National Institute of Genetics (NIG),Japan,established in April 1995. http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html Release Notes (ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt) Genetic Sequence Data Bank - August 15 2003 NCBI-GenBank Flat File Release 137.0 Distribution Release Notes 33 865 022 251 bases, from 27 213 748 reported sequences
  • 13. GenBank Format LOCUS LISOD 756 bp DNA BCT 30-JUN-1993 DEFINITION L.ivanovii sod gene for superoxide dismutase. ACCESSION X64011.1 GI:37619753 NID g44010 KEYWORDS sod gene; superoxide dismutase. SOURCE Listeria ivanovii. ORGANISM Listeria ivanovii Eubacteria; Firmicutes; Low G+C gram-positive bacteria; Bacillaceae; Listeria. REFERENCE 1 (bases 1 to 756) AUTHORS Haas,A. and Goebel,W. TITLE Cloning of a superoxide dismutase gene from Listeria ivanovii by functional complementation in Escherichia coli and characterization of the gene product JOURNAL Mol. Gen. Genet. 231 (2), 313-322 (1992) MEDLINE 92140371 REFERENCE 2 (bases 1 to 756) AUTHORS Kreft,J. TITLE Direct Submission JOURNAL Submitted (21-APR-1992) J. Kreft, Institut f. Mikrobiologie, Universitaet Wuerzburg, Biozentrum Am Hubland, 8700 Wuerzburg, FRG
  • 14. FEATURES Location/Qualifiers source 1..756 /organism="Listeria ivanovii" /strain="ATCC 19119" /db_xref="taxon:1638" RBS 95..100 /gene="sod" gene 95..746 /gene="sod" CDS 109..717 /gene="sod" /EC_number="1.15.1.1" /codon_start=1 /product="superoxide dismutase" /db_xref="PID:g44011" /db_xref="SWISS-PROT:P28763" /transl_table=11 /translation="MTYELPKLPYTYDALEPNFDKETMEIHYTKHHNIYVTKL NEAVSGHAELASKPGEELVANLDSVPEEIRGAVRNHGGGHANHTLFWSSLSPN GGGAPTGNLKAAIESEFGTFDEFKEKFNAAAAARFGSGWAWLVVNNGKLEIVS TANQDSPLSEGKTPVLGLDVWEHAYYLKFQNRRPEYIDTFWNVINWDERNKRF DAAK" terminator 723..746 /gene="sod"
  • 15. Example of location descriptors Location Description 476 Points to a single base in the presented sequence 340..565 Points to a continuous range of bases bounded by and including the starting and ending bases <345..500 The exact lower boundary point of a feature is unknown. (102.110) Indicates that the exact location is unknown but that it is one of the bases between bases 102 and 110. (23.45)..600 Specifies that the starting point is one of the bases between bases 23 and 45, inclusive, and the end base 600 123^124 Points to a site between bases 123 and 124 145^177 Points to a site anywhere between bases 145 and 177 J00193:hladr Points to a feature whose location is described in another entry: the feature labeled 'hladr' in the entry (in this database) with primary accession 'J00193'
  • 16. BASE COUNT 247 a 136 c 151 g 222 t ORIGIN 1 cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc gccttacaat 61 gtaatttctt ttcacataaa taataaacaa tccgaggagg aatttttaat gacttacgaa 121 ttaccaaaat taccttatac ttatgatgct ttggagccga attttgataa agaaacaatg 181 gaaattcact atacaaagca ccacaatatt tatgtaacaa aactaaatga agcagtctca 241 ggacacgcag aacttgcaag taaacctggg gaagaattag ttgctaatct agatagcgtt 301 cctgaagaaa ttcgtggcgc agtacgtaac cacggtggtg gacatgctaa ccatacttta 361 ttctggtcta gtcttagccc aaatggtggt ggtgctccaa ctggtaactt aaaagcagca 421 atcgaaagcg aattcggcac atttgatgaa ttcaaagaaa aattcaatgc ggcagctgcg 481 gctcgttttg gttcaggatg ggcatggcta gtagtgaaca atggtaaact agaaattgtt 541 tccactgcta accaagattc tccacttagc gaaggtaaaa ctccagttct tggcttagat 601 gtttgggaac atgcttatta tcttaaattc caaaaccgtc gtcctgaata cattgacaca 661 ttttggaatg taattaactg ggatgaacga aataaacgct ttgacgcagc aaaataatta 721 tcgaaaggct cacttaggtg ggtcttttta tttcta //
  • 17. EMBL format ID LISOD standard; DNA; PRO; 756 BP. IDentification XX AC X64011; S78972; Accession (Axxxxx, Afxxxxxx), GUID XX NI g44010 Nucleotide Identifier --> x.x XX DT 28-APR-1992 (Rel. 31, Created) DaTe DT 30-JUN-1993 (Rel. 36, Last updated, Version 6) XX DE L.ivanovii sod gene for superoxide dismutase DEscription XX. KW sod gene; superoxide dismutase. KeyWord XX OS Listeria ivanovii Organism Species OC Eubacteria; Firmicutes; Low G+C gram-positive bacteria; Bacillaceae; OC Listeria. Organism Classification XX RN [1] RA Haas A., Goebel W.; Reference RT "Cloning of a superoxide dismutase gene from Listeria ivanovii by RT functional complementation in Escherichia coli and RT characterization of the gene product."; RL Mol. Gen. Genet. 231:313-322(1992). XX
  • 18. GenBank,EMBL & DDBJ: Comments • Collaboration Genbank/EMBL/DDBJ – Effort: Identical within 24 hours • Redundant information • Historical graveyard – BANKIT (responsability of the submitter) – Version conflicts • IDIOSYNCRATIC ( peculiar to the individual) – Heterogeneous annotation – No consistant quality check • Vectors, sequence errors etc
  • 19. Other Genbank Formats • ASN1 – Computer friendly, human unfriendly • FASTA – Brief, loses information – Easy to use – Compatible with multiple sequences
  • 20. Web Query tools & Programming Query tools • NCBI website example: – http://www.ncbi.nlm.nih.gov/entrez/query/static/ad vancedentrez.html • EBI UniProtKB website example: – http://www.ebi.ac.uk/uniprot/index.html – http://www.ebi.uniprot.org/search/SearchTools.sht ml
  • 21. batch download (ftp server) • Data available via website is most of the time also available via an ftp server to download a complete batch. • Examples: –ftp://ftp.ncbi.nih.gov/ –ftp://ftp.ebi.ac.uk/pub/
  • 22. Sequence file format tips • When saving a sequence for use in an email message or pasting into a web page, use an unannotated text format such as FASTA • When retrieving from a database or exchanging between programs, use an annotated text format such as Genbank • When using sequence again with the same program, use that program’s annotated binary format (or annotated text if binary not available) – Asn-1 (NCBI) – Gbff (sanger) – XML
  • 23. Expressed Sequence Tags • Sequence that codes for protein is < 5% of the genome. • Coding sequence can be obtained from mRNA by reverse transcription. • Tags for that sequence can be obtained by end- sequencing. • Incyte and HGS gambled on this being the useful part: – Search for homologies to known proteins, motifs. – Search for changed levels of expression and tissue specificity (“virtual/electronic northern” used in GeneCards) • ESTs have driven the huge expansion of GenBank: – Unigene now contains some sequence from most genes. – > 4,000,000 human est sequences – http://www.ncbi.nlm.nih.gov/dbEST/
  • 24. dbEST release 100303 Summary by Organism - October 3, 2003 Number of public entries: 18,762,324 Homo sapiens (human) 5,426,001 Mus musculus + domesticus (mouse) 3,881,878 Rattus sp. (rat) 538,073 Triticum aestivum (wheat) 500,898 Ciona intestinalis 492,488 Gallus gallus (chicken) 451,565 Zea mays (maize) 383,416 Danio rerio (zebrafish) 362,362 Hordeum vulgare + subsp. vulgare (barley) 348,233 Xenopus laevis (African clawed frog) 344,695 Glycine max (soybean) 341,573 Bos taurus (cattle) 322,074 Drosophila melanogaster (fruit fly) 261,404
  • 25. Traces <-> strings • Traces contain much more information – TraceDB: http://www.ncbi.nlm.nih.gov/Traces/ Example
  • 26. Traces <-> strings • Phrep – base calling, vector trimming, end of sequence read trimming • Phrap – Phrap uses Phred’s base calling scores to determine the consensus sequences. Phrap examines all individual sequences at a given position, and uses the highest scoring sequence (if it exists) to extend the consensus sequence • Consend – graphical interface extension that controls both Phred and Phrap
  • 27. What is Phred? • Phred is a program that observes the base trace, makes base calls, and assigns quality values (qv) of bases in the sequence. • It then writes base calls and qv to output files that will be used for Phrap assembly. The qv will be useful for consensus sequence construction. • For example, ATGCATGC string1 ATTCATGC string2 AT-CATGC superstring • Here we have a mismatch ‘G’ and ‘T’, the qv will determine the dash in the superstring. The base with higher qv will replaces the dash.
  • 28. How Phred calculates qv? • From the base trace Phred know number of peaks and actual peak locations. • Phred predicts peaks locations. • Phred reads the actual peak locations from base trace. • Phred match the actual locations with the predicted locations by using Dynamic Programming. • The qv is related to the base call error probability (ep) by the formula qv = -10*log_10(ep) • Example 1:10000 = qv 40
  • 29. Why Phred? • Output sequence might contain errors. • Vector contamination might occur. • Dye-terminator reaction might not occur. • Segment migration abnormal in gel electrophoresis. • Weak or variable signal strength of peak corresponding to a base.
  • 31. End of Sequence Cropping • It is common that the end of sequencing reads have poor data. This is due to the difficulties in resolving larger fragment ~1kb (it is easier to resolve 21bp from 20bp than it is to resolve 1001bp from 1000bp). • Phred assigns a non-value of ‘x’ to this data by comparing peak separation and peak intensity to internal standards. If the standard threshold score is not reached, the data will not be used.
  • 32. Traces <-> strings • Handle traces – Abi-view EMBOSS – Bioedit – Acembly, … • EXAMPLE
  • 33. NCBI reference sequences RefSeq database is a non-redundant set of reference standards that includes chromosomes, complete genomic molecules, intermediate assembled genomic contigs, curated genomic regions, mRNAs, RNAs, and proteins.
  • 34. RefSeq nomenclature NC_#### complete genomic NG_#### incomplete genomic NM_#### mRNA NR_#### noncoding transcripts NP_#### proteins NT_#### intermediate genomic contigs
  • 35. RefSeq nomenclature - models XM_#### mRNA XR_#### RNA XP_#### protein Automated Homo sapiens models provided by the Genome Annotation process; sequence corresponds to the genomic contig.
  • 36.
  • 37.
  • 38.
  • 39. Open reading frame • Definition: – A stretch of triplet codons with an initiator codon at one end and a stop codon sat the other, as identifiable by nucleotide sequences. • Example – http://www.ncbi.nlm.nih.gov/entrez/query.fcgi? cmd=Retrieve&db=nucleotide&list_uids=6688 473&dopt=GenBank&term=Y18948.1&qty=1
  • 40. Protein sequence database SWISS-PROT & TREMBL SwissProt - http://expasy.hcuge.ch/sprot/ SWISS-PROT is an annotated protein sequence database The sequences are translated from the EMBL Nucleotide Sequence Database Sequence entries are composed of different lines. For standardization purposes the format of SWISS-PROT follows as closely as possible that of the EMBL Nucleotide Sequence Database. Continuously updated (daily).
  • 41. Different Features of SWISS-PROT • Format follows as closely as possible that of EMBL’s • Curated protein sequence database • Three differences: 1. Strives to provide a high level of annotations 2. Minimal level of redundancy 3. High level of integration with other databases
  • 42. 1. Annotation Three Distinct Criteria The sequence data; the citation information (bibliographical references) and the taxonomic data (description of the biological source of the protein) such as protein functions,post-translational modifications ,domains and sites,secondary structure,quaternary structure,similarities to other proteins,diseases associated with deficiencies in the protein,sequence conflicts, variants, etc.
  • 43. 2. Minimal Redundancy any sequence databases contain, for a given protein sequence, separate entries which correspond to different literature reports. SWISS- PROT is as much as possible to merge all these data so as to minimize the redundancy. If conflicts exist between various sequencing reports, they are indicated in the feature table of the corresponding entry.
  • 44. 3. Integration With Other Databases • SWISS-PROT and TrEMBL - Protein sequences • PROSITE - Protein families and domains • SWISS-2DPAGE - Two-dimensional polyacrylamide gel electrophoresis • SWISS-3DIMAGE - 3D images of proteins and other biological macromolecules • SWISS-MODEL Repository - Automatically generated protein models • CD40Lbase - CD40 ligand defects • ENZYME - Enzyme nomenclature • SeqAnalRef - Sequence analysis bibliographic references
  • 45. TREMBL- http://expasy.hcuge.ch/sprot/ Translated EMBL sequences not (yet) in Swissprot. Updated faster than SWISS-PROT. TREMBL - two parts 1. SP-TREMBL Will eventually be incorporated into Swissprot Divided into FUN, HUM, INV, MAM, MHC, ORG, PHG, PLN, PRO, ROD, UNC, VRL and VRT. 2. REM-TREMBL (remaining) Will NOT be incorporated into Swissprot Divided into:Immunoglobins and T-cell receptors,Synthetic sequences,Patent application sequences,Small fragments,CDS not coding for real proteins
  • 46. SWISS-PROT/TrEMBL • TrEMBL is a computer-annotated supplement of SWISS-PROT that contains all the translations of EMBL nucleotide sequence entries not yet integrated in SWISS-PROT • SWISS-PROT Release 39.15 of 19- Mar-2001: 94,152 entries TrEMBL Release 16.2 of 23-Mar- 2001: 436,924 entries
  • 47. Example of a SwissProt entry ID TNFA_HUMAN STANDARD; PRT; 233 AA. IDentification AC P01375; ACcession DT 21-JUL-1986 (REL. 01, CREATED) DaTe DT 21-JUL-1986 (REL. 01, LAST SEQUENCE UPDATE) DT 15-JUL-1998 (REL. 36, LAST ANNOTATION UPDATE) DE TUMOR NECROSIS FACTOR PRECURSOR (TNF-ALPHA) (CACHECTIN). GN TNFA. Gene name OS HOMO SAPIENS (HUMAN). Organism Species OC EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA; OC EUTHERIA; PRIMATES. Organism Classification RN [1] Reference RP SEQUENCE FROM N.A. RX MEDLINE; 87217060. RA NEDOSPASOV S.A., SHAKHOV A.N., TURETSKAYA R.L., METT V.A., RA AZIZOV M.M., GEORGIEV G.P., KOROBKO V.G., DOBRYNIN V.N., RA FILIPPOV S.A., BYSTROV N.S., BOLDYREVA E.F., CHUVPILO S.A., RA CHUMAKOV A.M., SHINGAROVA L.N., OVCHINNIKOV Y.A.; RL COLD SPRING HARB. SYMP. QUANT. BIOL. 51:611-624(1986). RN [2] RP SEQUENCE FROM N.A. RX MEDLINE; 85086244. RA PENNICA D., NEDWIN G.E., HAYFLICK J.S., SEEBURG P.H., DERYNCK R., RA PALLADINO M.A., KOHR W.J., AGGARWAL B.B., GOEDDEL D.V.; RL NATURE 312:724-729(1984). ...
  • 48. CC -!- FUNCTION: CYTOKINE WITH A WIDE VARIETY OF FUNCTIONS: IT CAN CC CAUSE CYTOLYSIS OF CERTAIN TUMOR CELL LINES, IT IS IMPLICATED CC IN THE INDUCTION OF CACHEXIA, IT IS A POTENT PYROGEN CAUSING CC FEVER BY DIRECT ACTION OR BY STIMULATION OF IL-1 SECRETION, IT CC CAN STIMULATE CELL PROLIFERATION & INDUCE CELL DIFFERENTIATION CC UNDER CERTAIN CONDITIONS. Comments CC -!- SUBUNIT: HOMOTRIMER. CC -!- SUBCELLULAR LOCATION: TYPE II MEMBRANE PROTEIN. ALSO EXISTS AS CC AN EXTRACELLULAR SOLUBLE FORM. CC -!- PTM: THE SOLUBLE FORM DERIVES FROM THE MEMBRANE FORM BY CC PROTEOLYTIC PROCESSING. CC -!- DISEASE: CACHEXIA ACCOMPANIES A VARIETY OF DISEASES, INCLUDING CC CANCER AND INFECTION, AND IS CHARACTERIZED BY GENERAL ILL CC HEALTH AND MALNUTRITION. CC -!- SIMILARITY: BELONGS TO THE TUMOR NECROSIS FACTOR FAMILY. DR EMBL; X02910; G37210; -. Database Cross-references DR EMBL; M16441; G339741; -. DR EMBL; X01394; G37220; -. DR EMBL; M10988; G339738; -. DR EMBL; M26331; G339764; -. DR EMBL; Z15026; G37212; -. DR PIR; B23784; QWHUN. DR PIR; A44189; A44189. DR PDB; 1TNF; 15-JAN-91. DR PDB; 2TUN; 31-JAN-94.
  • 49. KW CYTOKINE; CYTOTOXIN; TRANSMEMBRANE; GLYCOPROTEIN; SIGNAL-ANCHOR; KW MYRISTYLATION; 3D-STRUCTURE. KeyWord FT PROPEP 1 76 Feature Table FT CHAIN 77 233 TUMOR NECROSIS FACTOR. FT TRANSMEM 36 56 SIGNAL-ANCHOR (TYPE-II PROTEIN). FT LIPID 19 19 MYRISTATE. FT LIPID 20 20 MYRISTATE. FT DISULFID 145 177 FT MUTAGEN 105 105 L->S: LOW ACTIVITY. FT MUTAGEN 108 108 R->W: BIOLOGICALLY INACTIVE. FT MUTAGEN 112 112 L->F: BIOLOGICALLY INACTIVE. FT MUTAGEN 162 162 S->F: BIOLOGICALLY INACTIVE. FT MUTAGEN 167 167 V->A,D: BIOLOGICALLY INACTIVE. FT MUTAGEN 222 222 E->K: BIOLOGICALLY INACTIVE. FT CONFLICT 63 63 F -> S (IN REF. 5). FT STRAND 89 93 FT TURN 99 100 FT TURN 109 110 FT STRAND 112 113 FT TURN 115 116 FT STRAND 118 119 FT STRAND 124 125
  • 50. FT STRAND 130 143 FT STRAND 152 159 FT STRAND 166 170 FT STRAND 173 174 FT TURN 183 184 FT STRAND 189 202 FT TURN 204 205 FT STRAND 207 212 FT HELIX 215 217 FT STRAND 218 218 FT STRAND 227 232 SQ SEQUENCE 233 AA; 25644 MW; 666D7069 CRC32; MSTESMIRDV ELAEEALPKK TGGPQGSRRC LFLSLFSFLI VAGATTLFCL LHFGVIGPQR EEFPRDLSLI SPLAQAVRSS SRTPSDKPVA HVVANPQAEG QLQWLNRRAN ALLANGVELR DNQLVVPSEG LYLIYSQVLF KGQGCPSTHV LLTHTISRIA VSYQTKVNLL SAIKSPCQRE TPEGAEAKPW YEPIYLGGVF QLEKGDRLSA EINRPDYLDF AESGQVYFGI IAL //
  • 51. Protein searching 3-levels of Protein Searching 1. Swissprot Little Noise Annotated entries 2. Swissprot + TREMBL More Noisy All probable entries 3. Translated EMBL - tblast or tfasta Most Noisy All possible entries
  • 52. New initiatiaves • IPI: International Protein Index – http://www.ebi.ac.uk/IPI/IPIhelp.ht ml • UNIPROT: Universal Protein Knowledgebase – http://www.pir.uniprot.org/ • HPRD: Human Protein Reference Database – http://www.hprd.org/
  • 53. UniProt UniProt Consortium • European Bioinformatics Institute (EBI) • Swiss Institute of Bioinformatics (SIB) • Protein Information Resource (PIR) Uniprot Databases •UniProt Knowledgebase (UniProtKB) •UniProt Reference Clusters (UniRef) •UniProt Archive (UniParc) UniprotKB •Swiss-Prot (annotated protein sequence db, golden standard) •trEMBL (translated EMBL + automated electronic annotations)
  • 54. understanding molecular structure is critical to the understanding of biology because because structure determines function
  • 55.
  • 56. From Structure to Function • the drug morphine has chemical groups that are functionally equivalent to the natural endorphins found in the human body
  • 57. From Structure to Function • the drug morphine has chemical groups that are functionally equivalent to the natural endorphins found in the human body • the receptor molecules located at the synapse (between two neurons) bind morphine much the same way as endorphins • therefore, morphine is able to attenuate the pain response
  • 58. Structure databases Protein Data Bank (PDB) Protein Data Bank - http://www.rcsb.org/pdb Diffraction 7373 structures determined by X-ray diffraction NMR 388 structures determined by NMR spectroscopy Theoretical Model 201 structures proposed by modeling
  • 59. • PDB is three-dimensional structure of proteins,some nuclei acids involved • PDB is operated by RCSB(Research Collaboratory for Structural Bioinformatics),funded by NSF, DOE, and two units of NIH:NIGMS National Institute Of General Medical Sciences and NLM National Library Of Medicine. • Established at BNL Brookhaven National Laboratories in 1971,as an archive for biological macromolecular crystal structures • In 1980s, the number of deposited structures began to increase dramatically. • October 1998, the management of the PDB became the responsibility of RCSB. • Website http://www.rcsb.org
  • 60. PDB Holdings List: 27-Mar-2001 Molecule Type Proteins, Protein/ Peptides, Nucleic Nuclei Carbohydrate total and Viruses Acid c s Complexes Acids X-ray 11045 526 552 14 12137 Exp. Diffraction and other Tech. NMR 1832 71 366 4 2273 Theoretica 281 19 21 0 321 l Modeling total 13158 616 939 18 14731 5032 Structure Factor Files 968 NMR Restraint Files
  • 62. PDB Growth in New Folds
  • 63. Other structure databases BioMagResBank http://www.bmrb.wisc.edu/ A Repository for Data from NMR Spectroscopy on Proteins, Peptides, and Nucleic Acids Biological Macromolecule Crystallization Database (BMCD) http://h178133.carb.nist.gov:4400/bmcd/bmcd.html Contains crystal data and the crystallization conditions, which have been compiled from literature Nucleic Acid Database (NDB) http://ndbserver.rutgers.edu:80/ Assembles and distributes structural information about nucleic acids Structural Classification of Proteins (SCOP) http://scop.mrc-lmb.cam.ac.uk/scop/ Structure similarity search. Hierarchic organization. MOOSE http://db2.sdsc.edu/moose/ Macromolecular Structure Query Cambridge Structural Database (CSD) http://www.ccdc.cam.ac.uk/ Small molecules.
  • 64.
  • 65. Protein Splicing? • Protein splicing is defined as the excision of an intervening protein sequence (the INTEIN) from a protein precursor and the concomitant ligation of the flanking protein fragments (the EXTEINS) to form a mature extein protein and the free intein • http://www.neb.com/inteins/intein_intro.ht ml
  • 66. Biological databases • NAR Database Issue – Every year: NAR DB Issue – The 2006 update includes 858 databases – Citation top 5 are: • Pfam • Gene Ontology • UniProt • SMART • KEGG – Primary Nucleotide DB’s and PDB are not cited anymore
  • 67. Outline • Molecular Biology • Flat files “sequence” databases – DNA – Protein – Structure • Relational Databases – What ? – Why ? • Biological Relational Databases – Howto ?
  • 68. Why biological databases ? • Explosive growth in biological data • Data (sequences, 3D structures, 2D gel analysis, MS analysis….) are no longer published in a conventional manner, but directly submitted to databases • Essential tools for biological research, as classical publications used to be !
  • 69. Problems with Flat files … • Wasted storage space • Wasted processing time • Data control problems • Problems caused by changes to data structures • Access to data difficult • Data out of date • Constraints are system based • Limited querying eg. all single exon GPCRs (<1000 bp)
  • 70. Relational • The Relational model is not only very mature, but it has developed a strong knowledge on how to make a relational back-end fast and reliable, and how to exploit different technologies such as massive SMP, Optical jukeboxes, clustering and etc. Object databases are nowhere near to this, and I do not expect then to get there in the short or medium term. • Relational Databases have a very well-known and proven underlying mathematical theory, a simple one (the set theory) that makes possible – automatic cost-based query optimization, – schema generation from high-level models and – many other features that are now vital for mission-critical Information Systems development and operations.
  • 71. • What is a relational database ? – Sets of tables and links (the data) – A language to query the datanase (Structured Query Language) – A program to manage the data (RDBMS) • Flat files are not relational – Data type (attribute) is part of the data – Record order mateters – Multiline records – Massive duplication • Bv Organism: Homo sapeinsm Eukaryota, … – Some records are hierarchical • Xrefs – Records contain multiple “sub-records” – Implecit “Key”
  • 72. The Benefits of Databases • Redundancy can be reduced • Inconsistency can be avoided • Conflicting requirements can be balanced • Standards can be enforced • Data can be shared • Data independence • Integrity can be maintained • Security restrictions can be applied
  • 73. Disadvantages • size • complexity • cost • Additional hardware costs • Higher impact of failure • Recovery more difficult
  • 74. Relational Terminology CUSTOMER Table (Relation) ID NAME PHONE EMP_ID Row (Tuple) 201 Unisports 55-2066101 12 202 Simms Atheletics 81-20101 14 203 Delhi Sports 91-10351 14 204 Womansport 1-206-104-0103 11 Column (Attribute)
  • 75. Relational Database Terminology • Each row of data in a table is uniquely identified by a primary key (PK) • Information in multiple tables can be logically related by foreign keys (FK) Table Name: CUSTOMER Table Name: EMP ID NAME PHONE EMP_ID ID LAST_NAME FIRST_NAME 201 Unisports 55-2066101 12 10 Havel Marta 202 Simms Atheletics 81-20101 14 11 Magee Colin 203 Delhi Sports 91-10351 14 12 Giljum Henry 204 Womansport 1-206-104-0103 11 14 Nguyen Mai Primary Key Foreign Key Primary Key
  • 76. • RDBM products – Free • MySQL, very fast, widely usedm easy to jump into but limited non standard SQL • PostrgreSQL – full SQLm limited OO, higher learning curve than MySQL – Commercial • MS Access – Great query builder, GUI interfaces • MS SQL Server – full SQL, NT only • Oracle, everything, including the kitchen sink • IBM DB2, Sybase
  • 77. A simple datamodel (tables and relations) Prot_id name seq Species_id 1 GTM1_HUMA MGTDHG… 1 N 2 GTM1_RAT MGHJADSW.. 2 3 GTM2_HUMA MVSDBSVD.. 1 N Species_id name Full Lineage 1 human Homo Sapiens … 2 rat Rattus rattus
  • 78. Relational Database Fundamentals • Basic SQL – SELECT – FROM – WHERE – JOIN – NATURAL, INNER, OUTER • Other SQL functions – COUNT() – MAX(),MIN(),AVE() – DISTINCT – ORDER BY – GROUP BY – LIMIT
  • 80.
  • 81.
  • 82.
  • 83. • Query: een opdracht om gegevens uit een databaase op te vragen noemt men een query • eg. MyGPCRdb – Bioentry – Taxid (include full lineage) – Linking table (bioentry_tax)
  • 84. MyGPCR; Geef me allE GPCR die korter zijn dan 1000bp select * from bioentry; select count(*) from bioentry; select * from bioentry inner join biosequence on bioentry.bioentry_id=biosequence.bioentry_id ; select * from bioentry inner join biosequence on bioentry.bioentry_id=biosequence.bioentry_id where length(biosequence_str)<1000;
  • 85. Example 3-tier model in biological database Example of different interface to the same back-end database (MySQL) http://www.bioinformatics.be
  • 86. Overview • DataBases – FF • *.txt • Indexed version – Relational (RDBMS) • Access, MySQL, PostGRES, Oracle – OO (OODBMS) • AceDB, ObjectStore – Hierarchical • XML – Frame based system Overview • Eg. DAML+OIL – Hybrid systems
  • 87. Object • The Object paradigm is already proven for application design and development, but it may simply not be an adequate paradigm for the data store. • Object Database are modelled by graphs. The graph theory plays a great role on computer science, but is also a great source of unbeatable problems, the NP-complex class: problems for which there are no computationally efficient solution, as there's no way to escape from exponential complexity. This is not a current technological limit. It's a limit inherent to the problem domain. • Hybrid Object-Relational databases will probably be the long term solution for the industry. They put a thin object layer above the relational structure, thus providing a syntax and semantics closer to the object oriented design and programming tools. They simply make it easier to build the data layer classes
  • 88. Conclusions • A database is a central component of any contemporary information system • The operations on the database and the mainenance of database consistency is handled by a DBMS • There exist stand alone query languages or embedded languages but both deal with definition (DDL) and manipulation (DML) aspects • The structural properties, constraints and operations permitted within a DBMS are defined by a data model - hierarchical, network, relational • Recovery and concurrency control are essential • Linking of heterogebous datasources is central theme in modern bioinformatics
  • 89.
  • 90. • How do you know which database exists ? • NAR list • Weblinks op Nexus – Searchable – Maintainable
  • 91.
  • 92. • Tools available in public domain for simultaneous access – entrez – srs • Batch queries for offload in local databases for subsequent analysis (see further)
  • 93.
  • 94.
  • 95. • What if you want to search the complete human genome (golden path coordinates) instead of separate NCBI entries ? • ENSEMBL
  • 96. BioMart • Joined project between EBI and CSHL, http://www.biomart.org/ • Aim is to develop a generic, query-oriented data management system capable of integrating distributed data sources • 3 step system: – Start by selecting a dataset to query – Filter this dataset by applying the appropriate filters – Generate the output by selecting the attributes and output format • Available public biomart websites: http://www.biomart.org/biomart/martview
  • 97. BioMart - Single access point - Generic interface
  • 98. BioMart - ‘Out of the box’ website
  • 99. BioMart – 3 step system Dataset Attribute Filter
  • 100. BioMart - 3 step system Name, chromosome Dataset position, description Attribute for all Ensembl genes Filter located on chromosome 1, expressed in lung, associated with human homologues
  • 101. BioMart - EnsMart • The first in line was EnsMart, a powerful data mining toolset for retrieving customized data sets from annotated genomes. EnsMart integrates data from Ensembl and various worldwide data sources. • EnsMart provides .... – Gene and protein annotation – Disease information – Cross-species analyses – SNPs affecting proteins – Allele frequency data – Retrieval by external identifiers – Retrieval by Gene Ontology – Customized sequence datasets – Microarray annotation tools
  • 102. Other BioMart implementations • Other data resources also implemented a BioMart interface: – Wormbase – Gramene – HapMap – DictyBase – euGenes
  • 104.
  • 105.
  • 106.
  • 107. BioBar • A toolbar for browsing biological data and databases http://biobar.mozdev.org/ • The following databases are included http://biobar.mozdev.org/Databases.ht ml • a toolbar for Mozilla-based browsers including Firefox and Netscape 7+
  • 108. Weblems Weblems Online (example posting) W2.1. Which isolate of Tabac was used in record accession Z71230, and human sample in the genbank entry with accession AJ311677 ? W2.2: Find all structures of GFP in the Protein Data Bank and draw a histogram of their dates of deposition ? W2.3: What is the chromosomal location of the human gene for insulin ? W2.4: How many different human NHR (nuclear hormone receptors) s exist ? How many of these are single exon genes ? Are there any drugs working on this class of receptors ? W2.5: The gene for Berardinelli-Seip syndrome was initially localized between two markers on chromosome band 11q13- D11S4191 and D11S987. a. How many base pairs are there in the interval between these two markers ? b. How many known genes are there ? c. List the gene ontology terms for that region ?