SlideShare una empresa de Scribd logo
1 de 17
GenBank Databases
Hafiz.M.Zeeshan.Raza
Research Associate_HEC_NRPU
hafizraza26@gmail.com
COMSATS UNIVERSITY SAHIWAL
Overview
• Introduction
• Sections of Database
• Importance of GenBank
Historical background
• The first major bioinformatics project was undertaken by Margaret Dayhoff in 1965,
who developed a first protein sequence database called Atlas of Protein Sequence
and Structure.
• Subsequently, in the early 1970s, the Brookhaven National Laboratory established
the Protein Data Bank for archiving three-dimensional protein structures.
• The first sequence alignment algorithm was developed by Needleman and Wunsch
in 1970. This was a fundamental step in the development of the field of
bioinformatics, which paved the way for the routine sequence comparisons and
database searching practiced by modern biologists.
• The 1980s saw the establishment of GenBank and the development of fast database
searching algorithms such as FASTA by William Pearson and BLAST by Stephen
Altschul and coworkers.
Introduction
• GenBank is the most complete collection of annotated nucleic acid
sequence data for almost every organism.
• The content includes genomic DNA, mRNA, cDNA, ESTs, high throughput
raw sequence data, and sequence polymorphisms.
• There is also a GenPept database for protein sequences, the majority of
which are conceptual translations from DNA sequences, although a small
number of the amino acid sequences are derived using peptide
sequencing techniques.
How to search GenBank
• There are two ways to search for
sequences in GenBank.
• One is using text-based keywords
similar to a PubMed search.
• The other is using molecular sequences
to search by sequence similarity using
BLAST.
GenBank Sequence Format
• To search GenBank effectively using the text-based method requires an
understanding of the GenBank sequence format.
• GenBank is a relational database. However, the search output for
sequence files is produced as flat files for easy reading.
• The resulting flat files contain three sections; Header, Features, and
Sequence entry.
• There are many fields in the Header and Features sections. Each field has
an unique identifier for easy indexing by computer software.
• Understanding the structure of the GenBank files helps in designing
effective search strategies.
1st section…Header Part
• The line, “DEFINITION,” provides the summary information for the
sequence record including the name of the sequence, the name and
taxonomy of the source organism if known, and whether the sequence is
complete or partial.
• This is followed by an accession number for the sequence, which is a
unique number assigned to a piece of DNA when it was first submitted to
GenBank and is permanently associated with that sequence.
• This is the number that should be cited in publications. It has two different
formats: two letters with five digits or one letter with six digits.
Continue…
• For a nucleotide sequence that has been translated into a protein sequence, a new
“accession number” is given in the form of a string of alphanumeric characters.
• In addition to the accession number, there is also a version number and a gene
index (gi) number. The purpose of these numbers is to identify the current version
of the sequence.
• If the sequence annotation is revised at a later date, the accession number
remains the same, but the version number is incremented as is the gi number.
• A translated protein sequence also has a different gi number from the DNA
sequence it is derived from.
Continue…
• The next line in the Header section is the “ORGANISM” field,
which includes the source of the organism with the scientific
name of the species and sometimes the tissue type.
• Along with the scientific name is the information of taxonomic
classification of the organism.
• Different levels of the classification are hyperlinked to the
NCBI taxonomy database with more detailed descriptions.
Continue…
• This is followed by the “REFERENCE” field, which provides the publication citation
related to the sequence entry.
• The REFERENCE part includes author and title information of the published work
(or tentative title for unpublished work).
• The “JOURNAL” field includes the citation information as well as the date of
sequence submission.
• The citation is often hyperlinked to the PubMed record for access to the original
literature information.
• The last part of the Header is the contact information of the sequence submitter.
2nd section…Features
• The “Features” section includes annotation information about the gene and gene product, as
well as regions of biological significance reported in the sequence, with identifiers and
qualifiers.
• The “Source” field provides the length of the sequence, the scientific name of the organism,
and the taxonomy identification number. Some optional information includes the clone
source, the tissue type and the cell line.
• The “gene” field is the information about the nucleotide coding sequence and its name. For
DNA entries, there is a “CDS” field, which is information about the boundaries of the
sequence that can be translated into amino acids.
• For eukaryotic DNA, this field also contains information of the locations of exons and
translated protein sequences is entered.
3rd section…Sequence
• The third section of the flat file is the sequence itself starting with the
label “ORIGIN”.
• The format of the sequence display can be changed by choosing options at
a Display pull-down menu at the upper left corner.
• For DNA entries, there is a BASE COUNT report that includes the numbers
of A, G, C, and T in the sequence.
• This section, for both DNA or protein sequences, ends with two forward
slashes (the “//” symbol).
Importance
• In retrieving DNA or protein sequences from GenBank, the search can be limited to
different fields of annotation such as “organism,” “accession number,” “authors,”
and “publication date.”
• One can use a combination of the “Limits” and “Preview/Index” options as
described. Alternatively, a number of search qualifiers can be used, each defining
one of the fields in a GenBank file.
• The qualifiers are similar to but not the same as the field tags in PubMed. For
example, in GenBank, [GENE] represents field for gene name, [AUTH] for author
name, and [ORGN] for organism name.
• Frequently used GenBank qualifiers, which have to be in uppercase and in brackets
Alternative sequence Formats
• In bioinformatics, FASTA format is a text-based format for representing
either nucleotide sequences or peptide sequences, in which nucleotides or amino
acids are represented using single-letter codes.
• FASTA is one of the simplest and the most popular sequence formats because it
contains plain sequence information that is readable by many bioinformatics
analysis programs.
• It has a single definition line that begins with a right angle bracket (>) followed by a
sequence name.
• Sometimes, extra information such as gi number or comments can be given, which
are separated from the sequence name by a “|” symbol.
FASTA Format Sequence
>E01306.1 DNA encoding human insulin-like growth factor I(IGFI)
GAATTCTAACGGTCCCGAAACTCTGTGCGGTG TGAATGGTTGACGCTCTGCAG
TTGTTTGCGGTGACCGTGGTTTTTATTTTAACAAACCCACTGGTTATGGTTCTT
TTCTCGTCGTGCTCCCCAGACTGGTATTGTTGA GAATGCTGCTTTCGTTCTTG
GACCTGCGTCGTCTGGAAATGTATTGCGCTCCCCTGAAACCCGC
• The extra information is considered optional and is ignored by sequence analysis
programs.
• The plain sequence in standard one-letter symbols starts in the second line.
• Each line of sequence data is limited to sixty to eighty characters in width.
• The drawback of this format is that much annotation information is lost.
Gen bank databases

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Swiss prot database
Swiss prot databaseSwiss prot database
Swiss prot database
 
Fasta
FastaFasta
Fasta
 
MULTIPLE SEQUENCE ALIGNMENT
MULTIPLE  SEQUENCE  ALIGNMENTMULTIPLE  SEQUENCE  ALIGNMENT
MULTIPLE SEQUENCE ALIGNMENT
 
Blast and fasta
Blast and fastaBlast and fasta
Blast and fasta
 
Introduction OF BIOLOGICAL DATABASE
Introduction OF BIOLOGICAL DATABASEIntroduction OF BIOLOGICAL DATABASE
Introduction OF BIOLOGICAL DATABASE
 
Ddbj
DdbjDdbj
Ddbj
 
Scop database
Scop databaseScop database
Scop database
 
Primary and secondary databases ppt by puneet kulyana
Primary and secondary databases ppt by puneet kulyanaPrimary and secondary databases ppt by puneet kulyana
Primary and secondary databases ppt by puneet kulyana
 
Biological database
Biological databaseBiological database
Biological database
 
DNA data bank of japan (DDBJ)
DNA data bank of japan (DDBJ)DNA data bank of japan (DDBJ)
DNA data bank of japan (DDBJ)
 
BLAST
BLASTBLAST
BLAST
 
Data Retrieval Systems
Data Retrieval SystemsData Retrieval Systems
Data Retrieval Systems
 
Proteins databases
Proteins databasesProteins databases
Proteins databases
 
Entrez databases
Entrez databasesEntrez databases
Entrez databases
 
Database in bioinformatics
Database in bioinformaticsDatabase in bioinformatics
Database in bioinformatics
 
sequence alignment
sequence alignmentsequence alignment
sequence alignment
 
Clustal W - Multiple Sequence alignment
Clustal W - Multiple Sequence alignment   Clustal W - Multiple Sequence alignment
Clustal W - Multiple Sequence alignment
 
Structural databases
Structural databases Structural databases
Structural databases
 
Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-
 
Scoring matrices
Scoring matricesScoring matrices
Scoring matrices
 

Similar a Gen bank databases

Sequencedatabases
SequencedatabasesSequencedatabases
Sequencedatabases
Abhik Seal
 

Similar a Gen bank databases (20)

Databases_L2.pptx
Databases_L2.pptxDatabases_L2.pptx
Databases_L2.pptx
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Genomic Databases-.pptx
Genomic Databases-.pptxGenomic Databases-.pptx
Genomic Databases-.pptx
 
Biological databases
Biological databasesBiological databases
Biological databases
 
BLAST AND FASTA.pptx12345789999987544321234
BLAST AND FASTA.pptx12345789999987544321234BLAST AND FASTA.pptx12345789999987544321234
BLAST AND FASTA.pptx12345789999987544321234
 
Bioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxBioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptx
 
Introduction to databases.pptx
Introduction to databases.pptxIntroduction to databases.pptx
Introduction to databases.pptx
 
Sequencedatabases
SequencedatabasesSequencedatabases
Sequencedatabases
 
Data retreival system
Data retreival systemData retreival system
Data retreival system
 
Intro to databases
Intro to databasesIntro to databases
Intro to databases
 
BioInformatics Tools -Genomics , Proteomics and metablomics
BioInformatics Tools -Genomics , Proteomics and metablomicsBioInformatics Tools -Genomics , Proteomics and metablomics
BioInformatics Tools -Genomics , Proteomics and metablomics
 
biological databases.pptx
biological databases.pptxbiological databases.pptx
biological databases.pptx
 
Rap db(rice annotation project data base)
Rap db(rice annotation project data base)Rap db(rice annotation project data base)
Rap db(rice annotation project data base)
 
Apollo annotation guidelines for i5k projects Diaphorina citri
Apollo annotation guidelines for i5k projects Diaphorina citriApollo annotation guidelines for i5k projects Diaphorina citri
Apollo annotation guidelines for i5k projects Diaphorina citri
 
database retrival.pdf
database retrival.pdfdatabase retrival.pdf
database retrival.pdf
 
Databases_CSS2.pptx
Databases_CSS2.pptxDatabases_CSS2.pptx
Databases_CSS2.pptx
 
Protein database
Protein  databaseProtein  database
Protein database
 
Important protein databases and proteomics softwares
Important protein databases and proteomics softwaresImportant protein databases and proteomics softwares
Important protein databases and proteomics softwares
 
Submitting DNA sequences to the databases, SEQUIN.pptx
Submitting DNA sequences to the databases, SEQUIN.pptxSubmitting DNA sequences to the databases, SEQUIN.pptx
Submitting DNA sequences to the databases, SEQUIN.pptx
 
1.introduction to genetic engineering and restriction enzymes
1.introduction to genetic engineering and restriction enzymes1.introduction to genetic engineering and restriction enzymes
1.introduction to genetic engineering and restriction enzymes
 

Más de Hafiz Muhammad Zeeshan Raza

Más de Hafiz Muhammad Zeeshan Raza (13)

Car manufacturing is a complex and fascinating industry that plays a signific...
Car manufacturing is a complex and fascinating industry that plays a signific...Car manufacturing is a complex and fascinating industry that plays a signific...
Car manufacturing is a complex and fascinating industry that plays a signific...
 
Experience of New Graduate Nurses Feeling Not Ready for Professional Role on ...
Experience of New Graduate Nurses Feeling Not Ready for Professional Role on ...Experience of New Graduate Nurses Feeling Not Ready for Professional Role on ...
Experience of New Graduate Nurses Feeling Not Ready for Professional Role on ...
 
TO ANALYZE THE ROLE OF RURAL WOMAN'S TO ENSURE CHILD NUTRITION IN DISTRICT RA...
TO ANALYZE THE ROLE OF RURAL WOMAN'S TO ENSURE CHILD NUTRITION IN DISTRICT RA...TO ANALYZE THE ROLE OF RURAL WOMAN'S TO ENSURE CHILD NUTRITION IN DISTRICT RA...
TO ANALYZE THE ROLE OF RURAL WOMAN'S TO ENSURE CHILD NUTRITION IN DISTRICT RA...
 
OMANTEL
OMANTELOMANTEL
OMANTEL
 
Quality control of sequencing with fast qc obtained with
Quality control of sequencing with fast qc obtained withQuality control of sequencing with fast qc obtained with
Quality control of sequencing with fast qc obtained with
 
Cell organelles
Cell organellesCell organelles
Cell organelles
 
Human genome project
Human genome projectHuman genome project
Human genome project
 
Translation & Post Translational Modifications
Translation & Post Translational ModificationsTranslation & Post Translational Modifications
Translation & Post Translational Modifications
 
DNA transcription & Post Transcriptional Modification
DNA transcription & Post Transcriptional ModificationDNA transcription & Post Transcriptional Modification
DNA transcription & Post Transcriptional Modification
 
Recombinant DNA technology
Recombinant DNA technologyRecombinant DNA technology
Recombinant DNA technology
 
Restriction Fragment Length Polymorphism (RFLP)
Restriction Fragment Length Polymorphism (RFLP)Restriction Fragment Length Polymorphism (RFLP)
Restriction Fragment Length Polymorphism (RFLP)
 
Mendeley software beginers
Mendeley software beginersMendeley software beginers
Mendeley software beginers
 
Bioinformatics introduction
Bioinformatics introductionBioinformatics introduction
Bioinformatics introduction
 

Último

Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
heathfieldcps1
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 

Último (20)

Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
 

Gen bank databases

  • 2. Overview • Introduction • Sections of Database • Importance of GenBank
  • 3. Historical background • The first major bioinformatics project was undertaken by Margaret Dayhoff in 1965, who developed a first protein sequence database called Atlas of Protein Sequence and Structure. • Subsequently, in the early 1970s, the Brookhaven National Laboratory established the Protein Data Bank for archiving three-dimensional protein structures. • The first sequence alignment algorithm was developed by Needleman and Wunsch in 1970. This was a fundamental step in the development of the field of bioinformatics, which paved the way for the routine sequence comparisons and database searching practiced by modern biologists. • The 1980s saw the establishment of GenBank and the development of fast database searching algorithms such as FASTA by William Pearson and BLAST by Stephen Altschul and coworkers.
  • 4. Introduction • GenBank is the most complete collection of annotated nucleic acid sequence data for almost every organism. • The content includes genomic DNA, mRNA, cDNA, ESTs, high throughput raw sequence data, and sequence polymorphisms. • There is also a GenPept database for protein sequences, the majority of which are conceptual translations from DNA sequences, although a small number of the amino acid sequences are derived using peptide sequencing techniques.
  • 5. How to search GenBank • There are two ways to search for sequences in GenBank. • One is using text-based keywords similar to a PubMed search. • The other is using molecular sequences to search by sequence similarity using BLAST.
  • 6. GenBank Sequence Format • To search GenBank effectively using the text-based method requires an understanding of the GenBank sequence format. • GenBank is a relational database. However, the search output for sequence files is produced as flat files for easy reading. • The resulting flat files contain three sections; Header, Features, and Sequence entry. • There are many fields in the Header and Features sections. Each field has an unique identifier for easy indexing by computer software. • Understanding the structure of the GenBank files helps in designing effective search strategies.
  • 7.
  • 8. 1st section…Header Part • The line, “DEFINITION,” provides the summary information for the sequence record including the name of the sequence, the name and taxonomy of the source organism if known, and whether the sequence is complete or partial. • This is followed by an accession number for the sequence, which is a unique number assigned to a piece of DNA when it was first submitted to GenBank and is permanently associated with that sequence. • This is the number that should be cited in publications. It has two different formats: two letters with five digits or one letter with six digits.
  • 9. Continue… • For a nucleotide sequence that has been translated into a protein sequence, a new “accession number” is given in the form of a string of alphanumeric characters. • In addition to the accession number, there is also a version number and a gene index (gi) number. The purpose of these numbers is to identify the current version of the sequence. • If the sequence annotation is revised at a later date, the accession number remains the same, but the version number is incremented as is the gi number. • A translated protein sequence also has a different gi number from the DNA sequence it is derived from.
  • 10. Continue… • The next line in the Header section is the “ORGANISM” field, which includes the source of the organism with the scientific name of the species and sometimes the tissue type. • Along with the scientific name is the information of taxonomic classification of the organism. • Different levels of the classification are hyperlinked to the NCBI taxonomy database with more detailed descriptions.
  • 11. Continue… • This is followed by the “REFERENCE” field, which provides the publication citation related to the sequence entry. • The REFERENCE part includes author and title information of the published work (or tentative title for unpublished work). • The “JOURNAL” field includes the citation information as well as the date of sequence submission. • The citation is often hyperlinked to the PubMed record for access to the original literature information. • The last part of the Header is the contact information of the sequence submitter.
  • 12. 2nd section…Features • The “Features” section includes annotation information about the gene and gene product, as well as regions of biological significance reported in the sequence, with identifiers and qualifiers. • The “Source” field provides the length of the sequence, the scientific name of the organism, and the taxonomy identification number. Some optional information includes the clone source, the tissue type and the cell line. • The “gene” field is the information about the nucleotide coding sequence and its name. For DNA entries, there is a “CDS” field, which is information about the boundaries of the sequence that can be translated into amino acids. • For eukaryotic DNA, this field also contains information of the locations of exons and translated protein sequences is entered.
  • 13. 3rd section…Sequence • The third section of the flat file is the sequence itself starting with the label “ORIGIN”. • The format of the sequence display can be changed by choosing options at a Display pull-down menu at the upper left corner. • For DNA entries, there is a BASE COUNT report that includes the numbers of A, G, C, and T in the sequence. • This section, for both DNA or protein sequences, ends with two forward slashes (the “//” symbol).
  • 14. Importance • In retrieving DNA or protein sequences from GenBank, the search can be limited to different fields of annotation such as “organism,” “accession number,” “authors,” and “publication date.” • One can use a combination of the “Limits” and “Preview/Index” options as described. Alternatively, a number of search qualifiers can be used, each defining one of the fields in a GenBank file. • The qualifiers are similar to but not the same as the field tags in PubMed. For example, in GenBank, [GENE] represents field for gene name, [AUTH] for author name, and [ORGN] for organism name. • Frequently used GenBank qualifiers, which have to be in uppercase and in brackets
  • 15. Alternative sequence Formats • In bioinformatics, FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which nucleotides or amino acids are represented using single-letter codes. • FASTA is one of the simplest and the most popular sequence formats because it contains plain sequence information that is readable by many bioinformatics analysis programs. • It has a single definition line that begins with a right angle bracket (>) followed by a sequence name. • Sometimes, extra information such as gi number or comments can be given, which are separated from the sequence name by a “|” symbol.
  • 16. FASTA Format Sequence >E01306.1 DNA encoding human insulin-like growth factor I(IGFI) GAATTCTAACGGTCCCGAAACTCTGTGCGGTG TGAATGGTTGACGCTCTGCAG TTGTTTGCGGTGACCGTGGTTTTTATTTTAACAAACCCACTGGTTATGGTTCTT TTCTCGTCGTGCTCCCCAGACTGGTATTGTTGA GAATGCTGCTTTCGTTCTTG GACCTGCGTCGTCTGGAAATGTATTGCGCTCCCCTGAAACCCGC • The extra information is considered optional and is ignored by sequence analysis programs. • The plain sequence in standard one-letter symbols starts in the second line. • Each line of sequence data is limited to sixty to eighty characters in width. • The drawback of this format is that much annotation information is lost.