Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview

Prokaryotic Annotation Overview Ramana Madupu 2008

Outline ,[object Object],[object Object],[object Object],[object Object],[object Object]

What is Annotation? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

ORFs vs. Protein Coding Genes ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Each DNA sequence has 6 possible translations, one for each frame (Very) Short sample sequence TAGATGATTAGCTTGGATGAGCTCATATAGCCCGTAAGA ATCTACTAATCGAACCTACTCGAGTATATCGGGCATTGT Frame and corresponding codons +1 TAG ATG ATT AGC TTG GAT GAG CTC ATA TAG CCC GTA +2 AGA TGA TTA GCT TGG ATG AGC TCA TAT AGC CCG TAA +3 GAT GAT TAG CTT GGA TGA GCT CAT ATA GCC CGT AAG -1 TGT TAC GGG CTA TAT GAG CTC ATC CAA GCT AAG CAT -2 GTT ACG GGC TAT ATG AGC TCA TCC AAG CTA AGC ATC -3 TTA CGG GCT ATA TGA GCT CAT CCA AGC TAA GCA TCT color key translational start translational stop Just one sequence can result in many different potential coding sequence situations: frame +1 and +2 have relatively long ORFs with start codons and therefore are potential genes. Frames +3 and -3 have short ORFs with no starts and therefore no genes. Frame -1 and -2 are all ORF since there are no stops at all, but only one has a start.

More on 6-Frame translations In order to visualize the genes within the context of their neighbors along a DNA sequence, the sequence is often represented as a 6-frame translation. There are 6 possible frames for translation in every sequence of DNA, 3 in the forward (+) direction on the DNA sequence, and 3 in the reverse (-) direction on the DNA sequence. These are represented as horizontal bars with vertical marks for stops (long) and starts (short) in the order shown below. +1 +2 +3 -1 -2 -3 These are some of the many ORFs in this graphic. stop start

Gene Finding with Glimmer ,[object Object],[object Object],[object Object],[object Object],[object Object]

Gene finding with Glimmer: Gathering the Training Set Glimmer built-in system: long ORFs that do not overlap other long ORFs IMM built Glimmer builds a model from the training set One can gather published sequences from the organism sequenced, however, this generally does not provide enough sequence for training - a general guideline is that ~250 kb of total sequence is needed (that is if you line up all of the training genes end to end they total 250 kb) BLAST all ORFs against your favorite protein database, retain only extremely strong matches

Gene Finding with Glimmer What happens during training. ATGCG T AAGGCTTTCACAGT ATGCG A GTAAGCTGCGTCGTAAGG A TGCGT A AGGCTTTCACAGTATGCGAGTAAGC TGCGT C GTAAGG AT GCGTA A GGCTTTCACAGTATGCGAGTAAGCTGCGTCGTAAGG ATG CGTAA G GCTTTCACAGTATGCGAGTAAGCTGCGTCGTAAGG ATGC GTAAG G CTTTCACAGTATGCGA GTAAG C TGCGTC GTAAG G Glimmer moves sequentially through each sequence in the training set, recording the nucleotide that occurs after each possible oligomer up to oligomers of length eight Glimmer then calculates the statistical probability of each pattern appearing in a real gene. These probabilities form the statistical model of what a real gene looks like in the given organism. This model is then run against the complete genome sequence. All ORFs above the chosen minimum length (90 bp at JCVI) are scored according to how closely they match the model of a real gene. Example for a 5-mer:

Candidate ORFs 6-frame translation map for a region of DNA: Each horizontal bar corresponds to one of the 6 translation frames (3 forward, 3 reverse) long vertical lines represent stops (TAA, TAG, TGA) short vertical lines represent starts (ATG, GTG, TTG) = ORFs meeting minimum length In the first panel the ORFs highlighted in blue meet the minimum length requirement of what can be considered a gene. (in our pipeline 90 bp) In the second panel, potential coding sequences within the candidate ORFs are represented by arrows, going from start to stop. A long ORF does not necessarily result in a long putative gene. Green ORFs scored well to the model, red ORFs scored less well. The green ORFs are chosen by Glimmer as the set of likely genes. ORFs in the area of lateral transfer, although real genes, often will not be chosen since their nucleotide patterns likely won’t match the model built from the patterns of the genome as a whole. Below genes are represented as arrows drawn above (forward orientation) or below (reverse orientation) a line representing the DNA molecule. Glimmer numbers the genes sequentially from the beginning of the DNA molecule on which they reside. Genes missed by Glimmer and overlapping genes are resolved by post-Glimmer processes, which will be discussed on later slides. ORF00001 ORF00002 ORF00003 ORF00004 = laterally transferred DNA ? +1 +2 +3 -1 -2 -3

Coordinates Genes are mapped to the underlying genome sequence via coordinates. Each gene is defined by two coordinates: end5 (the 5 prime end of the gene) and end3 (the 3 prime end of the gene). Nucleotide #1 for each molecule in the genome is the beginning of each final assembled molecule. Circular chromosomes are treated as linear for the purposes of annotation. One position is chosen to be nucleotide #1 and that becomes the “beginning” of the molecule. Some genomes have just one DNA molecule, some have several (multiple chromosomes or plasmids). 1 10,000 gene end5 end3 purple 12 527 red 802 675 blue 927 1543 green 9425 7894 pink 9575 9945 Note that forward genes have end5<end3, while reverse genes have end5>end3.

Gathering Evidence Determining How The New Proteins Function

Finding the function of a new protein ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Protein Homology ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Collecting Sequence Similarity Evidence

Protein Alignment Tools ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Sample Alignments Pairwise Multiple -two rows of amino acids compared to each other, the top row is the search protein and the bottom row is the match protein, numbers indicate amino acid position in the sequence -solid lines between amino acids indicate identity -dashed lines (colons) between amino acids indicate similarity -next slide shows a full alignment Different shadings indicate amount of matching

Sample full-length protein alignment

Useful Protein Databases ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Other Useful Databases

NIAA ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Sample NIAA entry >biotin synthase, Escherichia coli MAHRPRWTLSQVTELFEKPLLDLLFEAQQVHRQHFDPRQVQVSTLLSIKTGACPEDCKYC PQSSRYKTGLEAERLMEVEQVLESARKAKAAGSTRFCMGAAWKNPHERDMPYLEQMVQGV KAMGLEACMTLGTLSESQAQRLANAGLDYYNHNLDTSPEFYGNIITTRTYQERLDTLEKV RDAGIKVCSGGIVGLGETVKDRAGLLLQLANLPTPPESVPINMLVKVKGTPLADNDDVDA FDFIRTIAVARIMMPTSYVRLSAGREQMNEQTQAMCFMAGANSIFYGCKLLTTPNPEEDK DLQLFRKLGLNPQQTAVLAGDNEQQQRLEQALMTPDTDEYYNAAAL Source databases where this protein is found: -Swiss-Prot, accession # SP:P12996 -Protein Information Resource, accession # PIR:JC2517 -NCBI’s GenBank, accession # GB:AAC73862.1 All of these are collapsed into one entry in NIAA that is linked to all three accessions.

BLAST-extend-repraze (BER) Our pairwise protein search tool ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

BLAST-extend-repraze (BER) niaa BLAST Extend 300 nucleotides on both ends (see later slide) modified Smith- Waterman Alignment genome.pep vs. Significant hits put into mini-dbs for each protein (non-identical amino acid) vs. extended protein mini-database from BLAST search , mini-db for protein #1 mini-db for protein #2 , mini-db for protein #3 ... mini-db for protein #3000

BER Alignment An alignment like this will be generated for every match protein in the mini-database. In the next slides we will look closely at the types of information displayed here.

BER Alignment detail: Boxed Header -The background color of this box will be gold if the protein is in the characterized table and grey if it is not. -The top bar lists the percent identity/similarity and the organism from which the protein comes (if available). -The bottom section lists all of the accession numbers and names for all the instances of the match protein from the source databases (used in building NIAA for the searches.) -The accession numbers are links to pages for the match protein in the source databases. -A particular entry in the list will have colored text (the color corresponding to its characterized status) if that is the accession that is entered into the characterized table - this tells the annotators which link they should follow to find experimental characterization information. Only one accession for the match protein need be in the characterized table for the header to turn gold. -There are links at the end of each line to enter the accession into the characterized table or to edit an already existing entry in the characterized table.

BER Alignment detail: alignment header -It is most important to look at the range over which the alignment stretches and the percent identity -The top line show the amino acid coordinates over which the match extends for our protein -The second line shows the amino acid coordinates over which the match extends for the match protein, along with the name and accession of the match protein -The last line indicates the number of amino acids in the alignment found in each forward frame for the sequence as defined by the coordinates of the gene. The primary frame is the one starting with nucleotide one of the gene. If all is well with the protein, all of the matching amino acids should be in frame 1. -If there is a frameshift in the alignment (see later slide) the phrase “Frame Shifts = #” will flash and indicate how many frameshifts there are. In addition, inside the brackets in the last line will be a count of how many amino acids of match are in each frame.

BER Alignment detail: alignment of amino acids -In these alignments the codons of the DNA sequence read down in columns with the corresponding amino acid underneath. -The numbers refer to amino acid position. Position 1 is the first amino acid of the protein. The first nucleotide of the codon coding for amino acid 1 is nucleotide 1 of the coding sequence. Negative amino acid numbers indicate positions upstream of the predicted start of the protein. -Vertical lines between amino acids of our protein and the match protein (bottom line) indicate exact matches, dotted lines (colons) indicate similar amino acids. -Start sites are color coded: ATG is green, GTG is blue, TTG is red/orange -Stop codons are represented as asterisks in the amino acid sequence. An open reading frame goes from an upstream stop codon to the stop at the end of the protein, while the gene starts at the chosen start codon.

BER Skim A list of best matches from niaa to the search protein with statistics on length of match and BLAST p-value. Colored backgrounds indicate presence in characterized table and corresponding status.

Extensions in BER The extensions help in the detection of frameshifts (FS) and point mutations resulting in in-frame stop codons (PM). This is indicated when similarity extends outside the coordinates of the protein coding sequence. Blue line indicates predicted protein coding sequence, green line indicates up- and downstream extensions. Red line is the match protein. ORFxxxxx 300 bp 300 bp FS PM FS or PM ? two functionally unrelated genes from other species matching one query protein could indicate incorrectly fused ORFs end5 end3 search protein match protein similarity extending through a frameshift upstream or downstream into extensions similarity extending in the same frame through a stop codon ? normal full length match * ! !

H idden M arkov M odels - HMMs ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Our HMM-Type Labels Domain: These HMMs describe a region of homology that is not required to be the full length of a protein. The function of the region may or may not be known. Superfamily : This type of HMM describes a group of proteins which have full length protein sequence similarity and have the same domain architecture, but which do not necessarily have the same function. Subfamily: This type of HMM describes a group of proteins which also have full length homology, which represent more specific sub-groupings with a superfamily. Equivalog: The supreme HMM, designed so that all members of the family being modeled and all proteins scoring above trusted share the exact same function. Equivalog_domain: Just like equivalog, only applies to a polypeptide that can be found as a single function protein or part of a multifunctional protein. (The above are just a sampling of the most commonly seen isology types arranged from most general to most specific, there are more types than those listed here.) Pfam: Indicates that no isology type has yet been assigned to the Pfam HMM. The starred HMMs provide the most specific information, look for matches to these first. They are very strong evidence for function.

Annotation attached to HMMs ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Building HMMs Collect proteins to be in the “seed” Generate and Curate Multiple Alignment of Seed proteins Run HMM algorithm Choose “noise” and “trusted” cutoff scores based on what scores the “known” vs. “unknown” proteins receive. HMM is ready to go! Region of good alignment and closest similarity (same function/similar domain/ family membership) Computes statistical probabilities for amino acid patterns in the seed Search new model against all proteins this step may need a few iterations

Choosing cutoff scores 100 300 matches (seed members bold) score protein “definitely” 547 protein “absolutely” 501 protein “sure thing” 487 protein “confident” 398 protein “safe bet” 376 protein “very confident” 365 protein “has to be one” 355 protein “could be” 210 protein “maybe” 198 protein “not sure” 150 protein “no way” 74 protein “can’t be” 54 protein “not a chance” 47 =search the new HMM against NIAA =see the range of scores the match proteins receive =do analysis to determine where confirmed members score =do analysis to determine where confirmed non-members score =set the cut-offs accordingly =proteins that score above trusted can be considered part of the protein family modeled by the HMM =proteins that score below noise should not be considered part of the protein family modeled by the HMM =usefulness of an HMM is directly related to the care taken by the person building the HMM since some steps are subjective

HMM Searches HMM hits for protein #1 , , , etc. HMM hits for protein #3 HMM hits for protein #2 genome.pep vs. HMM database TIGRFAMS + Pfams TIGR01234 PF00012 TIGR00005 TIGR01004 NONE Each protein in the genome is searched against all HMMs in our db. Some will not have significant hits to any HMM, some will have significant hits to several HMMs. Multiple HMM hits can arise in many ways, for example: the same protein could hit an equivalog model, a superfamily model to which the equivalog function belongs, and a domain model representing the catalytic domain for the particular equivalog function. There is also overlap between TIGR and Pfam HMMs.

Evaluating HMM scores - if the protein’s total score is….. … above trusted: the protein is a member of family the HMM models … below noise: the protein is not a member of family the HMM models … in-between noise and trusted: the protein MAY be a member of the family the HMM models ...above trusted and some or all scores are negative: the protein is a member of the family the HMM models 0 100 -50 0 100 -50 0 100 -50 T N P 0 100 -50

Genome Properties ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Genome Property: “Biotin Biosynthesis”

Genome Property: “Cell Shape”

Paralogous Families genome.pep genome.pep vs. Proteins from the same genome are grouped into families (minimum two members) according to sequence similarity. First, proteins are clustered according to HMM hits, second, other regions of the proteins, not found in HMM hits, are searched and clustered. Those families associated with HMMs get names containing the HMM accession (ex. fam_PF00528), those not associated with HMMs are given sequential numbered names as they are built (ex. fam_11). Reveals expansion/contraction of various families of proteins in one genome verses another. Helps in annotation consistency, frameshift detection, and start site editing.

Paralogous family output in Manatee

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Other searches

non-coding RNAs ,[object Object],[object Object],[object Object],[object Object]

Other Information ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Gene Model Curation ,[object Object],[object Object],[object Object],[object Object]

Gene Model Curation: Start site edits What to consider: - Start site frequency: ATG >> GTG >> TTG - Ribosome Binding Site (RBS): a string of AG rich sequence located 5-11 bp upstream of the start codon - Similarity to match protein, in BER, Paralogous Family, and multiple alignments - the most important factor. (Remember to note, that the DNA sequence reads down in columns for each codon.) -In the example below (showing just the beginning of one BER alignment), homology starts exactly at the first atg (the current chosen start, aa #1), there is a very favorable RBS beginning 9bp upstream of this atg (gagggaga). There is no reason to consider the ttg, and no justification for moving to the second atg (this would cut off some similarity and it does not have an RBS.) RBS upstream of chosen start 3 possible start sites This ORF’s upstream boundary

Gene Model Curation: Overlaps and InterEvidence Overlap analysis When two ORFs overlap (boxed areas), the one without similarity to anything (another protein, an HMM, etc.) is removed. If both don’t match anything, other considerations such as presence in a putative operon and potential start codon quality are considered. This process has both automated (for the easy ones) and manual (for the hard ones) components. Small regions of overlap are allowed (circle). InterEvidence regions Areas of the genome with no genes and areas within genes without any kind of evidence (no match to another protein, HMM, etc., such regions may include an entire gene in case of “hypothetical proteins”) are translated in all 6 frames and searched against niaa. Results are evaluated by the annotation team.

Functional Assignments Making the annotations: Assigning names and roles to the proteins

Functional Assignments: What we want to accomplish. Name and associated info Descriptive common name for the protein, with as much specificity as the evidence supports; gene symbol. EC number if the protein is an enzyme Role TIGR role Gene Ontology terms from all 3 ontologies (more on this soon) to describe what the protein is doing in the cell and why. Supporting evidence HMMs, Prosite, InterPro Characterized match from BER search Paralogous family membership.

Functional Assignments: What we want to avoid! ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Genome Rot! ~ ~ ~ ~ ~ ~ ~ ~

[object Object],[object Object],[object Object],[object Object],Are all pairwise matches with equal alignment quality of equal value to annotation?

Experimentally Characterized Proteins ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

AutoAnnotate Software tool which gives a preliminary name and role assignment to all the proteins in the genome. AutoAnnotate is not designed to give the most accurate annotation, it is designed to give at least some annotation (particularly TIGR roles) to every gene in the genome it can. It has been designed with the knowledge that manual annotation will follow the automatic annotation. This tool would need to be modified if we wanted it to produce stand-alone automatic annotation. Makes decisions based on ranked evidence types Equivalog HMM Other HMMs Characterized BER match Other BER match best evidence

Manual Annotation: Assigning Names to Proteins

Annotation must be grounded with supporting evidence The process of manual annotation involves assessing all available evidence and reaching a conclusion about what you think the protein is doing in the cell and why. Functional annotations should only be as specific as the supporting evidence allows, you must have high confidence in the level of functional annotation you are asserting. We have evolved a layered set of naming conventions that allows us to reflect the level of functional specificity of which we are confident in the names and other annotations assigned to a protein. All evidence that led to the annotation conclusions that were made must be stored so that others can assess the annotation for themselves. In addition, detailed documentation of methodologies and general rules or guidelines used in any annotation process should be provided.

Evaluate the Evidence ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Functional Assignments: High Confidence in Precise Function Required evidence: -At least one good alignment (minimum 35% identity, over the full lengths of both proteins) to a protein from another organism that has been experimentally characterized, preferably multiple such alignments. -Hits to appropriate HMMs above the trusted cutoff. -Conservation of active sites, binding sites, appropriate number of membrane spans, etc. Action: Give the protein a specific name and accompanying gene symbol, this is the only confidence level where we assign gene symbols. We default to E. coli gene symbols when possible, for Gram positive genes we use B. subtilis gene symbols. Example: name: “adenylosuccinate lyase” gene symbol: purB EC number: 4.3.2.2

Functional Assignments: High Confidence in General Function, Unsure of Substrate Specificity A good example of this is seen with transporters: If you can definitively identify a specific substrate: “ ribose ABC transporter, permease protein” -Hit scoring above trusted to an equivalog HMM for one substrate -high quality match to an experimentally characterized protein in the BER with that substrate specificity If you can identify a substrate group: “ sugar ABC transporter, permease protein” -Multiple characterized hits to a specific type of transporter in the BER results -Hits to appropriate HMMs for that type of transporter -The substrate identified for in these matches are not all be the same, but may fall into a group, for example they are all sugars. If you can only identify the type of transporter and nothing about substrate specificity: “ ABC transporter, permease protein” -Hits to HMMs defining the transporter family but without substrate specificity -Matches to characterized proteins in the BER where the substrates identified do not fall into a group ------------ Another example of a known function but not exact substrate: “ carbohydrate kinase, FGGY family”

Functional Assignments : Precise Function Unclear The “family” designation: -No matches to specific characterized protein -score above trusted cutoff to an HMM which defines a family, but not a specific function. Example: “CbbY family protein ” The general function name: -no match to a specific enzyme -HMM match to a superfamily of functions Example: “kinase” The “homolog” designation, used when we want to show sequence relationship but not functional relationship: -good match to a function not expected in the organism (like a photosynthesis gene in a non-photosynthetic bug) OR -two matches exist in the genome to a particular function, however, one has much stonger evidence than the other. One is likely to be the “real” one while the other is likely not. Example: “galactokinase homolog” The “putative” designation, used when evidence is missing one small element but is otherwise strong: -has been largely replace by “homolog” and “family” Example: “putative galactokinase”

Functional Assignments: Hypotheticals If a protein has no matches to any protein from another species in BER, HMM, Prosite, or TmHMM it is called: “ hypothetical protein” This term is used since we are not actually sure the gene/protein is actually real. It has been predicted by Glimmer but no evidence exists that it is really expressed. --------------------------------------------------------- If a “hypothetical protein” from one species matches a “hypothetical protein” from another, they both now become: “ conserved hypothetical protein” These are really no longer “hypothetical” because it is very unlikely that two species will both retain a sequence if it is not an actual gene. Other centers use terms for this such as: “ protein of unknown function” “ unknown protein” NOTE: the above names will likely change soon as discussion is underway to agree on new names for these that would be used by the whole community.

Functional Assignments: Frameshifts and Point Mutations When a frameshift or in-frame stop codon is seen within a region of alignment in our BER search results, two possible things may have occurred: a sequencing error, or a mutation in the gene. When these are found, the underlying sequence data is manually reviewed for accuracy. Occasionally we find that the sequence we are using has an error (missed base, extra base, miscalled base). When these are corrected the frameshift/in-frame stop is resolved. If we find that the sequence is correct as reported to us, the protein is annotated to reflect the presence of a disruption in the open reading frame: “ great protein, authentic frameshift” “ fun protein, authentic point mutation” We do NOT use the term “pseudogene” as this implies that some experiment was done to confirm the lack of function of the gene. We have done no such experiments and indeed it may be that the protein may be functional in some truncated form, or that some read-through mechanism allows expression. Also, it may be that mutations we see only occurred in the cells grown for the DNA sample used for the sequencing project.

Functional Assignments: Other ORF disruptions -many FS / PM, “ degenerate” -”truncation” -”deletion” -”insertion” (~20-50aa) -interruption “ interruption-N” “ interruption-C” -”fusion” -”fragment” ! ! ! * * * These are given descriptive terms in the common name and all are put into a “Disrupted reading frame” role category to make them easy to find. Again we do not use the term “pseudogene”. (some genes) N-term C-term

TIGR Roles Amino acid biosynthesis Purines, pyrimidines, nucleosides, and nucleotides Fatty acid and phospholipid metabolism Biosynthesis of cofactors, prosthetic groups, and carriers Central intermediary metabolism Energy metabolism Transport and binding proteins DNA metabolism Transcription Protein synthesis Protein Fate Regulatory Functions Signal Transduction Cell envelope Cellular processes Other categories Unknown Hypothetical Disrupted Reading Frame Unclassified ( not a real role ) Role Notes: Notes written by annotators expert in each role category to aid other annotators in knowing what belongs in that category and the JCVI naming conventions for it. AutoAnnotate makes a first pass at assigning role, based on roles associated with HMMs or with match proteins. Human annotator checks and adjusts as necessary. TIGR bacterial roles were first adapted from Monica Riley’s roles for E. coli - both systems have since undergone much change.

What is the Gene Ontology? and Why is it useful?

GO offers an annotation capture system that allows the unambiguous communication of annotation information in a format readable by both computers and humans and which facilitates data exchange. ,[object Object],[object Object],[object Object],[object Object],[object Object]

Do we use precise, consistent terminology in biology? ,[object Object],[object Object],[object Object],[object Object],[object Object]

Enzymes pyruvate kinase and phosphoenol transphosphorylase both are names for the reaction: ATP + pyruvate = ADP + phosphoenolpyruvate The Enzyme Commission has standardized enzyme reactions and provides id numbers and alternate names

sporulation ,[object Object],[object Object],(both photos from Wikipedia site) Neurospora , photo from NIH

Is a little ambiguity a big problem? ,[object Object],[object Object],[object Object]

Biologists need to: ,[object Object],[object Object],[object Object],[object Object]

GO History ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

(Thanks to Jen Clark for this slide) The groups that contribute to the GO effort: ZFIN Reactome

GO is a set of 3 controlled vocabularies, stored as ontologies, which are used to tag gene products with information. ,[object Object],[object Object]

What is a controlled vocabulary? ,[object Object],[object Object],[object Object],[object Object],[object Object]

GO consists of 3 separate controlled vocabularies The vocabularies are used to describe 3 aspects of a gene products existence: what it does (molecular function), why it does what it does (biological process), and where it does what it does (cellular component)

What is an ontology? ,[object Object],[object Object],[object Object],[object Object]

GO terms ,[object Object],[object Object],[object Object],[object Object]

GO term relationships ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

General GO tree structure Root term general term less general term more specific term even more specific term most specific term

Examples Molecular function catalytic activity kinase activity carbohydrate kinase activity ribokinase activity Biological process cellular process metabolism carbohydrate metabolism ribose metabolism transporter activity carbohydrate transporter activity ribose transporter activity lets look at just one branch: now two branches:

Example GO term ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

A Multi-parent term ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],amine group carboxylic acid group

Tree view in AmiGO of a multi-parent term: Each node in an AmiGO tree can be expanded or collapsed as needed by clicking on the +/- in the box at the beginning of each line. The letter in the green circle (I/P) indicates the relationship between the term and its parent. The numbers in parentheses at the end of each line indicate the number of gene products in the GO repository annotation collection which have been annotated to that term or one of its children.

True Path Rule The pathway from a term all the way up to its top-level parent(s) must always be true for any gene product annotated to that term. cell organelle mitochondrian proton-transporting ATP synthase complex Incorrect for Bacteria cell intracellular proton-transporting ATP synthase complex plasma membrane proton-transporting ATP synthase complex mitochondrial proton-transporting ATP synthase complex membrane plasma membrane plasma membrane proton-transporting ATP synthase complex organelle mitochondrion mitochondrial inner membrane mitochondrial proton-transporting ATP synthase complex Correct for Bacteria (and Eukaryotes) NOTE: these trees are abbreviated versions of the actual ones

[object Object],[object Object],[object Object],[object Object],GO is a work in progress

Getting involved with Ontology Development ,[object Object],[object Object],[object Object],[object Object]

https://sourceforge.net/tracker/?func=add&group_id=36855&atid=440764 The Gene Ontology SourceForge Tracker for Ontology Development

http://www.geneontology.org/GO.interests.shtml?all

Making associations between GO terms and gene products: GO Annotations

What is GO Annotation? ,[object Object],[object Object]

6-phosphofructokinase ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Making Annotations ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Evaluate the Evidence ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Functional confidence captured with GO GO trees (very abbreviated) Function catalytic activity kinase activity carbohydrate kinase activity ribokinase activity glucokinase activity fructokinase activity Process metabolism carbohydrate metabolism monosaccharide metabolism hexose metabolism glucose metabolism fructose metabolism pentose metabolism ribose metabolism Available evidence for 3 genes #1 -good match to an HMM specific for “ribokinase” -a high-quality BER match to an experimentally characterized ribokinase #2 -good match to an HMM for “kinase” -a high-quality BER match to an experimentally characterized “glucokinase’ AND a ‘fructokinase’ #3 -good match to an HMM for “kinase’

Finding the terms to assign to your protein ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Full Annotation ,[object Object],[object Object]

chaperone DnaK ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Evidence ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

GO Evidence A key element of a good annotation system is the storage of evidence: ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

• IDA inferred from direct assay - Enzyme assays - In vitro reconstitution (e.g. transcription) - Immunofluorescence (for cellular component) - Cell fractionation (for cellular component) - Physical interaction/binding • IEP inferred from expression pattern - Transcript levels (e.g. Northerns, microarray data) - Protein levels (e.g. Western blots) • IGI inferred from genetic interaction - "Traditional" genetic interactions such as suppressors, synthetic lethals, etc. - Functional complementation - Rescue experiments - Inference about one gene drawn from the phenotype of a mutation in a different gene. • IMP inferred from mutant phenotype - Any gene mutation/knockout - Overexpression/ectopic expression of wild-type or mutant genes - Anti-sense experiments - RNAi experiments - Specific protein inhibitors • IPI inferred from physical interaction - 2-hybrid interactions - Co-purification/Co-immunoprecipitation - Ion/protein binding experiments • ISS inferred from sequence or structural similarity - Sequence similarity (to characterized proteins, domains, motifs) - Structural similarity • IGC inferred from genomic context -operon structure -syntenic regions -pathway/system reconstruction • ND no biological data available (used when annotating to the root terms) • IC inferred by curator • TAS/NAS traceable/nontraceable author statement • IEA inferred from electronic annotation http://www.geneontology.org/GO.evidence.html GO Evidence codes

GO Standard References GO_REF:0000011 A Hidden Markov Model (HMM) is a statistical representation of patterns found in a data set. When using HMMs with proteins, the HMM is a statistical model of the patterns of the amino acids found in a multiple alignment of a set of proteins called the "seed". Seed proteins are chosen based on sequence similarity to each other. Seed members can be chosen with different levels of relationship to each other. They can be members of a superfamily (ex. ABC transporter, ATP-binding proteins), they can all share the same exact specific function (ex. biotin synthase) or they could share another type of relationship of intermediate specificity (ex. subfamily, domain). New proteins can be scored against the model generated from the seed according to how closely the patterns of amino acids in the new proteins match those in the seed. There are two scores assigned to the HMM which allow annotators to judge how well any new protein scores to the model. Proteins scoring above the "trusted cutoff" score can be assumed to be part of the group defined by the seed. Proteins scoring below the "noise cutoff" score can be assumed to NOT be a part of the group. Proteins scoring between the trusted and noise cutoffs may be part of the group but may not. One of the important features of HMMs is that they are built from a multiple alignment of protein sequences, not a pairwise alignment. This is significant, since shared similarity between many proteins is much more likely to indicate shared functional relationship than sequence similarity between just two proteins. The usefulness of an HMM is directly related to the amount of care that is taken in chosing the seed members, building a good multiple alignment of the seed members, assessing the level of specificity of the model, and choosing the cutoff scores correctly. GO_REF:0000012 Pairwise alignments are generated by taking two sequences and aligning them so that the maximum number of amino acids in each protein match, or are similar to, each other. Tools such as BLAST work by comparing a protein-of-interest individually with every protein in a database of known protein sequences and retaining only those matches with a high probability of being significant. Basic BLAST generates local alignments between proteins for regions of high similarity. Other pairwise alignment tools attempt to generate global (full-length) protein alignments. GO_REF:0000025 Genes in prokaryotic organisms are often arranged in operons. Genes in an operon are all transcribed into one mRNA. Generally the genes in the operons code for proteins that all have related functions. For example, they may be the steps in a biochemical pathway, or they may be the subunits of a protein complex. Often the genes in operons shared between organisms are syntenic; that is, the same genes are in the same order in the operon in different species. When assessing sequence-comparison-based evidence during the process of manual annotation of a genome, it is often the case that some of the genes in the operon will have strong sequence-based evidence while others will have weak evidence. If seen alone, not in the presence of an operon, the weak evidence in question may not be sufficient to make a functional annotation. However, in the presence of an operon in which there is strong evidence for some of the genes, the very presence of the gene in the operon is a strong indication that the gene shares in the process carried out by the operon. If the putative function is one expected to exist for the process in question and particularly if that function has been observed in the same operon in another species, then the annotation should be made. This type of evidence is inferred from the context of the gene in an operon, and therefore the evidence code is IGC "inferred from genomic context."

[object Object],[object Object],[object Object],[object Object],Accession Format All objects (papers, proteins, etc.) used in evidence must be represented according to GO’s format: “ database:accession” where “database” is an abbreviation defined at GO for the database housing the object, and “accession” is the accession number or other identifier for the object Each database must have a registered abbreviation with GO. New abbreviations can be requested at any time. The current file can be viewed at: www.geneontology.org/doc/GO.xrf_abbs

abbreviation: JCVI database: J Craig Venter Institute generic_url: http://www.jcvi.org/ abbreviation: JCVI_CMR database: Comprehensive Microbial Resource object: Locus example_id: JCVI_CMR:VCA0557 generic_url: http://www.jcvi.org/ url_syntax: http://www.jcvi.org/tigr-scripts/CMR2/GenePage.spl?locus= url_example: http://www.jcvi.org/tigr-scripts/CMR2/GenePage.spl?locus=VCA0557 abbreviation: TIGR_TIGRFAMS database: J Craig Venter Institute, TIGRFAMs HMM collection object: Accession example_id: TIGR_TIGRFAMS:TIGR00254 generic_url: http://www.jcvi.org/ url_syntax: http://www.jcvi.org/tigr-scripts/CMR2/hmm_report.spl?acc= url_example: http://www.jcvi.org/tigr-scripts/CMR2/hmm_report.spl?acc=TIGR00254 Example Abbreviations from GO file

From the literature…… Enzymes involved in DNA ligation and end-healing in the radioresistant bacterium Deinococcus radiodurans Blasius, et al, 2007, BMC Molecular Biology “ Results: in this report, we describe the cloning and expression of a Deinococcus radiodurans DNA ligase in Escherichia coli . This enzyme efficiently catalysis DNA ligation in the presence of Mn(II) and NAD+ as cofactors…….” Read on in the text to see how they did the experiments and we find that they performed a direct assay on purified protein. This is IDA for the ligase activity. cellular component biological process molecular function aspect GO:0003909 PMID: 17705817 IC cytoplasm GO:0005737 N/A PMID: 17705817 IDA DNA ligation GO:0006266 N/A PMID: 17705817 IDA DNA ligase activity GO:0003909 with reference ev code term name GO id

Sequence similarity-based Annotations ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

For example…. … . a high quality alignment to an experimentally characterized triosephosphate isomerase from Vibrio marinus

further down the page Info from Swiss-Prot database on experimentally characterized match protein

… . full-length match, high percent identity (67.8%), conserved active and binding sites (boxed in red). High quality…..

Genome Property for glycolysis

Annotations resulting from this match (and others in the pathway for the biological process annotation) …. cellular component biological process molecular function aspect GO:0004807 GO_REF: 0000012 IC cytoplasm GO:0005737 TIGR_GenProp: GenProp0120 PMID: 15347579 IGC glycolysis GO:0006096 Swiss-Prot: P50921 GO_REF: 0000012 ISS triose-phosphate isomerase activity GO:0004807 with reference ev code term name GO id

Sharing Annotations ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Annotation (or Association) file format The annotation file consists of 15 tab-delimited columns: DB - the database submitting the annotation DB_Object_ID - unique identifier for the object being annotated DB_Object_Symbol - gene symbol or other symbol for the gene product Qualifier - one of: NOT, contributes_to, colocalizes_with Goid - the GO id number for the term annotated to the gene product DB:Reference - the reference which describes the annotation source Evidence Code - ISS, IEA, IMP, etc. With/From - accession number of item with sequence similarity, or physical interaction, etc. Aspect - one of: process, function, component stored as P, F, or C DB_Object_Name - common name of gene product (eg. “ribokinase”) Synonym - any alternate names for the gene product, can put gene symbol here too DB_Object_Type - what type of thing is being annotated (protein, gene, etc.) Taxon - the NCBI taxon id for the organism from which the gene product came Date - date the annotation was made Assigned_by - the database that made the annotation (same as the submitting db in most, but not all cases

Some of the Association files at GO

Updating Annotations and Enforcing Standards in the Repository ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

GO Slims ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Using GO for data mining ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Annotation Training ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

JCVI’s Annotation Service ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Data Availability ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Acknowledgements Director of bioinformatics: Granger Sutton Prokaryotic Annotation: Scott Durkin Bob Dodson Ramana Madupu Sagar Kothari Susmita Shrivastava Yinong Sebastian Derek Harkins And the many other TIGR/JCVI employees, present and past, who have contributed to the development of these tools and to the annotation protocols I have described. Also thanks to the funding agencies that make our work possible including NIH, NSF, and DOE. CMR/Pathema team: Tanja Davidsen Erin Kelsey BRC Project Manager: Lauren Brinkac HMM/Genome Properties team: Dan Haft Jeremy Selengut Building the tools: Nikhat Zafar Alex Richter and many more….

Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview

Similar a Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview (20)

Más de Pathema

Más de Pathema (9)

Último

Último (20)

Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview