Comparative genome analysis requires high quality annotations of all genomic elements. Today’s sequencing projects face numerous challenges including lower coverage, more frequent assembly errors, and the lack of closely related species with well-annotated genomes. Precise elucidation of the many different biological features encoded in any genome requires careful examination and review. We need genome annotation editing tools to modify and refine the location and structure of the genome elements that predictive algorithms cannot yet resolve automatically. During the manual annotation process, curators identify elements that best represent the underlying biology and eliminate elements that reflect systemic errors of automated analyses.
Apollo is a web-based application that supports and enables collaborative genome curation in real time, analogous to Google Docs, allowing teams of curators to improve on existing automated gene models through an intuitive interface. Researchers from nearly one hundred institutions worldwide are currently using Apollo for distributed curation efforts in over sixty genome projects across the tree of life: from plants to arthropods, to fungi, to species of fish and other vertebrates including human, cattle (bovine), and dog.
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Genome Curation using Apollo
1. Curating genes and genomes
Apollo: a collaborative tool for genome curation
Monica Munoz-Torres, PhD | @monimunozto
Berkeley Bioinformatics Open-Source Projects (BBOP)
Lawrence Berkeley National Laboratory |
University of California Berkeley | U.S. Department of Energy
BioInfoGenomicsWkshopv2 | Reed College, Portland, Oregon | 10 October, 2015
2. OUTLINE
Web
Apollo
Collabora've
Cura'on
and
Interac've
Analysis
of
Genomes
2OUTLINE
• Today
we
will
discover
how
to
extract
the
most
valuable
informa'on
about
a
genome
through
cura'on
efforts.
3. APOLLO DEVELOPMENT
APOLLO DEVELOPERS 3
h* p://G e nom e Ar c hite c t. or g /
Nathan Dunn
Eric Yao
JBrowse, UC Berkeley
Christine Elsik’s Lab,
University of Missouri
Suzi Lewis
Principal Investigator
BBOP
Moni Munoz-Torres
Stephen Ficklin
GenSAS,
Washington State University
Colin DieshDeepak Unni
4. 4
BY THE END OF THIS TALK
you will
v Be@er
understand
genome
cura'on
in
the
context
of
annota'on:
assembled
genome
à
automated
annota=on
à
manual
annota=on
v Become
familiar
with
the
environment
and
func'onality
of
the
Apollo
genome
annota'on
edi'ng
tool.
v Learn
to
iden'fy
homologs
of
known
genes
of
interest
in
a
newly
sequenced
genome.
v Learn
about
corrobora'ng
and
modifying
automa'cally
annotated
gene
models
using
available
evidence
in
Apollo.
What to expect
6. 6
Genome Sequencing Project
Anatomy of a genome sequencing project
Experimental design, sampling.
Comparative analyses
Consensus
Gene Set
Manual
Annotation
Automated
Annotation
Sequencing Assembly
Synthesis &
dissemination.
7. CURATING GENOMES
steps involved
1 Genera=on
of
Gene
Models
calling
ORFs,
one
or
more
rounds
of
gene
predic'on,
etc.
2 Annota=on
of
gene
models
Describing
func'on,
expression
pa@erns,
metabolic
network
memberships.
3 Manual
annota=on
CURATING GENOMES 7
8. GENOME ANNOTATION
objectives and uses
Curating Genomes 8
The
gene
set
of
an
organism
informs
a
variety
of
studies:
• Gene
number,
GC%,
TE
composi'on,
repe''ve
regions.
• Func'onal
assignments.
• Molecular
evolu'on,
sequence
conserva'on.
• Gene
families.
• Metabolic
pathways.
• What
makes
an
organism
what
it
is?
What
makes
a
bee
a
“bee”?
Marbach et al. 2011. Nature Methods | Shutterstock.com | Alexander Wild
10. WHAT WE NEED TO KNOW
for manual annotation
To
remember…
Biological
concepts
to
be@er
understand
manual
annota'on
10FOOD FOR THOUGHT
• GLOSSARY
from
con1g
to
splice
site
• CENTRAL
DOGMA
in
molecular
biology
• WHAT
IS
A
GENE?
defining
your
goal
• TRANSCRIPTION
mRNA
in
detail
• TRANSLATION
and
other
defini'ons
• GENOME
CURATION
steps
involved
13. 13CURATING GENOMES
CENTRAL “DOGMA”
of molecular biology
v DNA
can
be
copied
to
DNA
(DNA
replica'on),
v DNA
informa'on
can
be
copied
into
mRNA
(transcrip'on),
and
v Proteins
can
be
synthesized
using
the
informa'on
in
mRNA
as
a
template
(transla'on).
http://www.wisegeek.com/
14. 14BIO-REFRESHER
What is a gene?
v The
defini'on
of
a
gene
paints
a
very
complex
picture
of
molecular
ac'vity
and
it
is
a
con'nuously
evolving
concept.
• From
the
Sequence
Ontology
(SO):
“A
gene
is
a
locatable
region
of
genomic
sequence,
corresponding
to
a
unit
of
inheritance,
which
is
associated
with
regulatory
regions,
transcribed
regions
and/or
other
func'onal
sequence
regions”.
“Evolving
Concept”
at
h@p://goo.gl/LpsajQ
15. 15BIO-REFRESHER
What is a gene?
v In
your
life'me,
the
Encyclopedia
of
DNA
Elements
(ENCODE)
project
updated
this
concept
yet
again.
Long
transcripts
&
dispersed
regula1on!
“A
gene
is
a
DNA
segment
that
contributes
to
phenotype/func'on.
In
the
absence
of
demonstrated
func'on,
a
gene
may
be
characterized
by
sequence,
transcrip'on
or
homology.”
https://www.encodeproject.org/
16. 16BIO-REFRESHER
What is a gene?
let’s think computationally!
v Think
of
the
genome
as
an
operating system for
a
living
being
• Considering
that
the
nucleo'des
of
the
genome
are
put
together
into
a
code
that
is
executed
through
the
process
of
transcription
and
translation…
• …
think
of
genes
as
subroutines
that
are
repe''vely
called
in
the
process
of
transcription
Gerstein et al., 2007. Genome Res.
17. 17BIO-REFRESHER
What is a gene?
considerations
v Also
consider
:
• A
gene
is
a
genomic
sequence
(DNA
or
RNA)
directly
encoding
func'onal
product
molecules,
either
RNA
or
protein.
• If
several
func'onal
products
share
overlapping
regions,
we
take
the
union
of
all
overlapping
genomics
sequences
coding
for
them.
• This
union
must
be
coherent
–
i.e.,
processed
separately
for
final
protein
and
RNA
products
–
but
does
not
require
that
all
products
necessarily
share
a
common
subsequence.
Gerstein et al., 2007. Genome Res.
18. 18BIO-REFRESHER
“The
gene
is
a
union
of
genomic
sequences
encoding
a
coherent
set
of
poten'ally
overlapping
func'onal
products.”
Gerstein et al., 2007. Genome Res
The
Gene:
a
moving
target.
What is a gene?
19. 19BIO-REFRESHER
TRANSLATION
reading frame
v Reading
frame
is
a
manner
of
dividing
the
sequence
of
nucleo'des
in
mRNA
(or
DNA)
into
a
set
of
consecu've,
non-‐overlapping
triplets
(codons).
v Three
frames
can
be
read
in
the
5’
à
3’
direc'on.
Given
that
DNA
has
two
an'-‐parallel
strands,
an
addi'onal
three
frames
are
possible
to
be
read
on
the
an'-‐sense
strand.
Six
total
possible
reading
frames
exist.
v In
eukaryotes,
only
one
reading
frame
per
sec'on
of
DNA
is
biologically
relevant
at
a
'me:
it
has
the
poten'al
to
be
transcribed
into
RNA
and
translated
into
protein.
This
is
called
the
OPEN
READING
FRAME
(ORF)
• ORF
=
Start
signal
+
coding
sequence
(divisible
by
3)
+
Stop
signal
v The
sec'ons
of
the
mature
mRNA
transcribed
with
the
coding
sequence
but
not
translated
are
called
UnTranslated
Regions
(UTR);
one
at
each
end.
22. 22BIO-REFRESHER
TRANSLATION
reading frame: splice sites
v The
spliceosome
catalyzes
the
removal
of
introns
and
the
liga'on
of
flanking
exons.
• introns:
spaces
inside
the
gene,
not
part
of
the
coding
sequence
• exons:
expression
units
(of
the
coding
sequence)
v Splicing
“signals”
(from
the
point
of
view
of
an
intron):
• There
is
a
5’
end
splice
“signal”
(site):
usually
GT
(less
common:
GC)
• And
a
3’
end
splice
site:
usually
AG
• …]5’-‐GT/AG-‐3’[…
v It
is
possible
to
produce
more
than
one
protein
(polypep'de)
sequence
from
the
same
genic
region,
by
alterna'vely
bringing
exons
together=
alterna=ve
splicing.
For
example,
the
gene
Dscam
(Drosophila)
has
38,000
alterna'vely
spliced
mRNAs
=
isoforms
23. 23
"Gene structure" by Daycd- Wikimedia Commons
BIO-REFRESHER
TRANSLATION
now in your mind
• Although
of
brief
existence,
understanding
mRNAs
is
crucial,
as
they
will
become
the
center
of
your
work.
24. 24BIO-REFRESHER
TRANSLATION
reading frame: phase
v Introns
can
interrupt
the
reading
frame
of
a
gene
by
inser'ng
a
sequence
between
two
consecu've
codons
v Between
the
first
and
second
nucleo'de
of
a
codon
v Or
between
the
second
and
third
nucleo'de
of
a
codon
"Exon and Intron classes”. Licensed under Fair use via Wikipedia
26. 26BIO-REFRESHER
HICCUPS
in transcription and translation
v The
presence
of
premature
Stop
codons
in
the
message
is
possible.
A
process
called
non-‐sense
mediated
decay
checks
for
them
and
corrects
them
to
avoid:
incomplete
splicing,
DNA
muta'ons,
transcrip'on
errors,
and
leaky
scanning
of
ribosome
–
causing
changes
in
the
reading
frame
(frame
shiYs).
v Inser'ons
and
dele'ons
(indels)
can
cause
frame
shijs,
when
indel
is
not
divisible
by
three
(3).
As
a
result,
the
pep'de
can
be
abnormally
long,
or
abnormally
short
–
depending
when
the
first
in-‐frame
Stop
signal
is
located.
28. 28Gene Prediction
GENE PREDICTION
v The
iden'fica'on
of
structural
features
of
the
genome:
• Primarily
focused
on
protein-‐coding
genes.
• Predicts
also
transfer
RNAs
(tRNA),
ribosomal
RNAs
(rRNA),
regulatory
mo'fs,
long
and
small
non-‐coding
RNAs
(ncRNA),
repe''ve
elements
(masked),
etc.
• Two
methods
for
iden'fica'on.
• Some
are
self-‐trained
and
some
must
be
trained.
29. 29Gene Prediction
GENE PREDICTION
methods for discovery
1)
Ab
ini,o:
-‐
based
on
DNA
composi'on,
-‐
deals
strictly
with
genomic
sequences
-‐
makes
use
of
sta's'cal
approaches
to
search
for
coding
regions
and
typical
gene
signals.
• E.g.
Augustus,
GENSCAN,
geneid,
fgenesh,
etc.
3’
Nat Rev Genet. 2015 Jun;16(6):321-32. doi: 10.1038/nrg3920
30. 30
Nucleic Acids 2003 vol. 31 no. 13 3738-3741
Gene Prediction
GENE PREDICTION
methods for discovery (ctd)
2)
Homology-‐based:
-‐
evidence-‐based,
-‐
finds
genes
using
either
similarity
searches
in
the
main
databases
or
experimental
data
including
RNAseq,
expressed
sequence
tags
(ESTs),
full-‐length
complementary
DNAs
(cDNAs),
etc.
• E.g:
fgenesh++,
Just
Annotate
My
genome
(JAMg),
SGP2
31. 31
GENE ANNOTATION
Integra'on
of
data
from
computa'onal
&
experimental
evidence
with
data
from
predic'on
tools,
to
generate
a
reliable
set
of
structural
annota=ons.
Involves:
1)
ab
ini1o
predic'ons
2)
assessment
of
biological
evidence
to
drive
the
gene
predic'on
process
3)
synthesis
of
these
results
to
produce
a
set
of
consensus
gene
models
Gene Annotation
32. 32
In
some
cases
algorithms
and
metrics
used
to
generate
consensus
sets
may
actually
reduce
the
accuracy
of
the
gene’s
representa'on.
GENE ANNOTATION
Gene
models
may
be
organized
into
“sets”
using:
v automa'c
integra'on
of
predicted
sets
(combiners);
e.g:
GLEAN,
EvidenceModeler
or
v tools
packaged
into
pipelines;
e.g:
MAKER,
PASA,
Gnomon,
Ensembl,
etc.
Gene Annotation
33. ANNOTATION
an imperfect art
No one is perfect, least of all automated annotation. 33
New
technology
brings
new
challenges:
• Assembly
errors
can
cause
fragmented
annota'ons
• Limited
coverage
makes
precise
iden'fica'on
a
difficult
task
Image: www.BroadInstitute.org
34. MANUAL ANNOTATION
improving predictions
Precise
elucida=on
of
biological
features
encoded
in
the
genome
requires
careful
examina=on
and
review.
Schiex
et
al.
Nucleic
Acids
2003
(31)
13:
3738-‐3741
Automated Predictions
Experimental Evidence
Manual Annotation – to the rescue. 34
cDNAs,
HMM
domain
searches,
RNAseq,
genes
from
other
species.
35. 35
BIOCURATION
structural and functional adjustments
Iden=fies
elements
that
best
represent
the
underlying
biology
and
eliminates
elements
that
reflect
systemic
errors
of
automated
analyses.
Assigns
func=on
through
compara've
analysis
of
similar
genome
elements
from
closely
related
species
using
literature,
databases,
and
experimental
data.
MANUAL ANNOTATION
h@p://GeneOntology.org
1
2
36. GENOME ANNOTATION
an inherently collaborative task
APOLLO 36
Researchers
oDen
turn
to
colleagues
for
second
opinions
and
insight
from
those
with
exper1se
in
par1cular
areas
(e.g.,
domains,
families).
So
many
sequences,
but
not
enough
hands!
37. APOLLO
collaborative genome annotation editing tool
37
v Web
based,
integrated
with
JBrowse.
v Supports
real
'me
collabora'on!
v Automa'c
genera'on
of
ready-‐made
computable
data.
v Supports
annota'on
of
genes,
pseudogenes,
tRNAs,
snRNAs,
snoRNAs,
ncRNAs,
miRNAs,
TEs,
and
repeats.
v Intui've
annota'on,
gestures,
and
pull-‐down
menus
to
create
and
edit
transcripts
and
exons
structures,
insert
comments
(CV,
freeform
text),
associate
GO
terms,
etc.
APOLLO
h@p://GenomeArchitect.org
38. APOLLO ARCHITECTURE
simple, flexible
ARCHITECTURE 38
Web-‐based
client
+
annota'on-‐edi'ng
engine
+
server-‐side
data
service
REST / JSON
Websockets
Annotation Engine (Server)
Shiro
LDAP
OAuth
JBrowse Data
Organism 2
Annotations
Security
Preferences
Organisms
Tracks
BAM
BED
VCF
GFF3
BigWig
Annotators
Google Web Toolkit (GWT) /
Bootstrap
JBrowse DOJO / jQuery JBrowse Data
Organism 1
Load genomic
evidence per
selected organism
Single Data Store
PostgreSQL, MySQL,
MongoDB, ElasticSearch
Apollo v2.0
39. We
con'nuously
train
and
support
hundreds
of
geographically
dispersed
scien'sts
from
diverse
research
communi'es
in
conduc'ng
manual
annota'ons
efforts
to
recover
coding
sequences
in
agreement
with
all
available
biological
evidence
using
Apollo.
39
LESSONS LEARNED
APOLLO
What
we
have
learned:
• Collabora've
work
dis'lls
invaluable
knowledge
• We
must
enforce
strict
rules
and
formats
• We
must
evolve
with
the
data
• NGS
poses
addi'onal
challenges
40. 40
TRAINING CURATORS
a little training goes a long way!
Provided
with
adequate
tools,
wet
lab
scien'sts
make
excep'onal
curators
who
can
easily
learn
to
maximize
the
genera'on
of
accurate,
biologically
supported
gene
models.
APOLLO
42. 42
APOLLO
annotation editing environment
BECOMING ACQUAINTED WITH APOLLO
Color
by
CDS
frame,
toggle
strands,
set
color
scheme
and
highlights.
Upload
evidence
files
(GFF3,
BAM,
BigWig),
add
combina=on
and
sequence
search
tracks.
Query
the
genome
using
BLAT.
Naviga'on
and
zoom.
Search
for
a
gene
model
or
a
scaffold.
Get
coordinates
and
“rubber
band”
selec'on
for
zooming.
Login
User-‐created
annota'ons.
Annotator
panel.
Evidence
Tracks
Stage
and
cell-‐type
specific
transcrip'on
data.
h@p://genomearchitect.org/web_apollo_user_guide
46. Becoming Acquainted with Web Apollo
46 | 46
GENERAL PROCESS OF CURATION
main steps to remember
1. Select
or
find
a
region
of
interest,
e.g.
scaffold.
2. Select
appropriate
evidence
tracks
to
review
the
gene
model.
3. Determine
whether
a
feature
in
an
exis'ng
evidence
track
will
provide
a
reasonable
gene
model
to
start
working.
4. If
necessary,
adjust
the
gene
model.
5. Check
your
edited
gene
model
for
integrity
and
accuracy
by
comparing
it
with
available
homologs.
6. Comment
and
finish.
47. USER NAVIGATION
removable side dock
HIGHLIGHTED IMPROVEMENTS 47
Annotations Organism Users Groups AdminTracks
Reference
Sequence
50. 50 | 50
Becoming Acquainted with Web Apollo.
USER NAVIGATION
Annotator
panel.
• Choose
appropriate
evidence
from
list
of
“Tracks”
on
annotator
panel.
• Select
&
drag
elements
from
evidence
track
into
the
‘User-‐created
Annota1ons’
area.
• Hovering
over
annota'on
in
progress
brings
up
an
informa'on
pop-‐up.
• Crea'ng
a
new
annota'on
51. 51 | 51
USER NAVIGATION
Becoming Acquainted with Web Apollo.
• Annota'on
right-‐click
menu
52. 52 | 52
USER NAVIGATION
Becoming Acquainted with Web Apollo.
• ‘Zoom
to
base
level’
op'on
reveals
the
DNA
Track.
53. 53 | 53
USER NAVIGATION
Becoming Acquainted with Web Apollo.
• Color
exons
by
CDS
from
the
‘View’
menu.
54. 54 |
Zoom
in/out
with
keyboard:
shij
+
arrow
keys
up/down
54
USER NAVIGATION
Becoming Acquainted with Web Apollo.
• Toggle
reference
DNA
sequence
and
transla=on
frames
in
forward
strand.
Toggle
models
in
either
direc'on.
57. “Simple
case”:
-‐
the
predicted
gene
model
is
correct
or
nearly
correct,
and
-‐
this
model
is
supported
by
evidence
that
completely
or
mostly
agrees
with
the
predic'on.
-‐
evidence
that
extends
beyond
the
predicted
model
is
assumed
to
be
non-‐coding
sequence.
The
following
are
simple
modifica'ons.
57 | 57
ANNOTATING SIMPLE CASES
Becoming Acquainted with Web Apollo. SIMPLE CASES
58. 58 |
• A
confirma'on
box
will
warn
you
if
the
receiving
transcript
is
not
on
the
same
strand
as
the
feature
where
the
new
exon
originated.
• Check
‘Start’
and
‘Stop’
signals
ajer
each
edit.
58
ADDING EXONS
Becoming Acquainted with Web Apollo. SIMPLE CASES
59. If
transcript
alignment
data
are
available
and
extend
beyond
your
original
annota'on,
you
may
extend
or
add
UTRs.
1. Right
click
at
the
exon
edge
and
‘Zoom
to
base
level’.
2. Place
the
cursor
over
the
edge
of
the
exon
un1l
it
becomes
a
black
arrow
then
click
and
drag
the
edge
of
the
exon
to
the
new
coordinate
posi'on
that
includes
the
UTR.
59 | 59
ADDING UTRs
Becoming Acquainted with Web Apollo. SIMPLE CASES
To
add
a
new
spliced
UTR
to
an
exis'ng
annota'on
follow
the
procedure
for
adding
an
exon.
60. To
modify
an
exon
boundary
and
match
data
in
the
evidence
tracks:
select
both
the
offending
exon
and
the
feature
with
the
expected
boundary,
then
right
click
on
the
annota'on
to
select
‘Set
3’
end’
or
‘Set
5’
end’
as
appropriate.
60 |
In
some
cases
all
the
data
may
disagree
with
the
annota'on,
in
other
cases
some
data
support
the
annota'on
and
some
of
the
data
support
one
or
more
alterna've
transcripts.
Try
to
annotate
as
many
alterna've
transcripts
as
are
well
supported
by
the
data.
60
MATCHING EXON BOUNDARY TO EVIDENCE
Becoming Acquainted with Web Apollo. SIMPLE CASES
61. 1. Zoom
in
to
clearly
resolve
each
exon
as
a
dis'nct
rectangle.
2. Two
exons
from
different
tracks
sharing
the
same
start
and/or
end
coordinates
will
display
a
red
bar
to
indicate
matching
edges.
3. Selec'ng
the
whole
annota'on
or
one
exon
at
a
'me,
use
this
‘edge-‐
matching’
func'on
and
scroll
along
the
length
of
the
annota'on,
verifying
exon
boundaries
against
available
data.
Use
square
[
]
brackets
to
scroll
from
exon
to
exon.
4. Check
if
cDNA
/
RNAseq
reads
lack
one
or
more
of
the
annotated
exons
or
include
addi'onal
exons.
61 | 61
CHECKING EXON INTEGRITY
Becoming Acquainted with Web Apollo. SIMPLE CASES
62. Non-‐canonical
splice
sites
flags.
Double
click:
selec'on
of
feature
and
sub-‐features
Evidence
Tracks
Area
‘User-‐created
Annota1ons’
Track
Edge-‐matching
Apollo’s
edi'ng
logic
(brain):
§ selects
longest
ORF
as
CDS
§ flags
non-‐canonical
splice
sites
62
ORFs AND SPLICE SITES
Becoming Acquainted with Web Apollo. SIMPLE CASES
63. 63 |
Non-‐canonical
splices
are
indicated
by
an
orange
circle
with
a
white
exclama'on
point
inside,
placed
over
the
edge
of
the
offending
exon.
Canonical
splice
sites:
3’-‐…exon]GA
/
TG[exon…-‐5’
5’-‐…exon]GT
/
AG[exon…-‐3’
reverse
strand,
not
reverse-‐complemented:
forward
strand
63
SPLICE SITES
Becoming Acquainted with Web Apollo. SIMPLE CASES
Zoom
to
review
non-‐canonical
splice
site
warnings.
Although
these
may
not
always
have
to
be
corrected
(e.g
GC
donor),
they
should
be
flagged
with
the
appropriate
comment.
Exon/intron
splice
site
error
warning
Curated
model
64. Web
Apollo
calculates
the
longest
possible
open
reading
frame
(ORF)
that
includes
canonical
‘Start’
and
‘Stop’
signals
within
the
predicted
exons.
If
‘Start’
appears
to
be
incorrect,
modify
it
by
selec'ng
an
in-‐frame
‘Start’
codon
further
up
or
downstream,
depending
on
evidence
(protein
database,
addi'onal
evidence
tracks).
It
may
be
present
outside
the
predicted
gene
model,
within
a
region
supported
by
another
evidence
track.
In
very
rare
cases,
the
actual
‘Start’
codon
may
be
non-‐canonical
(non-‐ATG).
64 | 64
‘START’ AND ‘STOP’ SITES
Becoming Acquainted with Web Apollo. SIMPLE CASES
66. Evidence
may
support
joining
two
or
more
different
gene
models.
Warning:
protein
alignments
may
have
incorrect
splice
sites
and
lack
non-‐conserved
regions!
1. In
‘User-‐created
Annota,ons’
area
shij-‐click
to
select
an
intron
from
each
gene
model
and
right
click
to
select
the
‘Merge’
op'on
from
the
menu.
2. Drag
suppor'ng
evidence
tracks
over
the
candidate
models
to
corroborate
overlap,
or
review
edge
matching
and
coverage
across
models.
3. Check
the
resul'ng
transla'on
by
querying
a
protein
database
e.g.
UniProt,
NCBI
nr.
Add
comments
to
record
that
this
annota'on
is
the
result
of
a
merge.
66 | 66
Red
lines
around
exons:
‘edge-‐matching’
allows
annotators
to
confirm
whether
the
evidence
is
in
agreement
without
examining
each
exon
at
the
base
level.
COMPLEX CASES
merge two gene predictions on the same scaffold
Becoming Acquainted with Web Apollo. COMPLEX CASES
67. One
or
more
splits
may
be
recommended
when:
-‐
different
segments
of
the
predicted
protein
align
to
two
or
more
different
gene
families
-‐
predicted
protein
doesn’t
align
to
known
proteins
over
its
en're
length
Transcript
data
may
support
a
split,
but
first
verify
whether
they
are
alterna've
transcripts.
67 | 67
COMPLEX CASES
split a gene prediction
Becoming Acquainted with Web Apollo. COMPLEX CASES
68. DNA
Track
‘User-‐created
Annota=ons’
Track
68
COMPLEX CASES
correcting frameshifts and single-base errors
Becoming Acquainted with Web Apollo. COMPLEX CASES
Always
remember:
when
annota'ng
gene
models
using
Apollo,
you
are
looking
at
a
‘frozen’
version
of
the
genome
assembly
and
you
will
not
be
able
to
modify
the
assembly
itself.
71. 1. Apollo
allows
annotators
to
make
single
base
modifica'ons
or
frameshijs
that
are
reflected
in
the
sequence
and
structure
of
any
transcripts
overlapping
the
modifica'on.
These
manipula'ons
do
NOT
change
the
underlying
genomic
sequence.
2. If
you
determine
that
you
need
to
make
one
of
these
changes,
zoom
in
to
the
nucleo'de
level
and
right
click
over
a
single
nucleo'de
on
the
genomic
sequence
to
access
a
menu
that
provides
op'ons
for
crea'ng
inser'ons,
dele'ons
or
subs'tu'ons.
3. The
‘Create
Genomic
Inser=on’
feature
will
require
you
to
enter
the
necessary
string
of
nucleo'de
residues
that
will
be
inserted
to
the
right
of
the
cursor’s
current
loca'on.
The
‘Create
Genomic
Dele=on’
op'on
will
require
you
to
enter
the
length
of
the
dele'on,
star'ng
with
the
nucleo'de
where
the
cursor
is
posi'oned.
The
‘Create
Genomic
Subs=tu=on’
feature
asks
for
the
string
of
nucleo'de
residues
that
will
replace
the
ones
on
the
DNA
track.
4. Once
you
have
entered
the
modifica'ons,
Apollo
will
recalculate
the
corrected
transcript
and
protein
sequences,
which
will
appear
when
you
use
the
right-‐click
menu
‘Get
Sequence’
op'on.
Since
the
underlying
genomic
sequence
is
reflected
in
all
annota'ons
that
include
the
modified
region
you
should
alert
the
curators
of
your
organisms
database
using
the
‘Comments’
sec'on
to
report
the
CDS
edits.
5. In
special
cases
such
as
selenocysteine
containing
proteins
(read-‐throughs),
right-‐click
over
the
offending/premature
‘Stop’
signal
and
choose
the
‘Set
readthrough
stop
codon’
op'on
from
the
menu.
71 | 71
Becoming Acquainted with Web Apollo. COMPLEX CASES
COMPLEX CASES
correcting frameshifts, single-base errors, and selenocysteines
72. 72 | 72
USER NAVIGATION
Becoming Acquainted with Web Apollo.
• Annotation right-click menu
73. 73
Annota'ons,
annota'on
edits,
and
History:
stored
in
a
centralized
database.
73
USER NAVIGATION
Becoming Acquainted with Web Apollo.
74. Follow
the
checklist
un'l
you
are
happy
with
the
annota'on!
And
remember
to…
– comment
to
validate
your
annota'on,
even
if
you
made
no
changes
to
an
exis'ng
model.
Think
of
comments
as
your
vote
of
confidence.
– or
add
a
comment
to
inform
the
community
of
unresolved
issues
you
think
this
model
may
have.
74 | 74
Always
Remember:
Apollo
cura'on
is
a
community
effort
so
please
use
comments
to
communicate
the
reasons
for
your
annota'on.
Your
comments
will
be
visible
to
everyone.
COMPLETING THE ANNOTATION
Becoming Acquainted with Apollo.
75. 75 | 75
USER NAVIGATION
Becoming Acquainted with Web Apollo.
• Annotation right-click menu
76. 76
The
Annota'on
Informa=on
Editor
76
USER NAVIGATION
Becoming Acquainted with Web Apollo.
DBXRefs
are
database
crossed
references:
if
you
have
reason
to
believe
that
this
gene
is
linked
to
a
gene
in
a
public
database
(including
your
own),
then
add
it
here.
77. 77
The
Annota'on
Informa=on
Editor
• Add
PubMed
IDs
• Include
GO
terms
as
appropriate
from
any
of
the
three
ontologies
• Write
comments
sta'ng
how
you
have
validated
each
model.
77
USER NAVIGATION
Becoming Acquainted with Web Apollo.
79. • Check
‘Start’
and
‘Stop’
sites.
• Check
splice
sites:
most
splice
sites
display
these
residues
…]5’-‐GT/AG-‐3’[…
• Check
if
you
can
annotate
UTRs,
for
example
using
RNA-‐Seq
data:
– Align
it
against
relevant
genes/gene
family
– blastp
against
NCBI’s
RefSeq
or
nr
• Check
for
gaps
in
the
genome.
• Addi'onal
func'onality
may
be
necessary:
– Merging
2
gene
predic'ons
on
the
same
scaffold
– Merging
2
gene
predic'ons
from
different
scaffolds
– Spligng
a
gene
predic'on
– Correc'ng
frameshiYs
and
other
errors
in
the
genome
assembly
– Annotate
selenocysteines,
correct
single-‐
base
errors,
etc.
79 | 79
• Add:
– Important
project
informa'on
in
the
form
of
comments
– IDs
from
public
databases
e.g.
GenBank
(via
DBXRef),
gene
symbol(s),
common
name(s),
synonyms,
top
BLAST
hits,
orthologs
with
species
names,
and
everything
else
you
can
think
of,
because
you
are
the
expert.
– Comments
about
the
kinds
of
changes
you
made
to
the
gene
model
of
interest,
if
any.
– Any
appropriate
func'onal
assignments,
e.g.
via
BLAST,
RNA-‐Seq
data,
literature
searches,
etc.
THE CHECKLIST
for accuracy and integrity
MANUAL ANNOTATION CHECKLIST
81. Example
Example 81
A
public
Apollo
Demo
using
the
Honey
Bee
genome
is
available
at
h@p://genomearchitect.org/WebApolloDemo
-‐
Cura'on
example
using
the
Hyalella
azteca
genome
(amphipod
crustacean).
82. What do we know about this genome?
• Currently
publicly
available
data
at
NCBI:
• >37,000
nucleo'de
seqsà
scaffolds,
mitochondrial
genes
• 300
amino
acid
seqsà
mitochondrion
• 53
ESTs
• 0
conserved
domains
iden'fied
• 0
“gene”
entries
submi@ed
• Data
at
i5K
Workspace@NAL
(annota'on
hosted
at
USDA)
-‐
10,832
scaffolds:
23,288
transcripts:
12,906
proteins
Example 82
84. PubMed Search: what’s new?
Example 84
“Ten
popula'ons
(3
cultures,
7
from
California
water
bodies)
differed
by
at
least
550-‐fold
in
sensi=vity
to
pyrethroids.”
“By
sequencing
the
primary
pyrethroid
target
site,
the
voltage-‐gated
sodium
channel
(vgsc),
we
show
that
point
muta'ons
and
their
spread
in
natural
popula'ons
were
responsible
for
differences
in
pyrethroid
sensi'vity.”
“The
finding
that
a
non-‐target
aqua'c
species
has
acquired
resistance
to
pes'cides
used
only
on
terrestrial
pests
is
troubling
evidence
of
the
impact
of
chronic
pes=cide
transport
from
land-‐based
applica'ons
into
aqua'c
systems.”
85. How many sequences are there, publicly available,
for our gene of interest?
Example 85
• Para,
(voltage-‐gated
sodium
channel
alpha
subunit;
Nasonia
vitripennis).
• NaCP60E
(Sodium
channel
protein
60
E;
D.
melanogaster).
– MF:
voltage-‐gated
ca'on
channel
ac'vity
(IDA,
GO:0022843).
– BP:
olfactory
behavior
(IMP,
GO:
0042048),
sodium
ion
transmembrane
transport
(ISS,GO:0035725).
– CC:
voltage-‐gated
sodium
channel
complex
(IEA,
GO:0001518).
And
what
do
we
know
about
them?
86. Retrieving sequences for
sequence similarity searches.
Example 86
>vgsc-‐Segment3-‐DomainII
RVFKLAKSWPTLNLLISIMGKTVGALGNLTFVLCIIIFIFAVMGMQLFGKNYTEKVTKFKWSQDG
QMPRWNFVDFFHSFMIVFRVLCGEWIESMWDCMYVGDFSCVPFFLATVVIGNLVVSFMHR
88. BLAT search
results
Example 88
• High-‐scoring
segment
pairs
(hsp)
are
listed
in
tabulated
format.
• Clicking
on
one
line
of
results
sends
you
to
those
coordinates.
89. Creating a new gene model: drag and drop
Example 89
• Apollo automatically calculates ORF.
In this case, ORF includes the high-scoring segment pairs (hsp),
marked here in blue.
95. Editing: merge the three models
Example 95
Merge
by
dropping
an
exon
or
gene
model
onto
another.
Merge
by
selec'ng
two
exons
(holding
down
“Shij”)
and
using
the
right
click
menu.
or…
97. Editing: correct boundaries
Example 97
Modify
exon
/
intron
boundary:
-‐ Drag
the
end
of
the
exon
to
the
nearest
canonical
splice
site.
or
-‐ Use
right-‐click
menu.
100. Editing: add an exon - supported by RNAseq
Example 100
• RNAseq
reads
show
evidence
in
support
of
transcribed
product,
which
was
not
predicted.
• Add
exon
at
coordinates
97946-‐98012
by
dragging
up
one
of
the
RNAseq
reads.
103. Finished model
Example 103
Corroborate
integrity
and
accuracy
of
the
model:
-‐
Start
and
Stop
-‐
Exon
structure
and
splice
sites
…]5’-‐GT/AG-‐3’[…
-‐
Check
the
predicted
protein
product
vs.
NCBI
nr,
UniProt,
etc.
104. Information Editor
• DBXRefs:
e.g.
NP_001128389.1,
N.
vitripennis,
RefSeq
• PubMed
iden'fier:
PMID:
24065824
• Gene
Ontology
IDs:
GO:0022843,
GO:
0042048,
GO:0035725,
GO:0001518.
• Comments.
• Name,
Symbol.
• Approve
/
Delete
radio
bu@on.
Example 104
Comments
(if
applicable)
110. Exercises
Live
Demonstra'on
using
the
Apis
mellifera
genome.
110
1.
Evidence
in
support
of
protein
coding
gene
models.
1.1
Consensus
Gene
Sets:
Official
Gene
Set
v3.2
Official
Gene
Set
v1.0
1.2
Consensus
Gene
Sets
comparison:
OGSv3.2
genes
that
merge
OGSv1.0
and
RefSeq
genes
OGSv3.2
genes
that
split
OGSv1.0
and
RefSeq
genes
1.3
Protein
Coding
Gene
Predic=ons
Supported
by
Biological
Evidence:
NCBI
Gnomon
Fgenesh++
with
RNASeq
training
data
Fgenesh++
without
RNASeq
training
data
NCBI
RefSeq
Protein
Coding
Genes
and
Low
Quality
Protein
Coding
Genes
1.4
Ab
ini,o
protein
coding
gene
predic=ons:
Augustus
Set
12,
Augustus
Set
9,
Fgenesh,
GeneID,
N-‐SCAN,
SGP2
1.5
Transcript
Sequence
Alignment:
NCBI
ESTs,
Apis
cerana
RNA-‐Seq,
Forager
Bee
Brain
Illumina
Con'gs,
Nurse
Bee
Brain
Illumina
Con'gs,
Forager
RNA-‐Seq
reads,
Nurse
RNA-‐Seq
reads,
Abdomen
454
Con'gs,
Brain
and
Ovary
454
Con'gs,
Embryo
454
Con'gs,
Larvae
454
Con'gs,
Mixed
Antennae
454
Con'gs,
Ovary
454
Con'gs
Testes
454
Con'gs,
Forager
RNA-‐Seq
HeatMap,
Forager
RNA-‐Seq
XY
Plot,
Nurse
RNA-‐Seq
HeatMap,
Nurse
RNA-‐Seq
XY
Plot
Becoming Acquainted with Web Apollo.
111. Exercises
Live
Demonstra'on
using
the
Apis
mellifera
genome.
111
1.
Evidence
in
support
of
protein
coding
gene
models
(Con=nued).
1.6
Protein
homolog
alignment:
Acep_OGSv1.2
Aech_OGSv3.8
Cflo_OGSv3.3
Dmel_r5.42
Hsal_OGSv3.3
Lhum_OGSv1.2
Nvit_OGSv1.2
Nvit_OGSv2.0
Pbar_OGSv1.2
Sinv_OGSv2.2.3
Znev_OGSv2.1
Metazoa_Swissprot
2.
Evidence
in
support
of
non
protein
coding
gene
models
2.1
Non-‐protein
coding
gene
predic=ons:
NCBI
RefSeq
Noncoding
RNA
NCBI
RefSeq
miRNA
2.2
Pseudogene
predic=ons:
NCBI
RefSeq
Pseudogene
Becoming Acquainted with Web Apollo.
113. Thank you. 113
• Berkeley
Bioinforma=cs
Open-‐source
Projects
(BBOP),
Berkeley
Lab:
Apollo
and
Gene
Ontology
teams.
Suzanna
E.
Lewis
(PI).
• §
Chris1ne
G.
Elsik
(PI).
University
of
Missouri.
• *
Ian
Holmes
(PI).
University
of
California
Berkeley.
• Arthropod
genomics
community:
i5K
Steering
Commi@ee
(esp.
Sue
Brown
(Kansas
State)),
Alexie
Papanicolaou
(UWS),
and
the
Honey
Bee
Genome
Sequencing
Consor'um.
• Stephen
Ficklin,
GenSAS,
Washington
State
University
• Apollo
is
supported
by
NIH
grants
5R01GM080203
from
NIGMS,
and
5R01HG004483
from
NHGRI.
Both
projects
are
also
supported
by
the
Director,
Office
of
Science,
Office
of
Basic
Energy
Sciences,
of
the
U.S.
Department
of
Energy
under
Contract
No.
DE-‐AC02-‐05CH11231
•
• For
your
a*en=on,
thank
you!
Apollo
Nathan
Dunn
Colin
Diesh
§
Deepak
Unni
§
Gene
Ontology
Chris
Mungall
Seth
Carbon
Heiko
Dietze
BBOP
Apollo:
h@p://GenomeArchitect.org
GO:
h@p://GeneOntology.org
i5K:
h@p://arthropodgenomes.org/wiki/i5K
Thank
you!
NAL
at
USDA
Monica
Poelchau
Christopher
Childers
Gary
Moore
HGSC
at
BCM
fringy
Richards
Kim
Worley
JBrowse
Eric
Yao
*