GRC Workshop held at Churchill College on Sep 21, 2014. Talk by Bronwen Aken discussing the Ensembl approach to annotating the complete human reference assembly.
4. Challenges
1. Find functional elements in a genome
• Data have lots of noise
2. Software / hardware
• Storing and manipulating data
3. Intuitive and comprehensive access to data
• Visualization
6. What is Genebuilding?
• Automatic, evidence-based annotation of
genes
• Not ab initio
• Based on sequence alignment
• “Best-in-genome”
• Aim for high specificity
• Prefer to miss a few features than heavily over-
predict
Automated gene annotation pipeline is designed
around decisions made during manual annotation
7. Advantages of re-annotating
• Add new genes to new / fixed genomic regions
• Updated supporting evidence: Remove models built on
data that has been deleted from archives
• Move alignments to regions with better mapping
8. Gene annotation pipeline – the basics
Identify interesting regions
• Rough alignment of sequences
to genome
Exhaustive alignment to
produce transcript models
Filter models
• Prioritize data sources
Produce ‘best guess’ gene
set
9. Repeatmasking
Same-species proteins Other-species proteins
cDNAs/ESTs
UTR addition
Final gene set
Filtering
Protein-coding genebuild
Filtering
TranscriptConsensus
LayerAnnotation
Also:
Small ncRNAs
LincRNAs
Pseudogenes
10. Repeatmasking
Same-species proteins Other-species proteins
cDNAs/ESTs
UTR addition
Final gene set
Filtering
Protein-coding genebuild
Filtering
RNA-Seq models
Also:
Small ncRNAs
LincRNAs
Pseudogenes
MERGE WITH HAVANA
11. Release cycle
26 September 2014
11
Regulation
Gene
Allele
Conserved
sequence
Figure adapted from the ENCODE project www.nature.com/nature/focus/encode/
Genes
• Coding & noncoding
• Protein & mRNA
alignments
• GTF & BAM files
Compara
• Conserved DNA sequence
• Multiple genome
alignments
• Homologues
• Protein families
Regulatory regions
• DNA methylation
• TFBS
• Open chromatin
Variation
• SNPs, indels,
structural variation
• Phenotypes
• QTLs
14. Genome assembly representation
• Coord_system table
• Lists the allowed coordinate systems
• chromosome, scaffold, contig
• With ‘versions’
• GRCh37, GRCh38
• Contigs are shared between assemblies so have no version
• ‘Toplevel’ coordinate system
• Chromosomes + unplaced scaffolds + unlocalized scaffolds
+ alternate sequences
• Most popular means to access the whole genome
• API options for including/excluding alternate sequences and
PAR
20. Seq_region names
• Regions of the genome are given a slice name; it’s like an
address
• eg. chromosome:GRCh37:6:133090509:133119701:1
• Users like to say, ‘chromosome 6’
• INSDC coordinates are versioned, but less human-readable
• chromosome:GRCh37:CM000668.1:133090509:133119701:1
assembly
seq_region.
name
coord_system
start
end
strand
21. Alternate sequences
• Assembly_exception table defines ‘bubbles’
• Initially set up to handle Y chromosome PAR
• Adapted to work for MHC haplotypes
• Now also used for GRC patches
• Assumes ‘equivalent’ region will be present in primary
assembly
22. Gene annotation on a ‘patched’ genome
62.3Mb 62.4Mb 62.5MbHsap HG183_PATCH
Assembly excepti...
SNORA76 >
SNORD104 >
MILR1 >
Genes (GENCODE...
Primary assembly...
AC025362.12 > AC016489.18 > < AC234063.4Contigs
< Y_RNA < hsa-mir-1273e
< AC234063.1
< TEX2 < AC016489.1
< PECAM1
Genes (GENCODE...
H.sap-H.sap lastz-...
Assembly excepti...
62.3Mb 62.4Mb 62.5MbHsap HG183_PATCH
protein coding merged Ensembl/Havana
RNA gene pseudogene
Alternative alleles Projection
Gene Legend
62.225Mb 62.250Mb 62.275Mb 62.300Mb 62.325Mb 62.350Mb 62.375Mb 62.400Mb 62.425Mb 62.450Mb 62.475MbHsap Chr. 17
Assembly excepti...
H.sap-H.sap lastz-...
SNORA76 >
SNORD104 >
AC138744.2 >
MILR1 >
Genes (GENCODE...
GL383558.1
... ...GRC alignment i...
AC025362.12 > AC016489.18 > < AC009994.10Contigs
< TEX2 < RPL31P57 < POLG2
Genes (GENCODE...
Assembly excepti...
62.225Mb 62.250Mb 62.275Mb 62.300Mb 62.325Mb 62.350Mb 62.375Mb 62.400Mb 62.425Mb 62.450Mb 62.475MbHsap Chr. 17
Insert relative to reference Delete relative to reference ... Large insert shown truncated due to image scale or edgeMatchAlignment Differe...
protein coding merged Ensembl/Havana
RNA gene pseudogene
Alternative alleles Projection
Gene Legend
331.04 kb Forward strand
Reverse strand 331.04 kb
276.06 kb Forward strand
Reverse strand 276.06 kb
TEX2 gene lies across
the patch boundary
PECAM1 is annotated
only on patch HG183
Gap in primary
assembly
PatchedchromosomePrimarychromosome
26. Gene annotation on patches
Patch
Primary
Patch
Primary
2. Project
models to
patch
1. Manual
annotation
27. Gene annotation on patches
Patch
Primary
Patch
Primary
Patch
Primary
1. Manual
annotation
2. Project
models to
patch
3. Gap-fill
with mini
genebuilld
28. Ongoing challenges
• How strict should we be when aligning proteins cDNAs to
the genome?
1. Genome assembly
• Sequencing error (inversion, artificial duplication)
• Assembly incomplete
• Alignments must allow for truncated matches
2. Population variation
• Linear genome is made from ‘one’ individual vs protein
databases contain data from many unknown individuals
• Paralogues, gene families, pseudogenes
3. Public databases eg. UniProt
• Include suspect data and incomplete for many species
• When there’s a match, or no match, is it biologically real?
• Aligning proteins from other species must allow for mismatches
Specificity
Sensitivity
31. Reporting data to users
Visualisation and Data querying:
• - When browsing the primary assembly, how do we make it obvious to users
when alternate sequences are available?
• - How do we show when the alternate genomic sequences are identical or differ
from one another?
• - How do we show whether the alternate genome sequences result in identical or
different transcribed / translated products?
• - How do we make a qualitative call about which allele is “better” to use? eg. ABO
• - Data download options
• - Concept of a ‘canonical’ transcript per gene (per tissue)
Data analysis:
• - Linking between alternate alleles (and paralogues?)
• - How do we show when data have been mapped from an old to new assembly,
compared to freshly aligned to a new assembly? When is it right to map instead of
align?
• - In a non-linear genome model, how will SNPs (rsIDs) work?
• - In a non-linear genome model, what coordinate system should be used?