Improvements in the Tomato Reference Genome (SL3.0) and Annotation (ITAG3.0)
1. Improvements in the Tomato Reference
Genome (SL3.0) and Annotation
(ITAG3.0)
Prashant S Hosmani, Surya Saha, Mirella Flores, Stephane
Rombauts, Florian Maumus, Henri van de Geest, Gabino Sanchez-
Perez and Lukas Mueller
Boyce Thompson Institute, Ithaca, NY
VIB Department of Plant Systems Biology, Ghent University, Gent, Belgium
URGI, INRA, Université Paris-Saclay, Versailles, France
Wageningen Plant Research, Wageningen University, Netherlands
psh65@cornell.edu
8. Structural annotation pipeline
Repeat masking
genome
Evidence – RNA
and protein
ITAG 2.40 gene
models
Post-processing
• Genes with functional domain support
• Assign Solyc-ID to novel genes
9. Repeat identification and masking the
genome
• Generated custom repeat
libraryRepeatModeler
• Exclusion of repeats with
similarity with known proteins
(SwissProt)
ProtExcluder
• Masked 56.39% genomeRepeatMasker
10. Repeat identification and classification
Extensive identification and classification of repeats using
REPET, which masks 61% of the SL3.0 reference
genome.
Florian Maumus
12. Expression evidence for annotation
Expression data evidence
• 8 billion RNAseq reads
• Tissue and treatment specific RNAseq
• 5’ and 3’ UTR enriched RNAseq
• RENseq for NBS-LRR genes
• Pacbio Iso-seq data
• SwissProt plant proteins
Mapped on to SL3.0 and transcriptome was assembled
Mapping rate ~85%
13. RNAseq data sources
• Jim Giovannoni (BTI/USDA)
• Jocelyn Rose (Cornell)
• Greg Martin (BTI)
• Zhangjun Fei (BTI/USDA)
• Jonathan Jones (The Sainsbury Laboratory)
• Asaph Aharoni (Weizmann Institute of Science)
• Neelima Sinha (University of California, Davis)
14. MAKER pipeline
Ab-initio gene prediction methods
• Augustus (Training using BRAKER1)
• SNAP (MAKER based training)
• GeneMark (with high quality genes)
• Eugene (Stephane Rombauts)
Updating legacy annotation (ITAG2.40)
Post-processing
Added genes only with functional domain support (Pfam) ~800 genes
Removed genes with 70% overlap with repeats (674 genes).
Assigned Solyc ID to novel genes with ITAG convention.
Novel genes are assigned Solyc ID between existing Solyc ID.
15. Improvements in ITAG 3.0 compared with
ITAG 2.40
ITAG 2.40 ITAG 3.0
# of genes 34,725 34,769
Avg. gene length 1,209 bp 1,529 bp
Exons per gene 4.61 5.10
5’ UTR per gene 0.39 0.63
3’ UTR per gene 0.44 0.62
Novel genes in ITAG3.0 – 5,822
16. Gene structure improvement example
ITAG3.0
ITAG2.40
ITAG3.0
ITAG2.40
Correct fusion example
UTR example
RNAseq
XY plot
RNAseq
XY plot
17. Quality check - Annotation Edit Distance
(AED)
AED= 0 complete support
AED =1 lack of support
AED
18. Functional annotation
Automated Assignment of Human Readable Descriptions
(AHRD)
Swissprot plant protein database
TrEMBL plant protein database
Araport 11 (Arabidopsis latest annotation)
User curated locus information from solgenomics.net (2000+)
Unknown proteins
In ITAG 3.0, 409 have a functional description of “Unknown
proteins” compared to 7,689 in ITAG2.40
19. Functional annotation
Automated Assignment of Human Readable Descriptions (AHRD)
AHRD-Version 3.3.2
Quality score (***)
Solyc08g081780.1.1 Dirigent protein (***)
Solyc01g008960.2.1 Argonaute family protein (***)
Solyc01g013880.1.1 Leucine-rich repeat receptor-like protein kinase family protein (*-*)
Position Criteria
1 Bit score of the blast result is >50 and e-value is <e-10
2 Alignment of the blast result is >60%
3 Human Readable Description score is >0.5
“AHRD’s quality-code consists of a three character string, where each
character is either ‘*’ if the respective criteria is met or ‘-’ otherwise.”
21. Future work
Genome
Improving genome assembly by sequencing with Pacbio
technology
Annotation
tRNA, non-coding RNA annotation
Multiple isoforms
Co-expression network based functional annotation
22. Workshop: SGN and RTB Databases
Tuesday, Jan 17 10:30 AM
Posters
Surya Saha: Improved Tomato Genome
Reference (SL3.0) using Full-Length BACs,
BioNano Optical Maps and SGN Community
Resources (P0798)
Prashant Hosmani: ITAG3.0 Annotation for the
New Tomato Reference Genome SL3.0 (P0797)
26. Repeat classification
SGN Workshop, SOL 2016
LTR retrotransposon
Copia 64840935
Gypsy 260719161
TRIM/LARD 671571
Non-LTR retrotransposon LINE 9871924
Putative_retrotransposon Putative_RT 528982
DNA DNA 20712725
Helitron Helitron 1210271
TIR TIR 12144035
Confused Confused 48373586
Unclassified Unclassified 70850157
Hostgene Endogenous virus 5839457
Tandem repeats Hostgene 5044454
Tandem repeats 8901715
Ns SUM repeats 509708973
27. Mapping rates for different RNAseq data
RNAseq data # of reads in
Millions
REPET light RepeatModeler
light
AC_Jim 637 86.87% 88.03%
epigenome 82 60.77% 64.35%
UTR seq 87 85.88% 86.57%
TEA part A 4,295 84.41% 84.39%
TEA part B 2,449 84.40% 84.71%
RENseq 15 32.91% 39.83%
Yang 331 79.94% 80.28%
Total reads 7,930