2. RefSeq Curators
Shashi Pujar
Eric Cox
Catherine Farrell
Tamara Goldfarb
John Jackson
Vinita Joardar
Kelly McGarvey
Michael Murphy
Nuala O’Leary
Bhanu Rajput
Sanjida Rangwala
Lillian Riddick
David Webb
Terence Murphy, RefSeq Team Lead
RefSeq Developers
Alex Astashyn
Olga Ermolaeva
Vamsi Kodali
Craig Wallin
Adam Frankish, Manual Genome Annotation Coordinator
Fiona Cunningham, Variation Annotation Team Lead
Ensembl HAVANA/LRG curators
Jane Loveland
Joannella Morales
Ruth Bennett
Andrew Berry
Claire Davidson
Laurent Gil
Jose Manuel Gonzalez
Matt Hardy
Mike Kay
Aoife McMahon
Marie-Marthe Suner
Glen Threadgold
This research was supported by the
Intramural Research Program of the
NIH, National Library of Medicine.
NCBI RefSeq
3. NCBI RefSeq vs. Ensembl/GENCODE
NCBI’s RefSeq:
• NM/NR: manually annotated set
• Only includes full-length transcripts
• XM/XR: automatically produced
• Predict full-length from partial data
• Transcripts don’t necessarily match the genome
assembly:
• represent a prevalent, 'standard' allele
• Independent of reference assembly changes
• Clinical annotation predominantly done using
a RefSeq transcript or a subset of NMs
Ensembl/GENCODE:
• ENS ID: More manually-reviewed transcripts
• Includes partial transcripts
• More transcripts for non-coding genes
• On average more transcripts per gene
• Must match reference genome
• Reference set for gnomAD/ ExAC, GTEx,
Decipher, 100,000 Genomes Project, COSMIC,
ICGC
NCBI
4. A core set of annotation matches*
Different UTR(s)
1k
Different end(s)
31k
identical
5k
Other NM/NR: 20k
RefSeq models: 72k
Other GENCODE basic: 20k
GENCODE comprehensive: 62k
GENCODE comprehensive partials: 32k
GRCh38 primary assembly
HGNC-named protein-coding loci
RefSeq AR109 vs. Ensembl 94CCDS
(97% of HGNC-named
protein-coding genes)
5. But most have some differences
RefSeq
Ensembl
• Often subtle
• RefSeq mismatches require
special mapping logic
• Differences complicate data
exchange, especially for
clinical reporting
• “Can we match for at least
one representative
transcript for each gene?”
6. Why define a representative transcript?
• Preferred substrate for clinical reporting
• Useful for comparative / evolutionary genomics
• Standardize default across resources
• LRG, VEP, gnomAD, COSMIC, UCSC, UniProt, others all have their own defaults
• Help make a better choice than “I just use the longest/first one”
7. Matched Annotation from the NCBI and EMBL-EBI
• Set of 100% identical RefSeq & Ensembl transcripts
• Scope: at least one transcript for all protein-coding genes
• Match GRCh38, identical 5’ and 3’ ends, all splice sites, CDS
• Three tiers:
• MANE Select – one per gene, representative of biology at each locus
• Well-supported, expressed, conserved
• MANE Plus – alternate transcripts to capture key aspects of gene structure
• MANE Extended – additional transcripts that match
• Both RefSeq & Ensembl will have additional unmatched transcripts
• Fairly stable, but will allow updates when necessary
8. Methodology
• How to pick a Select transcript
• How to match ends
• Opportunities to improve both RefSeq & Ensembl/GENCODE
9. Choosing a Select transcript
• Ensembl Pipeline
• Length
• Expression
• Conservation (APPRIS)
• Representation in UniProt and
RefSeq
• Coverage of pathogenic variants
• RefSeq Select Pipeline
• Conservation (PhyloCSF)
• Expression
• CAGE
• Representation in UniProt and Ensembl
• Length
• Prior manual curation (LRG)
RefSeq:Ensembl:S
P, 13644
RefSeq:Ensembl
CDS match, 4569
other, 1219
10. Define 5’ ends from FANTOM CAGE data
• Deep sequencing
dataset of 5’ ends
• Integrate data to
pick 5’-most strong
site (not always the
absolute peak)
Ensembl
RefSeq
KNG1
CAGE
Transcripts
RNAseq
12. Define 3’ ends from polyA sequencing
• Long and short read data to define maximum 3’ UTRs
• Integrating multiple datasets to define sites within
clusters (polyA_DB, PolyAsite, +more)
72% of select transcripts
matched to polyA data
polyA cluster, no
extension, 10968
polyA cluster,
possible extension,
3023
other extensions,
646
no polyA, 3576
no match, 1219
15. Deliverables
• Annotation files and tracks in genome browsers
• Synonymous RefSeq & Ensembl IDs
• Reciprocal markup in NCBI and EMBL-EBI resources
16. Timelines
• Dec 2018: alpha dataset available, one Matched Select
transcript for 50% of coding genes
• Bulk RefSeq transcript updates starting in next few months
• In browsers Spring 2019
• 2019: select and match transcripts for 90% of coding genes
• Emphasis on clinically-relevant loci
17. We want to hear from you!
• NCBI booth: #315
• Find us at this meeting: Terence Murphy, Adam Frankish,
Jane Loveland, Joannella Morales
• E-mail: refseq-support@nlm.nih.gov
gencode-help@ebi.ac.uk