Locus Development is building a clinical genome interpretation service by curating genotypes, phenotypes, and risk models from multiple sources. Their pipeline involves variant calling from sequencing data, selecting clinically relevant variants, annotating variants and samples with attributes, and generating clinical reports. However, reconciling genomic variants with transcript information is challenging due to differences between reference genomes and RefSeq transcripts, with up to 18% of RefSeq transcripts differing from the reference genome.
Building a clinical genome interpretation services company
1. Building a clinical genome
interpretation services company
Reece Hart, Ph.D.
reece@locusdev.net
Locus Development Inc.
http://locusdevelopmentinc.com/
Reece Hart — Locus Development 1/28
3. Clinical Genome Interpretation
Patient presents with
symptoms
If genomic interpretation might influence
diagnosis or treatment, doctor refers
patient to genetic counselor
GC takes history; sample is
sent to internal or one of
Report is returned to GC
hundreds of labs that provide
and/or physician who
specific genomic tests
verify interpretation and
consult with patient
Sequencing and other lab
data are processed into
preliminary iterpretation
photos:
Baylor College of Medicine, Univ. Utah, learningradiology.com, sciencephotos.com
Reece Hart — Locus Development 3/28
4. 100s of laboratory diagnostic testing labs
Reece Hart — Locus Development 4/28
7. The Significance of
“Variants of Uncertain Significance”
“VUS – Variant of uncertain significance. A variation in a
genetic sequence whose association with disease risk is
unknown. Also called variant of uncertain significance,
variant of unknown significance, and unclassified variant.”
http://www.cancer.gov/cancertopics/genetics-terms-alphalist
7/28
8. The long tail of rare diseases.
“A rare disease typically affects a patient
population estimated at fewer than 200,000 in
the U.S. There are more than 6,000 rare
diseases known today and they affect an
estimated 25 million persons in the U.S.”
NIH Office of Rare Diseases Research
http://rarediseases.info.nih.gov/
8/28
9. The Problems to Solve
➢ Develop a reliable database of genotypes
and phenotypes.
➢ Develop methods to interpret all types of
variants, not just common SNVs.
➢ Provide meaningful, reliable interpretations
based on genomic data.
➢ Do it better than everyone else.
Reece Hart — Locus Development 9/28
19. The big lesson…
Transcripts are much
messier than expected.
Reece Hart — Locus Development 19/28
20. Problem statement
There is no single source of transcripts that is all of:
stable (archived), mapped, agree with the reference
genome, have RefSeq accessions.
➢ Issues:
● Poor access / programmability
● No archived mappings
● RefSeq != reference genome due to origin,
ambiguity, error
● Patches are difficult to use
Reece Hart — Locus Development 20/28
21. When RefSeq != Genome Reference
NC_000006.11:g.31030103C>T NC_000006.11:g.31038124T>G
variant published discovered variant
relative to RefSeq reported relative to RefSeq
NM_0123.4:c.45C>T NM_0123.4:c.832T>G
mismatch ins/del
A -
downstream coordinates
shifted
Reece Hart — Locus Development 21/28
22. 17.8% of RefSeq transcripts differ from
GRCh37
5.4% have coordinate-
changing differences
Garla, V., Kong, Y., Szpakowski, S., & Krauthammer, M. (2011).
MU2A--reconciling the genome and transcriptome to determine the effects of base substitutions.
Bioinformatics (Oxford, England), 27(3), 416-8. doi:10.1093/bioinformatics/btq658
Reece Hart — Locus Development 22/28
23. Sources of transcript information
➢ NCBI:
● map current transcripts to current genome only
● maps with splign
● doesn't agree with ref genome ~18%
● no local database option
➢ UCSC:
● current transcripts only
● maps using blat
➢ Ensembl:
● aligns using in-house gene building process
● cross-linked to refseqs
● incorporates NCBI transcripts ad hoc
● well-maintained; good API; broad data; VEP
Reece Hart — Locus Development 23/28
26. RefSeq Handling
---------- Forwarded message ----------
Date: Wed, Jan 25, 2012 at 1:59 PM
Subject: [Genome] How does UCSC hg19 gene model add exons to
RefSeqs?
To: genome@soe.ucsc.edu
Hi, when using the human reference hg19 gene model
…
where the hg19 model has an exon that does not exon exist in
the RefSeq accession (or any historical version of the
RefSeq accession).
How/why does the alignment introduce an intron in this case?
Does it ensure there are plausible flanking splice junctions
before inserting an intron to a RefSeq sequence that lacks
it but it maps to?
Reece Hart — Locus Development 26/28
27. 338 genes so far
➢ We should encourage LRG and adopt it when ready
(and we'll still have to deal with legacy transcripts)
Reece Hart — Locus Development 27/28