Building a clinical genome interpretation services company

Building a clinical genome
interpretation services company

Reece Hart, Ph.D.
reece@locusdev.net

Locus Development Inc.
http://locusdevelopmentinc.com/

Reece Hart — Locus Development 1/28

Opportunity


Clinical Genome Interpretation

Patient presents with
symptoms

If genomic interpretation might influence
diagnosis or treatment, doctor refers
patient to genetic counselor

GC takes history; sample is
sent to internal or one of
Report is returned to GC
hundreds of labs that provide
and/or physician who
specific genomic tests
verify interpretation and
consult with patient

Sequencing and other lab
data are processed into
preliminary iterpretation
photos:
Baylor College of Medicine, Univ. Utah, learningradiology.com, sciencephotos.com

100s of laboratory diagnostic testing labs


Common variants are hard to interpret


Some variants are informative


The Significance of
“Variants of Uncertain Significance”

“VUS – Variant of uncertain significance. A variation in a
genetic sequence whose association with disease risk is
unknown. Also called variant of uncertain significance,
variant of unknown significance, and unclassified variant.”
http://www.cancer.gov/cancertopics/genetics-terms-alphalist

7/28

The long tail of rare diseases.

“A rare disease typically affects a patient
population estimated at fewer than 200,000 in
the U.S. There are more than 6,000 rare
diseases known today and they affect an
estimated 25 million persons in the U.S.”
NIH Office of Rare Diseases Research
http://rarediseases.info.nih.gov/

8/28

The Problems to Solve

➢ Develop a reliable database of genotypes
and phenotypes.
➢ Develop methods to interpret all types of
variants, not just common SNVs.
➢ Provide meaningful, reliable interpretations
based on genomic data.
➢ Do it better than everyone else.


Plan


Company Overview

Locus
Genomic Sequence Clinical
and Variants Interpretation


Curating Genotypes, Phenotypes, and Risk

Genotype-Phenotype Database

dbSNP GO
LSDBs Genotypes/ Phenotypes/ OMIM
PharmGKB Variants Conditions ICD-9/10
… …

Risk
Models


Locus Overview

hospitals/clinics, physicians, insurers

workflow and tracking
variants/ condition inter-
sequences
attributes predictions pretation

Implementation


Curation Content
➢ Many sources
● automated and manual tools
● databases and literature
➢ Most kinds of variants
● SNV, del, ins, delins, repeat, conv, CNV,
haplotypes
➢ Many kinds of conditions
● inherited, spontaneous, dominant, recessive, x-
linked, preventative, cancer, metabolic,
pharmacogenomic, cardio
➢ Examples:
● Cystic Fibrosis (w/modifiers)
● CMT (~21 subclasses)
● Long and Short QT
● TPMT, warfarin, CYP2D6

The pipeline

LIMS curation

req'n and cond'n var. risk models
sample info

reads calls attributes report
(fastq) variant (vcf) selection (xml) inter- (xml)
calling pretation
execution framework
<?xml version="1.0"?>
variants_and_refagree.vcf filtered_on_callable.vcf: lake.mk reads.fastq
<locus-report format="1.0">
@G88NFDU01AI6Z3 rank=0000170 x=101.0 y=1953.5 length=56
<requisition>
AGTGTAGTAGTGAGAAAAACTTTGTGGGGATATGGATACAATTATTTACCCAAATC
<requisition>
(set -e;
<?xml version="1.0" encoding="UTF-8"?>
##fileformat=VCFv4.1
+<conditions>not giving this section too muchthought. Good enough for now
source /locus/opt/lake/bin/lakeSetupEnv;

IIIIIIIIIIIGC>////-....826666<EIIIIIIIIIIIHI6644/..222==
$(MAKE) -f $<id="LS125"when
<sample-info $@;
Can update later, gender="Unknown" … >
##FILTER=<ID=LowQual,Description="Lowy=1960.0 length=59
@G88NFDU01AKOQI rank=0000178 x=118.0 quality">
</conditions>
) 2>$@.err
<client id="uuid"></client>
<reference>
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic ##INFO=<ID=AC,Number=A,Type=Inte
</requisition>
agtgtagtagtaaggaagattgagtgcctgaccttCCGGGTGGCGGTAGCGTTGGCCCCid="LS99"></patient>
<patient name="LS99" ethnicity="" gender="Male" dob=""
<organism>homo sapiens</organism>
##VariantFiltration="analysis_type=VariantFiltration input_file=[] sample_metadata=[] read_
calls.vcf: variants_and_refagree.vcf #filtered_on_callable.vcf
…
+ <build-id>human_g1k_v37</build-id>
##contig=<ID=GL000240.1,length=41933,assembly=b37>
<?xml version="1.0" encoding="UTF-8"?>
BHBEEIIIIIEEEEEBGBECCCDEIIIIIIIIIIIEEICC===988ED>?>>>88...-
ln -s $< $@
<conditions>
</reference>
##reference=file:///locus/data/references/genomes/human_g1k_v37/sequences/human_g1k_v37.fas
<samples>
@G88NFDU01AL6H7 rank=0000323 x=135.0 y=2013.5 length=95
<condition code="VonHL">
<loci>
##source=SelectVariantsgender="Male" birth-date="/Date(1320120000000-0700)/"
<sample-info id="LS99" start="10183685" ${ATTR_FILE}
attr.xml: calls.vcf req.xml sample.xml end="10183685"
agtgtagtagtgtgagctggtgaagaaggtctccGATGTCATATGGAACAGCCTCAGCCGCTCCTACTTCAAGGATCGGGCCCACA
<associated-conditions></associated-conditions>
<locus chr="3"
#CHROM</condition>status="New" ordering-clinician="JMajor" nanodrop-
generate_attributes_file.py FILTER INFO
POS ID REF ALT QUAL
type="GenomicDNA"
TCCAGTCCC … FORMAT LS99
sequence="G" read-coverage="0">
1
concentration="300" A
+ 145414740 . N . PASS AC=0;AF=0.00;AN=2;DP=137;MQ=213.16;MQ0=0
</conditions>read-coverage="" quality-score="" sequence=""
original-barcode="NA06994" use-type="RD" origin="Coriell" GT:DP:G
<alt
1 </requisition>.
report.xml: attr.xmlGreq.xml.
145414741 N PASS AC=0;AF=0.00;AN=2;DP=138;MQ=213.16;MQ0=0
code="NA06994" concentration="300" description="" accession-date=""
=>BBBBB==;;B>454@EA@>>===>>BBIIE@ACIGIEIFFDD66665@@:::>AA777A<;;>A>?accession- GT:DP:G
locus-cvid-code="CVID1003741" locus-cvid="A|G"
1 <coverage>
>>4433;>>;660000.9=85533,,, .-o PASS
145414742
user=""/> . C N
sampleconditionreport $^ $@ AC=0;AF=0.00;AN=2;DP=139;MQ=212.39;MQ0=0 GT:DP:G
locus-cvid-start="10183685"/>
</samples>
<sequence minimum-depth="100" sensitivity="98.9" specificity="99.0">
</locus>
report.html: report.xml
<region genome-build="GRCh37" chrom="3" end="440" start="400"></region>
reportrenderer $< -o $@
<region genome-build="GRCh37" chrom="3" end="700" start="600"></region>

The pipeline in action
$ ls reads.fastq.gz req.xml sample.* Makefile
Makefile reads.fastq.gz req.xml sample.info sample.xml

$ time make report.html report.pdf
gzip -cdq <reads.fastq.gz >reads.fastq
lake --recipe reads_to_variants >lake.mk
ln -s variants_and_refagree.vcf calls.vcf

generate_attributes_file.py ...

sampleconditionreport attr.xml req.xml -o report.xml

reportrenderer report.xml -o report.html

wkhtmltopdf report.html report.pdf

real 7m14.804s
user 7m16.490s
sys 2m0.150s

Locus Interpretation


The big lesson…

Transcripts are much
messier than expected.


Problem statement
There is no single source of transcripts that is all of:
stable (archived), mapped, agree with the reference
genome, have RefSeq accessions.

➢ Issues:
● Poor access / programmability
● No archived mappings
● RefSeq != reference genome due to origin,
ambiguity, error
● Patches are difficult to use


When RefSeq != Genome Reference

NC_000006.11:g.31030103C>T NC_000006.11:g.31038124T>G

variant published discovered variant
relative to RefSeq reported relative to RefSeq

NM_0123.4:c.45C>T NM_0123.4:c.832T>G

mismatch ins/del
A -

downstream coordinates
shifted


17.8% of RefSeq transcripts differ from
GRCh37

5.4% have coordinate-
changing differences

Garla, V., Kong, Y., Szpakowski, S., & Krauthammer, M. (2011).
MU2A--reconciling the genome and transcriptome to determine the effects of base substitutions.
Bioinformatics (Oxford, England), 27(3), 416-8. doi:10.1093/bioinformatics/btq658


Sources of transcript information

➢ NCBI:
● map current transcripts to current genome only
● maps with splign
● doesn't agree with ref genome ~18%
● no local database option
➢ UCSC:
● current transcripts only
● maps using blat
➢ Ensembl:
● aligns using in-house gene building process
● cross-linked to refseqs
● incorporates NCBI transcripts ad hoc
● well-maintained; good API; broad data; VEP


PTEN: insertion/deletion in 5' UTR


NEFL: genome insertion leads to
frameshift/stop


RefSeq Handling
---------- Forwarded message ----------
Date: Wed, Jan 25, 2012 at 1:59 PM
Subject: [Genome] How does UCSC hg19 gene model add exons to
RefSeqs?
To: genome@soe.ucsc.edu

Hi, when using the human reference hg19 gene model
…
where the hg19 model has an exon that does not exon exist in
the RefSeq accession (or any historical version of the
RefSeq accession).

How/why does the alignment introduce an intron in this case?
Does it ensure there are plausible flanking splice junctions
before inserting an intron to a RefSeq sequence that lacks
it but it maps to?


338 genes so far

➢ We should encourage LRG and adopt it when ready
(and we'll still have to deal with legacy transcripts)

Not pictured: Jon Sorenson


Building a clinical genome interpretation services company

Recomendados

Recomendados

Más contenido relacionado

Similar a Building a clinical genome interpretation services company

Similar a Building a clinical genome interpretation services company (20)

Más de Reece Hart

Más de Reece Hart (12)

Building a clinical genome interpretation services company