SlideShare una empresa de Scribd logo
1 de 28
Descargar para leer sin conexión
Building a clinical genome
interpretation services company

Reece Hart, Ph.D.
reece@locusdev.net

Locus Development Inc.
http://locusdevelopmentinc.com/

                             Reece Hart — Locus Development   1/28
Opportunity




  Reece Hart — Locus Development   2/28
Clinical Genome Interpretation


                             Patient presents with
                             symptoms

                                                          If genomic interpretation might influence
                                                          diagnosis or treatment, doctor refers
                                                          patient to genetic counselor

                                                                                   GC takes history; sample is
                                                                                   sent to internal or one of
              Report is returned to GC
                                                                                   hundreds of labs that provide
                 and/or physician who
                                                                                   specific genomic tests
               verify interpretation and
                    consult with patient




                                                                     Sequencing and other lab
                                                                      data are processed into
                                                                      preliminary iterpretation
photos:
Baylor College of Medicine, Univ. Utah, learningradiology.com, sciencephotos.com
                                                     Reece Hart — Locus Development                                3/28
100s of laboratory diagnostic testing labs




                Reece Hart — Locus Development   4/28
Common variants are hard to interpret




               Reece Hart — Locus Development   5/28
Some variants are informative




               Reece Hart — Locus Development   6/28
The Significance of
“Variants of Uncertain Significance”




  “VUS – Variant of uncertain significance. A variation in a
  genetic sequence whose association with disease risk is
  unknown. Also called variant of uncertain significance,
  variant of unknown significance, and unclassified variant.”
  http://www.cancer.gov/cancertopics/genetics-terms-alphalist


                                                                7/28
The long tail of rare diseases.



   “A rare disease typically affects a patient
   population estimated at fewer than 200,000 in
   the U.S. There are more than 6,000 rare
   diseases known today and they affect an
   estimated 25 million persons in the U.S.”
   NIH Office of Rare Diseases Research
   http://rarediseases.info.nih.gov/




                                                   8/28
The Problems to Solve

 ➢   Develop a reliable database of genotypes
     and phenotypes.
 ➢   Develop methods to interpret all types of
     variants, not just common SNVs.
 ➢   Provide meaningful, reliable interpretations
     based on genomic data.
 ➢   Do it better than everyone else.




                    Reece Hart — Locus Development   9/28
Plan




Reece Hart — Locus Development   10/28
Company Overview




                   Locus
Genomic Sequence                                  Clinical
   and Variants                                Interpretation


              Reece Hart — Locus Development               11/28
Curating Genotypes, Phenotypes, and Risk



                Genotype-Phenotype Database


   dbSNP                                                                GO
    LSDBs   Genotypes/                                    Phenotypes/   OMIM
PharmGKB     Variants                                      Conditions   ICD-9/10
       …                                                                …




                                    Risk
                                   Models




                         Reece Hart — Locus Development                       12/28
Locus Overview



             hospitals/clinics, physicians, insurers



                         workflow and tracking
                   variants/                condition      inter-
 sequences
                   attributes              predictions   pretation
Implementation




   Reece Hart — Locus Development   14/28
Curation Content
 ➢   Many sources
     ●   automated and manual tools
     ●   databases and literature
 ➢   Most kinds of variants
     ●   SNV, del, ins, delins, repeat, conv, CNV,
         haplotypes
 ➢   Many kinds of conditions
     ●   inherited, spontaneous, dominant, recessive, x-
         linked, preventative, cancer, metabolic,
         pharmacogenomic, cardio
 ➢   Examples:
     ●   Cystic Fibrosis (w/modifiers)
     ●   CMT (~21 subclasses)
     ●   Long and Short QT
     ●   TPMT, warfarin, CYP2D6
                         Reece Hart — Locus Development    15/28
The pipeline

     LIMS                    curation

                req'n and           cond'n var.     risk models
               sample info

    reads                calls             attributes               report
   (fastq)    variant    (vcf)   selection   (xml)        inter-     (xml)
              calling                                   pretation
execution framework
 <?xml version="1.0"?>
 variants_and_refagree.vcf filtered_on_callable.vcf: lake.mk reads.fastq
 <locus-report format="1.0">
 @G88NFDU01AI6Z3 rank=0000170 x=101.0 y=1953.5 length=56
 <requisition>
 AGTGTAGTAGTGAGAAAAACTTTGTGGGGATATGGATACAATTATTTACCCAAATC
   <requisition>
     (set -e;
 <?xml version="1.0" encoding="UTF-8"?>
 ##fileformat=VCFv4.1
 +<conditions>not giving this section too muchthought. Good enough for now
     source /locus/opt/lake/bin/lakeSetupEnv;
     <!-- I'm
  <sample-attributes>
 … <condition>VonHL</condition> we commercialize -->
 IIIIIIIIIIIGC>////-....826666<EIIIIIIIIIIIHI6644/..222==
     $(MAKE) -f $<id="LS125"when
     <sample-info $@; 
          Can update later, gender="Unknown" … >
 ##FILTER=<ID=LowQual,Description="Lowy=1960.0 length=59
 @G88NFDU01AKOQI rank=0000178 x=118.0 quality">
  </conditions>
     ) 2>$@.err
     <client id="uuid"></client>
     <reference>
 ##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic ##INFO=<ID=AC,Number=A,Type=Inte
 </requisition>
 agtgtagtagtaaggaagattgagtgcctgaccttCCGGGTGGCGGTAGCGTTGGCCCCid="LS99"></patient>
     <patient name="LS99" ethnicity="" gender="Male" dob=""
         <organism>homo sapiens</organism>
 ##VariantFiltration="analysis_type=VariantFiltration input_file=[] sample_metadata=[] read_
 calls.vcf: variants_and_refagree.vcf #filtered_on_callable.vcf
 …
 +       <build-id>human_g1k_v37</build-id>
 ##contig=<ID=GL000240.1,length=41933,assembly=b37>
 <?xml version="1.0" encoding="UTF-8"?>
 BHBEEIIIIIEEEEEBGBECCCDEIIIIIIIIIIIEEICC===988ED>?>>>88...-
     ln -s $< $@
     <conditions>
     </reference>
 ##reference=file:///locus/data/references/genomes/human_g1k_v37/sequences/human_g1k_v37.fas
 <samples>
 @G88NFDU01AL6H7 rank=0000323 x=135.0 y=2013.5 length=95
       <condition code="VonHL">
     <loci>
 ##source=SelectVariantsgender="Male" birth-date="/Date(1320120000000-0700)/"
 <sample-info id="LS99" start="10183685" ${ATTR_FILE}
 attr.xml: calls.vcf req.xml sample.xml end="10183685"
 agtgtagtagtgtgagctggtgaagaaggtctccGATGTCATATGGAACAGCCTCAGCCGCTCCTACTTCAAGGATCGGGCCCACA
         <associated-conditions></associated-conditions>
         <locus chr="3"
 #CHROM</condition>status="New" ordering-clinician="JMajor" nanodrop-
     generate_attributes_file.py FILTER INFO
         POS ID REF ALT QUAL
 type="GenomicDNA"
 TCCAGTCCC                        …               FORMAT LS99
               sequence="G" read-coverage="0">
 1
 concentration="300" A
 +   145414740    .       N   .   PASS    AC=0;AF=0.00;AN=2;DP=137;MQ=213.16;MQ0=0
     </conditions>read-coverage="" quality-score="" sequence=""
                      original-barcode="NA06994" use-type="RD" origin="Coriell"      GT:DP:G
              <alt
 1 </requisition>.
 report.xml: attr.xmlGreq.xml.
     145414741            N       PASS    AC=0;AF=0.00;AN=2;DP=138;MQ=213.16;MQ0=0
 code="NA06994" concentration="300" description="" accession-date=""
 =>BBBBB==;;B>454@EA@>>===>>BBIIE@ACIGIEIFFDD66665@@:::>AA777A<;;>A>?accession-      GT:DP:G
              locus-cvid-code="CVID1003741" locus-cvid="A|G"
 1 <coverage>
 >>4433;>>;660000.9=85533,,, .-o PASS
     145414742
 user=""/>        .   C   N
     sampleconditionreport $^     $@      AC=0;AF=0.00;AN=2;DP=139;MQ=212.39;MQ0=0   GT:DP:G
              locus-cvid-start="10183685"/>
 </samples>
     <sequence minimum-depth="100" sensitivity="98.9" specificity="99.0">
         </locus>
 report.html: report.xml
       <region genome-build="GRCh37" chrom="3" end="440" start="400"></region>
     reportrenderer $< -o $@
       <region genome-build="GRCh37" chrom="3" end="700" start="600"></region>
The pipeline in action
  $ ls reads.fastq.gz req.xml sample.* Makefile
  Makefile reads.fastq.gz req.xml sample.info sample.xml

  $ time make report.html report.pdf
  gzip -cdq <reads.fastq.gz >reads.fastq
  lake --recipe reads_to_variants >lake.mk
  ln -s variants_and_refagree.vcf calls.vcf

  generate_attributes_file.py ...

  sampleconditionreport attr.xml req.xml -o report.xml

  reportrenderer report.xml -o report.html

  wkhtmltopdf report.html report.pdf

  real 7m14.804s
  user 7m16.490s
  sys 2m0.150s
Locus Interpretation




                Reece Hart — Locus Development   18/28
The big lesson…

 Transcripts are much
messier than expected.




       Reece Hart — Locus Development   19/28
Problem statement
 There is no single source of transcripts that is all of:
 stable (archived), mapped, agree with the reference
 genome, have RefSeq accessions.


 ➢   Issues:
     ●   Poor access / programmability
     ●   No archived mappings
     ●   RefSeq != reference genome due to origin,
         ambiguity, error
     ●   Patches are difficult to use




                        Reece Hart — Locus Development      20/28
When RefSeq != Genome Reference

   NC_000006.11:g.31030103C>T                NC_000006.11:g.31038124T>G

                 variant published                           discovered variant
                 relative to RefSeq                          reported relative to RefSeq


       NM_0123.4:c.45C>T                           NM_0123.4:c.832T>G




             mismatch             ins/del
                A                 -

                                            downstream coordinates
                                            shifted




                            Reece Hart — Locus Development                           21/28
17.8% of RefSeq transcripts differ from
GRCh37




                                              5.4% have coordinate-
                                              changing differences




                    Garla, V., Kong, Y., Szpakowski, S., & Krauthammer, M. (2011).
                    MU2A--reconciling the genome and transcriptome to determine the effects of base substitutions.
                    Bioinformatics (Oxford, England), 27(3), 416-8. doi:10.1093/bioinformatics/btq658


                Reece Hart — Locus Development                                                                  22/28
Sources of transcript information

 ➢   NCBI:
     ●   map current transcripts to current genome only
     ●   maps with splign
     ●   doesn't agree with ref genome ~18%
     ●   no local database option
 ➢   UCSC:
     ●   current transcripts only
     ●   maps using blat
 ➢   Ensembl:
     ●   aligns using in-house gene building process
     ●   cross-linked to refseqs
     ●   incorporates NCBI transcripts ad hoc
     ●   well-maintained; good API; broad data; VEP


                         Reece Hart — Locus Development   23/28
PTEN: insertion/deletion in 5' UTR




                Reece Hart — Locus Development   24/28
NEFL: genome insertion leads to
frameshift/stop




               Reece Hart — Locus Development   25/28
RefSeq Handling
 ---------- Forwarded message ----------
 Date: Wed, Jan 25, 2012 at 1:59 PM
 Subject: [Genome] How does UCSC hg19 gene model add exons to
 RefSeqs?
 To: genome@soe.ucsc.edu

 Hi, when using the human reference hg19 gene model
 …
 where the hg19 model has an exon that does not exon exist in
 the RefSeq accession (or any historical version of the
 RefSeq accession).

 How/why does the alignment introduce an intron in this case?
 Does it ensure there are plausible flanking splice junctions
 before inserting an intron to a RefSeq sequence that lacks
 it but it maps to?




                        Reece Hart — Locus Development          26/28
338 genes so far




➢ We should encourage LRG and adopt it when ready
(and we'll still have to deal with legacy transcripts)
                    Reece Hart — Locus Development       27/28
Not pictured: Jon Sorenson

    Reece Hart — Locus Development   28/28

Más contenido relacionado

Similar a Building a clinical genome interpretation services company

Curriculum Vitae Justin Villarreal
Curriculum Vitae Justin VillarrealCurriculum Vitae Justin Villarreal
Curriculum Vitae Justin VillarrealJustin Villarreal
 
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...Nathan Olson
 
A Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with HypertableA Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with HypertableDATAVERSITY
 
Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009Ian Foster
 
Clinical molecular diagnostics for drug guidance
Clinical molecular diagnostics for drug guidanceClinical molecular diagnostics for drug guidance
Clinical molecular diagnostics for drug guidanceNikesh Shah
 
SNOMED CT concept model for molecular pathology_final.pptx
SNOMED CT concept model for molecular pathology_final.pptxSNOMED CT concept model for molecular pathology_final.pptx
SNOMED CT concept model for molecular pathology_final.pptxHariHaran685388
 
Bioinformatics tools for development, analysis, and preclinical testing of in...
Bioinformatics tools for development, analysis, and preclinical testing of in...Bioinformatics tools for development, analysis, and preclinical testing of in...
Bioinformatics tools for development, analysis, and preclinical testing of in...Malachi Griffith
 
CNV, GWAS & Clinical Analysis Advancements in SVS
CNV, GWAS & Clinical Analysis Advancements in SVSCNV, GWAS & Clinical Analysis Advancements in SVS
CNV, GWAS & Clinical Analysis Advancements in SVSGolden Helix
 
VarSeq 2.6.0: Advancing Pharmacogenomics and Genomic Analysis
VarSeq 2.6.0: Advancing Pharmacogenomics and Genomic AnalysisVarSeq 2.6.0: Advancing Pharmacogenomics and Genomic Analysis
VarSeq 2.6.0: Advancing Pharmacogenomics and Genomic AnalysisGolden Helix
 
VarSeq 2.5.0: Empowering Family Planning through Carrier Screening Analysis
VarSeq 2.5.0: Empowering Family Planning through Carrier Screening AnalysisVarSeq 2.5.0: Empowering Family Planning through Carrier Screening Analysis
VarSeq 2.5.0: Empowering Family Planning through Carrier Screening AnalysisGolden Helix
 
Trends in Annotation of Genomic Data
Trends in Annotation of Genomic DataTrends in Annotation of Genomic Data
Trends in Annotation of Genomic Databiobase
 
Rare diseases in children and genetic diagnosis - part 1 [Today's paper]
Rare diseases in children and genetic diagnosis - part 1 [Today's paper]Rare diseases in children and genetic diagnosis - part 1 [Today's paper]
Rare diseases in children and genetic diagnosis - part 1 [Today's paper]HeonjongHan
 
PROTEASE DETECTION ON BLOOD SPOT CARDS FOR FUTURE COMPANION DIAGNOSTICS
PROTEASE DETECTION ON BLOOD SPOT CARDS FOR FUTURE COMPANION DIAGNOSTICSPROTEASE DETECTION ON BLOOD SPOT CARDS FOR FUTURE COMPANION DIAGNOSTICS
PROTEASE DETECTION ON BLOOD SPOT CARDS FOR FUTURE COMPANION DIAGNOSTICSiQHub
 

Similar a Building a clinical genome interpretation services company (20)

Mason abrf single_cell_2017
Mason abrf single_cell_2017Mason abrf single_cell_2017
Mason abrf single_cell_2017
 
Curriculum Vitae Justin Villarreal
Curriculum Vitae Justin VillarrealCurriculum Vitae Justin Villarreal
Curriculum Vitae Justin Villarreal
 
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
 
A Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with HypertableA Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with Hypertable
 
Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009
 
Clinical molecular diagnostics for drug guidance
Clinical molecular diagnostics for drug guidanceClinical molecular diagnostics for drug guidance
Clinical molecular diagnostics for drug guidance
 
QPS Biomarker Capabilities
QPS Biomarker CapabilitiesQPS Biomarker Capabilities
QPS Biomarker Capabilities
 
SNOMED CT concept model for molecular pathology_final.pptx
SNOMED CT concept model for molecular pathology_final.pptxSNOMED CT concept model for molecular pathology_final.pptx
SNOMED CT concept model for molecular pathology_final.pptx
 
Bioinformatics tools for development, analysis, and preclinical testing of in...
Bioinformatics tools for development, analysis, and preclinical testing of in...Bioinformatics tools for development, analysis, and preclinical testing of in...
Bioinformatics tools for development, analysis, and preclinical testing of in...
 
CNV, GWAS & Clinical Analysis Advancements in SVS
CNV, GWAS & Clinical Analysis Advancements in SVSCNV, GWAS & Clinical Analysis Advancements in SVS
CNV, GWAS & Clinical Analysis Advancements in SVS
 
VarSeq 2.6.0: Advancing Pharmacogenomics and Genomic Analysis
VarSeq 2.6.0: Advancing Pharmacogenomics and Genomic AnalysisVarSeq 2.6.0: Advancing Pharmacogenomics and Genomic Analysis
VarSeq 2.6.0: Advancing Pharmacogenomics and Genomic Analysis
 
155 dna microarray
155 dna microarray155 dna microarray
155 dna microarray
 
155 dna microarray
155 dna microarray155 dna microarray
155 dna microarray
 
Dna microarray mehran
Dna microarray  mehranDna microarray  mehran
Dna microarray mehran
 
VarSeq 2.5.0: Empowering Family Planning through Carrier Screening Analysis
VarSeq 2.5.0: Empowering Family Planning through Carrier Screening AnalysisVarSeq 2.5.0: Empowering Family Planning through Carrier Screening Analysis
VarSeq 2.5.0: Empowering Family Planning through Carrier Screening Analysis
 
Trends in Annotation of Genomic Data
Trends in Annotation of Genomic DataTrends in Annotation of Genomic Data
Trends in Annotation of Genomic Data
 
Dna microarray mehran- u of toronto
Dna microarray  mehran- u of torontoDna microarray  mehran- u of toronto
Dna microarray mehran- u of toronto
 
MutaDATABASE
MutaDATABASEMutaDATABASE
MutaDATABASE
 
Rare diseases in children and genetic diagnosis - part 1 [Today's paper]
Rare diseases in children and genetic diagnosis - part 1 [Today's paper]Rare diseases in children and genetic diagnosis - part 1 [Today's paper]
Rare diseases in children and genetic diagnosis - part 1 [Today's paper]
 
PROTEASE DETECTION ON BLOOD SPOT CARDS FOR FUTURE COMPANION DIAGNOSTICS
PROTEASE DETECTION ON BLOOD SPOT CARDS FOR FUTURE COMPANION DIAGNOSTICSPROTEASE DETECTION ON BLOOD SPOT CARDS FOR FUTURE COMPANION DIAGNOSTICS
PROTEASE DETECTION ON BLOOD SPOT CARDS FOR FUTURE COMPANION DIAGNOSTICS
 

Más de Reece Hart

Clinical significance of transcript alignment discrepancies gne - 20141016
Clinical significance of transcript alignment discrepancies   gne - 20141016Clinical significance of transcript alignment discrepancies   gne - 20141016
Clinical significance of transcript alignment discrepancies gne - 20141016Reece Hart
 
Invitae PSB 2014 poster
Invitae PSB 2014 posterInvitae PSB 2014 poster
Invitae PSB 2014 posterReece Hart
 
AWS Life Sciences
AWS Life SciencesAWS Life Sciences
AWS Life SciencesReece Hart
 
ASHG 2012 Poster
ASHG 2012 PosterASHG 2012 Poster
ASHG 2012 PosterReece Hart
 
Bio-IT 2010 Genome Commons
Bio-IT 2010 Genome CommonsBio-IT 2010 Genome Commons
Bio-IT 2010 Genome CommonsReece Hart
 
HVP Critical Assessment of Genome Interpretation
HVP Critical Assessment of Genome InterpretationHVP Critical Assessment of Genome Interpretation
HVP Critical Assessment of Genome InterpretationReece Hart
 
Introduction to and Applications of Unison, an Open Source Database for Targe...
Introduction to and Applications of Unison, an Open Source Database for Targe...Introduction to and Applications of Unison, an Open Source Database for Targe...
Introduction to and Applications of Unison, an Open Source Database for Targe...Reece Hart
 
Unison: Enabling easy, rapid, and comprehensive proteomic mining
Unison: Enabling easy, rapid, and comprehensive proteomic miningUnison: Enabling easy, rapid, and comprehensive proteomic mining
Unison: Enabling easy, rapid, and comprehensive proteomic miningReece Hart
 
A Tour of Research Computing at Genentech
A Tour of Research Computing at GenentechA Tour of Research Computing at Genentech
A Tour of Research Computing at GenentechReece Hart
 
Integrating Public and Private Data: Lessons Learned from Unison
Integrating Public and Private Data: Lessons Learned from UnisonIntegrating Public and Private Data: Lessons Learned from Unison
Integrating Public and Private Data: Lessons Learned from UnisonReece Hart
 
Unison: An Integrated Platform for Computational Biology Discovery
Unison: An Integrated Platform for Computational Biology DiscoveryUnison: An Integrated Platform for Computational Biology Discovery
Unison: An Integrated Platform for Computational Biology DiscoveryReece Hart
 
Mining for Novel TNF Ligands
Mining for Novel TNF LigandsMining for Novel TNF Ligands
Mining for Novel TNF LigandsReece Hart
 

Más de Reece Hart (12)

Clinical significance of transcript alignment discrepancies gne - 20141016
Clinical significance of transcript alignment discrepancies   gne - 20141016Clinical significance of transcript alignment discrepancies   gne - 20141016
Clinical significance of transcript alignment discrepancies gne - 20141016
 
Invitae PSB 2014 poster
Invitae PSB 2014 posterInvitae PSB 2014 poster
Invitae PSB 2014 poster
 
AWS Life Sciences
AWS Life SciencesAWS Life Sciences
AWS Life Sciences
 
ASHG 2012 Poster
ASHG 2012 PosterASHG 2012 Poster
ASHG 2012 Poster
 
Bio-IT 2010 Genome Commons
Bio-IT 2010 Genome CommonsBio-IT 2010 Genome Commons
Bio-IT 2010 Genome Commons
 
HVP Critical Assessment of Genome Interpretation
HVP Critical Assessment of Genome InterpretationHVP Critical Assessment of Genome Interpretation
HVP Critical Assessment of Genome Interpretation
 
Introduction to and Applications of Unison, an Open Source Database for Targe...
Introduction to and Applications of Unison, an Open Source Database for Targe...Introduction to and Applications of Unison, an Open Source Database for Targe...
Introduction to and Applications of Unison, an Open Source Database for Targe...
 
Unison: Enabling easy, rapid, and comprehensive proteomic mining
Unison: Enabling easy, rapid, and comprehensive proteomic miningUnison: Enabling easy, rapid, and comprehensive proteomic mining
Unison: Enabling easy, rapid, and comprehensive proteomic mining
 
A Tour of Research Computing at Genentech
A Tour of Research Computing at GenentechA Tour of Research Computing at Genentech
A Tour of Research Computing at Genentech
 
Integrating Public and Private Data: Lessons Learned from Unison
Integrating Public and Private Data: Lessons Learned from UnisonIntegrating Public and Private Data: Lessons Learned from Unison
Integrating Public and Private Data: Lessons Learned from Unison
 
Unison: An Integrated Platform for Computational Biology Discovery
Unison: An Integrated Platform for Computational Biology DiscoveryUnison: An Integrated Platform for Computational Biology Discovery
Unison: An Integrated Platform for Computational Biology Discovery
 
Mining for Novel TNF Ligands
Mining for Novel TNF LigandsMining for Novel TNF Ligands
Mining for Novel TNF Ligands
 

Building a clinical genome interpretation services company

  • 1. Building a clinical genome interpretation services company Reece Hart, Ph.D. reece@locusdev.net Locus Development Inc. http://locusdevelopmentinc.com/ Reece Hart — Locus Development 1/28
  • 2. Opportunity Reece Hart — Locus Development 2/28
  • 3. Clinical Genome Interpretation Patient presents with symptoms If genomic interpretation might influence diagnosis or treatment, doctor refers patient to genetic counselor GC takes history; sample is sent to internal or one of Report is returned to GC hundreds of labs that provide and/or physician who specific genomic tests verify interpretation and consult with patient Sequencing and other lab data are processed into preliminary iterpretation photos: Baylor College of Medicine, Univ. Utah, learningradiology.com, sciencephotos.com Reece Hart — Locus Development 3/28
  • 4. 100s of laboratory diagnostic testing labs Reece Hart — Locus Development 4/28
  • 5. Common variants are hard to interpret Reece Hart — Locus Development 5/28
  • 6. Some variants are informative Reece Hart — Locus Development 6/28
  • 7. The Significance of “Variants of Uncertain Significance” “VUS – Variant of uncertain significance. A variation in a genetic sequence whose association with disease risk is unknown. Also called variant of uncertain significance, variant of unknown significance, and unclassified variant.” http://www.cancer.gov/cancertopics/genetics-terms-alphalist 7/28
  • 8. The long tail of rare diseases. “A rare disease typically affects a patient population estimated at fewer than 200,000 in the U.S. There are more than 6,000 rare diseases known today and they affect an estimated 25 million persons in the U.S.” NIH Office of Rare Diseases Research http://rarediseases.info.nih.gov/ 8/28
  • 9. The Problems to Solve ➢ Develop a reliable database of genotypes and phenotypes. ➢ Develop methods to interpret all types of variants, not just common SNVs. ➢ Provide meaningful, reliable interpretations based on genomic data. ➢ Do it better than everyone else. Reece Hart — Locus Development 9/28
  • 10. Plan Reece Hart — Locus Development 10/28
  • 11. Company Overview Locus Genomic Sequence Clinical and Variants Interpretation Reece Hart — Locus Development 11/28
  • 12. Curating Genotypes, Phenotypes, and Risk Genotype-Phenotype Database dbSNP GO LSDBs Genotypes/ Phenotypes/ OMIM PharmGKB Variants Conditions ICD-9/10 … … Risk Models Reece Hart — Locus Development 12/28
  • 13. Locus Overview hospitals/clinics, physicians, insurers workflow and tracking variants/ condition inter- sequences attributes predictions pretation
  • 14. Implementation Reece Hart — Locus Development 14/28
  • 15. Curation Content ➢ Many sources ● automated and manual tools ● databases and literature ➢ Most kinds of variants ● SNV, del, ins, delins, repeat, conv, CNV, haplotypes ➢ Many kinds of conditions ● inherited, spontaneous, dominant, recessive, x- linked, preventative, cancer, metabolic, pharmacogenomic, cardio ➢ Examples: ● Cystic Fibrosis (w/modifiers) ● CMT (~21 subclasses) ● Long and Short QT ● TPMT, warfarin, CYP2D6 Reece Hart — Locus Development 15/28
  • 16. The pipeline LIMS curation req'n and cond'n var. risk models sample info reads calls attributes report (fastq) variant (vcf) selection (xml) inter- (xml) calling pretation execution framework <?xml version="1.0"?> variants_and_refagree.vcf filtered_on_callable.vcf: lake.mk reads.fastq <locus-report format="1.0"> @G88NFDU01AI6Z3 rank=0000170 x=101.0 y=1953.5 length=56 <requisition> AGTGTAGTAGTGAGAAAAACTTTGTGGGGATATGGATACAATTATTTACCCAAATC <requisition> (set -e; <?xml version="1.0" encoding="UTF-8"?> ##fileformat=VCFv4.1 +<conditions>not giving this section too muchthought. Good enough for now source /locus/opt/lake/bin/lakeSetupEnv; <!-- I'm <sample-attributes> … <condition>VonHL</condition> we commercialize --> IIIIIIIIIIIGC>////-....826666<EIIIIIIIIIIIHI6644/..222== $(MAKE) -f $<id="LS125"when <sample-info $@; Can update later, gender="Unknown" … > ##FILTER=<ID=LowQual,Description="Lowy=1960.0 length=59 @G88NFDU01AKOQI rank=0000178 x=118.0 quality"> </conditions> ) 2>$@.err <client id="uuid"></client> <reference> ##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic ##INFO=<ID=AC,Number=A,Type=Inte </requisition> agtgtagtagtaaggaagattgagtgcctgaccttCCGGGTGGCGGTAGCGTTGGCCCCid="LS99"></patient> <patient name="LS99" ethnicity="" gender="Male" dob="" <organism>homo sapiens</organism> ##VariantFiltration="analysis_type=VariantFiltration input_file=[] sample_metadata=[] read_ calls.vcf: variants_and_refagree.vcf #filtered_on_callable.vcf … + <build-id>human_g1k_v37</build-id> ##contig=<ID=GL000240.1,length=41933,assembly=b37> <?xml version="1.0" encoding="UTF-8"?> BHBEEIIIIIEEEEEBGBECCCDEIIIIIIIIIIIEEICC===988ED>?>>>88...- ln -s $< $@ <conditions> </reference> ##reference=file:///locus/data/references/genomes/human_g1k_v37/sequences/human_g1k_v37.fas <samples> @G88NFDU01AL6H7 rank=0000323 x=135.0 y=2013.5 length=95 <condition code="VonHL"> <loci> ##source=SelectVariantsgender="Male" birth-date="/Date(1320120000000-0700)/" <sample-info id="LS99" start="10183685" ${ATTR_FILE} attr.xml: calls.vcf req.xml sample.xml end="10183685" agtgtagtagtgtgagctggtgaagaaggtctccGATGTCATATGGAACAGCCTCAGCCGCTCCTACTTCAAGGATCGGGCCCACA <associated-conditions></associated-conditions> <locus chr="3" #CHROM</condition>status="New" ordering-clinician="JMajor" nanodrop- generate_attributes_file.py FILTER INFO POS ID REF ALT QUAL type="GenomicDNA" TCCAGTCCC … FORMAT LS99 sequence="G" read-coverage="0"> 1 concentration="300" A + 145414740 . N . PASS AC=0;AF=0.00;AN=2;DP=137;MQ=213.16;MQ0=0 </conditions>read-coverage="" quality-score="" sequence="" original-barcode="NA06994" use-type="RD" origin="Coriell" GT:DP:G <alt 1 </requisition>. report.xml: attr.xmlGreq.xml. 145414741 N PASS AC=0;AF=0.00;AN=2;DP=138;MQ=213.16;MQ0=0 code="NA06994" concentration="300" description="" accession-date="" =>BBBBB==;;B>454@EA@>>===>>BBIIE@ACIGIEIFFDD66665@@:::>AA777A<;;>A>?accession- GT:DP:G locus-cvid-code="CVID1003741" locus-cvid="A|G" 1 <coverage> >>4433;>>;660000.9=85533,,, .-o PASS 145414742 user=""/> . C N sampleconditionreport $^ $@ AC=0;AF=0.00;AN=2;DP=139;MQ=212.39;MQ0=0 GT:DP:G locus-cvid-start="10183685"/> </samples> <sequence minimum-depth="100" sensitivity="98.9" specificity="99.0"> </locus> report.html: report.xml <region genome-build="GRCh37" chrom="3" end="440" start="400"></region> reportrenderer $< -o $@ <region genome-build="GRCh37" chrom="3" end="700" start="600"></region>
  • 17. The pipeline in action $ ls reads.fastq.gz req.xml sample.* Makefile Makefile reads.fastq.gz req.xml sample.info sample.xml $ time make report.html report.pdf gzip -cdq <reads.fastq.gz >reads.fastq lake --recipe reads_to_variants >lake.mk ln -s variants_and_refagree.vcf calls.vcf generate_attributes_file.py ... sampleconditionreport attr.xml req.xml -o report.xml reportrenderer report.xml -o report.html wkhtmltopdf report.html report.pdf real 7m14.804s user 7m16.490s sys 2m0.150s
  • 18. Locus Interpretation Reece Hart — Locus Development 18/28
  • 19. The big lesson… Transcripts are much messier than expected. Reece Hart — Locus Development 19/28
  • 20. Problem statement There is no single source of transcripts that is all of: stable (archived), mapped, agree with the reference genome, have RefSeq accessions. ➢ Issues: ● Poor access / programmability ● No archived mappings ● RefSeq != reference genome due to origin, ambiguity, error ● Patches are difficult to use Reece Hart — Locus Development 20/28
  • 21. When RefSeq != Genome Reference NC_000006.11:g.31030103C>T NC_000006.11:g.31038124T>G variant published discovered variant relative to RefSeq reported relative to RefSeq NM_0123.4:c.45C>T NM_0123.4:c.832T>G mismatch ins/del A - downstream coordinates shifted Reece Hart — Locus Development 21/28
  • 22. 17.8% of RefSeq transcripts differ from GRCh37 5.4% have coordinate- changing differences Garla, V., Kong, Y., Szpakowski, S., & Krauthammer, M. (2011). MU2A--reconciling the genome and transcriptome to determine the effects of base substitutions. Bioinformatics (Oxford, England), 27(3), 416-8. doi:10.1093/bioinformatics/btq658 Reece Hart — Locus Development 22/28
  • 23. Sources of transcript information ➢ NCBI: ● map current transcripts to current genome only ● maps with splign ● doesn't agree with ref genome ~18% ● no local database option ➢ UCSC: ● current transcripts only ● maps using blat ➢ Ensembl: ● aligns using in-house gene building process ● cross-linked to refseqs ● incorporates NCBI transcripts ad hoc ● well-maintained; good API; broad data; VEP Reece Hart — Locus Development 23/28
  • 24. PTEN: insertion/deletion in 5' UTR Reece Hart — Locus Development 24/28
  • 25. NEFL: genome insertion leads to frameshift/stop Reece Hart — Locus Development 25/28
  • 26. RefSeq Handling ---------- Forwarded message ---------- Date: Wed, Jan 25, 2012 at 1:59 PM Subject: [Genome] How does UCSC hg19 gene model add exons to RefSeqs? To: genome@soe.ucsc.edu Hi, when using the human reference hg19 gene model … where the hg19 model has an exon that does not exon exist in the RefSeq accession (or any historical version of the RefSeq accession). How/why does the alignment introduce an intron in this case? Does it ensure there are plausible flanking splice junctions before inserting an intron to a RefSeq sequence that lacks it but it maps to? Reece Hart — Locus Development 26/28
  • 27. 338 genes so far ➢ We should encourage LRG and adopt it when ready (and we'll still have to deal with legacy transcripts) Reece Hart — Locus Development 27/28
  • 28. Not pictured: Jon Sorenson Reece Hart — Locus Development 28/28