On the frontier of genotype-2-phenotype data integration

On the frontier of genotype-2-phenotype
data integration
Melissa Haendel, PhD
March 22, 2016
AMIA TBI
@monarchinit @ontowonka
haendel@ohsu.edu

Filling the G2P knowledge gap from other
organisms
Other= rat, fly, worm, mouse, zebrafish

monarchinitiative.org
Ulcerated
paws
Palmoplantar
hyperkeratosis
Thick hand skin

Challenge: Each database uses their own
vocabulary/ontology
MP
HP
MGI
HPOA

Challenge: Each database uses their own
phenotype vocabulary/ontology
ZFA
MP
DPO
WPO
HP
OMIA
VT
FYPO
APO
SNO
MED
…
…
…
WB
PB
FB
OMIA
MGI
RGD
ZFIN
SGD
HPOA
EHR
IMPC
OMIM
…
QTLdb

Can we help machines understand
phenotype terms?
“Palmoplantar
hyperkeratosis”
Human phenotype
I have
absolutely no
idea what that
means

The Human Phenotype Ontology
Hyposmia
Abnormality of
globe location
eyeball of
camera-type eye
sensory
perception of smell
Abnormal eye
morphology
Motor neuron
atrophyDeeply set eyes
motor neuronCL
34571 annotations in
22 species
157534 phenotype
annotations
2150 phenotype
annotations

Genotype-phenotype integration
One source
Two sources
3 or more
9%
91% of our 2.2 Million G2P associations required
integrating 2 or more data sources
(this number does not even include orthology (Panther) or any ontologies!)
91%

Diagnosing an undiagnosed disease
www.owlsim.org

Phenotype Exchange Standard
Mechanistic discovery
Improved
searchability
Integrated Data Landscape
Tool/algorithm creation
Cohort
identification
Patient
registries
Databases,
Web tools,
AlgorithmsPhenopacket
Registry
JournalsDiagnostic
screening
programs Clinical
trials
Phenopacket
flow
Primarybenefits
tostakeholders
Patients/
Families
Physicians
Patient
matchmaking
Diagnosis speed/accuracy
Organismal
biologist
www.phenopackets.org

What’s in a Phenopacket?
Ontology-based phenotypic descriptions for:
 Human patients, model organisms, or any organism
 Groupings of human patients or organisms
What does it include?
 age of patient or organism
 sex of patient or organism
 disease (if named)
 age of onset of disease
 Positive and negative phenotype associations
 Reference to Genes, variants, or collections of variants
 Reference to environmental factors
Multiple formats: TSV, JSON, YAML, JSON
Validation tools
Uses standardized publication citation mechanism for data sharing

brca-website.cloudapp.net
 13501 variants from ENIGMA, ClinVar, LOVD, exLOVD, BIC
 Merged by genomic coordinate and alternate allele string

Problems with evidence and provenance
of G2P Associations
PROBLEMS:
Variants have different pathogenicity calls due to annotation
inconsistency AND different experimental evidence
Incomplete, not computable, and frequently conflated
Annotations are to different aspects of the genotype: allele, variant,
gene, transcript, etc.
A computable model would enable:
 context to evaluate credibility/confidence
 support filtering and analysis of data
 detailed history for attribution

Building a computable model for ACMG
guidelines
http://brcaexchange.org/
Provenance Evidence Claim
- Materials & methods
- Agent(s) of evidence
- Agent(s) of claim
- Time and place
- Data (eg: images, sequences)
- Evidence codes
- Publications
- Confidence (p-val, z-score)
- Summary figures
- Conclusions from previous studies
- Domain expert’s knowledge
Causal relationships,
hypothesized relationships,
correlations etc.
https://github.com/monarch-initiative/SEPIO-ontology

Summary
 Ontologies can be used to perform deep phenotyping
integration across species
 An exchange standard is needed to facilitate distributed
phenotype data sharing
 A computable G2P evidence model can aid variant
interpretation

Acknowledgements
Lawrence Berkeley
Chris Mungall
Nicole Washington
Suzanna Lewis
Jeremy Nguyen
Seth Carbon
Charité
Peter Robinson
Sebastian Kohler
U of Pittsburgh
Harry Hochheiser
Mike Davis
Joe Zhou
OHSU
Nicole Vasilesky
Matt Brush
Kent Shefchek
Julie McMurry
Tom Conlin
Genomics England
Damian Smedley
Jules Jacobson
UCSC
David Haussler
Benedict Paten
Mark Diekhans
Melissa Cline
Garvan
Tudor Groza
Craig McNamara
Edwin Zhang
FUNDING: NIH Office of Director: 1R24OD011883; NIH-UDP:
HHSN268201300036C, HHSN268201400093P;
NCINCI/Leidos #15X143, BD2K U54HG007990-S2 (Haussler)
& BD2K PA-15-144-U01 (Kesselman)

On the frontier of genotype-2-phenotype data integration

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (17)

Similar a On the frontier of genotype-2-phenotype data integration

Similar a On the frontier of genotype-2-phenotype data integration (20)

Más de mhaendel

Más de mhaendel (13)

Último

Último (20)

On the frontier of genotype-2-phenotype data integration

Notas del editor