5. The Challenge
Variety of users / diversity of scientific questions
Scientists
Medical
Doctors
Data
Scientists
Graphdatabase
6. Biological question:
Are human T2D genes enzymes acting on metabolites which in turn are regulated in pig diabetes model?
The actual question (from a data-point-of-view):
Is there a connection between A and R?
=> 3s to look into the Excel sheet
Why graph? Easy scientific question
7.
The actual question (from a data-point-of-view):
Is there a connection between A and R?
=> 3s to look into the graph
A
B
C
E
D
F
G
K
Q
R
S
W
Z
U
Why graph? Easy scientific question
8. Back to the question
Are human T2D genes enzymes acting on metabolites which in turn are regulated in pig diabetes model?
Genomics
Human diabetic data
Genes
SNPs
Proteins
Enzymes
Pathways
Metabolites
Metabolomics
Pre diabetic pig
Metabolites
List of SNPs
List of Genes of
(species 1)
List of Proteins of
(species 1)
List of loci
List of Enzymes of
(species 1)
List of Pathways of
(species 1)
List of Metabolites
of (species 1)
List of Metabolites
of (species 2)
graph
9. Why graph? -> why not relational
• biomedical data / healthcare data is highly connected
• => variety of data
=> unstructured
=> heterogeneous
=> not connected
=> unFAIR
• easy to model
• extremely flexible / easy adoptable („re-shaping the graph“) vs. static SQL model
• scalable (Billion of nodes+relationships on a single machine
• easy to query (cyclic dependencies)
• GraphDataScience library + graph embeddings
12. DZDconnect: stats
• PROD-Server: 323m nodes, 1.1bn relationships => 480GB
• DEV-Server: 1.1bn nodes, 4.8bn relationships
• Singleserver (60 CPUs, 256GB memory, only SSDs)
• 4 developers
• Neo4j enterprise (live backup, GDS)
• UI: flask web server, SemSpect, Neo4j browser
• Visualization for interactive browsing (SemSpect by derive GmbH)
• Bloom (semi-natural-language queries)
Strata Data
Award finalist 2019
bytes4diabetes Award
2020
Graphie Award 2018
We have
DB role model
13. DZDconnect:
data integration + ML
Gene RNA Protein
CODES CODES
CODES*
• Python
• Py2Neo, GraphIO
• Docker Pipeline for orchestration (open-source by DZD)
• Based on integrated data => annotate / enrich
• textmatching + Natural Language Processing
• „shortcuts“ for queries (reduce #hops)
• inferring knowledge
16. The Challenge
User with a specific input => specific output
Scientist
multi-omics
experiment
output
Flask app
17. The Challenge
User ”start somewhere -> explore freely knowledge”
SemSpect
interactive
browsing
Start from any node
Scientist
or
Medical
Doctor
18. The Challenge
User with data analysis skills / computer scientist
Scientist
Start from any node
Cypher query language
Graph Data
Science
19. Use case 1
Handle mapping identifiers of molecular entities
Knowledge Graph
20. Query „friends of a friend“ on a gene level
Example: diabetes relevant gene ‚TCF7L2’
match path=(g:Gene{sid:'TCF7L2'})-[:MAPS|SYNONYM*0..2]-(g1:Gene) return path
21. Use case 2
Find information that is NOW connected
Knowledge Graph
22. Query for SNPs (mutations) associated to diabetes
Output: relevant protein and its function (ontology terms)
match (tr:Trait)
where tr.name contains ‚diabetes mellitus‘
with tr as disease
match path=(disease)<-[:ASSOCIATED_WITH_TRAIT]-(asso:Association)<-[:SNP_HAS_ASSOCIATION]-(snp:SNP)-
[:SNP_HAS_GENE]-(gene:Gene)-[:MAPS]-(g1:Gene)-[x:CODES]->(transcript:Transcript)-[:CODES]->
(prot:Protein)-[:ASSOCIATION]->(term:Term)—(o:Ontology)
return path
23. Use case 3
Using graph algorithms to infer new insights
Natural Language
Processing
Ontologies
Knowledge Graph
24. Google’s page rank algorithm - find the most relevant gene
finding ACE2 - the receptor the SARS-Cov2 virus uses to enter the cell
• 140’000 abstracts from
Covid19 related publications
• NamedEntityRecognition
of gene names
• Page Rank identified
‚ACE2‘ as the most relevant
gene
25. Who’s this ACE2-guy?
source: https://www.benaroyaresearch.org/blog/post/11-things-know-about-mrna-vaccines-covid-19
26. Use case 4
Using node embeddings to sub phenotype diabetic patients
Natural
31. k-nearest neighbour clustering with k=5
representing the 5 diabetes subtypes
patient 01 patient 02
patient 03
Graph
algorithms
patient 04
patient 05
patient 02
p
a
t
i
e
n
t
0
4
patient 03
patient 05
patient 01
subphenotyping of diabetic patients
32. DZDconnect
connect patient data with knowledge graph
Transcript
Gene
Synonyms
Abstract
PubMed
Article
Keyword
MeSH-term
Ontology term
Hello role-model :-)
33. Take home message
• Knowledge graph
• as single point of truth
• connect in-house data
• scalability
• infer new insights
• Use cases:
• simple and advanced (Cypher) queries
• Graph Data Science library (page rank, kNN)
• Node embeddings for complex data
• NLP
• Visualization of graph
• different users
• flask app, browser, SemSpect,…