We have created a large Neo4j database that integrates the results from text mining, experimental data and biological background knowledge. The utility of this graph is two fold:
- Identify promising compounds to be tested as a starting point for drug development.
- Better understand the results of large scale compound testing in cellular assays using imaging technology.
Currently the database contains 25 million article abstracts, data for 2 million compounds and 60000 genes – overall 29 million nodes and 270 million relationships.
We show some details about how the graph was built and show examples how combining text mining with experimental results leads to new insights and to better understanding and design in biological experiments.
Call Girl Bangalore Nandini 7001305949 Independent Escort Service Bangalore
Connecting the Dots in Early Drug Discovery
1. Novartis Institutes for BioMedical Research (NIBR)
Connecting the dots in
early drug discovery
Stephan Reiling
Senior Scientist, Novartis Institutes for
BioMedical Research
2. Connecting the dots in early
drug discovery
Stephan Reiling
In-Silico Lead Discovery Group
Novartis Institutes for BioMedical Research (NIBR) Cambridge
GraphConnect 2016, San Francisco
Novartis Institutes for
BioMedical Research
(NIBR)
3. Novartis Institutes for BioMedical Research (NIBR)
Why (might you be interested in this talk)
• The talk shows how a lot of heterogeneous data can be integrated into one big
graph
– Greater than the sum of its parts
• Text mining and pattern detection can lead to valuable insights
– Nobody can read 25 million scientific papers
• Data mining this graph can give novel biological insights
– Connecting the dots
Public3
4. Novartis Institutes for BioMedical Research (NIBR)
Why (did we build the graph)
Public4
Treatment effects in cellular phenotypic assays
Compound
treatment
5. Novartis Institutes for BioMedical Research (NIBR)
• What we have (the dots)
– almost 1 Billion data points of
compound activity data on protein
targets
(~99% of which can be summarized as “not active”)
– More and more results of phenotypic
assays
• What we lack (the connections)
– A good way to use biological
knowledge or background information
to make a connection
– A storage for “biological knowledge”
that can be “queried”
Public5
Why
Compound
Gene
Disease
(Phenotype)
6. Novartis Institutes for BioMedical Research (NIBR)
How (did we build the graph)
Public6
Text mining for chemicals, diseases, proteins
In continuation of our investigation on novel stearoyl-CoA desaturase (SCD) 1 inhibitors, we have already reported on the structural modification of the
benzoylpiperidines that led to a series of novel and highly potent spiropiperidine-based SCD1 inhibitors. In this report, we would like to extend the scope of our previous
investigation and disclose details of the synthesis, SAR, ADME, PK, and pharmacological evaluation of the spiropiperidines with high potency for SCD1 inhibition. Our
current efforts have culminated in the identification of 5-fluoro-1'-{6-[5-(pyridin-3-ylmethyl)-1,3,4-oxadiazol-2-yl]pyridazin-3-yl}-3,4-dihydrospiro[chromene-2,4'-
piperidine] (10e), which demonstrated a very strong potency for liver SCD1inhibition (ID(50)=0.6 mg/kg). This highly efficacious inhibition is presumed to be the result
of a combination of strong enzymatic inhibitory activity (IC(50) (mouse)=2 nM) and good oral bioavailability (F >95%). Pharmacological evaluation of 10e has
demonstrated potent, dose-dependent reduction of the plasma desaturation index in C57BL/6J mice on a high carbohydrate diet after a 7-day oral administration (q.d.).
In addition, it did not cause any noticeable skin abnormalities up to the highest dose (10 mg/kg).
7. Novartis Institutes for BioMedical Research (NIBR)
How (did we build the graph)
Public7
Text mining for chemicals, diseases, proteins
In continuation of our investigation on novel stearoyl-CoA desaturase (SCD) 1 inhibitors, we have already reported on the structural modification of the
benzoylpiperidines that led to a series of novel and highly potent spiropiperidine-based SCD1 inhibitors. In this report, we would like to extend the scope of our previous
investigation and disclose details of the synthesis, SAR, ADME, PK, and pharmacological evaluation of the spiropiperidines with high potency for SCD1 inhibition. Our
current efforts have culminated in the identification of 5-fluoro-1'-{6-[5-(pyridin-3-ylmethyl)-1,3,4-oxadiazol-2-yl]pyridazin-3-yl}-3,4-dihydrospiro[chromene-2,4'-
piperidine] (10e), which demonstrated a very strong potency for liver SCD1inhibition (ID(50)=0.6 mg/kg). This highly efficacious inhibition is presumed to be the result
of a combination of strong enzymatic inhibitory activity (IC(50) (mouse)=2 nM) and good oral bioavailability (F >95%). Pharmacological evaluation of 10e has
demonstrated potent, dose-dependent reduction of the plasma desaturation index in C57BL/6J mice on a high carbohydrate diet after a 7-day oral administration (q.d.).
In addition, it did not cause any noticeable skin abnormalities up to the highest dose (10 mg/kg).
Hit Type Recognized text Smiles
T1 GeneOrProtein stearoyl-CoA desaturase
T2 Mechanism inhibitors
T3 G benzoylpiperidines
T4 D spiropiperidine
O=C(NC(Cc1c[nH]c2ccccc12)C(=O)N3CCC4(CC3)CCc5ccccc45)NC6C
N7CCC6CC7
T5 GeneOrProtein SCD1
T6 Mechanism inhibitors
T7 GeneOrProtein SCD1
T8 M
5-fluoro-1'-{6-[5-(pyridin-3-ylmethyl)-1,3,4-oxadiazol-2-
yl]pyridazin-3-yl}-3,4-dihydrospiro[chromene-2,4'-
piperidine]
FC1=C2CCC3(OC2=CC=C1)CCN(CC3)C=3N=NC(=CC3)C=3OC(=NN3)
CC=3C=NC=CC3
T9 GeneOrProtein SCD1
T10 G carbohydrate
T11 Disease skin abnormalities
8. Novartis Institutes for BioMedical Research (NIBR)
How (did we build the graph)
• ~25,000,000 article abstracts
• 5,600 journals
• 1946 – current
Public8
National Institutes of Health (NIH) PubMed http://www.ncbi.nlm.nih.gov/pubmed
http://www.ncbi.nlm.nih.gov/pubmed/?term=20801551
• Tagged with “MeSH terms”
(MeSH: Medical Subject Heading)
9. Novartis Institutes for BioMedical Research (NIBR)
How
Public9
Structure of the MeSH term hierarchy (partial)
Yellow: Diseases
Blue: Processes and Mechanisms
Green: Anatomy
Red: Chemicals and Drugs
Grey: Organisms
12. Novartis Institutes for BioMedical Research (NIBR)
How
Public12
Association rule mining of co-occurrences
Article 1
• Compound A
• Gene 1
• Gene 2
Article 2
• Compound A
• Compound B
• Gene 1
Article 3
• Compound A
• Mesh term X
• Gene 1
Article 4
• Compound C
• Gene 1
• Identification of entities (compounds, mesh terms,
genes, diseases,…) from pubmed annotations or
textmining
• The a-priori algorithm from association rule mining
is used to identify frequently co-mentioned entities
(aka market basket analysis)
• Associations above a certain association strength
(lift) and number of articles in which they are co-
mentioned (support) are stored
• The association strength is scaled to 0-1 and
stored as the uncertainty of the association
(high lift = low uncertainty)
• Articles are stored as well, including the entities
that are mentioned in it
• This only captures the fact that something is
frequently co-mentioned with something else, not
any causality (similar to correlation)
13. Novartis Institutes for BioMedical Research (NIBR)
What (can you do with this)
Public13
Example: disease – compound – target from text mining
Every relationship in the graph has a property “uncertainty” in
the range of 0-1
This allows to query for connections with the highest confidence
Tafamidis (INN, or Fx-
1006A, trade
name Vyndaqel) is a drug
for the amelioration
of transthyretin-related
hereditary amyloidosis (also
familial amyloid
polyneuropathy, or FAP),
a rare but deadly
neurodegenerative disease.
Canavan disease is caused by a
defective ASPA gene which is
responsible for the production of
the enzyme aspartoacylase.
Decreased aspartoacylase activity
prevents the normal breakdown of
N-acetyl aspartate, wherein the
accumulation of N-acetylaspartate,
or lack of its further metabolism
interferes with growth of the myelin
sheath of the nerve fibers of the
brain.
From Wikipedia: From Wikipedia:
Color code: Disease, Gene, Compound
MATCH p =
(cpd:Compound) -[:is_associated]-> (g:Gene) -[:is_associated]-> (d:Disease) <-[:is_associated]- (cpd)
RETURN p, reduce(u=0.0, r in relationships(p) | u+r.uncertainty) as unc
ORDER BY unc
14. Novartis Institutes for BioMedical Research (NIBR)
What (can you do with this)
Public14
So why not just load Wikipedia?
Disease Uncertainty
Canavan Disease 0.1
Pelizaeus-Merzbacher Disease 0.364
Alexander Disease 0.432
Diffuse Axonal Injury 0.432
Brain Diseases, Metabolic 0.451
MATCH p = (cpd:Compound {name: 'N-acetylaspartate'}) -[r:is_associated]-> (m:Disease)
RETURN m.name as Disease, r.uncertainty as Uncertainty
ORDER BY r.uncertainty LIMIT 5
15. Novartis Institutes for BioMedical Research (NIBR)
What (can you do with this)
Public15
Now this is getting more interesting (for us)
MATCH p = (cpd:Compound {name: 'N-acetylaspartate'})
-[r:is_associated]-> (m:CellularComponent)
return m.name as CellularComponent, r.uncertainty as Uncertainty
ORDER BY r.uncertainty LIMIT 5
CellularComponent Uncertainty
Axons 0.582
Myelin Sheath 0.611
Extracellular Fluid 0.772
MATCH p = (cpd:Compound {name: 'N-acetylaspartate'})
-[r:is_associated]-> (m:BiologicalProcess)
RETURN m.name as BiologicalProcess, r.uncertainty as Uncertainty
ORDER BY r.uncertainty LIMIT 5
BiologicalProcess Uncertainty
Energy Metabolism 0.476
Dominance, Cerebral 0.532
Functional Laterality 0.586
Cerebrovascular Circulation 0.653
Lipid Metabolism 0.72
N-acetylaspartate association with
cellular components
N-acetylaspartate association with
biological processes
16. Novartis Institutes for BioMedical Research (NIBR)
Data sources:
1. MeSH Hierarchy
2. Pubmed articles, (pubmed_id, title,
abstract, Lucene full text searches
enabled)
3. Pubmed Associations
4. Comparative Toxicogenomics Database
(CTD)
5. Compound Target Scores*
6. Public compound annotations
7. Entity relations from sentences
8. Protein-protein interactions data set from
CCSB
9. MetaCore gene - gene interactions
(binds, activates, regulates expression, …)
10. Similarity relations for all the compounds in
the graph*
(~2M compounds)
11. Gene ontology
12. Protein annotations
13. Pathways / gene sets
Objects:
• 25,430,635 articles
• 1,951,819 compounds
• 257,000 Mesh and SCR
terms
• 59,859 Genes
• 24,769 GO terms
• 10,570 Diseases
Public16
How (did we build the graph)
Relationships:
91 different relationships
Compound - is_active – Gene
• X – is_associated – X
• Gene – binding – Gene
• Gene – ubiquitinates – Gene
• Compound – affects_ubiquitination – Gene
• Article – mentions – (compound, gene, mesh)
209,031,615 mentions
50,334,440 is_similar
6,951,257 literature_association
762,002 is_active
Other data sources integrated
(*: NIBR internal data)
See Acknowledgments / References slide
30 Million nodes 480 Million relationships
18. Novartis Institutes for BioMedical Research (NIBR)
How (did we build the graph)
Public18
Overall build process
MongoDB PostgreSQL
Pubmed
xml files
Internal data sources
MeSH hierarchies
ctdbase Pubchem
ChEMBL ChEBI
CCSB MetaStore
Information
extraction
Compound similarities
Gene sets
Protein annotations
Gene ontologies
CSV file
staging
Titles
Abstracts
• Information extraction
(entity recognition,
relationship detection,
association rule mining is
done on linux cluster)
• Neo4J “endpoint” focused
on graph mining
• MongoDB and PostgreSQL
are also used for
datamining purposes
Neo4J
19. Novartis Institutes for BioMedical Research (NIBR)
What (can you do with this)
Public19
Example: Analysis of compound activities
A
B
C
D
E
F
G
H
Active compounds Inactive compounds
20. Novartis Institutes for BioMedical Research (NIBR)
What
Public20
Example: Analysis of compound activities
A
B
C
D
E
F
G
H
2
5
1
4
3
6
Active compounds Inactive compounds
1. Find genes directly affected by
the compounds
21. Novartis Institutes for BioMedical Research (NIBR)
What
Public21
Example: Analysis of compound activities
A
B
C
D
E
F
G
H
2
8
5
1
4
9
3
6
7
10
Active compounds Inactive compounds
1. Find genes directly affected by
the compounds
2. Find all genes that are indirectly
affected with some confidence
(below a given uncertainyt)
22. Novartis Institutes for BioMedical Research (NIBR)
What
Public22
Example: Analysis of compound activities
A
B
C
D
E
F
G
H
2
8
5
1
4
9
3
6
7
10
Active compounds Inactive compounds
1. Find genes directly affected by
the compounds
2. Find all genes that are indirectly
affected with some confidence
(below a given uncertainty)
3. Assign nodes that can not be
reached a large distance
4. Identify nodes that
• can not be reached by
most of the inactive
compound
• or are “closer” to the
actives than the inactives
23. Novartis Institutes for BioMedical Research (NIBR)
What
Public23
Example: Analysis of compound activities
MATCH (cpd:Compound)
where any( nvs in cpd.cpd_id
where nvs in [‘cpd1’,’cpd2’,…])
WITH cpd
MATCH p = (cpd) -[r*1..2]-> (m)
WITH cpd, p, m, reduce(u=0.0,
r in relationships(p) | u+r.uncertainty
) as uncertainty
WHERE uncertainty < 0.9
RETURN
cpd.cpd_id as Compound_ID,
m.id as ID,
uncertainty as Distance
ORDER BY uncertainty
Query reachable nodes
Compound_ID Active C582554 C495901 C495900
1 0 1.00 1.00 1.00
2 1 0.78 0.89 0.88
3 1 1.00 1.00 1.00
4 0 1.00 1.00 1.00
5 0 1.00 0.78 0.67
6 0 1.00 1.00 1.00
7 0 1.00 1.00 1.00
8 0 0.88 0.88 0.90
9 0 1.00 0.88 0.82
10 1 1.00 1.00 1.00
11 0 1.00 1.00 1.00
12 0 1.00 0.80 0.83
13 0 1.00 1.00 1.00
14 1 1.00 1.00 1.00
15 1 0.82 1.00 1.00
16 1 0.78 0.89 0.88
17 1 0.80 1.00 1.00
18 1 0.80 1.00 1.00
19 1 0.78 0.89 0.88
20 1 0.80 1.00 1.00
Matrix of compound – node “distances” Result of recursive partitioning
(decision tree)
Sum of relationship uncertainty is used as
distance from compound to node
Distance to unreachable node is set to 1.0
( and one surrogate split with equivalent
performance: 2 nodes of interest )
24. Novartis Institutes for BioMedical Research (NIBR)
What
Public24
Example: Analysis of compound activities
Green:
relationships derived from
in-house data
Grey:
relationships found from
textmining
Compound
1
Compound
2
Compound
3
Compound
4
Compound
5
Compound
6
Compound
7
Compound
8
Compound
9
Compound
10
Compound
11
Compound
12
Compound
13
Only showing the active
compounds and their
connections to the
identified nodes.
25. Novartis Institutes for BioMedical Research (NIBR)
Public25
Compound
1
Compound
2
Compound
3
Compound
4
Compound
5
Compound
6
Compound
7
Compound
8
Compound
9
Compound
10
Compound
11
Compound
12
Compound
13
MATCH p = (g1:Gene) -[r*1..2 {datasource: 'metacore'}]-> (g2:Gene)
WHERE g2.gene_symbol in ['FOXO','MTOR']
and g1.gene_symbol in ['PRKAB1', 'PRKAA1','PRKAA2']
RETURN p, reduce(u=0.0, r in relationships(p) | u+r.uncertainty) as unc
ORDER BY unc LIMIT 20
26. Novartis Institutes for BioMedical Research (NIBR)
Public26
MATCH p = (g1:Gene) <-[:mentions]- (a:Article) -[:mentions]-> (g2:Gene)
WHERE g2.gene_symbol in ['FOXO','MTOR']
and g1.gene_symbol in ['PRKAB1', 'PRKAA1','PRKAA2']
RETURN p
MATCH p = (g1:Gene) -[r*1..2 {datasource: 'metacore'}]-> (g2:Gene)
WHERE g2.gene_symbol in ['FOXO','MTOR']
and g1.gene_symbol in ['PRKAB1', 'PRKAA1','PRKAA2']
RETURN p, reduce(u=0.0, r in relationships(p) | u+r.uncertainty) as unc
ORDER BY unc LIMIT 20
28. Novartis Institutes for BioMedical Research (NIBR)
Where (is this going)
• More tweaks to what we have
– Improvements to text mining
– Analysis of verbs (actions) / information extraction
– Monitor change over time (what is new “emerging knowledge”)
• Full text analysis
– Enable analysis and inclusion of internal documents
• Incorporate additional data sources
– Gene Expression data (tissue expression and perturbations)
– Mutations
– Proteomics
• Refining the “uncertainty” measure
– How best to compare uncertainties from different data sources
• Expand user base
• Automated updates
Public28
29. Novartis Institutes for BioMedical Research (NIBR)
• ISLD group
– John Davies
– Miguel Camargo
– Eugen Lounkine
– Elisabet Gregori-Puigjane
– Mark Bray
– Pierre Farmer
– Ansgar Schuffenhauer
• Text mining group
– Therese Vachon
– Pierre Parrisot
– Andrea Splendiani
– Fatima Oezdemir-Zaech
– Frederic Sutter
• Protein information:
– Pfam: R.D. Finn, et. al. The Pfam protein families database: towards a more sustainable future, Nucleic Acids
Research (2016) Database Issue 44:D279-D285
http://pfam.xfam.org/
– Uniprot: The UniProt Consortium, UniProt: a hub for protein information, Nucleic Acids Res. 43: D204-D212 (2015)
http://www.uniprot.org/
• Comparative Toxicogenomics database:
– Davis AP et. al. The Comparative Toxicogenomics Database's 10th year anniversary: update 2015. Nucleic Acids Res.
2015 Jan;43 (Database issue): D914-20.
Curated chemical–gene data were retrieved from the Comparative Toxicogenomics Database (CTD), MDI Biological
Laboratory, Salisbury Cove, Maine, and NC State University, Raleigh, North Carolina. World Wide Web (URL:
http://ctdbase.org/). [May 2016].
• MetaCore
– Thomson Reuters LifeSciences
http://thomsonreuters.com/en/products-services/pharma-life-sciences/pharmaceutical-research/metacore.html
• Protein-Protein interaction data set:
– Center for Cancer Systems Biology (CCSB) at the Dana Farber Cancer Institute
http://ccsb.dfci.harvard.edu/
• Gene Ontology
– The Gene Ontology Consortium. Gene Ontology Consortium: going forward. (2015) Nucl Acids Res 43 Database issue
D1049–D1056.
http://geneontology.org/
• Pathways
– Reactome pathway database:
A. Fabregat et. al., The Reactome pathway Knowledgebase, Nucl. Acids Res. (04 January 2016) 44 (D1): D481-D487
D. Croft et. al., The Reactome pathway knowledgebase, Nucl. Acids Res. (1 January 2014) 42 (D1): D472-D477
http://reactome.org/
Public29
Acknowledgments / References
Source References
• CPC
– Sylvain Cottens
– Doug Auld
• DMP
– Jeremy Jenkins
– Ben Cornett
– Florian Nigsch
• NX
– Stephen Litster