In this lecture, I provide an overview on how computers can be instrumental in drug discovery efforts. Topics covered includes: big data as a result of omics effort; bioinformatics; cheminformatics; biological space; chemical space; how computers particularly machine learning (and data science) can be applied in the context of drug discovery.
A video of this lecture is also provided on the "Data Professor" YouTube channel available at http://bit.ly/dataprofessor
If you are fascinated about data science, it would mean the world to me if you would consider subscribing to this channel (by clicking the link below):
http://bit.ly/dataprofessor
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Computational Drug Discovery: Machine Learning for Making Sense of Big Data in Drug Discovery
1. Computational Drug Discovery
Associate Professor Dr. Chanin Nantasenamat
E-mail: chanin.nan@mahidol.edu
YouTube: http://bit.ly/dataprofessor
Machine Learning for Making Sense
of Big Data in Drug Discovery
2. About the Speaker
• Research group website at http://codes.bio
• Codes and Data at http://github.com/
chaninn and http://github.com/chaninlab
• YouTube Channel called Data Professor
available at http://bit.ly/dataprofessor
• Data Professor FaceBook Page at
http://facebook.com/dataprofessor
Icon made by Freepik from www.flaticon.com
3. Disease
• The word ‘disease’ is
defined by Cambridge
Dictionary as
illness of people, animals, plants,
etc., caused by infection or a
failure of health rather than by
an accident
http://static.filmannex.com/users/galleries/
294182/19265_fa_rszd.jpg
4. Drugs
• A ‘drug’ is a biological or
chemical entity that can
modulate the course of a
disease state by interacting
with its target protein
• Biological entity
(e.g. antibodies)
• Chemical entity
(e.g. small molecules)
Natthapon Ngamnithiporn. Image from FreePik.
http://www.freepik.com/free-photo/packings-of-pills-and-
capsules-of-medicines_1178867.htm
6. Drug Discovery Process
• Costs ~2 billion USD
• Takes about 10-15 years
• Failure rate is > 90%
http://drugdiscovery.nd.edu/
7. Drug Discovery Process
Ashburn andThor. Nature Rev. Drug Discov. 3 (2004) 673-683
Identify target
protein that is key in
modulating disease
Screen for ‘hit’
molecules that
can inhibit the
target protein
‘Hit-to-lead’
and ‘Lead
optimization’
Evaluate
pharmaco-
kinetic
properties
Initiate Clinical trials to evaluate
safety & dosage; efficacy & side effects;
adverse reaction to long-term use
Drug reaches
the market
9. Multi-objective optimization
• A drug need not only target the protein of interest but must
also possess other properties
• Desirable characteristics of a drug:
1. Binds selectively to the target protein
2. Absorbs in the stomach (oral drugs)
3. Permeates gut-wall or cell-wall (can reach target site)
4. Metabolically stable
5. Non-toxic
6. Can be synthesized
• To achieve all these desirable properties, the chemical structure
will need to be optimized (an optimal balance will need to be
achieved against many factors)
10. Creating new compounds
• We can look to nature for inspiration (biologically inspired)
or use existing drugs as starting point
• Medicinal chemists optimize existing componds by modifying
them in a process known as bioisosteric replacement
(replacing a hydrogen atom by a halogen atom)
• Cheminformaticians can computationally enumerate a
compound (compound enumeration) library using the
rules of organic chemistry (considers chemical stability and
synthetic feasibility)
Icon made by dDara from www.flaticon.com
11. Molecules
• Molecules can be thought of as framework of atoms
(molecular graph) where atoms are vertices and bonds are
edges
- Each vertices can typically be one of nine atoms (C, N, O, F, P, S, Cl or
Br)
- Each edge that links the vertices can be a single, double or triple bond
• Compound enumeration as performed by the research group of
JL Reymond (Acc Chem Res 2015, 48(3):722-730)
- Molecules of up to 13 atoms ⟶ 977 million possible molecules (109)
- Molecules of up to 17 atoms ⟶ 166 billion possible molecules (1011)
12. Chemical space
• Theoretically possible chemical space as
revealed via compound enumeration by the
research group of JL Reymond (Acc Chem Res 2015,
48(3):722-730)
- Molecules of up to 13 atoms ⟶ 977 million
possible molecules (109)
- Molecules of up to 17 atoms ⟶ 166 billion
possible molecules (1011)
• Drug space (<500 Da) is estimated to
constitute up to 40 atoms (in some cases, even
more) ⟶ roughly 1060 molecules
14. Bioactivity
• Bioactivity is the activity elicited by the
target protein of interest
• Such target proteins are typically involved
in key pathways that influence the course
of a disease
• Thus, great attention has been placed to
modulate these target proteins
• Primary literature
• Curated
Databases
• ChEMBL, BindingDB,
MOAD, PubChem
• Open Innovation
• Pharmaceutical
companies are
making data publicly
available for non-
commercial diseases
15. What can computers do?
• Computers (IBM Deep Blue) have defeated human in
Jeopardy and Chess
• Google released a self-driving car
• NASA uses computers to simulate space missions
• Computers are being used to design aircrafts and cars
• Supermarkets and Shopping Malls are using our
purchase history to analyze and predict our spending
behavior
• Why not use it to discover, design and develop new
drugs?
• Computers (deep learning) can
paint likeVan Gogh and Picasso
• Computers can programmatically
code music (Sonic Pi)
• Computers can dream
18. Why do we need computational
models in drug discovery?
• To discern structure-activity
relationship of chemical library
• In vitro data are limited,
expensive, time-consuming,
laborious, etc.
• Computational models can be
quickly built to preliminarily
predict the pharmacokinetics
and bioactivity of query
compounds
Anuwongcharoen et al. PeerJ 4 (2016) e1958
19. Questions that can be answered by
computational models
• What target proteins could my compound(s) bind
to and modulate?
• Would my compound bind unspecifically to other
proteins and thus have off-target activity?
• What type of compounds can bind and modulate
the bioactivity of the target protein of my interest?
• Are there similar compounds to my query
compound that may potentially exert similar
binding behavior?
• How does my compound bind to the protein
structure of its target? Hall et al. Prog Biophys Mol Biol 116 (2014) 82-91.
• How can I modify the structure
of my compound to enhance
its pharmacokinetics and
bioactivity?
20.
21. ADMET
QSAR
Pharmacophore
Statistical molecular design
Molecular modeling
Protein structure prediction
- Homology/comparative
- Ab initio
Molecular dynamics
Normal mode analysis
Docking/reverse docking
Binding cavity analysis
Pharmacophore
Protein–ligand interactome
Protein–protein interactome
Drug target gene expression
Intrinsically disordered proteins
Allo-network drugs
High-throughput synthesis
High-throughput screening
Privileged structures
Bioisostere
Chemoisostere
Scaffold hopping
Sequence alignment
BLAST
Phylogenetic analysis
Biological space
Computational chemistry
Molecular descriptors
Chemical space
Profiling
Filtering
- Lipinski’s rule of 5
Search
- Molecular similarity
- Substructure similarity
- Shape, volume and
charge-based similarityDatabases
Small molecules
- DrugBank
- ChEMBL
- Pubchem
- BindingDB
- ZINC
Proteins
- PDB
- UniProt
- SCOP
Protein-protein
- MINT
- STITCH
- STRING
Pathway
- KEGG
- Reactome
Proteochemometrics
Computational
chemogenomics
Graph/network theory
Fragment-based docking
Fragment-based QSAR
Ligand growing
Structure-based
Systems-based
Medicinal chemistry
Bioinformatics
Cheminformatics
Ligand-based
Chemogenomics
Fragment-based
Maximizing computational tools for successful drug discovery
Overview of Computational Drug Discovery
Nantasenamat and Prachayasittikul. Expert Opin Drug Discov 10 (2015) 321-329.
22. Bioinformatics
• Bioinformatics is a discipline entailing
the use of computational approaches to
analyze biological data
‣ Analyze and compare genes, proteins
and genomes
‣ Explore structures and functions of
biomolecules (DNA, protein, lipid and
carbohydrate)
‣ Explore network biology and metabolic
pathways
http://www.gettyimages.com/detail/photo/bioinformatics-background-concept-royalty-free-
image/475811932?esource=SEO_GIS_CDN_Redirect
I424
L428
F404
R394
E353
A350
D351
L354
P535
W383L525
Suvannang et al. Manuscript under Preparation.
23. • Cheminformatics is a discipline at the
interface of chemistry and computers that
enables the analysis of various aspects
relevant to chemical structures
‣ Chemical space for investigating
Molecular similarity/diversity
‣ Molecular descriptors (e.g. MW,
LogP, nHBdon, nHBacc) and
Quantum chemical
descriptors (HOMO, LUMO,
HOMO-LUMO)
Cheminformatics
Ertl and Rohde. J Cheminf 4 (2012) 12.
24. Drugs and its pre-cursors
• Fragments - are one of many substructures found in a compound (drug)
• Privileged substructures - are substructures that are commonly found as
inhibitors/activators (drugs) against several therapeutic targets
• Hits - are a small subset of compounds from large chemical libraries that are
identified from high-throughput screening
• Leads - are compounds that have undergone minor structural optimization from
hits. From there, these leads often undergo further rounds of “lead optimization”
• Drugs - are one of many leads that had passed rigorous tests (pre-clinical and
clinical trials) before reaching the market
25. Identifying hits
• So how does one go about
identifying hit compounds?
- High-throughput screening
(Experimental and computational)
- Find similar compounds to
known actives as the bioactivity of
each compound is not an isolated point
(similar chemical structures also provide
similar biological activity)
๏ 30% of these similar compounds to
known actives, are themselves actives
https://southernresearch.org/news/nih-contract-high-
throughput-screening-for-zika/
Hernandex-Santoyo et al. Protein-protein and protein-
ligand docking. DOI:10.5772/56376
MartinYC, J Med Chem 2002, 45(19):4350-4358
26. Lead generation (Hit-to-Lead)
• Identified hits from high-
throughput screens are
transformed to leads by
means of limited
structural modification
(as to optimize their
ADMET properties)
• Generated leads are
subjected to further
rounds of lead
optimization
Fuller N et al. Drug DiscovToday 2016, 21(8):1272-1283.
27. Fragment-based Drug Design
Source: http://practicalfragments.blogspot.com/2011/08/first-fragment-based-drug-approved.html
Zelboraf treats melanoma by inhibiting BRAF.
29. • Christopher Lipinski analyzed a large set of > 2,000 orally-active
drugs that led to what is known as the Lipinski’s Rule of 5, which is a set of
rules defining the drug like-ness of small molecules
‣ Molecular weight < 500 Da
‣ Lipophilicity (LogP) < 5
‣ Hydrogen bond donors < 5
‣ Hydrogen bond acceptors < 10
Lipinski’s Rule of 5
a b
c da b
c d
Christopher Lipinski
@ Pfizer
Lipinski et al.Adv Drug Deliv Rev 23 (1997) 3-25
Suvannang et al. (2017) Unpublished results
30. • In drug discovery, there is a tendency for the lipophilicity and
molecular weight to increase as lead optimization progresses
as to improve the drug’s affinity and selectivity
‣ Molecular weight < 300 Da
‣ Lipophilicity (LogP) < 3
‣ Hydrogen bond donors < 3
‣ Hydrogen bond acceptors < 3
‣ Rotatable bonds < 3
Lead-like Rule of 3
31. Chemical space
• Chemical space can be generally defined as
the universe of synthetically feasible small
molecules of <500 Da that is estimated to
be in the order of ~1060 molecules
• The visualization of which gives us a bird’s
eye glance at the relative diversity/likeness
of chemical libraries
• Reymond group at University of Bern,
Switzerland developed a computational
algorithm that enumerates all possible chemical
structures that can be built from 17 heavy
atoms in their GDB-17 database which amounts
to 166.4 billion
Reymond and Awale.ACS Chem Neurosci 3 (2012) 649-657.
32. Biological space
• Biological space refers to the chemical
space of druggable protein families
‣ ADMET
‣ Aminergic/Lipophilic GPCR space
‣ Kinase space
‣ Protease space
‣ CYP450
‣ Nuclear receptors Petit-Zeman S. http://www.nature.com/horizon/
chemicalspace/background/figs/explore_b1.html
33. Fragment space
• Fragment space can be defined as
the universe or collection of all possible
molecular fragments (substructures)
• Fragments are < 300 Da
• Utilization of the fragment space has
been suggested to allow more diverse
exploration of the possible chemical
space
• Reymond group also extracted 10
million fragments from the GDB-17
https://software.zbh.uni-hamburg.de/assets/softwareserverslide6-
a0e42ecb3651120926821932574540d5b2e83ff0209654f9ab14
804c7858451a.png
Virshup et al. J Am Chem Soc 135 (2013) 7296-7303
34. Koch et al. PNAS 102 (2005) 17272-17277
Structural classification of natural products (SCONP)
36. Polypharmacology
• There is a paradigm shift from ‘one
drug-one target’ to ‘one drug-
multiple targets’
• Unintended off-target binding may elicit
undesirable side effects and adverse
effects
• Desirable off-target binding gives you
drug repositioning opportunities
• Knowledge of polypharmacology may aid
in the design of multi-targeted drugs
Reddy and Zhang. Expert Rev Clin Pharmacol 6 (2013) 41-47
Kinase targets of Staurosporine
37. Drug repositioning/repurposing
• There is a need to
discover new drugs for
treatment especially rare
and neglected diseases
• Drug repositioning/ re-
purposing is a lucrative
approach as it tests
existing FDA-approved
drugs against various
other whole-cell and
target assays
Wu et al. Mol BioSyst 9 (2013) 1268-1281.
38. Experimental activity (pIC50)
5.0 5.5 6.0 6.5 7.0 7.5 8.0
Predictedactivity(pIC50)
5.0
5.5
6.0
6.5
7.0
7.5
8.0
What is QSAR? (1)
• QSAR/QSPR is the
acronym of Quantitative
Structure-Activity/Property
Relationship
• QSAR seeks to correlate
structural features of
compounds with their
biological activities
39. What is QSAR? (2)
• Structure governs activity/
property
• Typically in the medicinal
chemistry literature, effects
of substituent groups on
activity is extensively studied
1"
2"
3"
4"
5"
6"
• QSAR/QSPR studies exploits this knowledge for modeling the
biological or chemical activities/properties
40. What is QSAR? (3)
• QSAR involves three main concepts:
1. Selecting a biological activity or chemical property of interest
2. Generating the physicochemical description
3. Predicting the biological activity or chemical property
Qm# Energy# μ# HOMO# LUMO# HOMO0LUMO#gap#
0.2271& '309.834& 1.0521& '0.21346& '0.0127& 0.20076&
0.2142& '195.31& 0.2337& '0.22611& '0.01915& 0.20696&
IC50%
0.05$
1.50$
Molecular
Descriptors
Biological
Activity
Computational Chemistry
Machine Learning
Compounds of Interest
Predict
41. Growth of QSAR?
• A search in
SCOPUS
shows the
growing trend
of QSAR
publications
42. Data set preparation QSAR modeling
ChEMBL 23
Bioactivity
measured by IC50
Remove duplicate
SMILES
Bioactivity data of
ER α inhibitors
Initial
data set
10,666 bioactivity
data for 5,809
compounds
IC50
subset
3,527 compounds
Final
data set
1,299 compounds
Select entries with
CONFIDENCE_SCORE=9
and assay_type=B
Selected
data set
1,346 compounds
Mechanistic
interpretation of
feature
importance
Feature
selection
12 sets of
PaDEL
fingerprints
Descriptor
calculation
Data
splitting
Evaluate
performance
QSAR model
Predicted
pIC50 values
Y-scrambling
for evaluating
chance
correlation
Delete entries with < or >
signs and those with
Salt removal
Transform
IC50 to pIC50
Final
data set
Tautomer
standardization
Remove collinear
descriptors
70/30 split ratio
Perform 10
data splits
Delete entries with missing
SMILES notation
R2,Q2,
Rm
2, RMSE
A typical QSAR workflow
Suvannang et al. RSC Adv 2018, 8: 11344-11356
43. Applications of QSAR/QSPR models
• Regulatory Use: QSAR for modelling environmental
toxicity/chemical hazards by EPA and OECD
• Drug Design: QSAR for modelling biological activities
• Materials Design: QSPR for modelling chemical
properties
47. Summary
• QSAR models allow us to understand how changes to the
chromophore structure leads to GFP color change
• PCM models allow us to understand how changes to
chromophore structure, changes to protein structure and the
chromophore-protein interaction influences GFP color
change
• Insights from the predictive models could be used in further
extending the spectral repertoire of GFP
Nantasenamat C et al. J. Comput. Chem. 35(27): 1951-1966.
48. Proteochemometrics
• Proteochemometrics was developed by Maris Lapins and Jarl Wikberg of
Uppsala University in 2001
• Advantages
• Can explain ligand-target affinity by providing detailed maps down to
the substructures and amino acid level
• Can be used to rationalise why a ligand is active toward one target and
not on the other related target
• Has been shown to be useful for Drug Repositioning
• Could be used for Personalized Medicine
49. Conclusion (1)
• It is without a doubt that the QSAR paradigm boasts much benefit for the rational design
of robust compounds
• Nevertheless, there are certain shortcomings that may limit the widespread application
of QSAR
• Workflow of QSAR model development
• High dimensionality of the input space
• Representation of the molecular structure
• Interpretability and meaning of the developed QSAR models
• Presence of outliers or activity cliffs
• Validation of QSAR model performance
• Applicability in real-world setting
50. Conclusion (2)
• In spite of certain inherent flaws, the QSAR paradigms inevitably
one of the most useful forces contributing to the rapid
development of drug discovery and design.
• As with all technologies, QSAR is not perfect; however, its
weaknesses and flaws are continuously being identified, solved
and reformed to help shape a new improved and robust
approach that is approaching minimal predictive error
• To help realize the goal of developing an intuitive approach
toward the development of robust QSAR models, our
laboratory had developed a software that affords a semi-
automated if not automated QSAR modeling.
51. Conclusion (3)
• At more than 10 years of QSAR research, we can say that the
demise of QSAR is a myth if done properly and we had only
scratched the surface of its full potential
• QSAR is continuously evolving…starting from 2D-QSAR to
8D-QSAR!
• Proteochemometrics (so to say Multi-Target QSAR) enables
us to take advantage of the explosion of Omics data
53. BioCurator
Nantasenamat et al. Manuscript under preparation.
• We had developed a web application that allow users to upload
ChEMBL bioactivity data for automatic data curation
Protocol
• The web app selects a
subset of IC50/Ki data
• Removes redundant
compounds if bioactivity
values exceed 2 SD
• Remove data with < or >
symbols in the bioactivity
label
• Remove redundant
compounds based on
SMILES notation
54. osFP
Simeon et al. J Cheminf 8 (2016) 72.
Protocol
• The web app accepts the
input peptide sequence
and computes amino acid
composition descriptors
• Applies the constructed
predictive model to predict
the class label of the query
peptide
• Predicted class label is
relayed into the Results
output
Simeon et al. J Cheminform (2016) 8:72
DOI 10.1186/s13321-016-0185-8
RESEARCH ARTICLE
osFP: a web server for predicting the
oligomeric states of fluorescent proteins
Saw Simeon1
, Watshara Shoombuatong1
, Nuttapat Anuwongcharoen1
, Likit Preeyanon2
,
Virapong Prachayasittikul2
, Jarl E. S. Wikberg3
and Chanin Nantasenamat1*
Abstract
Background: Currently, monomeric fluorescent proteins (FP) are ideal markers for protein tagging. The prediction of
Open Access
55. HemoPred
Win et al. Future Med Chem 9 (2017) 275-291.
Protocol
• The web app accepts the
input peptide sequence
and computes amino acid
composition descriptors
• Applies the constructed
predictive model to predict
the class label of the query
peptide
• Predicted class label is
relayed into the Results
output
Future
Medicinal
Chemistry
Research Article
HemoPred: a web server for predicting the
hemolytic activity of peptides
For reprint orders, please contact reprints@future-science.com
56. CryoProtect
Win et al. Future Med Chem 9 (2017) 275-291.
Protocol
• The web app accepts the
input peptide sequence
and computes amino acid
composition descriptors
• Applies the constructed
predictive model to predict
the class label of the query
peptide
• Predicted class label is
relayed into the Results
output
Research Article
CryoProtect: A Web Server for Classifying Antifreeze Proteins
from Nonantifreeze Proteins
Reny Pratiwi,1,2
Aijaz Ahmad Malik,1
Nalini Schaduangrat,1
Virapong Prachayasittikul,3
Jarl E. S. Wikberg,4
Chanin Nantasenamat,1
and Watshara Shoombuatong1
1
Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand
2
Department of Medical Laboratory Technology, Faculty of Health Science, Setia Budi University, Surakarta 57127, Indonesia
3
Department of Clinical Microbiology and Applied Technology, Faculty of Medical Technology, Mahidol University,
Bangkok 10700, Thailand
4
Hindawi
Journal of Chemistry
Volume 2017,Article ID 9861752, 15 pages
https://doi.org/10.1155/2017/9861752
57. How to get started in CDD?
• Hardware
• Laptop
• Desktop
• High-
performance
computer
• Compute clusters
• Cloud computing
• Software
• Commercial
• Free
• Programming
• C, Java, etc.
• R, Python,
MATLAB, etc.
58. Computational Drug Discovery
based on Open Source
• Data source
◦ Bioactivity data: ChEMBL,
PubChem, BindingDB
◦ Chemical database: ZINC,
ChemSpider, GDB-17
◦ Biological database: PDB, UniProt
• Data curation and pre-processing
◦ BioCurator (developed in-house)
◦ Babel
• Descriptor calculation
◦ Rcpi, PyDPI, CDK, PADEL
• Multivariate analysis
◦ R: caret
◦ Python: scikit-learn
• Plots
◦ R: ggplot
◦ Python: MatPlotLib, seaborn
Molecular modeling
◦ Avogadro
◦ PyMol
◦ Chimera
◦ VMD
• Molecular docking
◦ AutoDock
• Molecular dynamics
◦ Gromacs
◦ NAMD