SlideShare una empresa de Scribd logo
1 de 28
Descargar para leer sin conexión
Olexandr Isayev, Ph.D.
University of North Carolina at Chapel Hill
Twitter @olexandr http://olexandrisayev.com
o “Big Data” in chemistry world
o Sources
o Challenges
o Our vision for GPU accelerated cheminformatics
workflow
o Benchmarks & Case studies
o Descriptor calculations
o Similarity
o Predictive modeling
2
Outline
Data – Knowledge Gap
Drowning in Data but starving for Knowledge
Tremendous opportunities for discovery of new drugs / materials
OH
Cl
N
H
OH
Br
H2
N
CH2
CH3
Br
H2
N
OH
Br
Br
Cl
CH3
O
CH3
FH2
N
OH
OH
OH
H3
C
H2
P
D.Fourches. Cheminformatics at the crossroads of eras. In Book: Applications of Computational Techniques in Pharmacy and
Medicine, Springer. Available 04/2014.
* Polishchuk, Madzhidov, Varnek. J Comput Aided Mol Des. 2013, 27(8):675-9.
1060-100 chemicals
1033 drug-like chemicals*
108 compounds in PubChem
106 compounds in ChEMBL
with ≥ 1 known bioactivity
Scannell et al. Nature Reviews Drug Discovery, 2012, 11, 191-200
Decline in Pharmaceutical R&D efficiency
The cost of developing a new
drug (~$2B) roughly doubles
every nine years.
Need of novel approaches that
(i) Fully exploit the potential of modern chemical biological data streams;
(ii) Reliably forecast compounds’ bioactivity and safety profiles;
(iii) Accelerate the translation from basic research to drug candidates
Quantitative
Structure
Activity
Relationships
D
E
S
C
R
I
P
T
O
R
S
N
O
N
O
N
O
N
O
N
O
N
O
N
O
N
O
N
O
N
O
0.613
0.380
-0.222
0.708
1.146
0.491
0.301
0.141
0.956
0.256
0.799
1.195
1.005
C
O
M
P
O
U
N
D
S
A
C
T
I
V
I
T
Y
Thousands of molecular descriptors
are available for organic compounds
constitutional, topological, structural,
quantum mechanics based, fragmental, steric,
pharmacophoric, geometrical,
thermodynamical, conformational, etc.
- Building of models
using machine learning
methods (NN, SVM, RF)
- Validation of models
according to numerous
statistical procedures, and
their applicability domains.
7
Samples
(compounds)
Features (descriptors)
X1 X2 ... Xm
1 X11 X12 ... X1m
2 X21 X22 ... X2m
... ... ... ... ...
n Xn1 Xn2 ... Xnm
C
O
M
P
O
U
N
D
S
A
C
T
I
V
I
T
Y
ACTIVITY (i)
Descriptor
matrix
External predictive power of QSAR models is critical
to enable their application to virtual screening.
Technically challenging to compute molecular
properties and descriptors for more >109 compounds.
No cheminformatics architecture is able to screen >109
compounds.
~106 – 107
molecules
~102 – 103
molecules
VIRTUAL
SCREENING
Empirical Rules/Filters
Similarity Search
Consensus QSA
Potential
Hits
ML or QSAR Models
Structure-based Models
Virtual Screening
to identify potential hits
Candidate
molecules
Polypharmacology & Biological profiles
Our vision for next-gen
cheminformatics platforms
GPUCPU
Add GPUs: Accelerate Science Applications
Courtesy of NVIDIA
Small Changes, Big Speed-up
Application Code
+
GPU CPU
Use GPU to
Parallelize
Compute-Intensive
Functions
Rest of Sequential
CPU Code
Courtesy of NVIDIA
~ 300 bytes/cmpd-Data parsing from Smiles
-2D structure generation
-Automatic curation
Chemical
Library
(text file)
High throughput
descriptor generator
Mol weight, logP,
Rule of 5, Daylight fingerprints
30-50M/hr
Data
Processing
~ 300 bytes/cmpd-Data parsing from Smiles
-2D structure generation
-Automatic curation
Chemical
Library
(text file)
High throughput
descriptor generator
Mol weight, logP,
Rule of 5, Daylight fingerprints
30-50M/hr
Data
Processing
Similarity
Search
Indexed, fully searchable,
accessible via high level API, e.g.,
(MolWt > 150) & (logP == 3)
Access in chunks or streaming
Interactive
analytics
with IPython
GPU accelerated
similarity search
177M/s on K40
GPUsim
GPUdup
GPUdiv
~ 300 bytes/cmpd-Data parsing from Smiles
-2D structure generation
-Automatic curation
Chemical
Library
(text file)
High throughput
descriptor generator
Mol weight, logP,
Rule of 5, Daylight fingerprints
30-50M/hr
Data
Processing
Similarity
Search
Indexed, fully searchable,
accessible via high level API, e.g.,
(MolWt > 150) & (logP == 3)
Access in chunks or streaming
Interactive
analytics
with IPython
Rapid screening of extremely large libraries with
multiple molecular probes and QSAR/QSPR models
GPU accelerated
similarity search
177M/s on K40
GPUsim
GPUdup
GPUdiv
Predictive
Modeling
(QSAR/QSPR)
GPUrf
GPUdnn
…
CudaTree@GitHub Wrapper
Deep Learning based on Theano
Chemical Datasets
Largest publicly available virtual libraries
GDB-13 955 M compounds
GDB-13-ABCDE subset 141 M
GDB-17 subset 50 M
1 Blum and Reymond, 2009, J Am Chem Soc, 131, 8732–8733
2 Ruddigkeit et al., 2012, J Chem Inf Model, 52, 2864-2875
GDB-13
Subset of 141 M
GDB-17
Random sample
of 50 M
GPU - Case Study 1
Fast Computation of Molecular Properties for
Extremely Large Chemical Libraries
Our GPU-accelerated cheminformatics platform is able to compute
key molecular properties for GDB-13 (855M), GBD-13-ABCDE
(141M), and a subset of GDB-17 (50M) compounds.
GPU - Case Study 1
Fast Computation of Molecular Properties for
Extremely Large Chemical Libraries
Our GPU-accelerated cheminformatics platform is able to compute
key molecular properties for GDB-13 (855M), GBD-13-ABCDE
(141M), and a subset of GDB-17 (50M) compounds.
Similarity Search
From J. Bajorath, SSS Cheminformatics, Obernai 2008
Similarity searching using fingerprint representations of molecules is one of the
most widely used approaches for chemical database mining: it assumes that
similar compounds possess similar biological activities.
Tanimoto Coefficient
● Tanimoto similarity needs to know the number of 1s in a
binary representation of the data (popcount)
● CUDA includes a device instruction to accomplish this
__popc() for 32-bit data and __popcll() for 64-bit data.
● We used __popcll() in our implementation
● We break 1024bit fingerprints into 64-bit chunks
● Resulting similarity is aggregated over chunks
Implementation
__popc() and __popcll() instructions
__device__ double similarity(long long *query, long long
*target, int data_len) {
int a = 0, b = 0, c = 0, i;
for (i = 0; i < data_len; i++) {
a += __popcll(query[i]);
b += __popcll(target[i]);
c += __popcll(query[i] & target[i]);
}
return (double) c / (a + b - c);
}
Some GPU / CUDA code
Benchmarks
Benchmarks
GPU - Case Study
Virtual Screening of Very Large Chemical
Libraries to Identify Bioactive Compounds
Lacosamide
- Lacosamide (trade name Vimpat) is
an anticonvulsant drug used to
prevent seizures for patients treated
for epilepsy;
- Functionalized amino acid;
- Many active analogues have been
synthesized in Prof. Harold Kohn’s
laboratory* at UNC-CH.
*Wang et al., 2011, ACS Chem Neurosci, 2, 90–106
Analog 1 Analog 2
Analog 3 Analog 4 Analog 5
GPU - Case Study
Virtual Screening of Very Large Chemical
Libraries to Identify Bioactive Compounds
200M compound subset
of GDB-13/17
Similarity search using
Lacosamide as molecular
probe
Compound ID Tanimoto Ts
Analog 2 0.997
Analog 3 0.995
Analog 1 0.994
Analog 4 0.992
Analog 5 0.978
Gdb13-a10573585 0.977
Gdb13-b28137563 0.977
Gdb13-a36264983 0.976
Gdb13-a36264952 0.976
Gdb13-a10616005 0.976
Gdb13-a3011053 0.976
Gdb13-b21242261 0.976
Gdb17-44140083 0.976
Gdb13-a30878321 0.975
Gdb13-b3485216 0.975
GPU - Case Study
Virtual Screening of Very Large Chemical
Libraries to Identify Bioactive Compounds
The GPU-accelerated
screening platform was able to
retrieve:
-known active analogues of
lacosamide,
-several functionalized amino
acids present in GDB-13,
-a novel compound (Gdb17-
44140083) fully matching the
pharmacophore of lacosamide.
103K
molecules
DeepLearning - Case Study
Large scale QSPR prediction of
bioactivity Model accuracy 97%
Build model with
Deep Learning
200 M
molecules
Rapid screening
of potential
candidates
Deep Neural Net
2 Hidden Layers
Rectified Linear Unit (ReLU)
In Summary
• GPU-accelerated cheminformatics platform for high
performance virtual screening of extremely large
chemical libraries.
• Tested for the analysis of the largest publicly available
dataset GDB-13 (~900M compounds) and (2) the
screening of ~200M compound library for similarity
search using an anticonvulsant drug as the molecular
probe.
• Our platform aims to virtually screen billions of
compounds using similarity filters and QSAR models.
• UNC-CS: Vance Miller, Chun-Wei Liu,
Zimeng Wang and Reed Palmer
• Prof. Alex Tropsha (UNC-CH)
• Prof. Denis Fourches (NCSU)
• NVIDIA & Mark Berger for help & generous
hardware donation
Acknowledgements
Funding
- NSF ABI program
- Office of Naval Research

Más contenido relacionado

La actualidad más candente

Causes and consequences of automated extraction of patent-specified virtual d...
Causes and consequences of automated extraction of patent-specified virtual d...Causes and consequences of automated extraction of patent-specified virtual d...
Causes and consequences of automated extraction of patent-specified virtual d...Chris Southan
 
Digging out Structures for Repurposing: Non-competitive Intelligence ...
Digging out Structures for Repurposing: Non-competitive Intelligence        ...Digging out Structures for Repurposing: Non-competitive Intelligence        ...
Digging out Structures for Repurposing: Non-competitive Intelligence ...Chris Southan
 
Analysis and visualization of microarray experiment data integrating Pipeline...
Analysis and visualization of microarray experiment data integrating Pipeline...Analysis and visualization of microarray experiment data integrating Pipeline...
Analysis and visualization of microarray experiment data integrating Pipeline...Vladimir Morozov
 
Biopython
BiopythonBiopython
Biopythonbosc
 
Substructure Search Face-off
Substructure Search Face-offSubstructure Search Face-off
Substructure Search Face-offNextMove Software
 
Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...
Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...
Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...Alasdair Gray
 
Exploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural productsExploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural productsSunghwan Kim
 
Patent chemisty big bang: utilities for SMEs
Patent chemisty big bang: utilities for SMEsPatent chemisty big bang: utilities for SMEs
Patent chemisty big bang: utilities for SMEsChris Southan
 
Data-sharing in Metabolomics beyond your supplemental PDF
Data-sharing in Metabolomics beyond your supplemental PDFData-sharing in Metabolomics beyond your supplemental PDF
Data-sharing in Metabolomics beyond your supplemental PDFSteffen Neumann
 
IUPHAR/BPS Guide to Pharmacology: concise mapping of chemistry, data, and tar...
IUPHAR/BPS Guide to Pharmacology: concise mapping of chemistry, data, and tar...IUPHAR/BPS Guide to Pharmacology: concise mapping of chemistry, data, and tar...
IUPHAR/BPS Guide to Pharmacology: concise mapping of chemistry, data, and tar...Chris Southan
 
Closing the gap between chemistry and biology: Joining between text tombs and...
Closing the gap between chemistry and biology: Joining between text tombs and...Closing the gap between chemistry and biology: Joining between text tombs and...
Closing the gap between chemistry and biology: Joining between text tombs and...Chris Southan
 

La actualidad más candente (14)

Causes and consequences of automated extraction of patent-specified virtual d...
Causes and consequences of automated extraction of patent-specified virtual d...Causes and consequences of automated extraction of patent-specified virtual d...
Causes and consequences of automated extraction of patent-specified virtual d...
 
Digging out Structures for Repurposing: Non-competitive Intelligence ...
Digging out Structures for Repurposing: Non-competitive Intelligence        ...Digging out Structures for Repurposing: Non-competitive Intelligence        ...
Digging out Structures for Repurposing: Non-competitive Intelligence ...
 
Analysis and visualization of microarray experiment data integrating Pipeline...
Analysis and visualization of microarray experiment data integrating Pipeline...Analysis and visualization of microarray experiment data integrating Pipeline...
Analysis and visualization of microarray experiment data integrating Pipeline...
 
Biopython
BiopythonBiopython
Biopython
 
Substructure Search Face-off
Substructure Search Face-offSubstructure Search Face-off
Substructure Search Face-off
 
Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...
Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...
Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...
 
Assay Development and Drug Repurposing Core
Assay Development and Drug Repurposing CoreAssay Development and Drug Repurposing Core
Assay Development and Drug Repurposing Core
 
Exploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural productsExploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural products
 
Patent chemisty big bang: utilities for SMEs
Patent chemisty big bang: utilities for SMEsPatent chemisty big bang: utilities for SMEs
Patent chemisty big bang: utilities for SMEs
 
Data-sharing in Metabolomics beyond your supplemental PDF
Data-sharing in Metabolomics beyond your supplemental PDFData-sharing in Metabolomics beyond your supplemental PDF
Data-sharing in Metabolomics beyond your supplemental PDF
 
Is 20TB really Big Data?
Is 20TB really Big Data?Is 20TB really Big Data?
Is 20TB really Big Data?
 
IUPHAR/BPS Guide to Pharmacology: concise mapping of chemistry, data, and tar...
IUPHAR/BPS Guide to Pharmacology: concise mapping of chemistry, data, and tar...IUPHAR/BPS Guide to Pharmacology: concise mapping of chemistry, data, and tar...
IUPHAR/BPS Guide to Pharmacology: concise mapping of chemistry, data, and tar...
 
Harvester I
Harvester IHarvester I
Harvester I
 
Closing the gap between chemistry and biology: Joining between text tombs and...
Closing the gap between chemistry and biology: Joining between text tombs and...Closing the gap between chemistry and biology: Joining between text tombs and...
Closing the gap between chemistry and biology: Joining between text tombs and...
 

Destacado

2010 in Review: Nvidia Tesla GPU
2010 in Review: Nvidia Tesla GPU 2010 in Review: Nvidia Tesla GPU
2010 in Review: Nvidia Tesla GPU Olexandr Isayev
 
R cuda presentation_ib_features_120704
R cuda presentation_ib_features_120704R cuda presentation_ib_features_120704
R cuda presentation_ib_features_120704Olexandr Isayev
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit pptSandeep Singh
 
Designing Teams for Emerging Challenges
Designing Teams for Emerging ChallengesDesigning Teams for Emerging Challenges
Designing Teams for Emerging ChallengesAaron Irizarry
 
Visual Design with Data
Visual Design with DataVisual Design with Data
Visual Design with DataSeth Familian
 
3 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 20173 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 2017Drift
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheLeslie Samuel
 

Destacado (9)

2010 in Review: Nvidia Tesla GPU
2010 in Review: Nvidia Tesla GPU 2010 in Review: Nvidia Tesla GPU
2010 in Review: Nvidia Tesla GPU
 
Nvidia Tesla @ Softlayer
Nvidia Tesla @ SoftlayerNvidia Tesla @ Softlayer
Nvidia Tesla @ Softlayer
 
Code GPU with CUDA - SIMT
Code GPU with CUDA - SIMTCode GPU with CUDA - SIMT
Code GPU with CUDA - SIMT
 
R cuda presentation_ib_features_120704
R cuda presentation_ib_features_120704R cuda presentation_ib_features_120704
R cuda presentation_ib_features_120704
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit ppt
 
Designing Teams for Emerging Challenges
Designing Teams for Emerging ChallengesDesigning Teams for Emerging Challenges
Designing Teams for Emerging Challenges
 
Visual Design with Data
Visual Design with DataVisual Design with Data
Visual Design with Data
 
3 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 20173 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 2017
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your Niche
 

Similar a GPU-accelerated Virtual Screening

SOT short course on computational toxicology
SOT short course on computational toxicology SOT short course on computational toxicology
SOT short course on computational toxicology Sean Ekins
 
Cadd and molecular modeling for M.Pharm
Cadd and molecular modeling for M.PharmCadd and molecular modeling for M.Pharm
Cadd and molecular modeling for M.PharmShikha Popali
 
Promiscuous patterns and perils in PubChem and the MLSCN
Promiscuous patterns and perils in PubChem and the MLSCNPromiscuous patterns and perils in PubChem and the MLSCN
Promiscuous patterns and perils in PubChem and the MLSCNJeremy Yang
 
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...Kamel Mansouri
 
Towards semantic systems chemical biology
Towards semantic systems chemical biology Towards semantic systems chemical biology
Towards semantic systems chemical biology Bin Chen
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchAnshika Bansal
 
FAIR connectivity for DARCP
FAIR  connectivity for DARCPFAIR  connectivity for DARCP
FAIR connectivity for DARCPChris Southan
 
2016 bioinformatics i_bio_cheminformatics_wimvancriekinge
2016 bioinformatics i_bio_cheminformatics_wimvancriekinge2016 bioinformatics i_bio_cheminformatics_wimvancriekinge
2016 bioinformatics i_bio_cheminformatics_wimvancriekingeProf. Wim Van Criekinge
 
Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Sunghwan Kim
 
Bioinformatics t9-t10-bio cheminformatics-wimvancriekinge_v2013
Bioinformatics t9-t10-bio cheminformatics-wimvancriekinge_v2013Bioinformatics t9-t10-bio cheminformatics-wimvancriekinge_v2013
Bioinformatics t9-t10-bio cheminformatics-wimvancriekinge_v2013Prof. Wim Van Criekinge
 
How can you access PubChem programmatically?
How can you access PubChem programmatically?How can you access PubChem programmatically?
How can you access PubChem programmatically?Sunghwan Kim
 
2015 bioinformatics bio_cheminformatics_wim_vancriekinge
2015 bioinformatics bio_cheminformatics_wim_vancriekinge2015 bioinformatics bio_cheminformatics_wim_vancriekinge
2015 bioinformatics bio_cheminformatics_wim_vancriekingeProf. Wim Van Criekinge
 
Using Calorimetric Data to Drive Accuracy in Computer-Aided Drug Design
Using Calorimetric Data to Drive Accuracy in Computer-Aided Drug DesignUsing Calorimetric Data to Drive Accuracy in Computer-Aided Drug Design
Using Calorimetric Data to Drive Accuracy in Computer-Aided Drug DesignMichael Gilson
 
Use of spark for proteomic scoring seattle presentation
Use of spark for  proteomic scoring   seattle presentationUse of spark for  proteomic scoring   seattle presentation
Use of spark for proteomic scoring seattle presentationlordjoe
 
Deep learning methods applied to physicochemical and toxicological endpoints
Deep learning methods applied to physicochemical and toxicological endpointsDeep learning methods applied to physicochemical and toxicological endpoints
Deep learning methods applied to physicochemical and toxicological endpointsValery Tkachenko
 
Discovery PBPK: Efficiently using machine learning & PBPK modeling to drive l...
Discovery PBPK: Efficiently using machine learning & PBPK modeling to drive l...Discovery PBPK: Efficiently using machine learning & PBPK modeling to drive l...
Discovery PBPK: Efficiently using machine learning & PBPK modeling to drive l...PhinC Development
 

Similar a GPU-accelerated Virtual Screening (20)

SOT short course on computational toxicology
SOT short course on computational toxicology SOT short course on computational toxicology
SOT short course on computational toxicology
 
Cadd and molecular modeling for M.Pharm
Cadd and molecular modeling for M.PharmCadd and molecular modeling for M.Pharm
Cadd and molecular modeling for M.Pharm
 
Promiscuous patterns and perils in PubChem and the MLSCN
Promiscuous patterns and perils in PubChem and the MLSCNPromiscuous patterns and perils in PubChem and the MLSCN
Promiscuous patterns and perils in PubChem and the MLSCN
 
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
 
Towards semantic systems chemical biology
Towards semantic systems chemical biology Towards semantic systems chemical biology
Towards semantic systems chemical biology
 
BCSRCv1.3
BCSRCv1.3BCSRCv1.3
BCSRCv1.3
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences research
 
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
 
FAIR connectivity for DARCP
FAIR  connectivity for DARCPFAIR  connectivity for DARCP
FAIR connectivity for DARCP
 
2016 bioinformatics i_bio_cheminformatics_wimvancriekinge
2016 bioinformatics i_bio_cheminformatics_wimvancriekinge2016 bioinformatics i_bio_cheminformatics_wimvancriekinge
2016 bioinformatics i_bio_cheminformatics_wimvancriekinge
 
Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...
 
GiTools
GiToolsGiTools
GiTools
 
Bioinformatics t9-t10-bio cheminformatics-wimvancriekinge_v2013
Bioinformatics t9-t10-bio cheminformatics-wimvancriekinge_v2013Bioinformatics t9-t10-bio cheminformatics-wimvancriekinge_v2013
Bioinformatics t9-t10-bio cheminformatics-wimvancriekinge_v2013
 
How can you access PubChem programmatically?
How can you access PubChem programmatically?How can you access PubChem programmatically?
How can you access PubChem programmatically?
 
2015 bioinformatics bio_cheminformatics_wim_vancriekinge
2015 bioinformatics bio_cheminformatics_wim_vancriekinge2015 bioinformatics bio_cheminformatics_wim_vancriekinge
2015 bioinformatics bio_cheminformatics_wim_vancriekinge
 
Using Calorimetric Data to Drive Accuracy in Computer-Aided Drug Design
Using Calorimetric Data to Drive Accuracy in Computer-Aided Drug DesignUsing Calorimetric Data to Drive Accuracy in Computer-Aided Drug Design
Using Calorimetric Data to Drive Accuracy in Computer-Aided Drug Design
 
Use of spark for proteomic scoring seattle presentation
Use of spark for  proteomic scoring   seattle presentationUse of spark for  proteomic scoring   seattle presentation
Use of spark for proteomic scoring seattle presentation
 
Deep learning methods applied to physicochemical and toxicological endpoints
Deep learning methods applied to physicochemical and toxicological endpointsDeep learning methods applied to physicochemical and toxicological endpoints
Deep learning methods applied to physicochemical and toxicological endpoints
 
Discovering drugs (I. Belda)
Discovering drugs (I. Belda)Discovering drugs (I. Belda)
Discovering drugs (I. Belda)
 
Discovery PBPK: Efficiently using machine learning & PBPK modeling to drive l...
Discovery PBPK: Efficiently using machine learning & PBPK modeling to drive l...Discovery PBPK: Efficiently using machine learning & PBPK modeling to drive l...
Discovery PBPK: Efficiently using machine learning & PBPK modeling to drive l...
 

Último

Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxmaryFF1
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx023NiWayanAnggiSriWa
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuinethapagita
 
Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...navyadasi1992
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayupadhyaymani499
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024Jene van der Heide
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
Servosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicServosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicAditi Jain
 
trihybrid cross , test cross chi squares
trihybrid cross , test cross chi squarestrihybrid cross , test cross chi squares
trihybrid cross , test cross chi squaresusmanzain586
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptxpallavirawat456
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsSérgio Sacani
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationColumbia Weather Systems
 

Último (20)

Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by na
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
 
Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyay
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
 
Servosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicServosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by Petrovic
 
trihybrid cross , test cross chi squares
trihybrid cross , test cross chi squarestrihybrid cross , test cross chi squares
trihybrid cross , test cross chi squares
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptx
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive stars
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather Station
 

GPU-accelerated Virtual Screening

  • 1. Olexandr Isayev, Ph.D. University of North Carolina at Chapel Hill Twitter @olexandr http://olexandrisayev.com
  • 2. o “Big Data” in chemistry world o Sources o Challenges o Our vision for GPU accelerated cheminformatics workflow o Benchmarks & Case studies o Descriptor calculations o Similarity o Predictive modeling 2 Outline
  • 3. Data – Knowledge Gap Drowning in Data but starving for Knowledge Tremendous opportunities for discovery of new drugs / materials
  • 4. OH Cl N H OH Br H2 N CH2 CH3 Br H2 N OH Br Br Cl CH3 O CH3 FH2 N OH OH OH H3 C H2 P D.Fourches. Cheminformatics at the crossroads of eras. In Book: Applications of Computational Techniques in Pharmacy and Medicine, Springer. Available 04/2014. * Polishchuk, Madzhidov, Varnek. J Comput Aided Mol Des. 2013, 27(8):675-9. 1060-100 chemicals 1033 drug-like chemicals* 108 compounds in PubChem 106 compounds in ChEMBL with ≥ 1 known bioactivity
  • 5. Scannell et al. Nature Reviews Drug Discovery, 2012, 11, 191-200 Decline in Pharmaceutical R&D efficiency The cost of developing a new drug (~$2B) roughly doubles every nine years. Need of novel approaches that (i) Fully exploit the potential of modern chemical biological data streams; (ii) Reliably forecast compounds’ bioactivity and safety profiles; (iii) Accelerate the translation from basic research to drug candidates
  • 6. Quantitative Structure Activity Relationships D E S C R I P T O R S N O N O N O N O N O N O N O N O N O N O 0.613 0.380 -0.222 0.708 1.146 0.491 0.301 0.141 0.956 0.256 0.799 1.195 1.005 C O M P O U N D S A C T I V I T Y Thousands of molecular descriptors are available for organic compounds constitutional, topological, structural, quantum mechanics based, fragmental, steric, pharmacophoric, geometrical, thermodynamical, conformational, etc. - Building of models using machine learning methods (NN, SVM, RF) - Validation of models according to numerous statistical procedures, and their applicability domains. 7 Samples (compounds) Features (descriptors) X1 X2 ... Xm 1 X11 X12 ... X1m 2 X21 X22 ... X2m ... ... ... ... ... n Xn1 Xn2 ... Xnm C O M P O U N D S A C T I V I T Y ACTIVITY (i) Descriptor matrix External predictive power of QSAR models is critical to enable their application to virtual screening. Technically challenging to compute molecular properties and descriptors for more >109 compounds. No cheminformatics architecture is able to screen >109 compounds.
  • 7. ~106 – 107 molecules ~102 – 103 molecules VIRTUAL SCREENING Empirical Rules/Filters Similarity Search Consensus QSA Potential Hits ML or QSAR Models Structure-based Models Virtual Screening to identify potential hits Candidate molecules
  • 9. Our vision for next-gen cheminformatics platforms
  • 10. GPUCPU Add GPUs: Accelerate Science Applications Courtesy of NVIDIA
  • 11. Small Changes, Big Speed-up Application Code + GPU CPU Use GPU to Parallelize Compute-Intensive Functions Rest of Sequential CPU Code Courtesy of NVIDIA
  • 12. ~ 300 bytes/cmpd-Data parsing from Smiles -2D structure generation -Automatic curation Chemical Library (text file) High throughput descriptor generator Mol weight, logP, Rule of 5, Daylight fingerprints 30-50M/hr Data Processing
  • 13. ~ 300 bytes/cmpd-Data parsing from Smiles -2D structure generation -Automatic curation Chemical Library (text file) High throughput descriptor generator Mol weight, logP, Rule of 5, Daylight fingerprints 30-50M/hr Data Processing Similarity Search Indexed, fully searchable, accessible via high level API, e.g., (MolWt > 150) & (logP == 3) Access in chunks or streaming Interactive analytics with IPython GPU accelerated similarity search 177M/s on K40 GPUsim GPUdup GPUdiv
  • 14. ~ 300 bytes/cmpd-Data parsing from Smiles -2D structure generation -Automatic curation Chemical Library (text file) High throughput descriptor generator Mol weight, logP, Rule of 5, Daylight fingerprints 30-50M/hr Data Processing Similarity Search Indexed, fully searchable, accessible via high level API, e.g., (MolWt > 150) & (logP == 3) Access in chunks or streaming Interactive analytics with IPython Rapid screening of extremely large libraries with multiple molecular probes and QSAR/QSPR models GPU accelerated similarity search 177M/s on K40 GPUsim GPUdup GPUdiv Predictive Modeling (QSAR/QSPR) GPUrf GPUdnn … CudaTree@GitHub Wrapper Deep Learning based on Theano
  • 15. Chemical Datasets Largest publicly available virtual libraries GDB-13 955 M compounds GDB-13-ABCDE subset 141 M GDB-17 subset 50 M 1 Blum and Reymond, 2009, J Am Chem Soc, 131, 8732–8733 2 Ruddigkeit et al., 2012, J Chem Inf Model, 52, 2864-2875
  • 16. GDB-13 Subset of 141 M GDB-17 Random sample of 50 M GPU - Case Study 1 Fast Computation of Molecular Properties for Extremely Large Chemical Libraries Our GPU-accelerated cheminformatics platform is able to compute key molecular properties for GDB-13 (855M), GBD-13-ABCDE (141M), and a subset of GDB-17 (50M) compounds.
  • 17. GPU - Case Study 1 Fast Computation of Molecular Properties for Extremely Large Chemical Libraries Our GPU-accelerated cheminformatics platform is able to compute key molecular properties for GDB-13 (855M), GBD-13-ABCDE (141M), and a subset of GDB-17 (50M) compounds.
  • 18. Similarity Search From J. Bajorath, SSS Cheminformatics, Obernai 2008 Similarity searching using fingerprint representations of molecules is one of the most widely used approaches for chemical database mining: it assumes that similar compounds possess similar biological activities. Tanimoto Coefficient
  • 19. ● Tanimoto similarity needs to know the number of 1s in a binary representation of the data (popcount) ● CUDA includes a device instruction to accomplish this __popc() for 32-bit data and __popcll() for 64-bit data. ● We used __popcll() in our implementation ● We break 1024bit fingerprints into 64-bit chunks ● Resulting similarity is aggregated over chunks Implementation __popc() and __popcll() instructions
  • 20. __device__ double similarity(long long *query, long long *target, int data_len) { int a = 0, b = 0, c = 0, i; for (i = 0; i < data_len; i++) { a += __popcll(query[i]); b += __popcll(target[i]); c += __popcll(query[i] & target[i]); } return (double) c / (a + b - c); } Some GPU / CUDA code
  • 23. GPU - Case Study Virtual Screening of Very Large Chemical Libraries to Identify Bioactive Compounds Lacosamide - Lacosamide (trade name Vimpat) is an anticonvulsant drug used to prevent seizures for patients treated for epilepsy; - Functionalized amino acid; - Many active analogues have been synthesized in Prof. Harold Kohn’s laboratory* at UNC-CH. *Wang et al., 2011, ACS Chem Neurosci, 2, 90–106
  • 24. Analog 1 Analog 2 Analog 3 Analog 4 Analog 5 GPU - Case Study Virtual Screening of Very Large Chemical Libraries to Identify Bioactive Compounds 200M compound subset of GDB-13/17 Similarity search using Lacosamide as molecular probe
  • 25. Compound ID Tanimoto Ts Analog 2 0.997 Analog 3 0.995 Analog 1 0.994 Analog 4 0.992 Analog 5 0.978 Gdb13-a10573585 0.977 Gdb13-b28137563 0.977 Gdb13-a36264983 0.976 Gdb13-a36264952 0.976 Gdb13-a10616005 0.976 Gdb13-a3011053 0.976 Gdb13-b21242261 0.976 Gdb17-44140083 0.976 Gdb13-a30878321 0.975 Gdb13-b3485216 0.975 GPU - Case Study Virtual Screening of Very Large Chemical Libraries to Identify Bioactive Compounds The GPU-accelerated screening platform was able to retrieve: -known active analogues of lacosamide, -several functionalized amino acids present in GDB-13, -a novel compound (Gdb17- 44140083) fully matching the pharmacophore of lacosamide.
  • 26. 103K molecules DeepLearning - Case Study Large scale QSPR prediction of bioactivity Model accuracy 97% Build model with Deep Learning 200 M molecules Rapid screening of potential candidates Deep Neural Net 2 Hidden Layers Rectified Linear Unit (ReLU)
  • 27. In Summary • GPU-accelerated cheminformatics platform for high performance virtual screening of extremely large chemical libraries. • Tested for the analysis of the largest publicly available dataset GDB-13 (~900M compounds) and (2) the screening of ~200M compound library for similarity search using an anticonvulsant drug as the molecular probe. • Our platform aims to virtually screen billions of compounds using similarity filters and QSAR models.
  • 28. • UNC-CS: Vance Miller, Chun-Wei Liu, Zimeng Wang and Reed Palmer • Prof. Alex Tropsha (UNC-CH) • Prof. Denis Fourches (NCSU) • NVIDIA & Mark Berger for help & generous hardware donation Acknowledgements Funding - NSF ABI program - Office of Naval Research