How to Troubleshoot Apps for the Modern Connected Worker
Robots, Small Molecules & R
1. Robots,
Small
Molecules
&
R
Ingredients
for
Exploring
and
Predic<ng
Biological
Effects
Rajarshi
Guha
September
13,
2014
hEp://blog.rguha.net/
2. Target
Iden<fica<on
Lead
Discovery
Lead
Op<miza<on
Clinical
Development
• Sensi<vity
• Scaling
Assay
Op<miza<on
Primary
Screening
• Fluorescence
• High
Content
• Select
subset
to
follow
up
• Diversity
Cherry
Picking
Confirma<on
• Counter
screen
• Explore
SAR
HTS
Hun<ng
for
Leads
3. High
Throughput
Screening
• Test
thousands
to
hundreds
of
thousands
of
compounds
in
one
or
more
assays
• Employs
a
robo<c
plaXorm
• Rapidly
iden<fy
novel
modulators
of
biological
systems
– Infec<ous
agents
– Cellular
basis
of
diseases
6. HTS
Workflow
• Rapidly
screen
large
compound
collec<ons
• Efficiently
iden<fy
real
ac<ves
– Test
them
in
slower,
accurate,
expensive
screens
• Use
the
data
to
learn
what
types
of
compounds
tend
to
be
ac<ve
• Use
the
model
to
suggest
more
compounds
to
screen
300K
HTS
1000
300
Number of Molecules
Cherry
Picks
7. Data
Science
Problems
• Predic<ve
models
for
highlight
imbalanced
datasets
• Global
versus
local
models?
• Feature
selec<on
–
data
driven?
Domain
driven?
• Clustering
&
enrichment
• Similarity
–
defini<on,
computa<on,
performance
• Integra<on
–
chemical
structures,
numerical
data,
text
(papers,
patents),
images
8. The
Roles
of
R
Data Access
ROracle
RMyQSL
RPostgreSQL
rpubchem
chemblr
Chemistry
rcdk
ChemmineR
fingerprint
HTS QC
displayHTS
spdep
Imaging
EBImage
rflowcyt
ripa
raster
Visualization
grid
ggplot
Shiny
ggvis
igraph
Data Analysis
drc
igraph
randomForest
svm
...
Also
see
ChemPhys
CRAN
Task
View
9. HTS
Data
Types
–
Single
Point
100
75
50
25
0
9.50 9.75 10.00 10.25 10.50
Concentration
Response
16. Working
with
Molecules
in
R
• A
number
of
OSS
libraries
are
available
• ChemmineR
and
rcdk
are
the
main
packages
that
allow
you
to
manipulate
molecules
in
R
• Uses
rJava
to
interface
with
JOELib
and
CDK
respec<vely
17. rcdk
• Idioma<c
R
interface
to
the
CDK
library
– I/O
support
for
chemical
file
formats
– Manipula<on
of
atoms,
bonds,
molecules
– Generate
molecular
descriptors,
fingerprints
library(rcdk)
mol <- parse.smiles(‘CCCC’)[[1]]
mols <- load.molecules(‘http://www.rguha.net/mipe100.smi’)
19. Calcula<ng
Molecular
Features
• Evaluate
a
matrix
of
numerical
features
mols <- load.molecules("mipe100.smi")
dnames <- get.desc.names('topological')
descs <- eval.desc(mols, dnames)
• End
up
with
a
rectangular
data.frame
> str(descs)
'data.frame': 99 obs. of 195 variables:
$ nRings7 : num 1 0 1 0 0 0 0 0 0 0 ...
$ nRings8 : num 0 0 0 0 0 0 0 0 0 0 ...
$ nRings9 : num 0 0 0 0 0 0 0 0 0 0 ...
$ tpsaEfficiency : num 0.1856 0.2035 0.0118 0.0602 ...
20. Calcula<ng
Fingerprints
• Binary
string
representa<on
of
molecular
structure
– Objec<vely
defined,
fast
to
calculate
– Good
for
searching,
clustering,
predic<on
library(fingerprint)
fps <- lapply(mols, get.fingerprint)
• The
fingerprint
package
is
used
to
represent
them
as
S4
objects
24. 1.00
0.75
0.50
0.25
0.00
0 250 500 750
Bit Position
Normalized Frequency
Use
Case
-‐
Bit
Spectrum
• Vector
summary
of
the
fingerprints
for
a
dataset
• Defined
as
the
frac<on
of
<mes
a
bit
posi<on
is
set
to
1,
for
each
bit
posi<on
0 0 1
0 1 0
1 1 1
1 0 1
0.5 0.5 0.75
...
...
...
...
...
~
10K
molecules
25. • Comparison
• Simply
e.g.:
Compare
~
800
solubles
with
>
30k
insolubles
1.0
Use
Case
-‐
Bit
Spectrum
of
two
datasets
is
now
O(n)
take
the
difference
of
the
two
bit
spectra
Frequency
0.5
Normalized 0.0
-0.5
Δ -1.0
Bit Position 0 50 100 150
## make two subsets and generate bit spectra
sol.idx <- which(sol$label == 'high')
insol.idx <- which(sol$label != 'high')
sol.bs <- bit.spectrum(fps[sol.idx])
insol.bs <- bit.spectrum(fps[insol.idx])
## display a difference plot
bsdiff <- sol.bs - insol.bs
d <- data.frame(x=1:length(sol.bs), y=bsdiff)
ggplot(d, aes(x=x,y=y))+geom_line()+
xlab('Bit Position')+
ylab('Normalized Frequency')+
ylim(c(-1,1))
27. Building
Models
is
the
Easy
Part
• Given
a
descriptor
data.frame
or
fingerprint
list
we’re
ready
to
build
models
– caret,
caretEnsemble
• Ques<on
is
whether
the
model(s)
can
generalize
• Applicability
is
a
key
considera<on
when
predic<ng
bioac<vity
– Has
economic
&
safety
ramifica<ons
in
regulatory
enviroments
28. Domain
Applicability
• How
Training
Set
Test
Set
dissimilar
to
the
training
set
do
you
have
to
be
before
the
predic<on
is
meaningless?
– Distance
to
training
set?
Inside/outside
convex
hull
– Comparison
of
bit
spectra
29. Global
vs
Local
Models
• Bioassay
data
is
not
really
big
data
• Can
big
data
be
too
big?
• AID
1996
– 57K
measurements
of
aqueous
solubility
• Do
we
build
one
model?
• Or
mul<ple
local
models?
PCA
of
166
Binary
Features
32. How
to
Test
Combina<ons
• Many
procedures
described
in
the
literature
– Fixed
dose
ra<o
(aka
ray)
– Ray
contour
– Checkerboard
– Gene<c
algorithm
C5,D5 C5
C4,D4 C4
C3,D3 C3
C2,D2 C2
C1,D5 C1,D4 C1,D3 C1,D2 C1,D1 C1
D5 D4 D3 D2 D1 0
33. How
to
Test
Combina<ons
• Many
procedures
described
in
the
literature
– Fixed
dose
ra<o
(aka
ray)
– Ray
contour
– Checkerboard
– Gene<c
algorithm
Vargatef DCC-2036 PD-166285 GDC-0941
PI-103 GDC-0980 Bardoxolone methyl AATT-77551199
SNS-032 NCGC00188382-01 Lestaurtinib CNF-2024
ISOX Belinostat PF-477736 AZD-7762
35. When
are
Combina<ons
Similar?
• Differences
and
their
aggregates
such
as
RMSD
can
lead
to
degeneracy
• Instead
we’re
interested
in
the
shape
of
the
surface
• How
to
characterize
shape?
– Parametrized
fits
– Distribu<on
of
responses
0.010
0.005
0.000
0 25 50 75 100
0.06
0.04
0.02
0.00
0 25 50 75 100
0.15
0.10
0.05
0.00
0 50 100
D, p value
36. Similarity
via
the
Syrjala
Test
10.0
7.5
5.0
2.5
0.0
0.00 0.25 0.50 0.75
D
density
• Syrjala
test
used
to
compare
popula<on
distribu<ons
over
a
spa<al
grid
– Invariant
to
grid
orienta<on
– Provides
an
empirical
p-‐value
• Less
degenerate
than
just
considering
1D
distribu<ons
Syrjala,
S.E.,
“A
Sta<s<cal
Test
for
a
Difference
between
the
Spa<al
Distribu<ons
of
Two
Popula<ons”,
Ecology,
1996,
77(1),
75-‐80
38. Working
in
“Combina<on
Space”
• Each
cell
line
is
represented
as
a
vector
of
response
matrices
• “Distance”
between
two
cell
lines
is
a
func<on
of
the
distance
between
component
response
matrices
• F
can
be
min,
max,
mean,
…
L1
L2
=
d1
=
d2
=
d3
=
d4
=
d5
D L1, L2 ( ) = F({d1, d2,…, dn})
,
,
,
,
,
41. Networks
&
Integra<on
• Network
models
of
molecules,
and
targets
are
common
– Allows
for
the
incorpora<on
of
lots
of
associated
informa<on
– Diseases,
pathways,
OTE’s,
• When
linked
with
clinical
data
&
outcomes,
we
can
generate
massive
networks
– Adverse
events
(FDA
AERS)
– Analysis
by
Cloudera
considered
>
10E6
drug-‐drug-‐
reac<on
triples
Yildirim,
M.A.
et
al
42. Networks
&
integra<on
• SAR
data
can
be
viewed
in
a
network
form
– SALI,
SARI
based
networks
– Usually
requires
pairwise
calcula<ons
of
the
metric
• Current
studies
have
focused
on
small
datasets
(<
1000
molecules)
• Hadoop
+
Giraph
could
let
us
apply
this
to
HTS-‐
scale
datasets
Peltason,
L
et
al
hEp://sali.rguha.net/
43. Networks
&
integra<on
• When
we
apply
a
network
view
we
can
consider
many
interes<ng
applica<ons
&
make
use
of
cloud
scale
infrastructure
– Network
based
similarity
– Community
detec<on
(aka
clustering)
– PageRank
style
ranking
(of
targets,
compounds,
…)
– Generate
network
metrics,
which
can
be
used
as
input
to
predic<ve
models
(for
interac<ons,
effects,
…)
Bauer-‐Mehren
et
al
48. Summary
• The
HTS
workflow
presents
mul<ple
data
science
problems
involving
(unique)
data
types
• R
can
play
a
role
at
several
stages,
but
model
building
is
straighXorward
• Representa<on
is
key
and
guides
the
types
and
nature
of
analyses