3. Mo*va*ons
• Library
comparison
usually
driven
by
a
need
to
construct
or
expand
a
library
– OLen
with
constraints
on
resources
• Two
classes
of
features
to
consider
– Compound-‐centric
(physchem
proper:es,
bioac:vity,
target
preferences)
– Library-‐centric
(diversity,
chemical
space
coverage)
• Library
comparisons
generally
reduce
to
– Distribu:ons
of
compound
features
(univariate)
– Overlap
in
some
chemical
space
(mul:variate)
4. Comparing
Libraries
• Most
comparisons
employ
a
reduced
(numerical)
representa:on
of
the
structure
– Fingerprints,
BCUTs,
physicochemical
descriptors
• Perform
comparisons
in
the
new
space
– PCA,
SOM,
MDS,
GTM,
…
Schamberger
et
al,
DDT,
2011,
16,
636-‐641;
Kireeva
et
al,
Mol.
Inf.,
2012,
31,
301-‐312
5. Scaffolds
&
Networks
• Scaffolds
represent
a
chemically
meaningful
reduced
representa:on
of
the
structures
• Can
be
challenging
to
define
what
a
(good)
scaffold
is
• A
network
representa:on
of
the
collec:on
of
structures
allows
for
novel
ways
to
perform
library
comparisons
– How
fine
grained
can
such
comparisons
be?
6. Scaffold
Network
Representa*ons
• Scaffolds
are
generated
by
exhaus:ve
enumera:on
of
SSSR
• Scaffolds
are
nodes,
connected
by
directed
edges
• Nodes
are
labeled
by
a
hash
key
of
the
scaffold
4
compounds
1912
compounds
7. Scaffold
Network
Construc*on
• A
scaffold
network
is
a
directed
graph
• Edges
denote
sub/super-‐structure
rela:onships
between
scaffolds
• Each
node
in
the
network
represents
a
unique
scaffold
• Singletons
are
acyclic
molecules
8. Datasets
CL1420,
31320
compounds
CL886,
3552
compounds
MIPE,
1920
compounds
Natural
Products,
5000
compounds
Mathews
and
Guha
et
al,
PNAS,
2014,
111,
11365;
Singh
et
al,
JCIM,
2009,
49,
1010
LOPAC,
1280
compounds
1079
nodes,
115287
edges
69
trees
2131
nodes,
1843
edges
129
trees
Approved,
inves:ga:onal
drugs,
constructed
for
func:onal
diversity
Diverse
library,
designed
for
enrichment
of
bioac:vity
15283
nodes,
13622
edges
729
trees
5563
nodes,
4832
edges
239
trees
23716
nodes,
21468
edges
750
trees
9. • The
overall
structure
of
the
complete
network
can
characterize
the
library
• But
distribu:ons
of
vertex-‐level
network
metrics
may
be
informa:ve
• We
can
also
consider
approaches
to
iden:fy
“important”
scaffolds
Scaffold
Network
Representa*ons
10. Metrics
for
the
Complete
Network
• Examined
vertex-‐level
measures
of
centrality
– Closeness,
betweenness,
…
– High
similarity
of
MIPE
&
NP
and
low
similarity
of
LOPAC
&
NP
is
surprising
(Ertl
et
al,
JCIM,
2008)
0.00
0.25
0.50
0.75
−10 −9 −8 −7 −6 −5
log10(Betweenness)
density
CL1420
CL886
LOPAC
MIPE
NP
0
5000
10000
15000
20000
−8 −7 −6
log Closeness (in−degree)
Num.Scaffold
CL1420
CL886
LOPAC
MIPE
NP
12. Comparing
Complete
Networks
• Library
overlap
is
characterized
by
the
set
of
common
scaffolds
• Scaffolds
can
be
ranked
(e.g.,
PageRank)
–
Small
fragments
have
low
PR
– Large
frameworks
have
high
PR
– Interes:ng
scaffolds
lie
in
between?
• Similar
libraries
will
have
common
scaffolds
with
similar
PageRank
values
PageRank
vector
PageRank
vector
Subset
Common
Fragments
Subset
Common
Fragments
Normalized
Dot Product
14. Scaffold
Recogni*on
• What
is
a
scaffold?
• Can
be
addressed
through
the
scaffold
network
– A
scaffold
is
a
hub
within
the
scaffold
network
• Provide
a
prac:cal
answer
to
“What
are
the
missing
scaffolds
in
my
library”
• Examples
of
unique
scaffolds
in
MIPE
but
not
in
NP
16. Reduced
Network
Representa*on
• The
complete
network
can
be
reduced
to
a
forest
of
trees
• Order
nodes
by
out-‐degree
• From
each
node,
traverse
network
un:l
a
terminal
node
is
reached
• Result
is
a
set
of
spanning
trees
18. Network
Structure
• A
scaffold
forest
is
characterized
by
– Disconnected
components
• structurally
related
scaffolds,
scaffolds
diversity
– Singletons
• scaffolds
with
no
superstructure
– Branching
within
connected
components
• scaffold
complexity
19. Forest
Size
vs
Library
Size
• A
large
libraries
doesn’t
imply
a
large
forest
• Forest
size
is
a
func:on
of
scaffold
diversity
CL1420,
31K
combinatorial
library
MIPE,
1912
(target)
diverse
library
20. Summarizing
Forests
• A
key
feature
is
the
nature
of
branching
in
individual
trees
• Characterized
by
ID
-‐
informa:on
theore:c
descriptor
of
branching
derived
from
the
distance
matrix
Bonchev
&
Trinajis:c,
IJQC,
1978,
14,
293-‐303
ID
=
978
ID
=
90794
ID
=
3456
ID
=
979252
21. Summarizing
Forests
• Distribu:on
of
ID
dis:nguishes
datasets
primarily
in
the
tails
• Aggrega:ng
by
mean
ID
s:ll
discriminates
well
– Driven
by
the
tails
0.00
0.25
0.50
0.75
1.00
2 4 6
log10(ID)
Density
CL1420
CL886
LOPAC
MIPE
NP
0
1
2
3
4
CL1420 CL886 LOPAC MIPE NP
Meanlog10(ID)
22. Exploring
the
Forest
• The
metric
also
allows
us
to
drill
down
– Select
scaffolds
of
given
branching
complexity
– Iden:fy
scaffolds
of
given
complexity
range
across
different
libraries
(equivalent
to
finding
holes
in
scaffold
coverage)
≈
LOPAC,
ID
=
10214
MIPE,
ID
=
10197
23. Library
Comparison
via
Merging
• …
reduces
to
comparing
networks
• We
compute
a
graph
union
and
construct
new
edges
between
nodes
with
the
same
hash
• How
does
the
network
structure
of
the
union
differ
from
the
original
networks?
• Can
be
extended
to
merge
more
than
two
networks
24. Source
Forests
• Structurally
similar
networks
• 2659
iden:cal
nodes
• Construct
union
by
connec:ng
nodes
with
iden:cal
hash
LOPAC
MIPE
25. Merged
Network
• Green
edges
“bridge”
the
two
networks
• Trees
can
now
have
two
types
of
nodes
• How
can
we
characterize
the
– Contrac:on?
– Degree
of
mixing?
26. Contrac*on
to
Measure
Overlap
• Merging
very
similar
libraries
should
generate
a
smaller
forest
compared
to
the
original
forests
• But
this
doesn’t
really
describe
how
the
individual
trees
become
(more)
connected
Cnorm =
F12
F1 + F2
where
Fi = G1i,G2i,!,Gni{ } 0.00
0.25
0.50
0.75
1.00
Cl886/CL1420 MIPE/CL886 MIPE/LOPAC MIPE/NP
Cnorm
27. 0
25
50
75
100
Cl886/CL1420 MIPE/CL886 MIPE/LOPAC MIPE/NP
%oftrees
Assortive Not Assortive
Assorta*vity
to
Measure
Overlap
• Quan:fies
the
no:on
that
“like
connects
to
like”
• Undefined
for
trees
that
only
have
one
type
of
vertex
(i.e.,
only
from
a
single
library)
• The
number
of
trees
that
are
assorta:ve
is
a
global
indicator
of
library
similarity
Newman,
Phys.
Rev.
E.,
2003,
026126
28. 0
10
20
30
0.4 0.6 0.8 1.0
Assortativity
density
Cl886/CL1420
MIPE/CL886
MIPE/LOPAC
MIPE/NP
Assorta*vity
to
Measure
Overlap
• We
then
examine
the
distribu:on
of
assorta:vity
across
assorta:ve
trees
• Dissimilar
libraries
have
few
assorta:ve
trees
– But
they
have
high
values
of
assorta:vity
• However,
high
assorta:vity
doesn’t
imply
high
overlap
30. Overlap
via
Tree
Complexity
• Similar
libraries
lead
to
fewer
trees
in
the
merged
network,
but
also
denser
trees
• Change
in
density
(branching)
across
the
forest
can
also
measure
the
extent
of
overlap
MIPE
LOPAC
Merged
31. Summarizing
via
Tree
Complexity
• Distribu:ons
of
ID
before
and
aLer
merging
don’t
differ
very
much,
visually
• However
a
KS
test
does
discriminate
them
0.0
0.2
0.4
0.6
0.8
1 2 3 4
log10(ID)
density
Individual
Merged
CL886
/
CL1420
MIPE
/
NP
0.0
0.1
0.2
0.3
0.4
2.5 5.0 7.5
log10(ID)
density
Individual
Merged
D
=
0.0173,
p
=
1
D
=
0.0582,
p
=
.0008
32. Summary
• Scaffold
networks
are
a
rela:vely
objec:ve
way
to
characterize
&
compare
libraries
– Supports
fast
comparisons
between
libraries
• The
approach
supports
mul:plexing
informa:on
in
to
a
single
data
structure
– Physchem
proper:es,
bioac:vi:es,
…
• “What
is
a
good
comparison?”
quickly
becomes
a
philosophical
ques:on