1. Saravanan Anandathiyagar Project Background Paper
March 2002 Supervisor: Simon Colton
A Substructure Server
Abstract
Much of the reason for the high cost of medicines is rooted in the length and
complexity of the development and approval process. At every possible stage of
development, it is possible that a potential drug (leader) will fail to gain approval
on the basis that it produces erratic results or harmful side effects.
Predictive toxicology aims to reduce the money and time spent by identifying as
early on in the drug development process as possible leaders that are likely to fail.
Numerous machine learning techniques exist to identify such leaders. Here we
present a possible solution based on the Find a maximally specific hypothesis (Find-S)
algorithm. This algorithm, given a set of positive and negative examples of data,
finds substructures that are statistically true of the majority of positive compounds,
and statistically not true of the negative compounds.
A discussion of the algorithm and its motivation is presented here.
i
2. Saravanan Anandathiyagar Project Background Paper
March 2002 Supervisor: Simon Colton
A Substructure Server
Contents
Abstract...................................................................................................................i
Contents................................................................................................................ii
1.Introduction........................................................................................................3
1.1.Motivation..........................................................................................................................3
1.2.Summary of Report...........................................................................................................4
2.Previous Research..............................................................................................5
1.3.Structure-Activity Relationships.......................................................................................5
1.4.Attribute-based representations........................................................................................5
1.5.Relational-based representations......................................................................................7
1.6.Inductive logic programming...........................................................................................7
3.The Find-S Technique.......................................................................................9
1.7.Motivation.........................................................................................................................9
1.8.General-to-specific ordering of hypotheses......................................................................9
1.9.The Find-S algorithm......................................................................................................10
1.10.Algorithm evaluation methods.......................................................................................14
1.11.Issues with the Find-S technique...................................................................................15
1.12.Existing Prolog implementation....................................................................................16
4.Implementation Considerations.......................................................................18
1.13.Representing structures.................................................................................................18
1.14.Improvement of current implementation......................................................................18
1.15.Extensions......................................................................................................................18
5.References.........................................................................................................20
ii
3. Saravanan Anandathiyagar Project Background Paper
March 2002 Supervisor: Simon Colton
A Substructure Server
1. Introduction
1.1. Motivation
Each year, drug companies release new and improved drugs, claiming that they produce better
results with fewer side effects. However, the cost of such advances in the drug industry is not small.
Developing a drug from the theoretical stage to it appearing on pharmacy shelves normally takes in
the region of 10 to 15 years, at an average cost of over £500 million [see ref 1]. This outlay by the
drug company must be covered by the consumer for the company to remain in profit, and evidence
of this can be seen, for example, in the regular rise of NHS prescription charges.
Much of the reason for the high cost of medicines is rooted in the length and complexity of the
development and approval process. At every possible stage of development, it is possible that a
potential drug (leader) will fail to gain approval on the basis that it produces erratic results or
harmful side effects. Even after promising lab tests, further experiments on animal specimens often
return ideas to the drawing board. It is estimated that for every one drug that reaches clinical
(human) trial stage, another 1000 have failed earlier testing.
Despite this, it is important to note that medicines still reduce overall medical care costs by reducing
even more expensive hospitalisation, surgery or other treatments. Drugs are the primary way of
controlling the outcomes of chronic illness. Therefore, the development of new drugs is important
for both patient care and for the positive long-term financial implications.
It is clear that reducing the number of drug leaders developed at an early stage will have a significant
effect in limiting development costs. Determining at an early stage that a leader is unsuitable for
further testing saves the investment that may otherwise have been spent on this drug, only for the
same conclusion to be reached. For this reason, the field of predictive toxicology was born. It is an
effort on the part of biotechnology companies to predict in advance whether or not a drug will be
toxic, using various techniques learnt from the fields of statistics, artificial intelligence (AI), and
machine learning.
Negative effects of a drug can range from relatively minor problems such as headaches and stomach
upsets, to potentially life-threatening organ damage. While many accepted drugs do produce some
side effects for some patients, the value of the treatment is always said to outweigh the side effects.
However, there are certain characteristics of chemical compounds that will limit their effectiveness
as a drug. Predictive toxicology aims to find this drug toxicity while still in the planning stages. Ruling
out a leader at this early stage saves it being synthesised and tested, and allows resources to be
focused on more promising areas of research.
Machine learning programs in a variety of different guises have been used to try and discover the
reasons why certain chemicals are toxic and others are not. Essentially, they learn a concept that is
true of the toxic drugs and false for other non-toxic drugs. These derived concepts are usually small
Page 3
4. Saravanan Anandathiyagar Project Background Paper
March 2002 Supervisor: Simon Colton
A Substructure Server
(around five or six atoms) sub-structures of the larger drug molecule, where some of the atoms are
fixed elements and others may vary.
The task in hand is to effectively and efficiently identify such sub-structures using the Find
Maximally Specific Hypothesis (FIND-S) machine learning algorithm. An implementation of the
algorithm has been written in PROLOG by S Colton; our work here is based on extending this
implementation and producing a web-based server application.
A molecule is said to be positive if it contains the sub-structure in question. Conversely, it is said to be
negative if it does not. The application will return interesting substructures given positive and negative
molecules, whereby the substructure is true of statistically significant more positives than negatives.
1.2. Summary of Report
This report is an overview of the research undertaken, with an outline of how implementation of a
Substructure Server may proceed. Section 2 summarises the machine learning techniques used in the
field of predictive toxicology, and introduces the concepts of attribute-based and relationship-based
structure-activity relationships.
Section 3 is a comprehensive overview of the Find-S algorithm, with an emphasis on how it may
perform in a predictive toxicology situation. A fictional example is presented and analysed which
demonstrates the key methodologies of the technique. Evaluation techniques applicable to both the
algorithm itself and to the results it produces are outlined, as well as various considerations that
should be addressed on implementation. S Colton’s existing Prolog implementation of the algorithm
is also discussed.
Section 4 highlights some implementation considerations, suggesting a possible course of action
towards building a substructure server available for public use.
Page 4
5. Saravanan Anandathiyagar Project Background Paper
March 2002 Supervisor: Simon Colton
A Substructure Server
2. Previous Research
As was mentioned above, machine learning algorithms to find relevant sub-structures have been applied
in the field of predictive toxicology. It is important to understand the approaches that have been taken in
previous work, using it as a basis for further study.
A summary of the key features of background study undertaken is summarised in this section.
1.3. Structure-Activity Relationships
A structure-activity relationship (SAR) models the relationship between activities and
physicochemical properties of a set of compounds [2]. The goal of our work is essentially to form
SARs from the given input molecules. These resultant SARs represent the molecules most likely
contribute to toxicity, as calculated by our algorithm.
A SAR is derived from two components:
• The learning algorithm employed during derivation, and
• The choice of representation to describe the chemical structure of the compounds being
considered.
The learning algorithm used will rule out possible choices of representation, as the latter has to be
rich enough to support the algorithm’s procedure. SARs can store different information about
compounds, and typically such information (attributes) could consist of any of the following
chemical properties [5]:
• Partial atomic charges • CMR
• Surface area • pKa, pKb
• Volume • Hansch parameters π, σ, F
• H_bond donors/acceptors • Molecular grids
• ClogP • Polarisability
The exact nature or meaning of each attribute type need not be discussed here. It is however
important to note that there are any number of ways of representing a compound, using any
combination of the attributes given above (and more).
1.4. Attribute-based representations
A large variety of learning techniques are in use that derive SARs of different forms. The majority of
these are based on examining the types of attributes listed above. A short summary of a few of these
techniques is presented here.
Page 5
6. Saravanan Anandathiyagar Project Background Paper
March 2002 Supervisor: Simon Colton
A Substructure Server
1.4.1. Linear and Partial least-squares regression
Linear regression was the first learning algorithm employed in predictive toxicology, as
detailed by Hansch et al. [3]. “Training” the system involves providing suitable training
examples, which are simply saved to memory without being interpreted or compared in any
way. It is on this stored information (as explicitly provided by the user) that regression aims
to approximate its target function.
In the context of predictive toxicology, this would involve supplying examples of positive
compounds as training data. The procedure then run on a new compound would invoke a
set of similar compounds being retrieved from the stored values, and use this to classify the
new compound. The analysis of the compounds is based on chemical attributes as specified
by the algorithm; Hansch used global chemical properties of the molecule (LogP and π).
Least-squares regression is another learning technique involving the relationship between
chemical attributes. Visually it essentially entails forming a ‘line of best fit’ for a set of
training data plotted against two variables y and x, where x and y are two chemical attributes.
For any new compound encountered, a plot is made of the same two attributes; if the point
produced lies within a fixed bound of the line of best fit, then the new compound can be
deemed positive. The system can be extended to include multiple independent variables, and
to give each variable different weights – a measure of how important each attribute measure is
compared with each other.
It is important to note that both these techniques make no attempt to interpret the training
data as it is fed to them; all the processing of determining suitability criteria for new
compounds happens only once the new compound has been encountered.
1.4.2. Decision trees
Decision trees classify the training data by considering each <attribute, value> pair (tuple)
for a given compound [4]. Each node in the tree specifies a test of a particular attribute, and
each branch descending from that node corresponds to a possible value for that attribute. A
compound is classified as positive or negative at the leaf nodes of the graph.
New compounds are classified by comparing their attribute values to ones stored from the
training data. An implementation of this algorithm needs to address the critical issue of which
attribute(s) to perform the test on. This decision could crucially alter the classification
schema, and is a problem inherent in trying to separate objects into discrete sets when their
behaviour or identity is given by a number of attribute. It is possible that any two attribute
values could contradict each other on a particular classification scheme, and it then becomes
necessary to impose some ordering or priority system over the attributes.
Page 6
7. Saravanan Anandathiyagar Project Background Paper
March 2002 Supervisor: Simon Colton
A Substructure Server
1.4.3. Neural networks
Artificial Neural Networks (ANNs) provide a general and practical method for learning
functions from examples [4], and have widespread use in AI applications. Predictive
toxicology lends itself to the use of ANNs because of how compound attributes can be
treated as <attribute, value> tuples, in a manner similar to that discussed in section 2.1.2
above. A compound can be represented by a list of such tuples covering the full range of
attributes.
The simplest form of ANN system is based on perceptrons, which will take the list of tuples
and calculates a ‘score’ for the compound. This score is calculated from a combination of the
input tuples, and a weight associated with each attribute. The algorithm can learn from the
training data by considering the attributes of positive compounds, and can then classify
unknown compounds as positive or negative, depending on the score calculated being higher
than a defined threshold.
Practical ANN systems usually implement the more advanced backpropogation algorithm,
which learns the weights for a network of neural nodes on multiple layers. However the
principal is the same as that used in the perceptron algorithm, with the compound score
being calculated in a non-linear manner taking into account more variables.
1.5. Relational-based representations
The techniques mentioned above for deriving SARs all share one key concept: they are all based on
attributes of the object (in our case, the chemical compound being examined). These attributes can be
considered to be global properties of these molecules, e.g. using the molecular grid attribute maps
points in space, which are global properties of the coordinate system used. The tuple of attributes
that has been used to represent the properties of the molecule is not an ideal format; it is difficult to
efficiently map atoms and the bonds onto a linear list.
A more general way to describe objects is to use relations. In a relational description the basic
elements are substructures and their associations [2]. This allows the spatial representation of the
atoms within the molecule to be represented more accurately, directly and efficiently.
1.6. Inductive logic programming
Fully relational descriptions were first used in SARs with the inductive logic programming (ILP)
learning technique, as shown in [6]. ILP algorithms are designed to learn from training examples
encoded as logical relations. ILP has been shown to significantly outperform the feature (attribute)
based induction methods described above [7].
ILP for SARs can be based on knowledge of atoms and their bond connectives within a molecule.
Using this scheme has a number of benefits:
Page 7
8. Saravanan Anandathiyagar Project Background Paper
March 2002 Supervisor: Simon Colton
A Substructure Server
• Simple, powerful, and can be generally applied to any SAR
• Particularly well suited to forming SARs dependent on the relationship between the atoms
in space (shape)
• Chemists can easily understand and interpret the resultant SARs as they are familiar with
relating chemical properties to groups of atoms.
The formal difference between the descriptive properties of attribute and relational SARs
corresponds to the difference between propositional and first-order logic [2]. ILP involves learning a
set of “if-then” rules for a training set, which can then be applied to unseen examples. Sets of first-
order Horn clauses can be constructed to represent the given data rules, and these can be
interpreted in the logic programming language PROLOG.
ILP differs from the attribute based techniques in two key areas. ILP can learn first-order rules that
contain variables, whereas the earlier algorithms can only accept finite ground terms for attribute
values. Further, ILP sequentially examines the data set, learning one rule at a time to incrementally
grow the final set of rules.
We stated above that relational SARs can be described by fist-order predicate logic. The PROGOL
algorithm was developed [8] to allow the bottom-up induction of Horn clauses, and is implemented
in PROLOG. PROGOL uses inverted entailment to generalize a set of positive examples (active compounds)
with respect to some background knowledge – atom and bond structure date, given in the form of
prolog facts. PROGOL will construct a set of “if-then” rules which explain the positive (and negative)
examples given.
In the case of predictive toxicology, these rules generally specify a sub-molecular structure of around
five or six atoms. These structures are those that have been calculated to contribute to toxicity,
based on their presence in the set of positive training examples, and their non-presence in the set of
negative training examples.
These sub-structures can then be matched with components of unseen compounds in an attempt to
predict toxicity.
Page 8
9. Saravanan Anandathiyagar Project Background Paper
March 2002 Supervisor: Simon Colton
A Substructure Server
3. The Find-S Technique
1.7. Motivation
As mentioned previously, the focus of this research topic is to use the Find-S algorithm as described
below to identify the sub-structures discussed at the end of section 2.3.1. Within the scope of
predictive toxicology, it may appear that both Find-S and ILP do the same thing, however this is not
the case. The Find-S technique differs from that of ILP due to the motivation behind the process.
ILP looks for concepts that are true for positive examples, and false for negative examples, and
produces a sub-molecule structure as a result. The Find-S procedure, on the other hand, is given a
template (by the user) to guide its search, and the program looks for all possibilities of the general
shape in the positive inputs.
1.8. General-to-specific ordering of hypotheses
Any given problem has a predefined space of potential hypotheses [4], which we shall denote H.
Consider a target concept T, whose truth value (1 or 0) depends upon the values of three attributes,
a1, a2, and a3. Each attribute a1, a2, or a3 can take a range of discrete values, some combinations of
which will make T true, others will make T false. We denote the value x of an attribute an as v(an) =
x.
We can let each hypothesis consist of a conjunction of constraints on the attributes, i.e. take the list
of attribute values for that particular instance of the problem. This list of attributes (of length three
in this case) can be held in a vector. For each attribute an, the value v(an) will take one of the
following forms:
• ? - indicating that any value is acceptable for this attribute
• ∅ - indicating that no value is acceptable for this attribute
• a single required value for the attribute, e.g. for an attribute ‘day of week’, acceptable values
would be ‘Monday’, ‘Tuesday’ etc.
With this notation, the most general hypothesis for T is
<?, ?, ?>
which states that any assignment to any of the three attributes will result in the hypothesis being
satisfied. Conversely, the most specific hypothesis for T is
<∅, ∅, ∅>
which states that no assignment to any of the variables will ever satisfy the hypothesis.
All hypotheses within H can be represented in this way, with the majority falling somewhere
between the two above extremes of generality. Indeed, hypotheses can be ordered on their generality,
Page 9
10. Saravanan Anandathiyagar Project Background Paper
March 2002 Supervisor: Simon Colton
A Substructure Server
from most general to most specific instances. For example, consider the following two possible
hypotheses for T:
h1 = <x, ?, y>
h2 = <?, ?, y>
Considering the two sets of instances that are classified positive by the two hypotheses, we can say
that any instance classified positive by h1 will also be classified positive by h2, as h2 imposes fewer
constraints. We say that h2 is more general than h1.
Formally, for two hypotheses hj and hk, we can define hj to be more general than or equal to hk (written h j
≥ g h k ) if and only if
(∀x ∈ X) [(h k (x) = 1) → (h k (x) = 1)]
Further, we can define hj to be (strictly) more general than hk (written h j > g h k ) if and only if
(h j ≥ g h k ) ∧ (h j ≱ g h k )
1.9. The Find-S algorithm
The Find-S technique orders hypotheses according to their generality as explained in the previous
section. The algorithm then starts with the most specific hypothesis h possible within H. For each
positive example it encounters in the training set, if generalises h (if needed) so h now correctly
classifies the encountered example as positive. After considering all positive training examples, the
resultant h is output. This is the most specific hypothesis in H consistent with the examined positive
examples.
The algorithm can be more formally defined as follows [4]:
1. Initialise h to the most specific hypothesis in H.
2. For each positive training instance x
For each v(ai) in h
• If v(ai) is satisfied by x
Then do nothing
• Else replace ai in h by the next more general constraint that is
satisfied by x.
3. Output hypothesis h
The procedure is run with a different starting positive each time until all positives have been
analysed. There is a question over how to measure how specific a particular hypothesis is. This is
Page 10
11. Saravanan Anandathiyagar Project Background Paper
March 2002 Supervisor: Simon Colton
A Substructure Server
dependent on the representation scheme, but in first-order logic, for example, a more specific
hypothesis will have more ground terms (fewer variables) in the logic sentence describing it than a
less specific hypothesis.
1.9.1. A simple example
An example to illustrate how the algorithm could be used in predictive toxicology is
presented below. It has been adapted from [9], and is fabricated in that the derived structure
is not a real indicator to toxicity. The example is simply illustrates the algorithm process.
Training Data
Consider the training set of seven drugs, four of which are known positives, and the
remaining three known negatives. Diagrams of these molecules are given below, with
molecules P1, P2, P3 and P4 representing positive examples, and N1, N2 and N3
representing negative ones. The atom labels (α, β, µ, and ν) are used in place of possible real
elements (e.g. N, C, H etc) to enforce the notion that the example is purely fabricated.
α
P1 β µ ν
µ α
β ν α N
α
α ν α
P2 β β
α α µ α
α
β β
α N
α α
ν µ α
P3 β β
α α µ α
α β
ν β
α N
µ µ
P4 β β
β β
Figure 1: Training set for Find-S example
Page 11
12. Saravanan Anandathiyagar Project Background Paper
March 2002 Supervisor: Simon Colton
A Substructure Server
At this stage, the chemist (user) must suggest a possible template on which to base the search
for toxicity-inducing substructures. It is thought that a substructure of the form
ATOM ATOM ATOM
(with representing a bond) contributes to toxicity. It is now the task of the algorithm to
find sub-molecules matching the structure given above which exist in as many positives as
possible, but do not exist in as many negatives as possible.
The Algorithm Procedure
To solve the problem, we use the Find-S method with the aim of producing solutions of the
form
<A, B, C>
where A, B and C are taken from the set of chemical symbols present in the molecules, i.e.
{α, β, µ, ν}. However, we also need to look for general solutions where an atom in a
particular position is not fixed. We therefore append {?} to the previous set, giving {α, β, µ,
ν, ?}.
We start off with the most specific hypothesis possible. Any final concept learned will have
to be true of at least one positive example. We use this to produce our first set of triples:
<α, β, µ> and <β, µ, ν>
These are the two substructures that exist in P1 and match the template specified.
We now check whether each of these substructures is true in the next molecule (P2). If they
are not, then we generalise the substructure such that it becomes true in P2. This
generalisation is done by introducing as few variables as possible. In doing this, we find the
least general generalisations, which then guarantees that our final answers are as specific as
possible. This expanded set of substructures is then tested on P3, and following the same
procedure, on P4.
Page 12
13. Saravanan Anandathiyagar Project Background Paper
March 2002 Supervisor: Simon Colton
A Substructure Server
A trace of the intermediate results produced is shown here:
Molecule being analysed
P1 P2 P3 P4
<α, β, µ> <α, β, µ> <α, β, µ> <α, β, µ>
<β, µ, ν> <β, µ, ν> <β, µ, ν> <β, µ, ν>
<α, β, ?> <α, β, ?> <α, β, ?>
<β, ?, ν> <β, ?, ν> <β, ?, ν>
<?, β, µ> <?, β, µ>
<?, β, ?> <?, β, ?>
<α, ?, ?> <α, ?, ?>
<β, ?, ?> <β, ?, ?>
<?, ?, ν> <?, ?, ν>
The trace shows previously derived substructures with a greyed out background. Note that
no new substructures are produced on analysis of P4 – all the substructures produced after
analysis of P3 match exactly components of P4 without the need for generalisation.
Evaluation of Results
So the algorithm has now returned nine possible hypotheses for substructures that determine
toxicity. These can now be scored, based on
• How many positive molecules contain the substructure derived
• How many negatives do not contain the substructure derived
A calculation of scores is given below:
Correctly classified
Correctly classified positives:
negatives:
Hypothesi
P1 P2 P3 P4 N1 N2 N3 Accuracy
s
1 <α, β, µ> 43%
2 <β, µ, ν> 57%
3 <α, β, ?> 57%
4 <β, ?, ν> 86%
5 <?, β, µ> 57%
6 <?, β, ?> 57%
7 <α, ?, ?> 43%
8 <β, ?, ?> 57%
Page 13
14. Saravanan Anandathiyagar Project Background Paper
March 2002 Supervisor: Simon Colton
A Substructure Server
9 <?, ?, ν> 57%
It can be seen that the most accurate hypothesis derived is number four: <β, ?, ν>. This is
statistically the most frequent substructure (of the form ATOM ATOM ATOM)
that occurs in the positives, but not in the negatives. This structure can then be used to
predict the toxicity of unseen compounds; other compounds containing a match for
hypothesis four are statistically likely to be toxic.
For a complete implementation of the algorithm, the procedure should be repeated, but this
time with P2 as the initial positive, and generalising on the others. The same should be
applied for P3 and P4 as initial positives.
1.10.Algorithm evaluation methods
On obtaining a ‘result’ from the Find-S algorithm, i.e. a hypothesis (or set of hypotheses)
representing a sub-molecule thought most likely to contribute to toxicity, it is desirable to have
some certainty that the result obtained is indeed accurate. We want the promising results obtained
with the training set to be extended to unseen examples. There is no way to guarantee the accuracy
of a hypothesis, however there are accepted methods and measures through which a user can
become more confident in the results obtained.
In our example above, the ‘best’ hypothesis had a (predicted) accuracy of 86%, calculated by
considering the number of correctly classified positives and negatives, over the total number of
compounds analysed. However, this figure is based purely on the examples that the hypothesis has
already seen; it is not a strong indicator of accuracy for unseen examples.
1.10.1.Cross validation
One possible way of addressing this situation is to reserve some examples from the training
set, and then subsequently use these reserved examples as tests on the derived hypothesis.
The results of the hypothesis applied to the reserved examples can then be compared to their
actual categorisation, which is known as they were provided as part of the training set. This
cross validation is a standard machine learning technique, and the splitting of initial example
data into a training set and test set can give the user more confidence that the derived
hypothesis will be accurate and of use. Clearly, it can have the opposite effect, with a user
finding out that the derived hypothesis in fact performs poorly on genuinely unseen
examples.
1.10.2.K-fold cross validation
Page 14
15. Saravanan Anandathiyagar Project Background Paper
March 2002 Supervisor: Simon Colton
A Substructure Server
It is often of importance and interest that the performance of the learning algorithm itself is
measured, and not just a specific hypothesis. A technique to achieve this is k-fold cross
validation [4]. This involves partitioning the data into k disjoint subsets, each of equal size.
There are then k training and testing rounds, with each subset successively acting as a test
set, and the other k-1 sets as training sets. The average accuracy rate can then be calculated
from each independent test run. This technique is typically used when the number of data
objects is in the region of a few hundred, and the size of each subset is at least thirty. This
ensures that the tests provide reasonable results, as having too few test examples would
result in skewed accuracy figures.
As each round is performed independently, there is no guarantee that the hypothesis
generated on one training round will be the same as the hypothesis generated on another. It
is for this reason that the overall accuracy figures generated are representative of the
algorithm as a whole, not just one particular result.
1.11.Issues with the Find-S technique
As with all machine learning techniques, Find-S has some factors to encourage its use, and others
that make it less favourable. Some of these considerations are discussed here.
1.11.1.Guarantee of finding most specific hypothesis
As the name of the algorithm suggests, the process is guaranteed to find the most specific
hypothesis consistent with the positive training examples, within the hypothesis space. This
is because of the decisions made to select the least general generalisations when analysing
compounds.
This property can be viewed as being both advantageous and disadvantageous. It is
sometimes useful for users to know as much information about the substructure as possible,
and this may enable them to better understand the chemical reason for the molecule’s
toxicity. However, in the case of an example deriving multiple hypotheses consistent with the
tracing data, the algorithm would still return the most specific, even thought the others have
the same statistical accuracy.
Further, it is possible that the process derives several maximally specific consistent
hypotheses [4]. To account for this possible case, we need to extend the algorithm to allow
backtracking at choice points for generalisation. This would find target concepts along a
different branch to that first explored.
1.11.2.Overfitting
Overfitting is often thought of as the problem of an algorithm memorising answers rather than
deducing concepts and rules from them, and is inherent in many machine learning
techniques. A particular hypothesis is said to overfit the training examples when some other
Page 15
16. Saravanan Anandathiyagar Project Background Paper
March 2002 Supervisor: Simon Colton
A Substructure Server
hypothesis that fits the training examples less well, actually performs better over the whole set
of instances (i.e. including non-training set instances).
Overfitting can occur when the number of training examples used is too small and does not
provide an illustrative sample of the true target function. It can also occur when there are
errors in the example data, known as noise. Noise has a particularly detrimental effect on the
Find-S algorithm, as explained below.
1.11.3.Noisy data
Any non-trivial set of data taken from the real world is subject to a degree of error in its
representation. Mistakes can be made analysing the data and categorising examples, in
translation of information from one form to another, and repeated data not being consistent
with itself. In machine learning terms, such errors in the data are termed noise.
While certain algorithms are fairly robust to noise in data, the Find-S technique is inherently
not so. This is because the algorithm effectively ignores all negative examples in the training
examples. Generalisations are made to include as many positive examples as possible, but no
attempt is made to exclude negatives. This in itself is not a problem; if the data contains no
errors, then the current hypothesis can never require a revision in response to a negative
example [4]. However, the introduction of noise into the data changes this situation. It may
no longer be the case that the negative examples can simply be ignored. Find-S makes no
effort to accommodate for these possible inconsistencies in data.
1.11.4.Parallelisability
The Find-S algorithm lends itself well to a parallel distributed implementation, which would
speed-up computation time. A parallel implementation could involve individual processors
being allocated different initial positives; recall above that the algorithm is only complete
when hypotheses have been derived using each possible start positive. The derivation of any
particular hypothesis from an initial positive can be run independently, and hence can be run
in parallel with other derivations.
1.12.Existing Prolog implementation
S. Colton has implemented an initial version of the Find-S algorithm in PROLOG. This relatively
compact program (approximately 300 lines of code) identifies substructures from a sample data set
as used by King et al [2]. The program is guided by substructure templates, of which a few have
been hard coded. It has recreated some of the results produced by the ILP method and PROGOL on
the sample data set considered. The program can take parameters to specify the minimum number
of ground terms that must appear in a resultant hypothesis (i.e. limit variables), and also specify the
minimum number of molecules for which a hypothesis should return TRUE for a positive, and the
maximum for which it can FALSE for a negative.
Page 16
17. Saravanan Anandathiyagar Project Background Paper
March 2002 Supervisor: Simon Colton
A Substructure Server
An important point for discussion here is the representation of the background and structural data.
Information representing the molecules is represented as a series of facts in a PROLOG database. The
representation is identical to that suggested in the section on inductive logic programming, and
involves storing information about atoms and their inter bonding. The data stored for even a single
molecule is extensive; however these PROLOG facts can be generated automatically as mentioned in
section 4.1.
Page 17
18. Saravanan Anandathiyagar Project Background Paper
March 2002 Supervisor: Simon Colton
A Substructure Server
4. Implementation Considerations
The Find-S algorithm has been discussed at length as it represents the core component of a system to
identify substructures. However, the initial remit was to create a substructure server, whereby users would
be able to identify potentially interesting substructures from their positive and negative examples. As
such, other considerations need to be examined, and these are summarised here.
1.13.Representing structures
There exists a conflict between the natural user representation of chemical structures, and those that
are useful to the implemented algorithm. In a sense, the users’ view of structures must be parsed
into the computer view (first order logic) at some stage, either by the user manually, or by the
implemented software as pre-processing to the Find-S algorithm. It is clearly more desirable from
the users’ position that this conversion is done in an automated fashion. The feasibility of this is
briefly discussed here.
Chemists are often concerned with modelling compounds, and the industry standard modelling
software is QUANTA [9]. King et al. in [2] used QUANTA editing tools to automatically map a visual
representation of a molecule into first order logic. After some suitable pre-processing, this mapped
representation could be read by their PROGOL program as a series of facts.
Another molecular simulation program, CHARMM [10], stores as data files information about the
molecule being simulated. These data files use standard naming and referencing techniques, as
described in The Protein Data bank [11]. The structure of these flat text files is conducive to
translations to other formats, on development of suitable schema.
1.14.Improvement of current implementation
S Colton’s current implementation of the Find-S algorithm can serve as a basis for further work.
The algorithm could be recoded in a modern object oriented language, which would facilitate
parallelising and packaging the algorithm as a web-based application.
One key improvement that could be made is with the introduction of new search templates. These
templates guide the algorithm, restricting its search to sub-molecules matching the specified
template. Currently only a small number of templates are implemented; it is desirable that more be
available to the user.
1.15.Extensions
As advanced work in this area, further extensions to those suggested above are possible.
Implementing the algorithm in parallel is one such possible extension. This would speed up the
potentially highly complex and time consuming derivations of hypotheses.
Page 18
19. Saravanan Anandathiyagar Project Background Paper
March 2002 Supervisor: Simon Colton
A Substructure Server
There is also scope for the generated hypotheses to be represented in different formats. While an
answer returned in first order logic maybe strictly accurate, it is unlikely to be of much use to a user
with little or no knowledge of computational logic techniques. Molecular visualisation software such
as RASMOL and the later PROTEIN EXPLORER [12] exist, that can take as input data in a similar format to
that produced by QUANTA or CHARM. It would be desirable for a user to view the resultant
hypotheses, with the sub-molecule derived by the algorithm presented visually.
Page 19
20. Saravanan Anandathiyagar Project Background Paper
March 2002 Supervisor: Simon Colton
A Substructure Server
5. References
[1] Ellis, L., Aetna InteilHealth Drug Resource Centre, From Laboratory To Pharmacy: How Drugs Are
Developed, 2002.
http://www.intelihealth.com/IH/ihtIH/WSIHW000/8124/31116/346361.html?d=dmtContent
[2] King, Ross D., Muggleton, Stephen H., Srinivasan, A. & Sternberg, Michael J.E., Structure-activity
relationships derived by machine learning: The use of atoms and their bond connectives to predict mutagenicity by
inductive logic programming (1995) Proceedings of the National Academy of Sciences (USA) 93, 438-442
[3] Hansch, C., Maloney, P. P., Fujita, T. & Muir, R. M., Correlation of Biological Activity of Phenoxyacetic
Acids with Hammett Substituent Constants and Partition Coefficients (1962). Nature (London) 194, 178-180
[4] Mitchell, T. M., Machine Learning, International Edition, 1997, McGraw-Hill
[5] Glen,B., Molecular Modelling and Molecular Informatics, University of Cambridge – Centre for Molecular
Infomatics, www-ucc.ch.cam.ac.uk/colloquia/rcg-lectures/A4
[6] Muggleton, S., Inductive Logic Programming (1991), New Generation Computing 8, 295-318
[7] Srinivasan, A., Muggleton, S. H., Sternberg, M. J. E., King, R. D., Theories for mutagenicity: a study in first-
order and feature-based induction (1996), Artificial Intelligence 85(1,2), 277-299
[8] Muggleton, S., Inverse Entailment and Progol (1995), New Generation Computing 13, 245-286
[9] Colton, S. G., Lecture 11 – Overview of Machine Learning, Imperial College London, 2003.
http://www2.doc.ic.ac.uk/~sgc/teaching/341.html
[9] Quanta software, http://www.accelrys.com/quanta/, Accelrys Inc.
[10] Chemistry HARvard Molecular Mechanics (CHARMM),
http://www.ch.embnet.org/MD_tutorial/pages/CHARMM.Part1.html
[11] The Protein Data Bank, http://www.rcsb.org/pdb
[12] Rasmol Home Page, http://www.umass.edu/microbio/rasmol/
Page 20