1. Aims
09s1: COMP9417 Machine Learning and Data Mining This lecture will provide the basis for you to be able to describe
the motivation, scope and some application areas of machine learning.
Introduction to Machine Learning Following it you should be able to:
• describe the general learning problem
March 12, 2008
• state some of the steps in setting up a learning problem
• list some applications of machine learning
• list some issues in machine learning
Acknowledgement: Material derived from slides for the book
Machine Learning, Tom Mitchell, McGraw-Hill, 1997
http://www-2.cs.cmu.edu/~tom/mlbook.html
COMP9417: March 11, 2009 Introduction to Machine Learning: Slide 1
Overview Why Machine Learning
[Recommended reading: Mitchell, Chapter 1]
[Recommended exercises: 1.1,1.2, optionally 1.5] • Considerable progress in algorithms and theory
• Why Machine Learning? • Growing flood of online data
• What is a well-defined learning problem? • Increasing computational power
• An example: learning to play checkers (draughts) • Many successful commercial/scientific applications
• What questions should we ask about Machine Learning?
COMP9417: March 11, 2009 Introduction to Machine Learning: Slide 2 COMP9417: March 11, 2009 Introduction to Machine Learning: Slide 3
2. Three niches for machine learning: Some definitions
• Data mining: using historical data to improve decisions machine learning the science of algorithmic methods of learning from
experience with the goal of improving performance on selected tasks
– medical records → medical knowledge
• Software applications we can’t program by hand data mining the use of machine learning or statistical algorithms to
search large amounts of data for hidden patterns or relationships that are
– autonomous robots interesting and potentially useful
– speech recognition
• Self customizing programs
– Web sites that learn user interests
COMP9417: March 11, 2009 Introduction to Machine Learning: Slide 4 COMP9417: March 11, 2009 Introduction to Machine Learning: Slide 5
Typical data mining Task Datamining Result
Patient103 time=1 Patient103 time=2 ... Patient103 time=n
Patient103 time=1 Patient103 time=2 ... Patient103 time=n
Age: 23 Age: 23 Age: 23
FirstPregnancy: no FirstPregnancy: no FirstPregnancy: no Age: 23 Age: 23 Age: 23
Anemia: no Anemia: no Anemia: no FirstPregnancy: no FirstPregnancy: no FirstPregnancy: no
Diabetes: no Diabetes: YES Diabetes: no Anemia: no Anemia: no Anemia: no
PreviousPrematureBirth: no PreviousPrematureBirth: no PreviousPrematureBirth: no Diabetes: no Diabetes: YES Diabetes: no
Ultrasound: ? Ultrasound: abnormal Ultrasound: ? PreviousPrematureBirth: no PreviousPrematureBirth: no PreviousPrematureBirth: no
Elective C−Section: ? Elective C−Section: no Elective C−Section: no Ultrasound: ? Ultrasound: abnormal Ultrasound: ?
Emergency C−Section: ? Emergency C−Section: ? Emergency C−Section: Yes Elective C−Section: ? Elective C−Section: no Elective C−Section: no
... ... ...
Emergency C−Section: ? Emergency C−Section: ? Emergency C−Section: Yes
... ... ...
Given:
One of 18 learned rules:
• 9714 patient records, each describing a pregnancy and birth If No previous vaginal delivery, and
• Each patient record contains 215 features Abnormal 2nd Trimester Ultrasound, and
Malpresentation at admission
Learn to predict: Then Probability of Emergency C-Section is 0.6
Over training data: 26/41 = .63,
• Classes of future patients at high risk for Emergency Cesarean Section
Over test data: 12/20 = .60
COMP9417: March 11, 2009 Introduction to Machine Learning: Slide 6 COMP9417: March 11, 2009 Introduction to Machine Learning: Slide 7
3. Credit Risk Analysis Other Prediction Problems
...
Customer103: (time=t0) Customer103: (time=t1) Customer103: (time=tn) Customer purchase behavior:
Years of credit: 9 Years of credit: 9 Years of credit: 9
Loan balance: $2,400 Loan balance: $3,250 Loan balance: $4,500 ...
Customer103: (time=t0) Customer103: (time=t1) Customer103: (time=tn)
Income: $52k Income: ? Income: ?
Own House: Yes Own House: Yes Own House: Yes Sex: M Sex: M Sex: M
Age: 53 Age: 53 Age: 53
Other delinquent accts: 2 Other delinquent accts: 2 Other delinquent accts: 3
Income: $50k Income: $50k Income: $50k
Max billing cycles late: 3 Max billing cycles late: 4 Max billing cycles late: 6
Own House: Yes Own House: Yes Own House: Yes
Profitable customer?: ? Profitable customer?: ? Profitable customer?: No
MS Products: Word MS Products: Word MS Products: Word
... ... ...
Computer: 386 PC Computer: Pentium Computer: Pentium
Purchase Excel?: ? Purchase Excel?: ? Purchase Excel?: Yes
... ... ...
Rules learned from synthesized data:
If Other-Delinquent-Accounts > 2, and Customer retention:
Number-Delinquent-Billing-Cycles > 1 Customer103: (time=t0) Customer103: (time=t1) ... Customer103: (time=tn)
Then Profitable-Customer? = No [Deny Credit Card application] Sex: M Sex: M Sex: M
Age: 53 Age: 53 Age: 53
Income: $50k Income: $50k Income: $50k
If Other-Delinquent-Accounts = 0, and Own House: Yes Own House: Yes Own House: Yes
Checking: $5k Checking: $20k Checking: $0
(Income > $30k) OR (Years-of-Credit > 3) Savings: $15k Savings: $0 Savings: $0
Then Profitable-Customer? = Yes [Accept Credit Card application] Current−customer?: yes
... Current−customer?: yes
... Current−customer?: No
COMP9417: March 11, 2009 Introduction to Machine Learning: Slide 8 COMP9417: March 11, 2009 Introduction to Machine Learning: Slide 9
Process optimization: Tasmanian Apple Thinning
Product72: (time=t0) Product72: (time=t1) ... Product72: (time=tn)
Stage: mix Stage: cook Stage: cool Apple orchards are important in primary production in Tasmania, and
Mixing−speed: 60rpm Temperature: 325 Fan−speed: medium
Viscosity: 1.3 Viscosity: 3.2 Viscosity: 1.3 there has been a long history in the process of apple thinning. Apples are
Fat content: 15%
Density: 2.8
Fat content: 12%
Density: 1.1
Fat content: 12%
Density: 1.2
naturally biennial bearing,. Trees flower heavily one year producing a large
Spectral peak: 2800 Spectral peak: 3200 Spectral peak: 3100 crop of small fruit (called the ”On” year) followed by light flowering the
Product underweight?: ?? Product underweight?: ?? Product underweight?: Yes
... ... ...
next year with a small crop of large poor quality fruit.
Thinning is most economically done by applying sprays of chemicals that
act similarly to plant hormones and cause the abortion of flowers and
fruitlets at an early stage of development. Early thinning favours the
development of the desirable high density of cells in the fruit.
Orchardists – decision about concentration of thinning agent at blossom
time. If concentration too low, then thinning is not effective and cost
of hand thinning is prohibitive,. If the concentration too high, then risk
of losing all the fruit. Decision is difficult because of large number of
variables to be taken into account.
COMP9417: March 11, 2009 Introduction to Machine Learning: Slide 10 COMP9417: March 11, 2009 Introduction to Machine Learning: Slide 11
4. • trees - cultivar, rootstock and age. BG Gas Drilling - “Stuck Pipe”
• physiology - previous crop, vigour, number of blossom buds.
• pruning - severity of detailed pruning, limb thinning, and penetration
of light into the canopy.
• market - size of fruit required for the market.
• spraying - type of spray machinery and volume of water to be used in
the machinery.
60 tasks, (some with 50 decision tree leaves (i.e. rule paths), plus 30
other variables and 40 procedures supported by a customized help file of
5,000 words.
Drilling is a hugely expensive process, with daily costs for a North Sea
operation typically incurring rig costs of around $50,000 per day. Clearly,
anything that helps to reduce the time when a drilling rig is not productive
has the potential to achieve huge savings.
COMP9417: March 11, 2009 Introduction to Machine Learning: Slide 12 COMP9417: March 11, 2009 Introduction to Machine Learning: Slide 13
Daily report data from two databases: One of which was old and included Nissan - Car selection
incomplete or absent data - particularly IADC (International Association
of Drilling Contractors) codes. The other database was compiled more
recently and included a large amount of additional data about well site
geology, drilling costs, etc. Sixty recorded occurrences of Stuck Pipe in
170 BG wells.
Possible to mine the data and to determine trends. Much of the time
invested by the project team has concentrated on getting data in good
order. Results indicate that length of time the hole has been open; the
properties of the drilling mud; and the frequency with which the mud is
conditioned all play a significant role in the incidence of Stuck Pipe.
Starting from the basic choices of 3 alternative engines, 3 types of
suspension, 2 types of transmission, 9 colours and 3 styles of seat fabric,
customers can go far further and create a car to suit their own personality.
COMP9417: March 11, 2009 Introduction to Machine Learning: Slide 14 COMP9417: March 11, 2009 Introduction to Machine Learning: Slide 15
5. With 670,000 possible combinations, “it is a totally new concept” says Channel 4 TV scheduling
Takao Ohmura, Sales Manager of Tokyo Nissan Computer Systems.
A guidebook explains the options in table form and we were able to input During the day, Channel 4’s strength is the housewife market whilst in
these tables into XpertRule. Normally it is difficult to utilise such a large the evenings Channel 4’s strength lies in its varied targeting ability. In
matrix, but XpertRule was able to automatically generate a decision tree comparison with ITV, Channel 4 audiences contain a greater proportion
structure to arrive at the correct model, from attributes and values in the of younger, lighter, up-market, male viewers (audience research has also
tables. identified Channel 4’s ability to target cluster groups defined by names
such as “Progressive Priscillas” and “Free-thinking Franks”).
It met our three major requirements: (1) the model selection and check
must be completed in three minutes: (2) the ability to run on Nissan Advertisers may specify to have commercials placed first in the break,
dealers hardware, and (3) ease of maintaining the system after the launch last in the break or “Top & Tail” in a break making break sequencing
of the Cefiro model. a challenge if optimal use of airtime is to be achieved. Definition of a
knowledge-based system to solve the problem requires observation of a
number of prioritised “rules”: Top of the list is the need for no overlaps
or gaps, with Top and Tail or First and Last network spots also receiving
high priority. Lower down the list are First and Last Super-macro spots
and non-reporting Super-macros sequenced to play at the same time.
COMP9417: March 11, 2009 Introduction to Machine Learning: Slide 16 COMP9417: March 11, 2009 Introduction to Machine Learning: Slide 17
Optimization problems – as the number of possible combinations grows Problems Too Difficult to Program by Hand
it becomes impractical to try all combinations to arrive at a solution in a
reasonable time.
Rule of thumb can be used to narrow down options but, in most cases, good ALVINN [Pomerleau] drives 70 mph on highways !
rules are not available or are difficult to capture. Numerical optimization
techniques are currently available in most advanced spreadsheets, but
these tend to be incapable of optimizing problems involving sequencing
or scheduling and they are “exploitation” rather than “exploration”
techniques.
The solution involved the use of genetic algorithm techniques which allows
the exploration of large search spaces for optimal or near optimal solutions.
COMP9417: March 11, 2009 Introduction to Machine Learning: Slide 18 COMP9417: March 11, 2009 Introduction to Machine Learning: Slide 19
6. Sharp Straight Sharp Stanley - DARPA Grand Challenge Champion 2005
Left Ahead Right
30 Output
Units
4 Hidden
Units
30x32 Sensor
• won 2 million dollars (US), first team to complete 132 mile course
Input Retina
• modified VW Touareg R5 with drive-by-wire, took 6 hours 54 minutes
averaging over 19 mph
• seven Pentium M computers, GPS and various sensors
• localization, mapping and collision avoidance
COMP9417: March 11, 2009 Introduction to Machine Learning: Slide 20 COMP9417: March 11, 2009 Introduction to Machine Learning: Slide 21
Software that adapts to User Measuring Neural Activity
• Brin & Page - PhD students in data mining at Stanford • Botros, van Dijk & Killian (2007) - Cochlear implant adjustment
• PageRank algorithm (1998) • Expert system uses neural response telemetry (ECAP)
• Google business model - technology targets advertisements to users • Decision tree learning - Quinlan’s C5 and Cubist
COMP9417: March 11, 2009 Introduction to Machine Learning: Slide 22 COMP9417: March 11, 2009 Introduction to Machine Learning: Slide 23
7. concentration of yeast in the wells of the microtitre trays using the biological entities.
adjacent plate reader and returns the results to the LIMS (although The original bioinformatic information for the AAA model was
microtitre trays are still moved in and out of incubators manually).
Scientific Discovery taken mainly from the KEGG13 catalogue of metabolism. The model
Scientific Discovery
was then tested with all possible auxotrophic experiments involving
The Robot Scientist project (2004) Robot scientist in the lab
a single replacement metabolite, and was altered manually to fit the
empirical results. To ensure that the model was not ‘over-fitted’, we
carried out all possible auxotrophic experiments with pairs of
metabolites. The model correctly predicted at least 98.5% of the
experiments (Supplementary Information). To the best of our
knowledge, no bioinformatic model has been as thoroughly tested
with knockout mutants.
Machine learning is the branch of artificial intelligence that seeks
to develop computer systems that improve their performance
automatically with experience14,15. It has much in common with
statistics, but differs in having a greater emphasis on algorithms,
Figure 1 The Robot Scientist hypothesis-generation and experimentation loop. data representation and making acquired knowledge explicit. The
248 NATURE | VOL 427 | 15 JANUARY 2004 | www.nature.com/nature
COMP9417: March 11, 2009 Introduction to Machine Learning: Slide 24 COMP9417: March 11, 2009 Introduction to Machine Learning: Slide 25
Where Is this Headed ? Where Is this Headed ?
Mature algorithms Opportunity for tomorrow: enormous impact
• decision trees, regression, neural nets, Bayesian methods ... • Learn across full mixed-media data
• can be applied to standard database relations or flat files • Learn across multiple internal databases, plus the web and newsfeeds
• established software and services industry • Learn by active experimentation
• Learn more complex functions
• Learn by analogy
• Cumulative, lifelong learning and adaptation
• Programming languages and systems with learning embedded ?
COMP9417: March 11, 2009 Introduction to Machine Learning: Slide 26 COMP9417: March 11, 2009 Introduction to Machine Learning: Slide 27
8. Relevant Disciplines A definition of the learning problem
• Artificial intelligence Learning = improving with experience at some task
• Computational complexity theory
• Improve over task T ,
• Statistics
• with respect to performance measure P ,
• Information theory
• based on experience E.
• Bayesian methods
• Control theory E.g., Learn to play checkers (draughts)
• Philosophy
• T : Play checkers
• Psychology and neurobiology
• P : % of games won in world tournament
• Physics
• E: opportunity to play against self
• ...
COMP9417: March 11, 2009 Introduction to Machine Learning: Slide 28 COMP9417: March 11, 2009 Introduction to Machine Learning: Slide 29
Learning to Play Checkers Type of Training Experience
• T : Play checkers • Direct or indirect?
• P : Percent of games won in world tournament • Teacher or not?
• What experience? A problem: is training experience representative of performance goal?
• What exactly should be learned?
• How shall it be represented?
• What specific algorithm to learn it?
COMP9417: March 11, 2009 Introduction to Machine Learning: Slide 30 COMP9417: March 11, 2009 Introduction to Machine Learning: Slide 31
9. Choose the Target Function Possible Definition for Target Function V
• ChooseM ove : Board → M ove ?? • if b is a final board state that is won, then V (b) = 100
• V : Board → ?? • if b is a final board state that is lost, then V (b) = −100
• ... • if b is a final board state that is drawn, then V (b) = 0
• if b is a not a final state in the game, then V (b) = V (b ), where b
is the best final board state that can be achieved starting from b and
playing optimally until the end of the game.
This gives correct values, but is not operational
COMP9417: March 11, 2009 Introduction to Machine Learning: Slide 32 COMP9417: March 11, 2009 Introduction to Machine Learning: Slide 33
Choose Representation for Target Function A Representation for Learned Function
• collection of rules?
• neural network ? w0 + w1 · bp(b) + w2 · rp(b) + w3 · bk(b) + w4 · rk(b) + w5 · bt(b) + w6 · rt(b)
• polynomial function of board features?
• bp(b): number of black pieces on board b
• ...
• rp(b): number of red pieces on b
• bk(b): number of black kings on b
• rk(b): number of red kings on b
• bt(b): number of red pieces threatened by black (i.e., which can be
taken on black’s next turn)
• rt(b): number of black pieces threatened by red
COMP9417: March 11, 2009 Introduction to Machine Learning: Slide 34 COMP9417: March 11, 2009 Introduction to Machine Learning: Slide 35
10. Obtaining Training Examples Choose Weight Tuning Rule
LMS Weight update rule:
• V (b): the true target function
ˆ
• V (b) : the learned function Do repeatedly:
• Vtrain(b): the training value
• Select a training example b at random
One rule for estimating training values: 1. Compute error(b):
ˆ
error(b) = Vtrain(b) − V (b)
ˆ
• Vtrain(b) ← V (Successor(b))
2. For each board feature fi, update weight wi:
wi ← wi + c · fi · error(b)
c is some small constant, say 0.1, to moderate the rate of learning
COMP9417: March 11, 2009 Introduction to Machine Learning: Slide 36 COMP9417: March 11, 2009 Introduction to Machine Learning: Slide 37
Design Choices Some Issues in Machine Learning
Determine Type
of Training Experience
Games against
Table of correct
... • What algorithms can approximate functions well (and when)?
experts
Games against moves
self
• How does number of training examples influence accuracy?
Determine
Target Function
• How does complexity of hypothesis representation impact it?
Board Board ...
¨ move ¨ value • How does noisy data influence accuracy?
Determine Representation
of Learned Function
• What are the theoretical limits of learnability?
...
Polynomial
Linear function Artificial neural
• How can prior knowledge of learner help?
of six features network
Determine
• What clues can we get from biological learning systems?
Learning Algorithm
• How can systems alter their own representations?
Linear ...
Gradient programming
descent
Completed Design
COMP9417: March 11, 2009 Introduction to Machine Learning: Slide 38 COMP9417: March 11, 2009 Introduction to Machine Learning: Slide 39