data minig

Classification and
Prediction

- The Course

DS OLAP

DS DP DW DM

Association

DS Classification

DS = Data source Clustering
DW = Data warehouse
DM = Data Mining
DP = Staging Database

Chapter Objectives
 Learn basic techniques for data classification
and prediction.
 Realize the difference between the following
classifications of data:
– supervised classification
– prediction
– unsupervised classification

Chapter Outline
 What is classification and prediction of data?
 How do we classify data by decision tree induction?
 What are neural networks and how can they classify?
 What is Bayesian classification?
 Are there other classification techniques?
 How do we predict continuous values?

What is Classification?
 The goal of data classification is to organize and
categorize data in distinct classes.
– A model is first created based on the data
distribution.
– The model is then used to classify new data.
– Given the model, a class can be predicted for new
data.

 Classification = prediction for discrete and nominal
values

What is Prediction?
 The goal of prediction is to forecast or deduce the value of an
attribute based on values of other attributes.
– A model is first created based on the data distribution.
– The model is then used to predict future or unknown values

 In Data Mining
– If forecasting discrete value  Classification
– If forecasting continuous value  Prediction

Supervised and Unsupervised
 Supervised Classification = Classification
– We know the class labels and the number of
classes

 Unsupervised Classification = Clustering
– We do not know the class labels and may not
know the number of classes

Preparing Data Before
Classification
 Data transformation:
– Discretization of continuous data
– Normalization to [-1..1] or [0..1]
 Data Cleaning:
– Smoothing to reduce noise
 Relevance Analysis:
– Feature selection to eliminate irrelevant attributes

Application
 Credit approval
 Target marketing
 Medical diagnosis
 Defective parts identification in manufacturing
 Crime zoning
 Treatment effectiveness analysis
 Etc

Classification is a 3-step process
 1. Model construction (Learning):
• Each tuple is assumed to belong to a predefined class, as
determined by one of the attributes, called the class label.
• The set of all tuples used for construction of the model is
called training set.

– The model is represented in the following forms:
• Classification rules, (IF-THEN statements),
• Decision tree
• Mathematical formulae

1. Classification Process (Learning)
Name Income Age Credit
Samir Low <30
rating Classification Method
bad
Ahmed Medium [30...40
] good
Salah High <30 good
Ali Medium >40 good
Classification Model
Sami Low [30..40] good
Emad Medium <30 bad
IF Income = ‘High’
Training Data class OR Age > 30
THEN Class = ‘Good
OR
Decision Tree
OR
Mathematical For

Classification is a 3-step process
2. Model Evaluation (Accuracy):
– Estimate accuracy rate of the model based on a test set.
– The known label of test sample is compared with the
classified result from the model.
– Accuracy rate is the percentage of test set samples that are
correctly classified by the model.
– Test set is independent of training set otherwise over-fitting
will occur

2. Classification Process (Accuracy
Evaluation)


Name Income Age Credit rating Model
Naser Low <30 Bad Bad
Accuracy
Lutfi Medium <30 Bad good 75%
Adel High >40 good good
Fahd Medium [30..40] good good

class

Classification is a three-step process

3. Model Use (Classification):
– The model is used to classify unseen objects.
• Give a class label to a new tuple
• Predict the value of an actual attribute

3. Classification Process (Use)


Name Income Age Credit rating

Adham Low <30 ?

Classification Methods Classification Method

 Decision Tree Induction
 Neural Networks
 Bayesian Classification
 Association-Based Classification
 K-Nearest Neighbour
 Case-Based Reasoning
 Genetic Algorithms
 Rough Set Theory
 Fuzzy Sets
 Etc.

Evaluating Classification Methods
 Predictive accuracy
– Ability of the model to correctly predict the class label
 Speed and scalability
– Time to construct the model
– Time to use the model
 Robustness
– Handling noise and missing values
 Scalability
– Efficiency in large databases (not memory resident data)
 Interpretability:
– The level of understanding and insight provided by the
model

Chapter Outline
 What is classification and prediction of data?
 How do we classify data by decision tree induction ?

 What are neural networks and how can they
classify?
 What is Bayesian classification?
 Are there other classification techniques?
 How do we predict continuous values?

What is a Decision Tree?
 A decision tree is a flow-chart-like tree structure.
– Internal node denotes a test on an attribute
– Branch represents an outcome of the test
• All tuples in branch have the same value for the tested
attribute.

 Leaf node represents class label or class label
distribution

Sample Decision Tree
Excellent customers
Fair customers

80

Income
< 6K >= 6K

Age 50 No YES

20
2000 6000 10000
Income


80
Income
<6k >=6k

NO Age
Age 50 >=50
<50
NO Yes

20
2000 6000 10000

Income

Outlook Temp Humidity Windy Play?
sunny hot high FALSE No
sunny hot high TRUE No
overcast hot high FALSE Yes
rainy mild high FALSE Yes
rainy cool normal FALSE Yes
rainy cool Normal TRUE No
overcast cool Normal TRUE Yes
sunny mild High FALSE No
sunny cool Normal FALSE Yes
rainy mild Normal FALSE Yes
sunny mild normal TRUE Yes
overcast mild High TRUE Yes
overcast hot Normal FALSE Yes
rainy mild high TRUE No

http://www-lmmb.ncifcrf.gov/~toms/paper/primer/latex/index.html
http://directory.google.com/Top/Science/Math/Applications/Information_Theory/Papers/

Decision-Tree Classification Methods

 The basic top-down decision tree generation
approach usually consists of two phases:
1. Tree construction
• At the start, all the training examples are at the root.
• Partition examples are recursively based on selected
attributes.

2. Tree pruning
• Aiming at removing tree branches that may reflect noise
in the training data and lead to errors when classifying
test data  improve classification accuracy

How to Specify Test Condition?
 Depends on attribute types
– Nominal
– Ordinal
– Continuous

 Depends on number of ways to split
– 2-way split
– Multi-way split

Splitting Based on Nominal Attributes

 Multi-way split: Use as many partitions as distinct
values.

CarType
Family Luxury
Sports

 Binary split: Divides values into two subsets.
Need to find optimal partitioning.

CarType CarType
{Sports, OR {Family,
Luxury} {Family} Luxury} {Sports}

Splitting Based on Ordinal Attributes
 Multi-way split: Use as many partitions as distinct
values.
Size
Small Large
Medium

 Binary split: Divides values into two subsets.
Need to find optimal partitioning.
Size
Size {Medium,
{Small,
{Large}
OR Large} {Small}
Medium}

Size
{Small,
 What about this split? Large} {Medium}

Splitting Based on Continuous Attributes

 Different ways of handling
– Discretization to form an ordinal categorical
attribute
• Static – discretize once at the beginning
• Dynamic – ranges can be found by equal
interval bucketing, equal frequency bucketing
(percentiles), or clustering.

– Binary Decision: (A < v) or (A ≥ v)
• consider all possible splits and finds the best cut
• can be more compute intensive

Splitting Based on Continuous Attributes

Tree Induction
 Greedy strategy.
– Split the records based on an attribute test that
optimizes certain criterion.

 Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting

How to determine the Best Split
Good customers fair customers

Customers

Income Age
<10k >=10k young old

How to determine the Best Split
 Greedy approach:
– Nodes with homogeneous class distribution are
preferred

 Need a measure of node impurity:

High degree Low degree pure
of impurity of impurity
50% red 75% red 100% red
50% green 25% green 0% green

Measures of Node Impurity

 Information gain
– Uses Entropy

 Gain Ratio
– Uses Information
Gain and Splitinfo

 Gini Index
– Used only for
binary splits

Algorithm for Decision Tree Induction
 Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer
manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are discretized
in advance)
– Examples are partitioned recursively based on selected attributes
– Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
 Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
– There are no samples left

Classification Algorithms
 ID3
– Uses information gain
 C4.5
– Uses Gain Ratio

 CART
– Uses Gini

Entropy: Used by ID3

Entropy(S) = - p log2 p - q log2 q

 Entropy measures the impurity of S
 S is a set of examples
 p is the proportion of positive examples
 q is the proportion of negative examples

ID3
outlook temperature humidity windy play play
sunny hot high FALSE no
sunny hot high TRUE no don’t play
overcast hot high FALSE yes
rainy mild high FALSE yes pno = 5/14
rainy cool normal FALSE yes
rainy cool normal TRUE no
overcast cool normal TRUE yes
sunny mild high FALSE no
sunny cool normal FALSE yes
rainy mild normal FALSE yes
sunny mild normal TRUE yes pyes = 9/14
overcast mild high TRUE yes
overcast hot normal FALSE yes
rainy mild high TRUE no

Impurity = - pyes log2 pyes - pno log2 pno
= - 9/14 log2 9/14 - 5/14 log2 5/14
= 0.94 bits

ID3 0.94 bits
play
don’t play
al play
xim tion 2
don't play play don't play
play don't play play don't play
ma ma
sunny 3
high 3 4
hot 2 2
FALSE 6 2
or
overcast 4 0 mild 4 2
infrainy ain 3
g 2
normal 6 1
cool 3 1
TRUE 3 3

outlook humidity temperature windy

sunny overcast rainy high normal hot mild cool false true

amount of information required to specify class of an example given that it reaches node
0.97 bits 0.0 bits 0.97 bits 0.98 bits 0.59 bits 1.0 bits 0.92 bits 0.81 bits 0.81 bits 1.0 bits
* 5/14 * 4/14 * 5/14 * 7/14 * 7/14 * 4/14 * 6/14 * 4/14 * 8/14 * 6/14

+ + + +
= 0.69 bits = 0.79 bits = 0.91 bits = 0.89 bits
gain: 0.25 bits gain: 0.15 bits gain: 0.03 bits gain: 0.05 bits

ID3 outlook play
don’t play
sunny overcast rainy

0.97 bits outlook
sunny
temperature
hot
humidity
high
windy
FALSE
play
no
sunny hot high TRUE no
al
xim tion sunny mild normal TRUE yes
ma ma
humidity
or
inf gain temperature windy

high normal hot mild cool false true

0.0 bits 0.0 bits 0.0 bits 1.0 bits 0.0 bits 0.92 bits 1.0 bits
* 3/5 * 2/5 * 2/5 * 2/5 * 1/5 * 3/5 * 2/5

+ + +
= 0.0 bits = 0.40 bits = 0.95 bits
gain: 0.97 bits gain: 0.57 bits gain: 0.02 bits

ID3 outlook
play
don’t play
outlook temperature humidity windy play
sunny overcast rainy rainy mild high FALSE yes
0.97 bits rainy
rainy
cool
mild
normal
normal
TRUE
FALSE
no
yes
humidity
humidity temperature windy
high normal

high normal hot mild cool false true

∅
1.0 bits 0.92 bits 0.92 bits 1.0 bits 0.0 bits 0.0 bits
*2/5 * 3/5 * 3/5 * 2/5 * 3/5 * 2/5

+ + +
= 0.95 bits = 0.95 bits = 0.0 bits
gain: 0.02 bits gain: 0.02 bits gain: 0.97 bits

ID3
outlook temperature humidity windy play
sunny hot high FALSE no
sunny hot high TRUE no
overcast hot high FALSE yes
rainy mild high FALSE yes
rainy cool normal TRUE no
overcast cool normal TRUE yes
rainy mild normal FALSE yes play
sunny mild normal TRUE yes
overcast
overcast
mild
hot
high
normal
TRUE
FALSE
yes
yes outlook don’t play

sunny overcast rainy

Yes

humidity windy
high
normal false true
No Yes Yes No

C4.5
 Information gain measure is biased towards attributes with a large
number of values
 C4.5 (a successor of ID3) uses gain ratio to overcome the problem
(normalization to information gain)
– GainRatio(A) = Gain(A)/SplitInfo(A)
v | Dj | | Dj |
SplitInfo A ( D ) = −∑ × log 2 ( )
j =1 |D| |D|

 Ex.
5 5 4 4 5 5
SplitInfo A ( D) = − ×log 2 ( ) − ×log 2 ( ) − ×log 2 ( ) = 0.926
14 14 14 14 14 14

– gain_ratio(income) = 0.029/0.926 = 0.031
 The attribute with the maximum gain ratio is selected as the
splitting attribute

CART
 If a data set D contains examples from n classes, gini index,
gini(D) is defined as
n 2
gini( D) =1− ∑ p j
j =1
where pj is the relative frequency of class j in D
 If a data set D is split on A into two subsets D1 and D2, the gini
index gini(D) is defined as
|D1| |D |
gini A ( D) = gini( D1) + 2 gini( D 2)
|D| |D|
 Reduction in Impurity:
∆gini( A) = gini( D) − giniA ( D)

 The attribute provides the smallest ginisplit(D) (or the largest
reduction in impurity) is chosen to split the node (need to
enumerate all the possible splitting points for each attribute)

CART
 Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
2 2
9 5
gini ( D) = 1 −   −   = 0.459
 14   14 

 Suppose the attribute income partitions D into 10 in D1: {low,
medium} and 4 in D2
 10  4
giniincome∈{low,medium} ( D ) =  Gini ( D1 ) +  Gini ( D1 )
 14   14 

but gini{medium,high} is 0.30 and thus the best since it is the lowest
 All attributes are assumed continuous-valued
 May need other tools, e.g., clustering, to get the possible split
values
 Can be modified for categorical attributes

Comparing Attribute Selection Measures

 The three measures, in general, return good results but
– Information gain:
• biased towards multivalued attributes
– Gain ratio:
• tends to prefer unbalanced splits in which one partition is
much smaller than the others
– Gini index:
• biased to multivalued attributes
• has difficulty when # of classes is large
• tends to favor tests that result in equal-sized partitions
and purity in both partitions

Other Attribute Selection Measures

 CHAID: a popular decision tree algorithm, measure based on χ2 test for
independence
 C-SEP: performs better than info. gain and gini index in certain cases
 G-statistics: has a close approximation to χ2 distribution
 MDL (Minimal Description Length) principle (i.e., the simplest solution
is preferred):
– The best tree as the one that requires the fewest # of bits to both
(1) encode the tree, and (2) encode the exceptions to the tree
 Multivariate splits (partition based on multiple variable combinations)
– CART: finds multivariate splits based on a linear comb. of attrs.
 Which attribute selection measure is the best?
– Most give good results, none is significantly superior than others

Underfitting and Overfitting

Overfitting

Underfitting: when model is too simple, both training and
test errors are large

Overfitting due to Noise

Decision boundary is distorted by noise point

Underfitting due to Insufficient Examples

Lack of data points in the lower half of the diagram makes it difficult
to predict correctly the class labels of that region
- Insufficient number of training records in the region causes the
decision tree to predict the test examples using other training
records that are irrelevant to the classification task

Two approaches to avoid Overfitting

 Prepruning:
– Halt tree construction early—do not split a node if this would result
in the goodness measure falling below a threshold
– Difficult to choose an appropriate threshold

 Postpruning:
– Remove branches from a “fully grown” tree—get a sequence of
progressively pruned trees
– Use a set of data different from the training data to decide
which is the “best pruned tree”

Scalable Decision Tree Induction Methods
 ID3, C4.5, and CART are not efficient when the training set
doesn’t fit the available memory. Instead the following algorithms
are used

– SLIQ
• Builds an index for each attribute and only class list and
the current attribute list reside in memory
– SPRINT
• Constructs an attribute list data structure
– RainForest
• Builds an AVC-list (attribute, value, class label)
– BOAT
• Uses bootstrapping to create several small samples

BOAT

 BOAT (Bootstrapped Optimistic Algorithm for Tree
Construction)
– Use a statistical technique called bootstrapping to create several
smaller samples (subsets), each fits in memory
– Each subset is used to create a tree, resulting in several trees
– These trees are examined and used to construct a new tree T’
• It turns out that T’ is very close to the tree that would be
generated using the whole data set together
– Adv: requires only two scans of DB, an incremental alg.

Why decision tree induction in data mining?
 Relatively faster learning speed (than other
classification methods)

 Convertible to simple and easy to understand
classification rules

 Comparable classification accuracy with other
methods

Converting Tree to Rules
Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak
No Yes No Yes

R1: IF (Outlook=Sunny) AND (Humidity=High) THEN Play=No
R2: IF (Outlook=Sunny) AND (Humidity=Normal) THEN Play=Yes
R3: IF (Outlook=Overcast) THEN Play=Yes
R4: IF (Outlook=Rain) AND (Wind=Strong) THEN Play=No
R5: IF (Outlook=Rain) AND (Wind=Weak) THEN Play=Yes

Decision trees:
The Weka tool

@relation weather.symbolic

@attribute outlook {sunny, overcast, rainy}
@attribute temperature {hot, mild, cool}
@attribute humidity {high, normal}
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}

@data
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
sunny,mild,high,FALSE,no
sunny,cool,normal,FALSE,yes
rainy,mild,normal,FALSE,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,FALSE,yes
rainy,mild,high,TRUE,no

http://www.cs.waikato.ac.nz/ml/weka/

Bayesian Classifier

Thomas Bayes (1702-1761)

Bayesian Classifier – Basic Equation

P(X,C) = P(C|X)*P(X) = P(X|C)*P(C)

Class Prior Probability Descriptor Posterior Probability

P( C ) P( X | C )
P( C | X ) =
P( X )

Class Posterior Probability
Descriptor Prior Probability

Training Data
Outlook Temp Humidity Windy Play?
sunny hot high FALSE No
sunny hot high TRUE No
overcast hot high FALSE Yes
rainy mild high FALSE Yes
rainy cool normal FALSE Yes
rainy cool Normal TRUE No
overcast cool Normal TRUE Yes
sunny mild High FALSE No
sunny cool Normal FALSE Yes
rainy mild Normal FALSE Yes
sunny mild normal TRUE Yes
overcast mild High TRUE Yes
overcast hot Normal FALSE Yes
rainy mild high TRUE No

P(yes) = 9/14
P(no) = 5/14

Bayesian Classifier – Probabilities for the weather data

Frequency Tables

Outlook | No Yes Temp. | No Yes Humidity | No Yes Windy | No Yes
---------------------------------- ---------------------------------- ---------------------------------- ----------------------------------
Sunny | 3 2 Hot | 2 2 High | 4 3 False | 2 6
---------------------------------- ---------------------------------- ---------------------------------- ----------------------------------
Overcast | 0 4 Mild | 2 4 Normal | 1 6 True | 3 3
---------------------------------- ----------------------------------
Rainy | 2 3 Cool | 1 3

---------------------------------- ---------------------------------- ---------------------------------- ----------------------------------
Sunny | 3/5 2/9 Hot | 2/5 2/9 High | 4/5 3/9 False | 2/5 6/9
---------------------------------- ---------------------------------- ---------------------------------- ----------------------------------
Overcast | 0/5 4/9 Mild | 2/5 4/9 Normal | 1/5 6/9 True | 3/5 3/9
---------------------------------- ----------------------------------
Rainy | 2/5 3/9 Cool | 1/5 3/9

Likelihood Tables

Bayesian Classifier – zero frequency problem

 What if a descriptor value doesn’t occur with every class value

P(outlook=overcast|No)=0

 Remedy: add 1 to the count for every descriptor-class combination
(Laplace Estimator)

---------------------------------- ---------------------------------- ---------------------------------- ----------------------------------
Sunny | 3+1 2+1 Hot | 2+1 2+1 High | 4+1 3+1 False | 2+1 6+1
---------------------------------- ---------------------------------- ---------------------------------- ----------------------------------
Overcast | 0+1 4+1 Mild | 2+1 4+1 Normal | 1+1 6+1 True | 3+1 3+1
---------------------------------- ----------------------------------
Rainy | 2+1 3+1 Cool | 1+1 3+1

Bayesian Classifier – General Equation

P ( X | Ck ) P( Ck )
P ( Ck | X ) =
P( X )

Likelihood: P ( X | Ck )

1  ( x − µ )2 
Continues variable: P ( x | C ) = exp− 
(2πσ )
2 1/ 2
 2σ 2 

Bayesian Classifier – Dealing with numeric attributes

Naïve Bayesian Classifier: Comments

 Advantages
– Easy to implement
– Good results obtained in most of the cases
 Disadvantages
– Assumption: class conditional independence, therefore loss of
accuracy
– Practically, dependencies exist among variables
• E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer,
diabetes, etc.
• Dependencies among these cannot be modeled by Naïve
Bayesian Classifier
 How to deal with these dependencies?
– Bayesian Belief Networks

Bayesian Belief Networks

 Bayesian belief network allows a subset of the variables
conditionally independent
 A graphical model of causal relationships
– Represents dependency among the variables
– Gives a specification of joint probability distribution

 Nodes: random variables
 Links: dependency
X Y  X and Y are the parents of Z, and Y is
the parent of P
Z  No dependency between Z and P
P
 Has no loops or cycles

Bayesian Belief Network: An Example

The conditional probability table
Family (CPT) for variable LungCancer:
Smoker
History
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)

LC 0.8 0.5 0.7 0.1
~LC 0.2 0.5 0.3 0.9
LungCancer Emphysema

CPT shows the conditional probability for
each possible combination of its parents

PositiveXRay Dyspnea Derivation of the probability of a
particular combination of values of X,
from CPT:
n
Bayesian Belief Networks P ( x1 ,..., xn ) = ∏ P ( x i | Parents (Y i ))
i =1

Training Bayesian Networks

 Several scenarios:

– Given both the network structure and all variables
observable: learn only the CPTs

– Network structure known, some hidden variables: gradient
descent (greedy hill-climbing) method, analogous to neural
network learning

– Network structure unknown, all variables observable:
search through the model space to reconstruct network
topology

– Unknown structure, all hidden variables: No good
algorithms known for this purpose.

Sabic
 Email Mohammed S. Al-Shahrani
– shahranims@sabic.com

Support Vector Machines

 Find a linear hyperplane (decision boundary) that will separate the
data


 One Possible Solution


 Another possible solution


 Other possible solutions


 Which one is better? B1 or B2?
 How do you define better?


 Find a hyper plane that maximizes the margin => B1 is better than B2

Support Vectors
Support Vectors


Support Vectors


 
w• x + b = 0
 
w • x + b = +1

 
w • x + b = −1

 
 1 if w • x + b ≥ 1 2
f ( x) =    Margin =  2
−1 if w • x + b ≤ −1 || w ||

Finding the Decision Boundary

 Let {x1, ..., xn} be our data set and let yi ∈ {1,-1} be the class
label of xi
 The decision boundary should classify all points correctly ⇒

 The decision boundary can be found by solving the following
constrained optimization problem

 This is a constrained optimization problem. Solving it is beyond
our course

2
 We want to maximize: Margin =  2
|| w ||
 2
|| w ||
– Which is equivalent to minimizing: L( w) =
2
– But subjected to the following constraints:
 
 1 if w • x i + b ≥ 1
f ( xi ) =   
−1 if w • x i + b ≤ −1
• This is a constrained optimization problem
– Numerical approaches to solve it (e.g., quadratic
programming)

Classifying new Tuples

 The decision boundary is determined only by the support vectors

 Let tj (j=1, ..., s) be the indices of the s support vectors.

 For testing with a new data z

– Compute and
classify z as class 1 if the sum is positive, and class 2
otherwise

 What if the training set is not linearly separable?

 Slack variables ξi can be added to allow misclassification of
difficult or noisy examples, resulting margin called soft.

ξi
ξi

 What if the problem is not linearly separable?
– Introduce slack variables
• Need to minimize:
 2
|| w ||  N
k
L( w) = + C ∑ ξi 
2  i =1 
• Subject to:
 
 1 if w • x i + b ≥ 1 - ξi
f ( xi ) =   
−1 if w • x i + b ≤ −1 + ξi

Nonlinear Support Vector Machines

 What if decision boundary is not linear?

Non-linear SVMs
 Datasets that are linearly separable with some noise work out
great:

0 x

 But what are we going to do if the dataset is just too hard?

0 x

 How about… mapping data to a higher-dimensional space:
x2

0 x

Non-linear SVMs: Feature spaces

 General idea: the original feature space can always be mapped to
some higher-dimensional feature space where the training set is
separable:

Φ: x → φ(x)

What Is Prediction?
 (Numerical) prediction is similar to classification

– construct a model

– use model to predict continuous or ordered value for a given
input
 Prediction is different from classification

– Classification refers to predict categorical class label

– Prediction models continuous-valued functions
 Major method for prediction: regression

– model the relationship between one or more predictor
variables and a response variable

Prediction

Response Training data
Attribute (Y)

Attribute (X)

Predictor

Types of Correlation

Positive correlation Negative correlation No correlation

Regression Analysis
 Simple Linear regression
 multiple regression
 Non-linear regression
 Other regression methods:
– generalized linear model,
– Poisson regression,
– log-linear models,
– regression trees

Simple Linear Regression
describes the linear relationship between a predictor variable,
plotted on the x-axis, and a response variable, plotted on the
y-axis

Y

X


Y = βo + β X
1

β1
Y

1.0

βo
X


Y

X


ε
Y

ε

X


Fitting data to a linear model

Yi = β o + β1 X i + ε i

intercept slope residuals


How to fit data to a linear model?

Least Square Method

Least Squares Regression

ˆ
Model line: Y = β 0 + β1 X
Residual (ε) = Y − Yˆ

Sum of squares of residuals = ∑ ˆ
(Y − Y ) 2

 we must find values of β o and β1 that minimise

∑ ˆ
(Y − Y ) 2

Linear Regression

 A model line: y = w0 + w1 x acquired by using Method
of least squares to estimates the best-fitting straight
line has:
w = y−w x
0 1
| D|

∑( x − x )( yi − y )
w =
i
i=1

1 ∑( x
| D|

i − x )2
i=1

Multiple Linear Regression
 Multiple linear regression: involves more than one predictor
variable
 The linear model with a single predictor variable X can easily
be extended to two or more predictor variables

Y = β o + β1 X 1 + β 2 X 2 + ... + β p X p + ε

– Solvable by extension of least square method or using SAS,
S-Plus

Nonlinear Regression
 Some nonlinear models can be modeled by a polynomial
function
 A polynomial regression model can be transformed into linear
regression model. For example,
y = w0 + w1 x + w2 x2 + w3 x3
convertible to linear with new variables: x2 = x2, x3= x3
y = w0 + w1 x + w2 x2 + w3 x3
 Other functions, such as power function, can also be
transformed to linear model
 Some models are intractable nonlinear
– possible to obtain least square estimates through extensive
calculation on more complex formulae

Artificial Neural Networks
(ANN)

What is a ANN?
 ANN is a data structure that supposedly simulates
the behavior of neurons in a biological brain.
 ANN is composed of layers of units interconnected.
 Messages are passed along the connections from
one unit to the other.
 Messages can change based on the weight of the
connection and the value in the node

General Structure of ANN

x0 w0 - µk
x1 w1
∑ f
xn wn

ANN

Output Y is 1 if at least two of the three inputs are equal to 1.

ANN

Y = I (0.3 X 1 + 0.3 X 2 + 0.3 X 3 − 0.4 > 0)
1 if z is true
where I ( z ) = 
0 otherwise

Artificial Neural Networks

 Model is an assembly of
inter-connected nodes and
weighted links

 Output node sums up each
of its input value according
to the weights of its links
Perceptron Model

 Compare output node Y = I ( ∑wi X i − t ) or
against some threshold t i

Y = sign( ∑ wi X i − t )
i

Neural Networks
 Advantages
– prediction accuracy is generally high.
– robust, works when training examples contain errors.
– output may be discrete, real-valued, or a vector of several
discrete or real-valued attributes.
– fast evaluation of the learned target function.
 Criticism
– long training time.
– difficult to understand the learned function (weights).
– not easy to incorporate domain knowledge.

Learning Algorithms
 Back propagation for classification
 Kohonen feature maps for clustering
 Recurrent back propagation for classification
 Radial basis function for classification
 Adaptive resonance theory
 Probabilistic neural networks

Major Steps for Back Propagation
Network
 Constructing a network
– input data representation
– selection of number of layers, number of nodes in
each layer.
 Training the network using training data
 Pruning the network
 Interpret the results

A Multi-Layer Feed-Forward Neural Network

wij

I j = ∑ wij Oi + θ j
i

1
Oj = −I j
1+ e

How A Multi-Layer Neural Network Works?
 The inputs to the network correspond to the attributes measured for
each training tuple
 Inputs are fed simultaneously into the units making up the input layer
 They are then weighted and fed simultaneously to a hidden layer
 The number of hidden layers is arbitrary, although usually only one
 The weighted outputs of the last hidden layer are input to units making
up the output layer, which emits the network's prediction
 The network is feed-forward in that none of the weights cycles back to
an input unit or to an output unit of a previous layer
 From a statistical point of view, networks perform nonlinear
regression: Given enough hidden units and enough training samples,
they can closely approximate any function

Defining a Network Topology
 First decide the network topology: # of units in the input layer,
# of hidden layers (if > 1), # of units in each hidden layer, and #
of units in the output layer
 Normalizing the input values for each attribute measured in the
training tuples to [0.0—1.0]
 One input unit per domain value
 Output, if for classification and more than two classes, one
output unit per class is used
 Once a network has been trained and its accuracy is
unacceptable, repeat the training process with a different
network topology or a different set of initial weights

Backpropagation
 Iteratively process a set of training tuples & compare the network's
prediction with the actual known target value
 For each training tuple, the weights are modified to minimize the
mean squared error between the network's prediction and the
actual target value
 Modifications are made in the “backwards” direction: from the
output layer, through each hidden layer down to the first hidden
layer, hence “backpropagation”
 Steps
– Initialize weights (to small random #s) and biases in the network
– Propagate the inputs forward (by applying activation function)
– Backpropagate the error (by updating weights and biases)
– Terminating condition (when error is very small, etc.)

Backpropagation

Err j = O j (1 − O j )∑ Errk w jk
k

wij = wij + (l ) Err j Oi

θ j = θ j + (l) Err j

Err j = O j (1 − O j )(T j − O j )

Generated value Correct value

Network Pruning
 Fully connected network will be hard to articulate
 n input nodes, h hidden nodes and m output nodes
lead to h(m+n) links (weights)
 Pruning: Remove some of the links without affecting
classification accuracy of the network.

Other Classification Methods
 Associative classification : Association rule based condSet
class
 Genetic algorithm : Initial population of encoded rules are
changed by mutation and cross-over based on survival of
accurate once (survival).
 K-nearest neighbor classifier : Learning by analogy.
 Case-based reasoning : Similarity with other cases.
 Rough set theory : Approximation to equivalence classes.
 Fuzzy sets: Based on fuzzy logic (truth values between 0..1).

Lazy vs. Eager Learning
 Lazy vs. eager learning
– Lazy learning (e.g., instance-based learning): Simply
stores training data (or only minor processing) and waits
until it is given a test tuple
– Eager learning (the above discussed methods): Given a
set of training set, constructs a classification model
before receiving new (e.g., test) data to classify
 Lazy: less time in training but more time in predicting

Lazy Learner: Instance-Based Methods
 Instance-based learning:
– Store training examples and delay the processing (“lazy
evaluation”) until a new instance must be classified
 Typical approaches
– k-nearest neighbor approach
• Instances represented as points in a Euclidean
space.
– Case-based reasoning
• Uses symbolic representations and knowledge-
based inference

Nearest Neighbor Classifiers
 Basic idea:
– If it walks like a duck, quacks like a duck, then it’s
probably a duck

Compute
Distance Test
Record

Choose k of the
“nearest” records
Training
records

Instance-Based Classifiers

• Store the training records
• Use training records to
predict the class label of
unseen cases

Definition of Nearest Neighbor

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x are data points
that have the k smallest distance to x

The k-Nearest Neighbor Algorithm
 All instances correspond to points in the n-D space
 The nearest neighbor are defined in terms of Euclidean
distance, dist(X1, X2)
 Target function could be discrete- or real- valued
 For discrete-valued, k-NN returns the most common value
among the k training examples nearest to xq
 Vonoroi diagram: the decision surface induced by 1-NN for a
typical set of training examples
_
_ _
.
_
+
_ .
+
+
. . .
xq
_ + .

Nearest-Neighbor Classifiers
Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve

To classify an unknown record:
– Compute distance to other training
records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the class
label of unknown record (e.g., by
taking majority vote)

Nearest Neighbor Classification
 Compute distance between two points:
– Euclidean distance

d ( p, q ) = ∑( p i
i
−q )
i
2

 Determine the class from nearest neighbor list
– take the majority vote of class labels among the k-
nearest neighbors
– Weigh the vote according to distance
• weight factor, w = 1/d2

Nearest Neighbor Classification…
 Scaling issues
– Attributes may have to be scaled to prevent
distance measures from being dominated by one of
the attributes
– Example:
• height of a person may vary from 1.5m to 1.8m
• weight of a person may vary from 90lb to 300lb
• income of a person may vary from $10K to $1M

Nearest Neighbor Classification…
 Choosing the value of k:
– If k is too small, sensitive to noise points
– If k is too large, neighborhood may include points from other
classes

Metrics for Performance Evaluation
 Focus on the predictive capability of a model
– Rather than how fast it takes to classify or build models,
scalability, etc.
 Confusion Matrix:

PREDICTED CLASS a: TP (true positive)

Class=Yes Class=No b: FN (false negative)
c: FP (false positive)
Class=Yes a b
ACTUAL d: TN (true negative)
CLASS Class=No c d

Metrics for Performance Evaluation…

PREDICTED CLASS

Class=Yes Class=No

ACTUAL Class=Yes a b
CLASS (TP) (FN)
Class=No c d
(FP) (TN)

 Most widely-used metric:
a+d TP + TN
Accuracy = =
a + b + c + d TP + TN + FP + FN

Error Rate = 1 - Accuracy

Limitation of Accuracy
 Consider a 2-class problem
– Number of Class 0 examples = 9990
– Number of Class 1 examples = 10

 If model predicts everything to be class 0, accuracy is
9990/10000 = 99.9 %
– Accuracy is misleading because model does not
detect any class 1 example

Alternative Classifier Accuracy Measures

 accuracy = sensitivity * pos/(pos + neg) + specificity * neg/(pos + neg)

– sensitivity = tp/pos /* true positive recognition rate */

– specificity = tn/neg /* true negative recognition rate */
 precision = tp/(tp + fp)

Predictor Error Measures
 Test error (generalization error): the average loss over the test set
d

– Mean absolute error: ∑| yi − yi ' |
i =1

d
d

– Mean squared error: ∑(y
i =1
i − yi ' ) 2

d
d

∑y
| i −yi ' |
– Relative absolute error: i=
d
1

∑y
|
i=1
i −y |

d

∑(y
i =1
i − yi ' ) 2
– Relative squared error: d

∑(y
i =1
i − y)2

– The mean squared-error exaggerates the presence of outliers
Popularly use (square) root mean-square error, similarly, root
relative squared error

Evaluating Accuracy
 Holdout method
– Given data is randomly partitioned into two independent sets
• Training set (e.g., 2/3) for model construction
• Test set (e.g., 1/3) for accuracy estimation
– Random sampling: a variation of holdout
• Repeat holdout k times, accuracy = avg. of the
accuracies obtained
 Cross-validation (k-fold, where k = 10 is most popular)
– Randomly partition the data into k mutually exclusive
subsets, each approximately equal size
– At i-th iteration, use Di as test set and others as training set

Evaluating Accuracy
 Bootstrap
– Works well with small data sets
– Samples the given training tuples uniformly with replacement
 Several boostrap methods, and a common one is .632 boostrap
– Suppose we are given a data set of d tuples. The data set is sampled
d times, with replacement, resulting in a training set of d samples. The
data tuples that did not make it into the training set end up forming the
test set. About 63.2% of the original data will end up in the bootstrap,
and the remaining 36.8% will form the test set (since (1 – 1/d)d ≈ e-1 =
0.368)
– Repeat the sampling procedure k times, overall accuracy of the model:
k
acc( M ) = ∑ (0.632 × acc( M i ) test _ set +0.368 × acc( M i ) train _ set )
i =1

Ensemble Methods
 Construct a set of classifiers from the training data
 Predict class label of previously unseen records by
aggregating predictions made by multiple classifiers
– Use a combination of models to increase accuracy
– Combine a series of k learned models, M1, M2, …, Mk, with the aim
of creating an improved model M*
 Popular ensemble methods
– Bagging
• averaging the prediction over a collection of classifiers
– Boosting
• weighted vote with a collection of classifiers

Bagging: Boostrap Aggregation
 Analogy: Diagnosis based on multiple doctors’ majority vote
 Training
– Given a set D of d tuples, at each iteration i, a training set Di of d
tuples is sampled with replacement from D (i.e., boostrap)
– A classifier model Mi is learned for each training set Di
 Classification: classify an unknown sample X
– Each classifier Mi returns its class prediction
– The bagged classifier M* counts the votes and assigns the class
with the most votes to X
 Prediction: can be applied to the prediction of continuous values
by taking the average value of each prediction for a given test
tuple

Bagging: Boostrap Aggregation
 Accuracy
– Often significant better than a single classifier derived
from D
– For noise data: not considerably worse, more robust
– Proved improved accuracy in prediction

Boosting
 Analogy: Consult several doctors, based on a combination of
weighted diagnoses—weight assigned based on the previous
diagnosis accuracy
 How boosting works?
– Weights are assigned to each training tuple
– A series of k classifiers is iteratively learned
– After a classifier Mi is learned, the weights are updated to
allow the subsequent classifier, Mi+1, to pay more attention to
the training tuples that were misclassified by Mi

– The final M* combines the votes of each individual classifier,
where the weight of each classifier's vote is a function of its
accuracy

Boosting

 The boosting algorithm can be extended for the
prediction of continuous values

 Comparing with bagging: boosting tends to achieve
greater accuracy, but it also risks overfitting the
model to misclassified data

Boosting: Adaboost
 Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
 Initially, all the weights of tuples are set the same (1/d)
 Generate k classifiers in k rounds. At round i,
– Tuples from D are sampled (with replacement) to form a training set
Di of the same size
– Each tuple’s chance of being selected is based on its weight
– A classification model Mi is derived from Di
– Its error rate is calculated using Di as a test set
– If a tuple is misclassified, its weight is increased, otherwise it is
decreased
 Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier
Mi error rate is the sum of the weights of the misclassified tuples:
d
error ( M i ) = ∑ j ×err ( X j )
w
j
1 − error ( M i )
log
error ( M i )
 The weight of classifier Mi’s vote is

Summary
 Classification Vs prediction
 Eager learners
– Decision tree
– Bayesian
– Support vector Machines (SVM)
– Neural Networks
– Linear regression
 Lazy learners
– K-Nearest Neighbor (KNN)
 Performance (Accuracy) Evaluation
– Holdout
– Cross validation
– Bootstrap
 Ensemble Methods
– Bagging
– Boosting

data minig

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (6)

Similar a data minig

Similar a data minig (20)

Último

Último (20)

data minig