Data Mining

What Is Data Mining?
n Data mining (knowledge discovery from data)
¨ Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful) patterns or
knowledge from huge amount of data

n Is everything “data mining”?
¨ (Deductive) query processing.
¨ Expert systems or small ML/statistical programs

Build computer programs that sift through databases
automatically, seeking regularities or patterns
July 7, 2009 Data Mining: R. Akerkar 2

Data Mining — What’s in a Name?
Information Harvesting
Knowledge Mining
Data Mining
Knowledge Discovery
in Databases Data Dredging

Data Archaeology
Data Pattern Processing

Database Mining
Knowledge Extraction
Siftware

The process of discovering meaningful new correlations, patterns, and
trends by sifting through large amounts of stored data, using pattern
recognition technologies and statistical and mathematical techniques


Definition
n Several Definitions
¨ Non-trivial extraction of implicit, previously unknown
and potentially useful information from data

¨ Exploration & analysis, by automatic or
semi-automatic means, of
large quantities of data
in order to discover
meaningful patterns

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996


What is (not) Data Mining?
lWhat is not Data l What is Data Mining?
Mining?
– Certain names are more
– Look up phone common in certain Indian
number in phone states (Joshi, Patil,
directory Kulkarni… in Pune area).
– Group together similar
– Query a Web documents returned by
search engine for search engine according to
information about their context (e.g. Google
“Pune” Scholar, Amazon.com,).

Origins of Data Mining
n Draws ideas from machine learning/AI, pattern recognition,
statistics, and database systems
n Traditional Techniques
may be unsuitable due to
Statistics/ Machine Learning/
¨ Enormity of data AI Pattern
¨ High dimensionality Recognition
of data
Data Mining
¨ Heterogeneous,
distributed nature
of data Database
systems


Data Mining Tasks

n Prediction Methods
¨ Use some variables to predict unknown or
future values of other variables.

n Description Methods
¨ Find human-interpretable patterns that
describe the data.


Data Mining Tasks...

n Classification [Predictive] predicting an item class
n Clustering [Descriptive] finding clusters in data
n Association Rule Discovery [Descriptive]
frequent occurring events
n Deviation/Anomaly Detection [Predictive] finding
changes


Classification: Definition
n Given a collection of records (training set )
¨ Each record contains a set of attributes, one of the
attributes is the class.
n Find a model for class attribute as a function
of the values of other attributes.
n Goal: previously unseen records should be
assigned a class as accurately as possible.
¨A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it.


Classification Example
l l s
rica rica ou
go go inu
te te nt s
ca ca co clas
Tid Refund Marital Taxable Refund Marital Taxable
Status Income Cheat Status Income Cheat

1 Yes Single 125K No No Single 75K ?
2 No Married 100K No Yes Married 50K ?
3 No Single 70K No No Married 150K ?
4 Yes Married 120K No Yes Divorced 90K ?
5 No Divorced 95K Yes No Single 40K ?
6 No Married 60K No No Married 80K ? Test
No
0
1

Set
7 Yes Divorced 220K
8 No Single 85K Yes
9 No Married 75K No Learn
Training
10 No Single 90K Yes
Classifier Model
10

Set


Classification: Application 1
n Direct Marketing
¨ Goal:Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product.
¨ Approach:
n Use the data for a similar product introduced before.
n We know which customers decided to buy and which decided
otherwise. This {buy, don’t buy} decision forms the class
attribute.
n Collect various demographic, lifestyle, and company-
interaction related information about all such customers.
¨ Type of business, where they stay, how much they earn, etc.
n Use this information as input attributes to learn a classifier
model. From [Berry & Linoff] Data Mining Techniques, 1997


n Fraud Detection
¨ Goal: Predict fraudulent cases in credit card
transactions.
¨ Approach:
n Use credit card transactions and the information on its
account-holder as attributes.
¨ When does a customer buy, what does he buy, how often he
pays on time, etc
n Label past transactions as fraud or fair transactions. This
forms the class attribute.
n Learn a model for the class of the transactions.
n Use this model to detect fraud by observing credit card
transactions on an account.


n Customer Attrition/Churn:
¨ Goal: To predict whether a customer is likely
to be lost to a competitor.
¨ Approach:
n Use detailed record of transactions with each of the
past and present customers, to find attributes.
¨ How often the customer calls, where he calls, what time-
of-the day he calls most, his financial status, marital
status, etc.
n Label the customers as loyal or disloyal.
n Find a model for loyalty.
From [Berry & Linoff] Data Mining Techniques, 1997


Introduction
n A classification scheme which generates a tree
and a set of rules from given data set.

n The set of records available for developing
classification methods is divided into two disjoint
subsets – a training set and a test set.
n The attributes of the records are categorise into
two types:
¨ Attributes whose domain is numerical are called
numerical attributes.
¨ Attributes whose domain is not numerical are called
the categorical attributes.

Introduction

n A decision tree is a tree with the following properties:
¨ An inner node represents an attribute.
¨ An edge represents a test on the attribute of the father
node.
¨ A leaf represents one of the classes.

n Construction of a decision tree
¨ Based on the training data
¨ Top-Down strategy


Decision Tree
Example

n The data set has five attributes.
n There is a special attribute: the attribute class is the class label.
n The attributes, temp (temperature) and humidity are numerical
attributes
n Other attributes are categorical, that is, they cannot be ordered.

n Based on the training data set, we want to find a set of rules to know
what values of outlook, temperature, humidity and wind, determine
whether or not to play golf.


Decision Tree
Example

n We have five leaf nodes.
n In a decision tree, each leaf node represents a rule.

n We have the following rules corresponding to the tree given in
Figure.

n RULE 1 If it is sunny and the humidity is not above 75%, then play.
n RULE 2 If it is sunny and the humidity is above 75%, then do not play.
n RULE 3 If it is overcast, then play.
n RULE 4 If it is rainy and not windy, then play.
n RULE 5 If it is rainy and windy, then don't play.


Classification
n The classification of an unknown input vector is done by
traversing the tree from the root node to a leaf node.
n A record enters the tree at the root node.
n At the root, a test is applied to determine which child
node the record will encounter next.
n This process is repeated until the record arrives at a leaf
node.
n All the records that end up at a given leaf of the tree are
classified in the same way.
n There is a unique path from the root to each leaf.
n The path is a rule which is used to classify the records.


n In our tree, we can carry out the classification for
an unknown record as follows.
n Let us assume, for the record, that we know the
values of the first four attributes (but we do not
know the value of class attribute) as

n outlook= rain; temp = 70; humidity = 65; and
windy= true.


n We start from the root node to check the value of the
attribute associated at the root node.
n This attribute is the splitting attribute at this node.
n For a decision tree, at every node there is an attribute
associated with the node called the splitting attribute.

n In our example, outlook is the splitting attribute at root.
n Since for the given record, outlook = rain, we move to
the right-most child node of the root.
n At this node, the splitting attribute is windy and we find
that for the record we want classify, windy = true.
n Hence, we move to the left child node to conclude that
the class label Is "no play".


n The accuracy of the classifier is determined by the percentage
of the test data set that is correctly classified.

n We can see that for Rule 1 there are two records of the test
data set satisfying outlook= sunny and humidity < 75, and
only one of these is correctly classified as play.
n Thus, the accuracy of this rule is 0.5 (or 50%). Similarly, the
accuracy of Rule 2 is also 0.5 (or 50%). The accuracy of Rule
3 is 0.66.

RULE 1
If it is sunny and the humidity
is not above 75%, then play.


Concept of Categorical Attributes

n Consider the following training
data set.
n There are three attributes,
namely, age, pincode and
class.
n The attribute class is used for
class label.

The attribute age is a numeric attribute, whereas pincode is a categorical
one.
Though the domain of pincode is numeric, no ordering can be defined
among pincode values.
You cannot derive any useful information if one pin-code is greater than
another pincode.

n Figure gives a decision tree for
the training data.

n The splitting attribute at the root
is pincode and the splitting
criterion here is pincode = 500
046.
n Similarly, for the left child node,
the splitting criterion is age < 48
At root level, we have 9 records.
(the splitting attribute is age).
The associated splitting criterion is
pincode = 500 046.
n Although the right child node As a result, we split the records
has the same attribute as the into two subsets. Records 1, 2, 4, 8,
splitting attribute, the splitting and 9 are to the left child note and
criterion is different. remaining to the right node.
The process is repeated at every
node.


Advantages and Shortcomings of
Decision Tree Classifications
n A decision tree construction process is concerned with
identifying the splitting attributes and splitting criterion at every
level of the tree.

n Major strengths are:
¨ Decision tree able to generate understandable rules.
¨ They are able to handle both numerical and categorical attributes.
¨ They provide clear indication of which fields are most important for
prediction or classification.

n Weaknesses are:
¨ The process of growing a decision tree is computationally expensive. At
each node, each candidate splitting field is examined before its best split
can be found.
¨ Some decision tree can only deal with binary-valued target classes.

Iterative Dichotomizer (ID3)
n Quinlan (1986)
n Each node corresponds to a splitting attribute
n Each arc is a possible value of that attribute.

n At each node the splitting attribute is selected to be the most
informative among the attributes not yet considered in the
path from the root.

n Entropy is used to measure how informative is a node.
n The algorithm uses the criterion of information gain to
determine the goodness of a split.
¨ The attribute with the greatest information gain is
taken as the splitting attribute, and the data set is split
for all distinct values of the attribute.

Training Dataset
The class label attribute, This follows an example from Quinlan’s ID3
buys_computer, has two distinct
values.
age income student credit_rating buys_computer
Thus there are two distinct <=30 high no fair no
classes. (m =2) <=30 high no excellent no
31…40 high no fair yes
Class C1 corresponds to yes >40 medium no fair yes
and class C2 corresponds to no. >40 low yes fair yes
>40 low yes excellent no
There are 9 samples of class yes 31…40 low yes excellent yes
and 5 samples of class no. <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no


Extracting Classification Rules
from Trees
n Represent the knowledge in the
form of IF-THEN rules
n One rule is created for each
path from the root to a leaf
n Each attribute-value pair along
a path forms a conjunction
n The leaf node holds the class
prediction
n Rules are easier for humans to What are the rules?
understand


Solution (Rules)

IF age = “<=30” AND student = “no” THEN buys_computer = “no”

IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”

IF age = “31…40” THEN buys_computer = “yes”

IF age = “>40” AND credit_rating = “excellent” THEN buys_computer =
“yes”

IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no”


Algorithm for Decision Tree Induction
n Basic algorithm (a greedy algorithm)
¨ Tree is constructed in a top-down recursive divide-and-conquer
manner
¨ At start, all the training examples are at the root
¨ Attributes are categorical (if continuous-valued, they are
discretized in advance)
¨ Examples are partitioned recursively based on selected attributes
¨ Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
n Conditions for stopping partitioning
¨ All samples for a given node belong to the same class
¨ There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
¨ There are no samples left


Attribute Selection Measure: Information
Gain (ID3/C4.5)

n Select the attribute with the highest information gain
n S contains si tuples of class Ci for i = {1, …, m}
n information measures info required to classify any
arbitrary tuple m
si si
I( s1,s2,...,s m ) = − ∑ log 2
n
i =1 s s ….information is encoded in bits.
n entropy of attribute A with values {a1,a2,…,av}
v
s1 j + ...+ smj
E(A)= ∑ I ( s1 j ,...,smj )
j =1 s

n information gained by branching on attribute A

Gain(A) = I(s 1, s 2 ,..., s m ) − E(A)

Entropy
n Entropy measures the homogeneity (purity) of a set of examples.
n It gives the information content of the set in terms of the class labels of the
examples.
n Consider that you have a set of examples, S with two classes, P and N. Let the
set have p instances for the class P and n instances for the class N.
n So the total number of instances we have is t = p + n. The view [p, n] can be
seen as a class distribution of S.

The entropy for S is defined as
n Entropy(S) = - (p/t).log2(p/t) - (n/t).log2(n/t)

n Example: Let a set of examples consists of 9 instances for class positive, and 5
instances for class negative.
n Answer: p = 9 and n = 5.
n So Entropy(S) = - (9/14).log2(9/14) - (5/14).log2(5/14)
n = -(0.64286)(-0.6375) - (0.35714)(-1.48557)
n = (0.40982) + (0.53056)
n = 0.940


Entropy
The entropy for a completely pure set is 0 and is 1 for a set
with equal occurrences for both the classes.

i.e. Entropy[14,0] = - (14/14).log2(14/14) - (0/14).log2(0/14)
= -1.log2(1) - 0.log2(0)
= -1.0 - 0
=0

i.e. Entropy[7,7] = - (7/14).log2(7/14) - (7/14).log2(7/14)
= - (0.5).log2(0.5) - (0.5).log2(0.5)
= - (0.5).(-1) - (0.5).(-1)
= 0.5 + 0.5
=1

Attribute Selection by Information Gain
Computation 5 4
g Class P: buys_computer = “yes” E ( age ) = I ( 2,3) + I ( 4, 0 )
14 14
g Class N: buys_computer = “no”
5
g I(p, n) = I(9, 5) =0.940 + I (3, 2 ) = 0 .694
g Compute the entropy for age: 14
age pi ni I(pi, ni) 5
<=30 2 3 0.971 I ( 2 ,3 ) means “age <=30” has
14
30…40 4 0 0 5 out of 14 samples, with 2
>40 3 2 0.971 yes's and 3 no’s. Hence
age income student credit_rating buys_computer
<=30
<=30
high
high
no
no
fair
excellent
no
no
Gain (age ) = I ( p, n) − E (age ) = 0.246
31…40 high no fair yes
>40 medium no fair yes Gain (income) = 0.029
>40 low yes fair yes
>40 low yes excellent no Similarly, Gain ( student ) = 0.151
31…40 low yes excellent yes
<=30 medium no fair no Gain (credit _ rating ) = 0.048
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes Since, age has the highest information gain among
31…40 medium no excellent yes the attributes, it is selected as the test attribute.
31…40 high yes fair yes
>40 July medium
7, 2009 no excellent Data Mining: R. Akerkar
no 34

Exercise 1
n The following table consists of training data from an employee
database.

n Let status be the class attribute. Use the ID3 algorithm to
construct a decision tree from the given data.


Solution 1


Other Attribute Selection
Measures
n Gini index (CART, IBM IntelligentMiner)
¨ All attributes are assumed continuous-valued
¨ Assume there exist several possible split values for
each attribute
¨ May need other tools, such as clustering, to get the
possible split values
¨ Can be modified for categorical attributes


Gini Index (IBM IntelligentMiner)
n If a data set T contains examples from n classes, gini index,
n
gini(T) is defined as gini ( T ) = 1 − ∑ p 2
j
j =1
where pj is the relative frequency of class j in T.
n If a data set T is split into two subsets T1 and T2 with sizes N1 and
N2 respectively, the gini index of the split data contains examples
from n classes, the gini index gini(T) is defined as

gini split (T ) = N 1 gini (T 1) + N 2 gini (T 2 )
N N
n The attribute provides the smallest ginisplit(T) is chosen to split the
node (need to enumerate all possible splitting points for each
attribute).


Exercise 2


Solution 2
n SPLIT: Age <= 50
n ----------------------
n | High | Low | Total
n --------------------
n S1 (left) | 8 | 11 | 19
n S2 (right) | 11 | 10 | 21
n --------------------
n For S1: P(high) = 8/19 = 0.42 and P(low) = 11/19 = 0.58
n Gini(S1) = 1-[0.42x0.42 + 0.58x0.58] = 1-[0.18+0.34] = 1-0.52 = 0.48
n Gini(S2) = 1-[0.52x0.52 + 0.48x0.48] = 1-[0.27+0.23] = 1-0.5 = 0.5
n Gini-Split(Age<=50) = 19/40 x 0.48 + 21/40 x 0.5 = 0.23 + 0.26 = 0.49

n SPLIT: Salary <= 65K
n ----------------------
n | High | Low | Total
n --------------------
n S1 (top) | 18 | 5 | 23
n S2 (bottom) | 1 | 16 | 17
n --------------------
n Gini(S1) = 1-[0.78x0.78 + 0.22x0.22] = 1-[0.61+0.05] = 1-0.66 = 0.34
n Gini(S2) = 1-[0.06x0.06 + 0.94x0.94] = 1-[0.004+0.884] = 1-0.89 = 0.11
n Gini-Split(Age<=50) = 23/40 x 0.34 + 17/40 x 0.11 = 0.20 + 0.05 = 0.25


Exercise 3
n In previous exercise, which is a better split
of the data among the two split points?
Why?


Solution 3
n Intuitively Salary <= 65K is a better split point since it produces
relatively ``pure'' partitions as opposed to Age <= 50, which results
in more mixed partitions (i.e., just look at the distribution of Highs
and Lows in S1 and S2).

n More formally, let us consider the properties of the Gini index.
If a partition is totally pure, i.e., has all elements from the same
class, then gini(S) = 1-[1x1+0x0] = 1-1 = 0 (for two classes).

On the other hand if the classes are totally mixed, i.e., both classes
have equal probability then
gini(S) = 1 - [0.5x0.5 + 0.5x0.5] = 1-[0.25+0.25] = 0.5.

In other words the closer the gini value is to 0, the better the partition
is. Since Salary has lower gini it is a better split.


Clustering: Definition
n Given a set of data points, each having a set of
attributes, and a similarity measure among them,
find clusters such that
¨ Data points in one cluster are more similar to one
another.
¨ Data points in separate clusters are less similar to
one another.
n Similarity Measures:
¨ Euclidean Distance if attributes are continuous.
¨ Other Problem-specific Measures.


Clustering: Illustration
x Euclidean Distance Based Clustering in 3-D space.

Intracluster distances Intercluster distances
are minimized are maximized


Clustering: Application 1

n Market Segmentation:
¨ Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix.

¨ Approach:
n Collect different attributes of customers based on their
geographical and lifestyle related information.
n Find clusters of similar customers.
n Measure the clustering quality by observing buying patterns
of customers in same cluster vs. those from different
clusters.


Clustering: Application 2
n Document Clustering:
¨ Goal: To find groups of documents that are similar to
each other based on the important terms appearing
in them.
¨ Approach: To identify frequently occurring terms in
each document. Form a similarity measure based on
the frequencies of different terms. Use it to cluster.
¨ Gain: Information Retrieval can utilize the clusters to
relate a new document or search term to clustered
documents.


Clustering

n Clustering is the process of grouping data
into clusters so that objects within a cluster
have similarity in comparison to one
another, but are very dissimilar to objects
in other clusters.
n The similarities are assessed based on the
attributes values describing these objects.


The K-Means Clustering Method
n Given k, the k-means algorithm is
implemented in four steps:
¨ Partition objects into k nonempty subsets
¨ Compute seed points as the centroids of the
clusters of the current partition (the centroid is the
center, i.e., mean point, of the cluster)
¨ Assign each object to the cluster with the nearest
seed point
¨ Go back to Step 2, stop when no more new
assignment

The K-Means Clustering Method
n Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3

the
3

each
2 2
2

1
objects
1
cluster 1

0
means
0
0
to most
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

similar
center reassign reassign
10 10

K=2 9 9

8 8

Arbitrarily choose K 7 7

object as initial
6 6

5 5

cluster center 4 Update 4

3

2
the 3

2

1 cluster 1

0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10


K-Means Clustering
n K-means is a partition based clustering
algorithm.
n K-means’ goal: Partition database D into K parts,
where there is little similarity across groups, but
great similarity within a group. More specifically,
K-means aims to minimize the mean square
error of each point in a cluster, with respect to its
cluster centroid.


K-Means Example A1

2

4
n Consider the following one-dimensional 10
database with attribute A1.
12

3
n Let us use the k-means algorithm to partition
20
this database into k = 2 clusters. We begin by
30
choosing two random starting points, which
will serve as the centroids of the two clusters. 11

25

µ C1 = 2

µ C2 = 4


Cluster A1
n To form clusters, we assign each Assignment

point in the database to the
C1 2
nearest centroid.
C2 4
n For instance, 10 is closer to c2 C2 10
than to c1. C2 12

n If a point is the same distance C1 3

from two centroids, such as point C2 20

3 in our example, we make an C2 30

C2 11
arbitrary assignment.
C2 25


n Once all points have been assigned, we
recompute the means of the clusters.

2+3
µ C1 = = 2 .5
2
4 + 10 + 12 + 20 + 30 + 11 + 25 112
µ C2 = = = 16
7 7


n We then reassign each point to the two Clusters A1

clusters based on the new means. C1 2

C1 4

n Remark: point 4 now belongs to cluster C2 10

C1. C2 12

C1 3

n The steps are repeated until the means C2 20

converge to their optimal values. In C2 30
each iteration, the means are re- C2 11
computed and all points are reassigned. C2 25


n In this example, only one more iteration is
needed before the means converge. We
compute the new means:
2+3+ 4
µ C1 = =3
3

10 + 12 + 20 + 30 + 11 + 25 108
µ C2 = = = 18
6 6

Now if we reassign the points there is no change in the clusters.
Hence the means have converged to their optimal values and
the algorithm terminates.


Visualization of k-means
algorithm


Exercise
n Apply the K-means algorithm for the
following 1-dimensional points (for k=2): 1;
2; 3; 4; 6; 7; 8; 9.
n Use 1 and 2 as the starting centroids.


Solution
Iteration #1
1: 1 mean = 1
2: 2,3,4,6,7,8,9 mean = 5.57

Iteration #2
1: 1,2,3 mean = 2
5.57: 4,6,7,8,9 mean = 6.8

Iteration #3
2: 1,2,3,4 mean = 2.5
6.8: 6,7,8,9 mean = 7.5

Iteration #4
2.5: 1,2,3,4 mean = 2.5
7.5: 6,7,8,9 mean = 7.5

Means haven’t changed, so stop iterating.
The final clusters are {1,2,3,4} and {6,7,8,9}.


K – Mean for 2-dimensional
database
n Let us consider {x1, x2, x3, x4, x5} with following coordinates as
two-dimensional sample for clustering:

n x1 = (0, 2), x2 = (0, 0), x3 = (1.5,0), x4 = (5,0), x5 = (5, 2)

n Suppose that required number of clusters is 2.
n Initially, clusters are formed from random distribution of
samples:
n C1 = {x1, x2, x4} and C2 = {x3, x5}.


Centroid Calculation
n Suppose that the given set of N samples in an n-dimensional
space has somehow be partitioned into K clusters {C1, C2, …,
Ck}
n Each Ck has nk samples and each sample is exactly in one
cluster.
n Therefore, Σ nk = N, where k = 1, …, K.
nk
n The mean vector Mk of cluster Ck is defined as centroid of the
cluster, Where xik is the ith sample belonging

Σi = 1 xik
to cluster Ck.
Mk = (1/ nk)

n In our example, The centroids for these two clusters are
n M1 = {(0 + 0 + 5)/3, (2 + 0 + 0)/3} = {1.66, 0.66}
n M2 = {( 1.5 + 5)/2, (0 +2)/2} = {3.25, 1.00}


The Square-error of the cluster

n The square-error for cluster Ck is the sum of squared
Euclidean distances between each sample in Ck and its
centroid.
n This error is called the within-cluster variation.
nk

ek2 = Σi = 1 (xik – Mk)2
n Within cluster variations, after initial random distribution of
samples, are
n e12 = [(0 – 1.66)2 + (2 – 0.66)2] + [(0 – 1.66)2 + (0 – 0.66)2]
+ [(5 – 1.66)2 + (0 – 0.66)2] = 19.36
n e22 = [(1.5 – 3.25)2 + (0 – 1)2] + [(5 – 3.25)2 + (2 – 1)2] = 8.12

Total Square-error
n The square error for the entire clustering space
containing K clusters is the sum of the within-cluster
variations. K
Ek2 = Σk = 1 ek2

n The total square error is
E2 = e12 + e22 = 19.36 + 8.12 = 27.48


n When we reassign all samples, depending on a minimum
distance from centroids M1 and M2, the new redistribution of
samples inside clusters will be,
n d(M1, x1) = (1.662 + 1.342)1/2 = 2.14 and d(M2, x1) = 3.40 ⇒ x1
∈ C1
n d(M1, x2) = 1.79 and d(M2, x2) = 3.40 ⇒ x2
∈ C1 d(M1, x3) = 0.83 and d(M2, x3) = 2.01
⇒ x3 ∈ C1 d(M1, x4) = 3.41 and d(M2, x4) =
2.01 ⇒ x4 ∈ C2 d(M1, x5) = 3.60 and d(M2,
x5) = 2.01 ⇒ x5 ∈ C2

Above calculation is based on Euclidean distance formula,
m

d(xi, xj) = Σk = 1 (xik – xjk)1/2


n New Clusters C1 = {x1, x2, x3} and C2 = {x4, x5} have new
centroids
n M1 = {0.5, 0.67}
n M2 = {5.0, 1.0}

n The corresponding within-cluster variations and the total
square error are,
n e12 = 4.17
n e22 = 2.00

n E2 = 6.17


The cluster membership
stabilizes…
n After the first iteration, the total square error is
significantly reduced: ( from 27.48 to 6.17)

n In this example, if we analysis the distances
between the new centroids and the samples, the
second iteration will be assigned to the same
clusters.

n Thus no further reassignment and algorithm
halts.

Variations of the K-Means Method
n A few variants of the k-means which differ in

¨ Selection of the initial k means

¨ Strategies to calculate cluster means

n Handling categorical data: k-modes (Huang’98)

¨ Replacing means of clusters with modes

¨ Using new dissimilarity measures to deal with categorical objects

¨ Using a frequency-based method to update modes of clusters

¨ A mixture of categorical and numerical data: k-prototype method


What is the problem of k-Means
Method?
n The k-means algorithm is sensitive to outliers !

¨ Since an object with an extremely large value may substantially
distort the distribution of the data.

n K-Medoids: Instead of taking the mean value of the object in a cluster
as a reference point, medoids can be used, which is the most
centrally located object in a cluster.
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10


Exercise 2
n Let the set X consist of the following sample points in 2 dimensional
space:

n X = {(1, 2), (1.5, 2.2), (3, 2.3), (2.5,-1), (0, 1.6), (-1,1.5)}

n Let c1 = (1.5, 2.5) and c2 = (3, 1) be initial estimates of centroids for X.
n What are the revised values of c1 and c2 after 1 iteration of k-means
clustering (k = 2)?


Solution 2
n For each data point, calculate the distance to
each centroid:

x y d(xi,c1) d(xi,c2)
x1 1 2 0.707107 2.236068
x2 1.5 2.2 0.3 1.920937
x3 3 2.3 1.513275 1.3
x4 2.5 -1 3.640055 2.061553
x5 0 1.6 1.749286 3.059412
x6 -1 1.5 2.692582 4.031129


n It follows that x1, x2, x5 and x6 are closer to c1 and the
other points are closer to c2. Hence replace c1 with the
average of x1, x2, x5 and x6 and replace c2 with the
average of x3 and x4. This gives:

n c1’ = (0.375, 1.825)
n c2’ = (2.75, 0.65)


Association Rule Discovery


Market-basket problem.

n We are given a set of items and a large collection of
transactions, which are subsets (baskets) of these items.

n Task: To find relationships between the presences of
various items within these baskets.

n Example: To analyze customers' buying habits by finding
associations between the different items that customers
place in their shopping baskets.


Associations discovery

n Finding frequent patterns, associations, correlations, or
causal structures among sets of items or objects in
transaction databases, relational databases, and other
information repositories

¨ Associations discovery uncovers affinities amongst collection of
items
¨ Affinities are represented by association rules
¨ Associations discovery is an unsupervised approach to data
mining.


Association Rule : Application 2

n Supermarket shelf management.
¨ Goal: To identify items that are bought together by
sufficiently many customers.
¨ Approach: Process the point-of-sale data collected
with barcode scanners to find dependencies among
items.
¨A classic rule --
n If a customer buys diaper and milk, then he is very likely to
buy beer.
n So, don’t be surprised if you find six-packs stacked next to
diapers!


Association Rule : Application 3

n Inventory Management:
¨ Goal: A consumer appliance repair company
wants to anticipate the nature of repairs on its
consumer products and keep the service
vehicles equipped with right parts to reduce
on number of visits to consumer households.
¨ Approach: Process the data on tools and
parts required in previous repairs at different
consumer locations and discover the co-
occurrence patterns.


What is a rule?

n The rule in a rule induction system comes in
the form “If this and this and this then this”
n For a rule to be useful two pieces of
information are needed
1. Accuracy (The lower the accuracy the closer the rule comes to
random guessing)
2. Coverage (how often you can use a useful rule)
n A Rule consists of two parts
1. The antecedent or the LHS
2. The consequent or the RHS.


An example
Rule Accuracy Coverage
If breakfast cereal purchased 85% 20%
then milk will be purchased

If bread purchased, then swiss 15% 6%
cheese will be purchased

If 42 years old and purchased 95% 0.01%
dry roasted peanuts, beer
purchased


What is association rule mining?


Frequent Itemset


Support and Confidence


What to do with a rule?

n Target the antecedent
n Target the consequent
n Target based on accuracy
n Target based on coverage
n Target based on “interestingness”

n Antecedent can be one or more conditions all of which must be true in
order for the consequent to be true at the given accuracy.
n Generally the consequent is just a simple condition (eg purchasing one
grocery store item) rather than multiple items.


• All rules that have a certain value for the antecedent are gathered
and presented to the user.
• For example the grocery store may request all rules that have
nails or bolts or screws in the antecedent and try and conclude
whether discontinuing sales of these lower priced items will
have any effect on the higher margin items like hammers.

• All rules that have a certain value for the consequent are
gathered. Can be used to understand what affects the
consequent.
• For instance might be useful to know what rules have “coffee”
in their RHS. Store owner might want to put coffee close to
other items in order to increase sales of both items, or a
manufacturer may determine in which magazine to place next
coupons.


• Sometimes accuracy most important. Highly accurate rules of 80
or 90% of the time imply strong relationships even if the coverage
is very low.
• For example lets say a rule can only be applied one time out of 1000 but if this
rule is very profitable the one time then it can be worthwhile. This is how most
successful data mining applications work in the financial markets looking for
that limited amount of time in which a very confident prediction can be made.

• Sometimes users want to know the rules that are most widely
applicable. By looking at rules ranked by coverage they can get a
high level view of what is happening in the database most of the
time

• Rules are interesting when they have high coverage and high
accuracy but they deviate from the norm. Eventually there may be
a tradeoff between coverage and accuracy can be made using a
measure of interestingness


Evaluating and using rules

n Look at simple statistics.
n Using conjunctions and disjunctions
n Defining “interestingness”
n Other Heuristics


Using conjunctions and
disjunctions

n This dramatically increases or decreases the
coverage. For example
¨ If diet soda or regular soda or beer then potato chips,
covers a lot more shopping baskets than just one of
the constraints by themselves.


Defining “interestingness”
n Interestingness must have 4 basic behaviors
1. Interestingness=0. Rule accuracy is equal to
background (a priori probability of the LHS), then
discard rule.
2. Interestingness increases as accuracy increases if
coverage fixed
3. Interestingness increases or decreases with
coverage if accuracy stays fixed.
4. Interestingness decreases with coverage for a fixed
number of correct responses.


Other Heuristics
n Look at the actual number of records covered
and not as a probability or a percentage.
n Compare a given pattern to random chance.
This will be an “out of the ordinary measure”.
n Keep it simple


Example

Here t supports items C, DM, and CO. The item DM is supported by 4
out of 6 transactions in T. Thus, the support of DM is 66.6%.


Definition


Association Rules
n Algorithms that obtain association rules
from data usually divide the task into two
parts:
¨ findthe frequent itemsets and
¨ form the rules from them.


Association Rules
n The problem of mining association rules
can be divided into two subproblems:


Definitions


a priori algorithm
n Agrawal and Srikant in 1994.
n It is also called the level-wise algorithm.
¨ It is the most accepted algorithm for finding all the
frequent sets.
¨ It makes use of the downward closure property.
¨ The algorithm is a bottom-up search, progressing
upward level-wise in the lattice.
n The interesting fact –
¨ before reading the database at every level, it prunes
many of the sets, which are unlikely to be frequent
sets.

a priori algorithm


a priori candidate-generation method


Pruning algorithm


a priori Algorithm


Exercise 3
Suppose that L3 is the list
{{a,b,c}, {a,b,d}. {a,c,d}, {b,c,d}. {b,c,w}, {b,c,x},
{p,q,r}, {p,q,s}, {p,q,t}, {p,r,s}, {q,r,s}}
Which itemsets are placed in C4 by the join
step of the Apriori algorithm? Which are
then removed by the prune step?


Solution3
n At the join step of Apriori Algorithm, each
member (set) is compared with every other
member.
n If all the elements of the two members are
identical except the right most ones, the union of
the two sets is placed into C4.
n For the members of L3 given the following sets
of four elements are placed into C4:
{a,b,c,d}, {b,c,d,w}, {b,c,d,x}, {b,c,w,x}, {p,q,r,s}, {p,q,r,t}
and {p,q,s,t}.


Solution3 (continued)
n At the prune step of the algorithm, each member of C4 is
checked to see whether all its subsets of 3 elements are
members of L3.
n The result in this case is as follows:


Solution3 (continued)
n Therefore,
{b,c,d,w}, {b,c,d,x}, {b,c,w,x}, {p,q,r,t} and
{p.q.s.t}
are removed by the prune step

n Leaving C4 as,
{{a,b,c,d}, {p,q,r,s}}


Exercise 4
n Given a dataset with four attributes w, x, y
and z, each with three values, how many
rules can be generated with one term on
the right-hand side?


Solution 4
n Let us assume that the attribute w has 3 values w1, w2,
and w3, and similarly for x, y, and z.

n If we select arbitrarily attribute w to be on the right-hand
side of each rule, there are 3 possible types of rule:
¨ IF…THEN w=w1
¨ IF…THEN w=w2
¨ IF…THEN w=w3

n Now choose one of these rules, say the first, and
calculate how many possible left hand sides there are
for such rules.


Solution 4 (continued)
n The number of “attribute=value” terms on
the LHS can be 1, 2, or 3.

n Case I: One trem on LHS
¨ There are 3 possible terms: x, y, and z. Each
has 3 possible values, so there are 3x3=9
possible LHS, e.g. IF x=x1.


n Case II: 2 terms on LHS
¨ There are 3 ways in which combination of 2
attributes may appear on the LHS: x and y, y
and z, and x and z.
¨ Each attribute has 3 values, so for each pair
there are 3x3=9 possible LHS, e.g. IF x=x1
AND y=y1
¨ There are 3 possible pairs of attributes, so the
totle number of possible LHS is 3x9=27.


n Case III: 3 terms on LHS
¨ All 3 attributes x, y and z must be on LHS.
¨ Each has 3 values, so 3x3x3=27 possible LHS, e.g.
IF x=x1 AND y=y1 AND z=z1.

¨ Thus for each of the 3 possible “w=value” terms on the RHS, the
total number of LHS with 1,2 or 3 terms is 9+27+27=63.

¨ So there are 3x63 = 189 possible rules with attribute w on the
RHS.

¨ The attribute on the RHS could be any of four possibilities (not
just w). Therefore total possible number of rules is 4x189=756.


References
n R. Akerkar and P. Lingras. Building an Intelligent Web: Theory &
Practice, Jones & Bartlett, 2008 (In India: Narosa Publishing House,
2009)
n U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy.
Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press,
1996
n U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data
Mining and Knowledge Discovery, Morgan Kaufmann, 2001
n J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan
Kaufmann, 2001
n D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT
Press, 2001


Data Mining

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Data Mining

Similar a Data Mining (14)

Más de R A Akerkar

Más de R A Akerkar (18)

Último

Último (20)

Data Mining