Data mining chapter04and5-best

Classification vs. Prediction
Classification:
– predicts categorical (discrete and unordered) class
labels
– classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new
data
Prediction:
– models continuous-valued functions to predicts
unknown or missing values
Chapter 4 -Classification Cont...
Cont...
Prediction is similar to classification
– First, construct a model
– Second, use model to predict unknown value
Major method for prediction is regression
– Linear and multiple regression
– Non-linear regression
Prediction is different from classification
– Classification refers to predict categorical class label
– Prediction models continuous-valued functions
Recall- Data Mining Models and Tasks
Classification
 Classification process involves two steps
1. Model construction:
 refers to describing a set of predetermined classes using training
data set
 The training data is a set of tuples where Each tuple/sample is
assumed to belong to a predefined class, as determined by the
class label attribute
The model is represented as classification rules, decision trees, or
mathematical formulae
2. Model usage:
 Refers to using the model for classifying future or unknown
objects
 Or explaining some scenario with some accuracy after testing
Classification—A Two-Step Process –cont...
 Model construction: describing a set of predetermined
classes
– Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
– The set of tuples used for model construction: training set
– The model is represented as classification rules, decision trees,
or mathematical formulae
 Model usage: for classifying future or unknown objects
– Estimate accuracy of the model
The known label of test sample is compared with the
classified result from the model
Accuracy rate is the percentage of test set samples that are
correctly classified by the model
Test set is independent of training set, otherwise over-fitting
will occur

Classification Process (1): Model Construction
Training
Data
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Classification
Algorithms
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classifier
(Model)
Classification Process (2): Use the Model in Prediction
Classifier
Testing
Data
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?
Supervised vs. Unsupervised Learning
 Supervised learning (classification)
– Supervision: The training data (observations, measurements,
etc.) are accompanied by labels indicating the class of the
observations
– New data is classified based on the training set
 Unsupervised learning (clustering)
– The class labels of training data is unknown
– Given a set of measurements, observations, etc. with the aim
of establishing the existence of classes or clusters in the data
– Usually classification follows after clustering
Issues regarding classification and prediction: Data
Preparation
Data cleaning
– Preprocess data in order to reduce noise and
handle missing values
Relevance analysis (feature selection)
– Remove the irrelevant or redundant attributes
Data transformation
– Generalize and/or normalize data
Issues regarding classification and prediction :
Evaluating Classification Methods
 Predictive accuracy
– Measure how accurate is the classifier to predict object class label
 Speed
– This refers to the computational costs involved in generating and using the
given classifier or predictor
– time to construct the model
– time to use the model
 Scalability
– This refers to the ability to construct the classifier or predictor efficiently
given large amounts of data
 Robustness
– This is the ability of the classifier or predictor to make correct predictions
given noisy data or data with missing values
 Interpretability:
– This refers to the level of understanding and insight that is provided by the
classifier or predictor.
Classification: Technical Definition
 Given a collection of records (training set )
– Each record contains a set of attributes, one of the
attributes is the class.
 Find a model for class attribute as a function
of the values of other attributes.
 Goal: previously unseen records should be
assigned a class as accurately as possible.
– A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it.

Thus Classification
 Classification is a data mining (machine learning) technique
used to predict group membership for data instances.
 Given a collection of records (training set), each record
contains a set of attributes, one of the attributes is the class.
– Find a model for class attribute as a function of the values of other
attributes.
 Goal: previously unseen records should be assigned a class
as accurately as possible. A test set is used to determine the
accuracy of the model.
– Usually, the given data set is divided into training and test sets, with
training set used to build the model and test set used to validate it.
 For example, one may use classification to predict whether the weather
on a particular day will be “sunny”, “rainy” or “cloudy”.
Illustrating Classification Task
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Learning
algorithm
Training Set
Classification methods
 Goal: Predict class Ci = f(x1, x2, .. Xn)
 There are various classification methods. Popular
classification techniques include the following.
– Decision tree classifier: divide decision space into
piecewise constant regions.
– Rule based – Association based classifier
– K-Nearest Neighbour: classify based on similarity
measurement
– Neural networks: partition by non-linear boundaries
– Bayesian network: a probabilistic model
– Support vector machine: solves non-linearly separable
problems
Simple classification using decision tree
Decision tree classifier
 Decision tree performs classification by constructing a tree
based on training instances with leaves having class labels.
– The tree is traversed for each test instance to find a leaf,
and the class of the leaf is the predicted class. This is a
directed knowledge discovery in the sense that there is a
specific field whose value we want to predict.
 Widely used learning method. It has been applied to:
– classify medical patients based on the disease,
– equipment malfunction by cause,
– loan applicant by likelihood of payment.
– Accidents by severity
Pros and Cons of decision trees
Cons
- Cannot handle complicated
relationship between features
- simple decision boundaries
- problems with lots of missing
data
Pros
• Reasonable training time
• Fast application
• Easy to interpret
• Easy to implement
• Can handle large number of
features
Why decision tree induction in data mining?
• Relatively faster learning speed (than other classification
methods)
• Convertible to simple and easy to understand
classification if-then-else rules
• Comparable classification accuracy with other methods
• Does not require any prior knowledge of data
distribution, works well on noisy data.

Chapter 5 - Cluster Analysis
What is Cluster Analysis?
 Cluster: a collection of data objects
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
 Cluster analysis
– Grouping a set of data objects into clusters
 Clustering is unsupervised classification: no predefined classes
 Typical applications
– As a stand-alone tool to get insight into data distribution
– As a preprocessing step for other algorithms
Clustering cont…
Given a set of points, with a
notion of distance between
points, group the points into
some number of clusters, so
that members of a cluster are
in some sense as close to
each other as possible.
While data points in the
same cluster are similar, those
in separate clusters are
dissimilar to one another.
x x
x x x x
x x x x
x x x
x x
x
xx x
x x
x x x
x
x x x
x
x x
x x x x
x x x
x
• Clustering is a data mining (machine learning) technique that
finds similarities between data according to the characteristics
found in the data & groups similar data objects into one cluster
Cont…
 Thus Cluster Analysis
– Finding groups of objects such that the objects in a group will be
similar (or related) to one another and different from (or unrelated
to) the objects in other groups
Inter-cluster
distances are
maximized
Intra-cluster
distances are
minimized
Quality: What Is Good Clustering?
• A good clustering method will produce high quality clusters
with
– high intra-class similarity
– low inter-class similarity
• The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation
– Key requirement of clustering: Need a good measure of similarity
between instances.
• The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns in the
given datasets

Requirements of Clustering in Data Mining
 Scalability
– Highly scalable algorithms are needed for clustering on large databases like DW
 Ability to deal with different types of attributes
– Clustering may be performed also on binary, categorical and ordinal data
 Discovery of clusters with arbitrary shape
– Most algorithms tend to find spherical clusters
 Minimal requirements for domain knowledge to determine input parameters
– Clustering results are quite sensitive to the input parameters
– Parameters are often difficult to determine
 Able to deal with noise and outliers
 Insensitive to order of input records
 High dimensionality
– DW can contain several dimensions
 Incorporation of user-specified constraints
 Interpretability and usability
Example: Clustering Application
• Text/Document Clustering:
– Goal: To find groups of documents that are similar
to each other based on the important terms
appearing in them.
– Approach:
–To identify frequently occurring terms in each
document.
–Form a similarity measure based on the frequencies
of different terms and use it to cluster documents.
–Gain: Information Retrieval can utilize the clusters
to relate a new document or search term to clustered
documents.
Cont…
Applications of Cluster Analysis can be for
– Understanding
Group related documents for browsing,
group genes and proteins that have similar
functionality, or group stocks with similar
price fluctuations
– Summarization
Reduce the size of large data sets
What is not Cluster Analysis?
 Supervised classification
– Have class label information
 Simple segmentation
– Dividing students into different registration groups
alphabetically, by last name
 Results of a query
– Groupings are a result of an external specification
Types of Clusters
 Major types : Well-separated clusters and Center-based
clusters
 Well-Separated Clusters:
– A cluster is a set of points such that any point in a cluster is
closer (or more similar) to every other point in the cluster than
to any point not in the cluster.
3 well-separated clusters
Types of Clusters: Center-Based
 Center-based
– A cluster is a set of objects such that an object in a cluster is
closer (more similar) to the “center” of a cluster, than to the
center of any other cluster
– The center of a cluster is often a centroid, the average of all
the points in the cluster, or a medoid, the most “representative”
point of a cluster
4 center-based clusters

Type of data in clustering analysis
 Data types of variables are different
 The difference need proper distance computation logic for
cluster analysis
 Some of the types of data we have are:
– Interval-scaled variables
– Binary variables
– Nominal, and ordinal
– mixed types:
Interval-valued variables
 This are values of variables of an object which are characterized
by its continuous nature of the measurement such as height,
weight, age
 As the measurement unit affect cluster distance, we need
preprocessing that avoid the effect of unit of measurement
 This is called standardization
Binary Variables
 A binary variable is a variable which has only two possible values (1 or 0,
yes or no, etc)
– For example smoker, educated, Ethiopian, IsFemale etc
Nominal Variables
A generalization of the binary variable in that it can take more than 2 states,
e.g., red, yellow, blue, green
Ordinal Variables
 An ordinal variable can be discrete or continuous
 order is important, e.g., rank
Variables of Mixed Types
 A database may contain different types of variables
– symmetric binary, asymmetric binary,
nominal, ordinal, interval.
 One may use
– a weighted formula to combine their effects.
– Or preprocess the data so that it fits to the techniques
requirement
Major Clustering Approaches
 Partitioning clustering approach:
– Construct various partitions and then evaluate them by some criterion, e.g.,
minimizing the sum of square errors
– Typical methods:
distance-based: K-means clustering
model-based: expectation maximization (EM) clustering.
 Hierarchical clustering approach:
– Create a hierarchical decomposition of the set of data (or objects) using some
criterion
– Typical methods:
agglomerative Vs divisive
single link Vs complete link

Data mining chapter04and5-best

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to Data mining chapter04and5-best

Similar to Data mining chapter04and5-best (20)

Recently uploaded

Recently uploaded (20)

Data mining chapter04and5-best