Data mining & Decison Trees

Presented and Contributed by:
Ahmet Selman Bozkır
Hacettepe University Ph.D. Student

November 29, 2011 1

 What is data mining?
 Motivation: Why data mining?
 Classification of data mining systems
 Architecture: Typical Data Mining System
 Data mining functionality

November 29, 2011 2

 Data mining (knowledge discovery from data)
 Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) patterns or knowledge from huge amount of
data
 Data mining: a misnomer?
 Alternative names
 Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
 Watch out: Is everything “data mining”?
 (Deductive) query processing.
 Expert systems or small ML/statistical programs

November 29, 2011 3

 Data explosion problem

 Automated data collection tools and mature database technology lead

to tremendous amounts of data accumulated and/or to be analyzed in
databases, data warehouses, and other information repositories
 We are drowning in data, but starving for knowledge!
 Solution: Data warehousing and data mining

 Data warehousing and on-line analytical processing

 Mining interesting knowledge (rules, regularities, patterns, constraints)

from data in large databases

November 29, 2011 4

 Data analysis and decision support
 Market analysis and management
▪ Target marketing, customer relationship management
(CRM), market basket analysis, cross selling, market segmentation
 Risk analysis and management
▪ Forecasting, customer retention, improved underwriting, quality
control, competitive analysis
 Fraud detection and detection of unusual patterns (outliers)
 Other Applications
 Text mining (news group, email, documents) and Web mining
 Bioinformatics and bio-data analysis

November 29, 2011 5

 Target marketing
 Find clusters of “model” customers who share the same characteristics:
interest, income level, spending habits, etc.
 Determine customer purchasing patterns over time

 Cross-market analysis—Find associations/co-relations between product
sales, & predict based on such association

 Customer profiling —What types of customers buy what products
(clustering or classification)

 Customer requirement analysis
 Identify the best products for different groups of customers
 Predict what factors will attract new customers

November 29, 2011 Data Mining: Concepts and Techniques 6

 Finance planning and asset evaluation
 cash flow analysis and prediction cross-sectional and time series
analysis (financial-ratio, trend analysis, etc.)
 Resource planning
 summarize and compare the resources and spending
 Competition
 monitor competitors and market directions

 group customers into classes and a class-based pricing procedure

 set pricing strategy in a highly competitive market

November 29, 2011 7

 Approaches: Clustering & model construction for frauds, outlier analysis
 Applications: Health care, retail, credit card service, telecomm.
 Auto insurance: ring of collisions
 Money laundering: suspicious monetary transactions
 Medical insurance
▪ Professional patients, ring of doctors, and ring of references
▪ Unnecessary or correlated screening tests
 Telecommunications: phone-call fraud
▪ Phone call model: destination of the call, duration, time of day or week.
Analyze patterns that deviate from an expected norm
 Retail industry
▪ Analysts estimate that 38% of retail shrink is due to dishonest employees
 Anti-terrorism

November 29, 2011 Data Mining: Concepts and Techniques 8

Pattern Evaluation
 Data mining—core of
knowledge discovery
process Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases
November 29, 2011 9

Increasing potential
to support
business decisions End User
Making
Decisions

Data Presentation Business
Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Analysis, Querying and Reporting

Data Warehouses / Data Marts
OLAP, MDA DBA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
November 29, 2011 10

 Learning the application domain
 relevant prior knowledge and goals of application
 Creating a target data set: data selection
 Data cleaning and preprocessing: (may take 70% of effort!)
 Data reduction and transformation
 Find useful features, dimensionality/variable reduction, invariant
representation.
 Choosing functions of data mining
 summarization, classification, regression, association, clustering.
 Choosing the mining algorithm(s)
 Data mining: search for patterns of interest
 Pattern evaluation and knowledge presentation
 visualization, transformation, removing redundant patterns, etc.
 Use of discovered knowledge


Graphical user interface

Pattern evaluation

Data mining engine
Knowledge-base
Database or data
warehouse server
Data cleaning &
Filtering
data integration
Data
Databases Warehouse


 General functionality
 Descriptive data mining
 Predictive data mining
 Different views, different classifications
 Kinds of databases to be mined
 Kinds of knowledge to be discovered
 Kinds of techniques utilized
 Kinds of applications adapted


 Concept description: Characterization and discrimination
 Generalize, summarize, and contrast data characteristics, e.g., dry vs.
wet regions
 Association (correlation and causality)
 Diaper  Beer [0.5%, 75%]
 Classification and Prediction
 Construct models (functions) that describe and distinguish classes or
concepts for future prediction
▪ E.g., classify countries based on climate, or classify cars based on gas
mileage
 Presentation: decision-tree, classification rule, neural network
 Predict some unknown or missing numerical values

 Cluster analysis
 Class label is unknown: Group data to form new
classes, e.g., cluster houses to find distribution patterns
 Maximizing intra-class similarity & minimizing interclass
similarity
 Outlier analysis
 Outlier: a data object that does not comply with the general
behavior of the data
 Noise or exception? No! useful in fraud detection, rare
events analysis


 Data mining: discovering interesting patterns from large amounts of data
 A natural evolution of database technology, in great demand, with wide
applications
 A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and knowledge
presentation
 Mining can be performed in a variety of information repositories
 Data mining functionalities:
characterization, discrimination, association, classification, clustering, outl
ier and trend analysis, etc.
 Data mining systems and architectures
 Major issues in data mining


 R. Agrawal, J. Han, and H. Mannila, Readings in Data Mining: A
Database Perspective, Morgan Kaufmann (in preparation)
 J. Han and M. Kamber. Data Mining: Concepts and Techniques.
Morgan Kaufmann, 2001


November 29, 2011
Thank you !!! 18

 • A decision tree (DT) is a hierarchical classification
and prediction model

• It is organized as a rooted tree with 2 types of
nodes called decision nodes and inter nodes

• It is a supervised data mining model used for
classification or prediction


 Chance and Terminal Nodes

•Each internal node of a DT is a decision point, where some
condition is tested
•The result of this condition determines which branch of the
tree is to be taken next
•Thus they are called decision node, chance node or non-
terminal node
•Chance nodes partition the available data at that point to
maximize dependent variable differences


 Terminal nodes

•The leaf nodes of a DT are called terminal node
•They indicate the class into which a data instance will
be classified
•They have just one incoming node
•They do not have child nodes (outgoing nodes)
•There are no conditions tested at terminal nodes
•Tree traversal from the root to the leaf produces the
production rule for that class


 Advantages of DT

• Easy to understand and interpret
• Works for categorical and continious data
• High performance classification (generally)
• DT can grow to any depth
• On-the-fly prediction
• Pruning a DT is very easy
• Works for missing or null values


 Advantages contd.

• Can be used to identify outliers
• Production rules can be obtained directly from the built DT
• They are relatively faster than other classification models
• DT can be used even when domain experts are absent
• Provide clear indication of which field is important for
predication and classification


 Disadvantages

•Class-overlap problem (due to the curse of
dimensionality)
•Complex production rules
•A DT can be sub-optimal (for this reason ensembe
methods are developed)
• Some decision tree can deal only with binary-valued.


•Training set - - to derive classifier
(Generally %70-%80)

•Test set - - to measure accuracy
(Generally %20-%30)


 Construction Phase: Initial Decision tree is
Constructed in this Phase
Q:How to split nodes?
A: Different approaches with algorithms

 Pruning Phase: In this stage lower branches
are removed to improve the performance
Q:Why?
A: Avoiding overfitting/overtraining

 ID3 (Available Everywhere)
 C4.5 / C5.0 (Weka/Spss Clementine)
 CART (Spss Clementine)
 CHAID (Spss Clementine, etc..)
 Microsoft Decision Trees (MS Analysis Services)
 Random Forests (Statistica)


 ID3 induction algorithm

•ID3 (Interactive dichotomiser)
•Introduced in 1986 by Quinlan
•Designed for only classification
•Works on categorical attributes only
•Uses entropy measure as splitting criteria
•Missing value handling is absent


 C4.5 induction algorithm

•Invented by Quinlan in 1993
•Is an extension of ID3 algorithm
•Designed for only classification
•Numerical attributes can be input
•Uses entropy measure as splitting criteria
•Uses multi-way splits
•Missing value handling is provided
•Tree pruning is also provided

 Classification and Regression Trees

•Invented by Breiman, et.al. in 1984
•Uses binary recursive partitioning method
•Designed for both classification and regression
•Works on both categorical & numerical attributes
•Uses Gini measure as splitting criteria
•Uses two-way splits
•Tree pruning is also provided

 Chi-squared Automatic Interaction Detection

•Invented by Kass, et.al. in 1980
•Uses Karl Pearson's X2 test as splitting criteria
•Uses multi-way splits
•Avoids tree pruning


 Micorosoft Decision Trees

•Invented by MS, in 1999
•Serves entropy, Bayesian K2, and Bayesian
Dirichlet Equivalent with Uniform prior choices as
splitting criteria
•Uses multi-way splits and support binary splitting
•Avoids tree pruning

 Overfitting: An induced tree may overfit the training data
 Too many branches, some may reflect anomalies due to noise or outliers
 Poor accuracy for unseen samples


 Two approaches to avoid overfitting
 Prepruning: Halt tree construction early—do not split a node if this
would result in the goodness measure falling below a threshold
▪ Difficult to choose an appropriate threshold
 Postpruning: Remove branches from a “fully grown” tree—get a
sequence of progressively pruned trees
▪ Use a set of data different from the training data to decide which is
the “best pruned tree”


Validation error

Training error

Time

Data mining & Decison Trees

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (8)

Similar a Data mining & Decison Trees

Similar a Data mining & Decison Trees (20)

Más de Selman Bozkır

Más de Selman Bozkır (12)

Último

Último (20)

Data mining & Decison Trees