1. Presented and Contributed by:
Ahmet Selman Bozkır
Hacettepe University Ph.D. Student
November 29, 2011 1
2. What is data mining?
Motivation: Why data mining?
Classification of data mining systems
Architecture: Typical Data Mining System
Data mining functionality
November 29, 2011 2
3. Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) patterns or knowledge from huge amount of
data
Data mining: a misnomer?
Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
Watch out: Is everything “data mining”?
(Deductive) query processing.
Expert systems or small ML/statistical programs
November 29, 2011 3
4. Data explosion problem
Automated data collection tools and mature database technology lead
to tremendous amounts of data accumulated and/or to be analyzed in
databases, data warehouses, and other information repositories
We are drowning in data, but starving for knowledge!
Solution: Data warehousing and data mining
Data warehousing and on-line analytical processing
Mining interesting knowledge (rules, regularities, patterns, constraints)
from data in large databases
November 29, 2011 4
5. Data analysis and decision support
Market analysis and management
▪ Target marketing, customer relationship management
(CRM), market basket analysis, cross selling, market segmentation
Risk analysis and management
▪ Forecasting, customer retention, improved underwriting, quality
control, competitive analysis
Fraud detection and detection of unusual patterns (outliers)
Other Applications
Text mining (news group, email, documents) and Web mining
Bioinformatics and bio-data analysis
November 29, 2011 5
6. Target marketing
Find clusters of “model” customers who share the same characteristics:
interest, income level, spending habits, etc.
Determine customer purchasing patterns over time
Cross-market analysis—Find associations/co-relations between product
sales, & predict based on such association
Customer profiling —What types of customers buy what products
(clustering or classification)
Customer requirement analysis
Identify the best products for different groups of customers
Predict what factors will attract new customers
November 29, 2011 Data Mining: Concepts and Techniques 6
7. Finance planning and asset evaluation
cash flow analysis and prediction cross-sectional and time series
analysis (financial-ratio, trend analysis, etc.)
Resource planning
summarize and compare the resources and spending
Competition
monitor competitors and market directions
group customers into classes and a class-based pricing procedure
set pricing strategy in a highly competitive market
November 29, 2011 7
8. Approaches: Clustering & model construction for frauds, outlier analysis
Applications: Health care, retail, credit card service, telecomm.
Auto insurance: ring of collisions
Money laundering: suspicious monetary transactions
Medical insurance
▪ Professional patients, ring of doctors, and ring of references
▪ Unnecessary or correlated screening tests
Telecommunications: phone-call fraud
▪ Phone call model: destination of the call, duration, time of day or week.
Analyze patterns that deviate from an expected norm
Retail industry
▪ Analysts estimate that 38% of retail shrink is due to dishonest employees
Anti-terrorism
November 29, 2011 Data Mining: Concepts and Techniques 8
9. Pattern Evaluation
Data mining—core of
knowledge discovery
process Data Mining
Task-relevant Data
Data Warehouse Selection
Data Cleaning
Data Integration
Databases
November 29, 2011 9
10. Increasing potential
to support
business decisions End User
Making
Decisions
Data Presentation Business
Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA DBA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
November 29, 2011 10
11. Learning the application domain
relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 70% of effort!)
Data reduction and transformation
Find useful features, dimensionality/variable reduction, invariant
representation.
Choosing functions of data mining
summarization, classification, regression, association, clustering.
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge
November 29, 2011 11
12. Graphical user interface
Pattern evaluation
Data mining engine
Knowledge-base
Database or data
warehouse server
Data cleaning &
Filtering
data integration
Data
Databases Warehouse
November 29, 2011 12
13. General functionality
Descriptive data mining
Predictive data mining
Different views, different classifications
Kinds of databases to be mined
Kinds of knowledge to be discovered
Kinds of techniques utilized
Kinds of applications adapted
November 29, 2011 13
14. Concept description: Characterization and discrimination
Generalize, summarize, and contrast data characteristics, e.g., dry vs.
wet regions
Association (correlation and causality)
Diaper Beer [0.5%, 75%]
Classification and Prediction
Construct models (functions) that describe and distinguish classes or
concepts for future prediction
▪ E.g., classify countries based on climate, or classify cars based on gas
mileage
Presentation: decision-tree, classification rule, neural network
Predict some unknown or missing numerical values
November 29, 2011 14
15. Cluster analysis
Class label is unknown: Group data to form new
classes, e.g., cluster houses to find distribution patterns
Maximizing intra-class similarity & minimizing interclass
similarity
Outlier analysis
Outlier: a data object that does not comply with the general
behavior of the data
Noise or exception? No! useful in fraud detection, rare
events analysis
November 29, 2011 15
16. Data mining: discovering interesting patterns from large amounts of data
A natural evolution of database technology, in great demand, with wide
applications
A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and knowledge
presentation
Mining can be performed in a variety of information repositories
Data mining functionalities:
characterization, discrimination, association, classification, clustering, outl
ier and trend analysis, etc.
Data mining systems and architectures
Major issues in data mining
November 29, 2011 16
17. R. Agrawal, J. Han, and H. Mannila, Readings in Data Mining: A
Database Perspective, Morgan Kaufmann (in preparation)
J. Han and M. Kamber. Data Mining: Concepts and Techniques.
Morgan Kaufmann, 2001
November 29, 2011 17
19. • A decision tree (DT) is a hierarchical classification
and prediction model
• It is organized as a rooted tree with 2 types of
nodes called decision nodes and inter nodes
• It is a supervised data mining model used for
classification or prediction
November 29, 2011 19
21. Chance and Terminal Nodes
•Each internal node of a DT is a decision point, where some
condition is tested
•The result of this condition determines which branch of the
tree is to be taken next
•Thus they are called decision node, chance node or non-
terminal node
•Chance nodes partition the available data at that point to
maximize dependent variable differences
November 29, 2011 21
22. Terminal nodes
•The leaf nodes of a DT are called terminal node
•They indicate the class into which a data instance will
be classified
•They have just one incoming node
•They do not have child nodes (outgoing nodes)
•There are no conditions tested at terminal nodes
•Tree traversal from the root to the leaf produces the
production rule for that class
November 29, 2011 22
24. Advantages of DT
• Easy to understand and interpret
• Works for categorical and continious data
• High performance classification (generally)
• DT can grow to any depth
• On-the-fly prediction
• Pruning a DT is very easy
• Works for missing or null values
November 29, 2011 24
25. Advantages contd.
• Can be used to identify outliers
• Production rules can be obtained directly from the built DT
• They are relatively faster than other classification models
• DT can be used even when domain experts are absent
• Provide clear indication of which field is important for
predication and classification
November 29, 2011 25
26. Disadvantages
•Class-overlap problem (due to the curse of
dimensionality)
•Complex production rules
•A DT can be sub-optimal (for this reason ensembe
methods are developed)
• Some decision tree can deal only with binary-valued.
November 29, 2011 26
28. •Training set - - to derive classifier
(Generally %70-%80)
•Test set - - to measure accuracy
(Generally %20-%30)
November 29, 2011 28
29. Construction Phase: Initial Decision tree is
Constructed in this Phase
Q:How to split nodes?
A: Different approaches with algorithms
Pruning Phase: In this stage lower branches
are removed to improve the performance
Q:Why?
A: Avoiding overfitting/overtraining
November 29, 2011 29
30. ID3 (Available Everywhere)
C4.5 / C5.0 (Weka/Spss Clementine)
CART (Spss Clementine)
CHAID (Spss Clementine, etc..)
Microsoft Decision Trees (MS Analysis Services)
Random Forests (Statistica)
November 29, 2011 30
31. ID3 induction algorithm
•ID3 (Interactive dichotomiser)
•Introduced in 1986 by Quinlan
•Designed for only classification
•Works on categorical attributes only
•Uses entropy measure as splitting criteria
•Missing value handling is absent
November 29, 2011 31
32. C4.5 induction algorithm
•Invented by Quinlan in 1993
•Is an extension of ID3 algorithm
•Designed for only classification
•Numerical attributes can be input
•Uses entropy measure as splitting criteria
•Uses multi-way splits
•Missing value handling is provided
•Tree pruning is also provided
November 29, 2011 32
33. Classification and Regression Trees
•Invented by Breiman, et.al. in 1984
•Uses binary recursive partitioning method
•Designed for both classification and regression
•Works on both categorical & numerical attributes
•Uses Gini measure as splitting criteria
•Uses two-way splits
•Missing value handling is provided
•Tree pruning is also provided
November 29, 2011 33
34. Chi-squared Automatic Interaction Detection
•Invented by Kass, et.al. in 1980
•Designed for both classification and regression
•Works on both categorical & numerical attributes
•Uses Karl Pearson's X2 test as splitting criteria
•Uses multi-way splits
•Missing value handling is provided
•Avoids tree pruning
November 29, 2011 34
35. Micorosoft Decision Trees
•Invented by MS, in 1999
•Designed for both classification and regression
•Works on both categorical & numerical attributes
•Serves entropy, Bayesian K2, and Bayesian
Dirichlet Equivalent with Uniform prior choices as
splitting criteria
•Uses multi-way splits and support binary splitting
•Missing value handling is provided
•Avoids tree pruning
November 29, 2011 35
36. Overfitting: An induced tree may overfit the training data
Too many branches, some may reflect anomalies due to noise or outliers
Poor accuracy for unseen samples
November 29, 2011 36
37. Two approaches to avoid overfitting
Prepruning: Halt tree construction early—do not split a node if this
would result in the goodness measure falling below a threshold
▪ Difficult to choose an appropriate threshold
Postpruning: Remove branches from a “fully grown” tree—get a
sequence of progressively pruned trees
▪ Use a set of data different from the training data to decide which is
the “best pruned tree”
November 29, 2011 37