What is Data Mining? The Evolution and Process

Agenda
• What Data Mining IS and IS NOT
• Steps in the Data Mining Process
– CRISP-DM
– Explanation of Models
– Examples of Data Mining
Applications
• Questions

The Evolution of Data Analysis
Evolutionary Step Business Question Enabling Product Providers Characteristics
Technologies

Data Collection "What was my total Computers, tapes, IBM, CDC Retrospective,
(1960s) revenue in the last disks static data delivery
five years?"

Data Access "What were unit Relational Oracle, Sybase, Retrospective,
(1980s) sales in New databases Informix, IBM, dynamic data
England last (RDBMS), Microsoft delivery at record
March?" Structured Query level
Language (SQL),
ODBC

Data Warehousing "What were unit On-line analytic SPSS, Comshare, Retrospective,
& Decision sales in New processing Arbor, Cognos, dynamic data
Support England last (OLAP), Microstrategy,NCR delivery at multiple
(1990s) March? Drill down multidimensional levels
to Boston." databases, data
warehouses

Data Mining "What’s likely to Advanced SPSS/Clementine, Prospective,
(Emerging Today) happen to Boston algorithms, Lockheed, IBM, proactive
unit sales next multiprocessor SGI, SAS, NCR, information
month? Why?" computers, massive Oracle, numerous delivery
databases startups

Results of Data Mining
Include:
• Forecasting what may happen in
the future
• Classifying people or things into
groups by recognizing patterns
• Clustering people or things into
groups based on their attributes
• Associating what events are likely
to occur together
• Sequencing what events are likely
to lead to later events

Data mining is not
•Brute-force crunching of bulk
data
•“Blind” application of algorithms
•Going to find relationships
where none exist
•Presenting data in different
ways
•A database intensive task
•A difficult to understand
technology requiring an
advanced degree in computer
science

Data Mining Is
•A hot buzzword for a class of
techniques that find patterns in data
•A user-centric, interactive process
which leverages analysis
technologies and computing power
•A group of techniques that find
relationships that have not
previously been discovered
•Not reliant on an existing database
•A relatively easy task that requires
knowledge of the business problem/
subject matter expertise

Data Mining versus
OLAP
•OLAP - On-line
Analytical
Processing
– Provides you
with a very
good view of
what is
happening, but
can not predict
what will
happen in the
future or why it
is happening

Data Mining Versus Statistical
Analysis
•Data Mining •Data Analysis
– Originally developed to act – Tests for statistical
as expert systems to solve correctness of models
problems • Are statistical
– Less interested in the assumptions of models
mechanics of the correct?
technique – Eg Is the R-Square
– If it makes sense then let’s good?
use it – Hypothesis testing
– Does not require • Is the relationship
assumptions to be made significant?
about data – Use a t-test to validate
– Can find patterns in very significance
large amounts of data – Tends to rely on sampling
– Requires understanding – Techniques are not
of data and business optimised for large amounts
problem of data
– Requires strong statistical
skills

Examples of What People
are Doing with Data Mining:
•Fraud/Non-Compliance •Recruiting/Attracting
Anomaly detection customers
– Isolate the factors that •Maximizing
lead to fraud, waste and profitability (cross
selling, identifying
abuse profitable customers)
– Target auditing and
•Service Delivery and
investigative efforts more Customer Retention
effectively – Build profiles of
•Credit/Risk Scoring customers likely
to use which
•Intrusion detection services
•Parts failure prediction •Web Mining

How Can We Do Data
Mining?
By Utilizing the CRISP-
DM Methodology
– a standard process
– existing data
– software
technologies
– situational expertise

Why Should There be a
Standard Process?
•Framework for recording
experience
– Allows projects to be
The data mining process must replicated
be reliable and repeatable by •Aid to project planning and
people with little data mining management
•“Comfort factor” for new
background. adopters
– Demonstrates maturity of
Data Mining
– Reduces dependency on
“stars”

Process
Standardization
CRISP-DM:
• CRoss Industry Standard Process for Data Mining
• Initiative launched Sept.1996
• SPSS/ISL, NCR, Daimler-Benz, OHRA
• Funding from European commission
• Over 200 members of the CRISP-DM SIG worldwide
– DM Vendors - SPSS, NCR, IBM, SAS, SGI, Data Distilleries,
Syllogic, Magnify, ..
– System Suppliers / consultants - Cap Gemini, ICL Retail, Deloitte
& Touche, …
– End Users - BT, ABB, Lloyds Bank, AirTouch, Experian, ...

CRISP-DM
•Non-proprietary
•Application/Industry
neutral
•Tool neutral
•Focus on business issues
– As well as technical
analysis
•Framework for guidance
•Experience base
– Templates for
Analysis

Why CRISP-DM?
•The data mining process must be reliable and repeatable by
people with little data mining skills

•CRISP-DM provides a uniform framework for
–guidelines
–experience documentation

•CRISP-DM is flexible to account for differences
–Different business/agency problems
–Different data

Phases and Tasks
B u s in e s s D a ta D a ta
M o d e lin g E v a lu a t io n D e p lo y m e n t
U n d e r s t a n d in g U n d e r s t a n d in g P r e p a r a t io n

D e t e r m in e C o lle c t In it ia l D a t a D ata Set S e le c t M o d e lin g E v a lu a t e R e s u lt s P la n D e p lo y m e n t
B u s i n e s s O b j e c t Ii v e s D ata C ollection
nitial D ata Set D escription T e c h n iq u e A ssessment of D ata D eployment P lan
B ackground R eport M odeling T echnique M ining R esults w.r.t.
B usiness Objectives S e le c t D a t a M odeling A ssumptions B usiness Success P la n M o n it o r in g a n d
B usiness Success D e s c r ib e D a t a R ationale for I nclusion / C riteria M a in t e n a n c e
C riteria D ata D escription R eport E xclusion G e n e r a t e T e s t D A pproved M odels
e s ig n M onitoring and
T est D esign M aintenance P lan
S i t u a t i o n A s s e s s mEex p l o r e D a t a
nt C le a n D a t a R e v ie w P r o c e s s
I nventory of R esources D ata E xploration R eport D ata C leaning R eport B u i l d M o d e l R eview of P rocess P r o d u c e F in a l R e p o
R equirements, P arameter Settings F inal R eport
A ssumptions, and V e r i f y D a t a Q u a l i t y C o n s t r u c t D a tM odels
a D e t e r m in e N e x t S F e p s resentation
t inal P
C onstraints D ata Q uality R eport D erived A ttributes M odel D escription List of P ossible A ctions
R isks and C ontingencies Generated R ecords D ecision R e v ie w P r o je c t
T erminology As s es s Model E xperience
C osts and B enefits I n t e g r a t e D a t a odel A ssessment
M D ocumentation
M erged D ata R evised P arameter
D e t e r m in e Settings
D a t a M in in g G o a l F o rma t D a ta
D ata M ining Goals R eformatted D ata
D ata M ining Success
C riteria

P r o d u c e P r o je c t P la n
P roj P lan
ect
I nitial A sessment of
T ools and T echniques

Phases in the DM Process:
CRISP-DM

Phases in the DM
Process (1 & 2)
•Business Understanding:
– Statement of
Business Objective
– Statement of Data
•Data Understanding
Mining objective
– Explore the data and
– Statement of Success
verify the quality
Criteria
– Find outliers

Phases in the DM
Process (3)
• Data preparation:
– Takes usually over 90% of our time
• Collection
• Assessment
• Consolidation and Cleaning
– table links, aggregation level,
missing values, etc
• Data selection
– active role in ignoring non-
contributory data?
– outliers?
– Use of samples
– visualization tools
• Transformations - create new
variables

Phases in the DM Process
(4)
• Model building
– Selection of the modeling
techniques is based upon
the data mining objective
– Modeling is an iterative
process - different for
supervised and
unsupervised learning
• May model for either
description or prediction

Types of Models
•Prediction Models for •Descriptive Models for
Predicting and Grouping and Finding
Classifying Associations
– Regression algorithms – Clustering/Grouping
(predict numeric
outcome): neural algorithms: K-
networks, rule means, Kohonen
induction, CART (OLS – Association
regression, GLM) algorithms: apriori,
– Classification GRI
algorithm predict
symbolic outcome):
CHAID, C5.0
(discriminant analysis,
logistic regression)

Neural Network
Input layer
Hidden layer

Output

Neural Networks
• Description
– Difficult interpretation
– Tends to ‘overfit’ the data
– Extensive amount of training time
– A lot of data preparation
– Works with all data types

Rule Induction
•Description
– Produces decision trees:
• income < $40K
– job > 5 yrs then good
risk
– job < 5 yrs then bad Credit ranking (1=default)

risk Cat. %
Bad 52.01 168
n

Good 47.99 155

• income > $40K Total (100.00) 323

Paid Weekly/Monthly
P-value=0.0000, Chi-square=179.6665, df=1
– high debt then bad risk Weekly pay Monthly salary

– low debt then good risk Cat. %
Bad 86.67 143
Good 13.33 22
n Cat. %
Bad 15.82 25
Good 84.18 133
n

Total (51.08) 165 Total (48.92) 158

– Or Rule Sets: Age Categorical
Age Categorical

• Rule #1 for good risk: Young (< 25);Middle (25-35)

Cat. % n
Old ( > 35)

Cat. % n Cat. %
Young (< 25)

n
Middle (25-35);Old ( > 35)

Cat. % n

– if income > $40K Bad 90.51 143
Good 9.49 15
Total (48.92) 158
Bad 0.00
Good 100.00
Total (2.17)
0
7
7
Bad 48.98 24
Good 51.02 25
Total (15.17) 49
Bad 0.92 1
Good 99.08 108
Total (33.75) 109

– if low debt Social Class

• Rule #2 for good risk: Management;Clerical

Cat. % n
Professional

Cat. % n

– if income < $40K
Bad 0.00 0 Bad 58.54 24
Good 100.00 8 Good 41.46 17
Total (2.48) 8 Total (12.69) 41

– if job > 5 years

Rule Induction
Description
• Intuitive output
• Handles all forms of numeric data, as well
as non-numeric (symbolic) data

C5 Algorithm a special case of rule
induction
• Target variable must be symbolic

Apriori
Description
• Seeks association rules in
dataset
• ‘Market basket’ analysis
• Sequence discovery

Kohonen Network
Description
• unsupervised
• seeks to
describe
dataset in
terms of
natural
clusters of
cases

Phases in the DM
Process (5)
• Model Evaluation
– Evaluation of model: how well it
performed on test data
– Methods and criteria depend on
model type:
• e.g., coincidence matrix with
classification models, mean
error rate with regression
models
– Interpretation of model:
important or not, easy or hard
depends on algorithm

Phases in the DM
Process (6)
•Deployment
– Determine how the results need to be
utilized
– Who needs to use them?
– How often do they need to be used
•Deploy Data Mining results by:
– Scoring a database
– Utilizing results as business rules
– interactive scoring on-line

Specific Data Mining
Applications:

What data mining has
done for...
The US Internal Revenue Service
needed to improve customer
service and...

Scheduled its workforce
to provide faster, more accurate
answers to questions.

What data mining has done
for...
The US Drug Enforcement
Agency needed to be more
effective in their drug “busts”
and

analyzed suspects’ cell phone
usage to focus investigations.

What data mining has done
for...
HSBC need to cross-sell more
effectively by identifying profiles
that would be interested in higher
yielding investments and...

Reduced direct mail costs by 30%
while garnering 95% of the
campaign’s revenue.

Final Comments
• Data Mining can be utilized in any
organization that needs to find
patterns or relationships in their
data.
• By using the CRISP-DM
methodology, analysts can have a
reasonable level of assurance that
their Data Mining efforts will
render useful, repeatable, and
valid results.

What is Data Mining? The Evolution and Process

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a What is Data Mining? The Evolution and Process

Similar a What is Data Mining? The Evolution and Process (20)

Más de Dr. C.V. Suresh Babu

Más de Dr. C.V. Suresh Babu (20)

Último

Último (20)

What is Data Mining? The Evolution and Process

Notas del editor