2. Agenda
• What Data Mining IS and IS NOT
• Steps in the Data Mining Process
– CRISP-DM
– Explanation of Models
– Examples of Data Mining
Applications
• Questions
3. The Evolution of Data Analysis
Evolutionary Step Business Question Enabling Product Providers Characteristics
Technologies
Data Collection "What was my total Computers, tapes, IBM, CDC Retrospective,
(1960s) revenue in the last disks static data delivery
five years?"
Data Access "What were unit Relational Oracle, Sybase, Retrospective,
(1980s) sales in New databases Informix, IBM, dynamic data
England last (RDBMS), Microsoft delivery at record
March?" Structured Query level
Language (SQL),
ODBC
Data Warehousing "What were unit On-line analytic SPSS, Comshare, Retrospective,
& Decision sales in New processing Arbor, Cognos, dynamic data
Support England last (OLAP), Microstrategy,NCR delivery at multiple
(1990s) March? Drill down multidimensional levels
to Boston." databases, data
warehouses
Data Mining "What’s likely to Advanced SPSS/Clementine, Prospective,
(Emerging Today) happen to Boston algorithms, Lockheed, IBM, proactive
unit sales next multiprocessor SGI, SAS, NCR, information
month? Why?" computers, massive Oracle, numerous delivery
databases startups
4. Results of Data Mining
Include:
• Forecasting what may happen in
the future
• Classifying people or things into
groups by recognizing patterns
• Clustering people or things into
groups based on their attributes
• Associating what events are likely
to occur together
• Sequencing what events are likely
to lead to later events
5. Data mining is not
•Brute-force crunching of bulk
data
•“Blind” application of algorithms
•Going to find relationships
where none exist
•Presenting data in different
ways
•A database intensive task
•A difficult to understand
technology requiring an
advanced degree in computer
science
6. Data Mining Is
•A hot buzzword for a class of
techniques that find patterns in data
•A user-centric, interactive process
which leverages analysis
technologies and computing power
•A group of techniques that find
relationships that have not
previously been discovered
•Not reliant on an existing database
•A relatively easy task that requires
knowledge of the business problem/
subject matter expertise
7. Data Mining versus
OLAP
•OLAP - On-line
Analytical
Processing
– Provides you
with a very
good view of
what is
happening, but
can not predict
what will
happen in the
future or why it
is happening
8. Data Mining Versus Statistical
Analysis
•Data Mining •Data Analysis
– Originally developed to act – Tests for statistical
as expert systems to solve correctness of models
problems • Are statistical
– Less interested in the assumptions of models
mechanics of the correct?
technique – Eg Is the R-Square
– If it makes sense then let’s good?
use it – Hypothesis testing
– Does not require • Is the relationship
assumptions to be made significant?
about data – Use a t-test to validate
– Can find patterns in very significance
large amounts of data – Tends to rely on sampling
– Requires understanding – Techniques are not
of data and business optimised for large amounts
problem of data
– Requires strong statistical
skills
9. Examples of What People
are Doing with Data Mining:
•Fraud/Non-Compliance •Recruiting/Attracting
Anomaly detection customers
– Isolate the factors that •Maximizing
lead to fraud, waste and profitability (cross
selling, identifying
abuse profitable customers)
– Target auditing and
•Service Delivery and
investigative efforts more Customer Retention
effectively – Build profiles of
•Credit/Risk Scoring customers likely
to use which
•Intrusion detection services
•Parts failure prediction •Web Mining
10. How Can We Do Data
Mining?
By Utilizing the CRISP-
DM Methodology
– a standard process
– existing data
– software
technologies
– situational expertise
11. Why Should There be a
Standard Process?
•Framework for recording
experience
– Allows projects to be
The data mining process must replicated
be reliable and repeatable by •Aid to project planning and
people with little data mining management
•“Comfort factor” for new
background. adopters
– Demonstrates maturity of
Data Mining
– Reduces dependency on
“stars”
12. Process
Standardization
CRISP-DM:
• CRoss Industry Standard Process for Data Mining
• Initiative launched Sept.1996
• SPSS/ISL, NCR, Daimler-Benz, OHRA
• Funding from European commission
• Over 200 members of the CRISP-DM SIG worldwide
– DM Vendors - SPSS, NCR, IBM, SAS, SGI, Data Distilleries,
Syllogic, Magnify, ..
– System Suppliers / consultants - Cap Gemini, ICL Retail, Deloitte
& Touche, …
– End Users - BT, ABB, Lloyds Bank, AirTouch, Experian, ...
15. Why CRISP-DM?
•The data mining process must be reliable and repeatable by
people with little data mining skills
•CRISP-DM provides a uniform framework for
–guidelines
–experience documentation
•CRISP-DM is flexible to account for differences
–Different business/agency problems
–Different data
16. Phases and Tasks
B u s in e s s D a ta D a ta
M o d e lin g E v a lu a t io n D e p lo y m e n t
U n d e r s t a n d in g U n d e r s t a n d in g P r e p a r a t io n
D e t e r m in e C o lle c t In it ia l D a t a D ata Set S e le c t M o d e lin g E v a lu a t e R e s u lt s P la n D e p lo y m e n t
B u s i n e s s O b j e c t Ii v e s D ata C ollection
nitial D ata Set D escription T e c h n iq u e A ssessment of D ata D eployment P lan
B ackground R eport M odeling T echnique M ining R esults w.r.t.
B usiness Objectives S e le c t D a t a M odeling A ssumptions B usiness Success P la n M o n it o r in g a n d
B usiness Success D e s c r ib e D a t a R ationale for I nclusion / C riteria M a in t e n a n c e
C riteria D ata D escription R eport E xclusion G e n e r a t e T e s t D A pproved M odels
e s ig n M onitoring and
T est D esign M aintenance P lan
S i t u a t i o n A s s e s s mEex p l o r e D a t a
nt C le a n D a t a R e v ie w P r o c e s s
I nventory of R esources D ata E xploration R eport D ata C leaning R eport B u i l d M o d e l R eview of P rocess P r o d u c e F in a l R e p o
R equirements, P arameter Settings F inal R eport
A ssumptions, and V e r i f y D a t a Q u a l i t y C o n s t r u c t D a tM odels
a D e t e r m in e N e x t S F e p s resentation
t inal P
C onstraints D ata Q uality R eport D erived A ttributes M odel D escription List of P ossible A ctions
R isks and C ontingencies Generated R ecords D ecision R e v ie w P r o je c t
T erminology As s es s Model E xperience
C osts and B enefits I n t e g r a t e D a t a odel A ssessment
M D ocumentation
M erged D ata R evised P arameter
D e t e r m in e Settings
D a t a M in in g G o a l F o rma t D a ta
D ata M ining Goals R eformatted D ata
D ata M ining Success
C riteria
P r o d u c e P r o je c t P la n
P roj P lan
ect
I nitial A sessment of
T ools and T echniques
18. Phases in the DM
Process (1 & 2)
•Business Understanding:
– Statement of
Business Objective
– Statement of Data
•Data Understanding
Mining objective
– Explore the data and
– Statement of Success
verify the quality
Criteria
– Find outliers
19. Phases in the DM
Process (3)
• Data preparation:
– Takes usually over 90% of our time
• Collection
• Assessment
• Consolidation and Cleaning
– table links, aggregation level,
missing values, etc
• Data selection
– active role in ignoring non-
contributory data?
– outliers?
– Use of samples
– visualization tools
• Transformations - create new
variables
20. Phases in the DM Process
(4)
• Model building
– Selection of the modeling
techniques is based upon
the data mining objective
– Modeling is an iterative
process - different for
supervised and
unsupervised learning
• May model for either
description or prediction
21. Types of Models
•Prediction Models for •Descriptive Models for
Predicting and Grouping and Finding
Classifying Associations
– Regression algorithms – Clustering/Grouping
(predict numeric
outcome): neural algorithms: K-
networks, rule means, Kohonen
induction, CART (OLS – Association
regression, GLM) algorithms: apriori,
– Classification GRI
algorithm predict
symbolic outcome):
CHAID, C5.0
(discriminant analysis,
logistic regression)
23. Neural Networks
• Description
– Difficult interpretation
– Tends to ‘overfit’ the data
– Extensive amount of training time
– A lot of data preparation
– Works with all data types
24. Rule Induction
•Description
– Produces decision trees:
• income < $40K
– job > 5 yrs then good
risk
– job < 5 yrs then bad Credit ranking (1=default)
risk Cat. %
Bad 52.01 168
n
Good 47.99 155
• income > $40K Total (100.00) 323
Paid Weekly/Monthly
P-value=0.0000, Chi-square=179.6665, df=1
– high debt then bad risk Weekly pay Monthly salary
– low debt then good risk Cat. %
Bad 86.67 143
Good 13.33 22
n Cat. %
Bad 15.82 25
Good 84.18 133
n
Total (51.08) 165 Total (48.92) 158
– Or Rule Sets: Age Categorical
P-value=0.0000, Chi-square=30.1113, df=1
Age Categorical
P-value=0.0000, Chi-square=58.7255, df=1
• Rule #1 for good risk: Young (< 25);Middle (25-35)
Cat. % n
Old ( > 35)
Cat. % n Cat. %
Young (< 25)
n
Middle (25-35);Old ( > 35)
Cat. % n
– if income > $40K Bad 90.51 143
Good 9.49 15
Total (48.92) 158
Bad 0.00
Good 100.00
Total (2.17)
0
7
7
Bad 48.98 24
Good 51.02 25
Total (15.17) 49
Bad 0.92 1
Good 99.08 108
Total (33.75) 109
– if low debt Social Class
P-value=0.0016, Chi-square=12.0388, df=1
• Rule #2 for good risk: Management;Clerical
Cat. % n
Professional
Cat. % n
– if income < $40K
Bad 0.00 0 Bad 58.54 24
Good 100.00 8 Good 41.46 17
Total (2.48) 8 Total (12.69) 41
– if job > 5 years
25. Rule Induction
Description
• Intuitive output
• Handles all forms of numeric data, as well
as non-numeric (symbolic) data
C5 Algorithm a special case of rule
induction
• Target variable must be symbolic
28. Phases in the DM
Process (5)
• Model Evaluation
– Evaluation of model: how well it
performed on test data
– Methods and criteria depend on
model type:
• e.g., coincidence matrix with
classification models, mean
error rate with regression
models
– Interpretation of model:
important or not, easy or hard
depends on algorithm
29. Phases in the DM
Process (6)
•Deployment
– Determine how the results need to be
utilized
– Who needs to use them?
– How often do they need to be used
•Deploy Data Mining results by:
– Scoring a database
– Utilizing results as business rules
– interactive scoring on-line
31. What data mining has
done for...
The US Internal Revenue Service
needed to improve customer
service and...
Scheduled its workforce
to provide faster, more accurate
answers to questions.
32. What data mining has done
for...
The US Drug Enforcement
Agency needed to be more
effective in their drug “busts”
and
analyzed suspects’ cell phone
usage to focus investigations.
33. What data mining has done
for...
HSBC need to cross-sell more
effectively by identifying profiles
that would be interested in higher
yielding investments and...
Reduced direct mail costs by 30%
while garnering 95% of the
campaign’s revenue.
34. Final Comments
• Data Mining can be utilized in any
organization that needs to find
patterns or relationships in their
data.
• By using the CRISP-DM
methodology, analysts can have a
reasonable level of assurance that
their Data Mining efforts will
render useful, repeatable, and
valid results.
The US Internal Revenue Service is using data mining to improve customer service. [Click] By analyzing incoming requests for help and information, the IRS hopes to schedule its workforce to provide faster, more accurate answers to questions.
The US DFAS needs to search through 2.5 million financial transactions that may indicate inaccurate charges. Instead of relying on tips to point out fraud, the DFAS is mining the data to identify suspicious transactions. [Click] Using Clementine, the agency examined credit card transactions and was able to identify purchases that did not match past patterns. Using this information, DFAS could focus investigations, finding fraud more costs effectively.
Retail banking is a highly competitive business. In addition to competition from other banks, banks also see intense competition from financial services companies of all kinds, from stockbrokers to mortgage companies. With so many organizations working the same customer base, the value of customer retention is greater than ever before. As a result, HSBC Bank USA looks to enticing existing customers to &quot;roll over&quot; maturing products, or on cross-selling new ones. [Click] Using SPSS products, HSBC found that it could reduce direct mail costs by 30% while still bringing in 95% of the campaign’s revenue. Because HSBC is sending out fewer mail pieces, customers are likely to be more loyal because they don’t receive junk mail from the bank.