2. Prithwis
Mukerjee 2
Why Suddenly Data Mining ?
Fluid, dynamic business environment
Markets are, or seem to be, saturated
Customers are aggresive and disloyal
Speed is essential
“The quick and the dead”
Availability of data
Vast amounts of data are generated, stored electronically
and are waiting to be processed !
Availability of tools and techniques
Mathematical tools are available to “process” this data
This processing is significantly different from MIS / EDP style of
data processing
Software tools are available to implement some of
these mathematical models
3. Prithwis
Mukerjee 3
The Business Environment
Customer Behaviour
Customers have access to more channels
Newer retail formats
Online stores
Customers have access to more suppliers
Increased commoditisation
Customer loyalty is not assured any more
Market Saturation
Multiple suppliers operating in each market
Niche market based on demographics and preferences
Competition has intensified
Need for speed
Product life cycles are getting shorter
Time to market
4. Prithwis
Mukerjee 4
New way of looking at customer
Customer Relationship Management
Intimacy, collaboration, one-to-one partnerships are
necessary
Need to ask ...
What classes of customers do we have ? Are there
subclasses in terms of behaviour ?
How can we sell more to existing customers ? What
exactly are they buying now ?
Is there a pattern in the way our customers behave ?
Who are my good customers ?
To whom we should sell more
Who are my bad customers ?
Who are likely to default or defraud ?
5. Prithwis
Mukerjee 5
Availability of vast amounts of data
ERP and OLTP systems
With their centralised RDBMSs are huge pools of
firmwide data that can overwhelm even the most
dedicated manager
Datawarehouse
Technology has resulted in equally huge pools of
historical data
Storage Capacity
Inexpensive
Ultra high capacity
6. Prithwis
Mukerjee 6
Availability of vast amounts of data
Cards
Credit & Debit Cards
Loyalty Card
Result in capture of huge pools of data
Transactional Data Capture
Point of sales systems, Bar code readers
Capture vast amount of transaction data at increasing
levels of granularity
What was sold ? Product, SKU
When was it sold ? Date, time
How was it sold ? Discount, Promotions
Beyond simple sales
Telephone calls, frequent flyer data
7. Prithwis
Mukerjee 7
Trends leading to Data Flood
More data is
generated:
Web, text, images …
Business transactions,
calls, ...
Scientific data:
astronomy, biology, etc
More data is
captured:
Storage technology
faster and cheaper
DBMS can handle
bigger DB
9. Prithwis
Mukerjee 9
Coming to the point ....
Data mining
Is the process of extracting unknown, valid and
actionable information from large databases and
then using this information to make crucial business
decisions.
Database
Data Mining
Tools
Data Presentation
and Visualisation
Tools
Decisions
10. Prithwis
Mukerjee 10
Knowledge Discovery Definition
Knowledge Discovery in Data is the
non-trivial process of identifying
valid
novel
potentially useful
and ultimately understandable patterns in data.
from Advances in Knowledge Discovery and
Data Mining, Fayyad, Piatetsky-Shapiro, Smyth,
and Uthurusamy, (Chapter 1), AAAI/MIT Press
1996
11. Prithwis
Mukerjee 11
Old Wine in new bottles ?
Are these the same as data mining ?
SQL queries against large databases
Multidimensional database analysis
Online analytical processing
Sophisticated graphic visualisation
“Classical” statistical analysis
ANOVA ? Regression ? Correrelation ?
What is missing ?
Discovery of information without a previously
articulated or formulated hypothesis
13. Prithwis
Mukerjee 13
Statistics, Machine Learning, Data
Mining
Statistics:
more theory-based
more focused on testing hypotheses
Machine learning
more heuristic
focused on improving performance of a learning agent
also looks at real-time learning and robotics – areas not part of
data mining
Data Mining and Knowledge Discovery
integrates theory and heuristics
focus on the entire process of knowledge discovery, including
data cleaning, learning, and integration and visualization of
results
Distinctions are fuzzy
15. Prithwis
Mukerjee 15
Some Definitions
Instance (also Item or Record):
an example, described by a number of attributes,
e.g. a day can be described by temperature,
humidity and cloud status
Attribute or Field
measuring aspects of the Instance, e.g. temperature
Class (Label)
grouping of instances, e.g. days good for playing
16. Prithwis
Mukerjee 16
Major Data Mining Tasks
Classification
predicting an item class
Clustering
finding clusters in data
Associations
e.g. A & B & C occur
frequently
Visualization
to facilitate human
discovery
Summarization
describing a group
Deviation Detection
finding changes
Estimation
predicting a continuous
value
Link Analysis
finding relationships
And
So on ...
17. Prithwis
Mukerjee 17
Classification
Learn a method for predicting the instance class
from pre-labeled (classified) instances
Many approaches:
Statistics,
Decision Trees,
Neural Networks,
...
20. Prithwis
Mukerjee 20
Visualization & Data Mining
Visualizing the data to
facilitate human
discovery
Presenting the
discovered results in a
visually "nice" way
21. Prithwis
Mukerjee 21
Summarization
Describe features of the
selected group
Use natural language
and graphics
Usually in Combination
with Deviation detection
or other methods
Average length of stay in this study area rose 45.7 percent,
from 4.3 days to 6.2 days, because ...
22. Prithwis
Mukerjee 22
Data Mining Central Quest
Find true patterns
and avoid overfitting
(finding seemingly signifcant
but really random patterns due
to searching too many possibilites)
24. Prithwis
Mukerjee 24
Classification
Learn a method for predicting the instance class
from pre-labeled (classified) instances
Many approaches:
Regression,
Decision Trees,
Bayesian,
Neural Networks,
...
Given a set of points from classes
what is the class of new point ?
25. Prithwis
Mukerjee 25
Classification: Linear Regression
Linear Regression
w0 + w1 x + w2 y >= 0
Regression computes
wi from data to
minimize squared error
to ‘fit’ the data
Not flexible enough
26. Prithwis
Mukerjee 26
Regression for Classification
Any regression technique can be used for
classification
Training: perform a regression for each class, setting the
output to 1 for training instances that belong to class, and 0 for
those that don’t
Prediction: predict class corresponding to model with largest
output value (membership value)
For linear regression this is known as multi-
response linear regression
28. Prithwis
Mukerjee 28
DECISION TREE
An internal node is a test on an attribute.
A branch represents an outcome of the test,
e.g., Color=red.
A leaf node represents a class label or class
label distribution.
At each node, one attribute is chosen to split
training examples into distinct classes as much
as possible
A new instance is classified by following a
matching path to a leaf node.
29. Prithwis
Mukerjee 29
Weather Data: Play or not Play?
Notruehighmildrain
Yesfalsenormalhotovercast
Yestruehighmildovercast
Yestruenormalmildsunny
Yesfalsenormalmildrain
Yesfalsenormalcoolsunny
Nofalsehighmildsunny
Yestruenormalcoolovercast
Notruenormalcoolrain
Yesfalsenormalcoolrain
Yesfalsehighmildrain
Yesfalsehighhotovercast
Notruehighhotsunny
Nofalsehighhotsunny
Play?WindyHumidityTemperatureOutlook
Note:
Outlook is the
Forecast,
no relation to
Microsoft
email program
33. Prithwis
Mukerjee 33
Direct Marketing Paradigm
Find most likely prospects to contact
Not everybody needs to be contacted
Number of targets is usually much smaller than
number of prospects
Typical Applications
retailers, catalogues, direct mail (and e-mail)
customer acquisition, cross-sell, attrition prediction
...
34. Prithwis
Mukerjee 34
Direct Marketing Evaluation
Accuracy on the entire dataset is not the right
measure
Approach
develop a target model
score all prospects and rank them by decreasing
score
select top P% of prospects for action
How do we decide what is the best subset of
prospects ?
37. Prithwis
Mukerjee 37
Data Mining Applications
Science: Chemistry,
Physics, Medicine
Biochemical analysis
Remote sensors on a
satellite
Telescopes – star
galaxy classification
Medical Image
analysis
Bioscience
Sequence-based
analysis
Protein structure and
function prediction
Protein family
classification
Microarray gene
expression
38. Prithwis
Mukerjee 38
Microarrays: Classifying Leukemia
Leukemia: Acute Lymphoblastic (ALL) vs Acute
Myeloid (AML), Golub et al, Science, v.286,
1999
72 examples (38 train, 34 test), about 7,000 genes
ALL AML
Visually similar, but genetically very different
Best Model: 97% accuracy,
1 error (sample suspected mislabelled)
39. Prithwis
Mukerjee 39
Microarray Potential Applications
New and better molecular diagnostics
Jan 11, 2005: FDA approved Roche Diagnostic AmpliChip,
based on Affymetrix technology
New molecular targets for therapy
few new drugs, large pipeline, …
Improved treatment outcome
Partially depends on genetic signature
Fundamental Biological Discovery
finding and refining biological pathways
Personalized medicine ?!
40. Prithwis
Mukerjee 40
Pharmaceutical
companies,
Insurance and Health
care, Medicine
Drug development
Identify successful
medical therapies
Claims analysis,
fraudulent behavior
Medical diagnostic
tools
Predict office visits
Data Mining Applications
Financial Industry,
Banks, Businesses, E-
commerce
Stock and investment
analysis
Identify loyal customers
vs. risky customer
Predict customer
spending
Risk management
Sales forecasting
42. Prithwis
Mukerjee 42
Application: Direct Marketing and CRM
Most major direct marketing companies are
using modeling and data mining
Most financial companies are using customer
modeling
Modeling is easier than changing customer
behaviour
Example
Verizon Wireless reduced customer attrition rate from
2% to 1.5%, saving many millions of $
43. Prithwis
Mukerjee 43
Application: Security and Fraud
Detection
Credit Card Fraud Detection
over 20 Million credit cards protected by Neural
networks (Fair, Isaac)
Securities Fraud Detection
NASDAQ KDD system
Phone fraud detection
AT&T, Bell Atlantic, British Telecom/MCI
44. Prithwis
Mukerjee 44
Fraud Detection and Management (1)
Applications
widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
Approach
use historical data to build models of fraudulent
behavior and use data mining to help identify similar
instances
Examples
auto insurance: detect a group of people who stage
accidents to collect on insurance
money laundering: detect suspicious money transactions (US
Treasury's Financial Crimes Enforcement Network)
medical insurance: detect professional patients and ring of
doctors and ring of references
45. Prithwis
Mukerjee 45
Fraud Detection and Management (2)
Detecting inappropriate medical treatment
Australian Health Insurance Commission identifies
that in many cases blanket screening tests were
requested (save Australian $1m/yr).
Detecting telephone fraud
Telephone call model: destination of the call, duration,
time of day or week. Analyze patterns that deviate
from an expected norm.
British Telecom identified discrete groups of callers
with frequent intra-group calls, especially mobile
phones, and broke a multimillion dollar fraud.
Retail
Analysts estimate that 38% of retail shrink is due to
dishonest employees.
46. Prithwis
Mukerjee 46
Application: e-Commerce
Amazon.com recommendations
if you bought (viewed) X, you are likely to buy Y
Netflix
If you liked "Monty Python and the Holy Grail",
you get a recommendation for "This is Spinal Tap"
Comparison shopping
Froogle, mySimon, Yahoo Shopping, …
47. Prithwis
Mukerjee 47
Example : Processing Loan Applications
Given: questionnaire with financial and
personal information
Problem: should money be lend?
Borderline cases referred to loan officers
But: 50% of accepted borderline cases
defaulted!
Solution:
reject all borderline cases?
Borderline cases are most active customers!
48. Prithwis
Mukerjee 48
Enter Machine Learning
Given:
1000 training examples of borderline cases
20 attributes:
age, years with current employer,years at current
address, years with the bank, years at current job,
other credit cards
Learned rules predicted 2/3 of borderline
cases correctly!
Rules could be used to explain decisions to
customers
49. Prithwis
Mukerjee 49
Case study 2:Screening images
Given:
radar satellite images of coastal waters
Problem:
detecting oil slicks in those images
Oil slicks = dark regions with changing size and
shape
Look-alike dark regions can be caused by
weather conditions (e.g. high wind)
Expensive process requiring highly trained
personnel
50. Prithwis
Mukerjee 50
Dark regions extracted from normalized image
Attributes:
size of region, shape, area, intensity, sharpness and
jaggedness of boundaries, proximity of other
regions, info about background
Constraints:
Scarcity of training examples (oil slicks are rare!)
Unbalanced data: most dark regions aren’t oil slicks
Regions from same image form a batch
Requirement is adjustable false-alarm rate
Enter Machine Learning
51. Prithwis
Mukerjee 51
Data Mining Applications ..
Prediction & Description
Would this customer buy this product ?
Is this customer likely to leave ?
Relationship Marketing
What kind of products have been bought by this
customer ?
What kind of marketing strategy has this customer
responded to ?
Outlier identification and Fraud detection
Locating unusual cases and behaviours
Customer Profiling & Segmentation
Is the bottomline that we are all looking at ...
52. Prithwis
Mukerjee 52
Data Mining Challenges
Computationally expensive to investigate all
possibilities
Dealing with noise/missing information and
errors in data
Choosing appropriate attributes/input
representation
Finding the minimal attribute space
Finding adequate evaluation function(s)
Extracting meaningful information
Not overfitting
53. Prithwis
Mukerjee 53
Are All “Discovered” Patterns Interesting?
Interestingness measures:
A pattern is interesting if
it is easily understood by humans,
valid on new or test data with some degree of certainty,
potentially useful,
novel, or validates some hypothesis that a user
Objective vs. subjective measures:
Objective: based on statistics and structures of
patterns
support and confidence
Subjective: based on user’s belief in the data
unexpectedness, novelty, action ability, etc.
Completeness - Find all the interesting patterns
Can a data mining system find all the interesting patterns?
Association vs. classification vs. clustering
55. Prithwis
Mukerjee 55
Data Mining, Privacy, and Security
TIA: Terrorism (formerly Total) Information
Awareness Program –
TIA program closed by Congress in 2003 because of
privacy concerns
However, in 2006 we learn that NSA is
analyzing US domestic call info to find potential
terrorists
Invasion of Privacy or Needed Intelligence?
56. Prithwis
Mukerjee 56
Criticism of Analytic Approaches to
Threat Detection:
Data Mining will
be ineffective - generate millions of false positives
and invade privacy
First, can data mining be effective?
57. Prithwis
Mukerjee 57
Can Data Mining and Statistics be Effective
for Threat Detection?
Criticism: Databases have 5% errors, so
analyzing 100 million suspects will generate 5
million false positives
Reality: Analytical models correlate many items
of information to reduce false positives.
Example: Identify one biased coin from 1,000.
After one throw of each coin, we cannot
After 30 throws, one biased coin will stand out with
high probability.
Can identify 19 biased coins out of 100 million with
sufficient number of throws
59. Prithwis
Mukerjee 59
Analytic technology can be effective
Data Mining is just one additional tool to help
analysts
Combining multiple models and link analysis
can reduce false positives
Today there are millions of false positives with
manual analysis
Analytic technology has the potential to reduce
the current high rate of false positives
60. Prithwis
Mukerjee 60
Data Mining with Privacy
Data Mining looks for patterns, not people!
Technical solutions can limit privacy invasion
Replacing sensitive personal data with anon. ID
Give randomized outputs
Multi-party computation – distributed data
…
Bayardo & Srikant, Technological Solutions for
Protecting Privacy, IEEE Computer, Sep 2003
61. Prithwis
Mukerjee 61
Summary
Data Mining and Knowledge Discovery are
needed to deal with the flood of data
Knowledge Discovery is a process !
Avoid overfitting (finding random patterns by
searching too many possibilities)