SlideShare una empresa de Scribd logo
1 de 37
KDD: A Definition
• KDD is the automatic extraction of non-obvious,
hidden knowledge from large volumes of data.
106-1012 bytes:
we never see the
whole data set, so will
put it in the memory of
computers
What is the knowledge?
How to represent
and use it?
Then run Data
Mining algorithms
Wal-Mart records 20 millions per day
Why do we need KDD ?
Data
Overload
Science
Marketing
Finance
Healthcare
Retail
Health care transactions: multi-gigabyte
databases
Mobil Oil: geological data of over 100
terabytes
Some Data Overload Examples:
Data is the most Important tool to gain a competitive edge by
providing improved, customized services.
Knowledge Discovery Process
Transformed
Data
Patterns
and
Rules
Target
Data
Raw
Dat
a
Knowledge
Understanding
DATA
Ware
house
Integration
Interpretation
& Evaluation
Knowledge
Knowledge Discovery in Database
• Knowledge discovery in databases (KDD) is the non-trivial
process of identifying valid, potentially useful and ultimately
understandable patterns in data
Clean,
Collect,
Summarize
Data
Warehouse
Data
Preparation
Training
Data
Data
Mining
Model
Patterns
Verification,
Evaluation
Operational
Databases
Knowledge Discovery Process
Goals
Data Selection, Acquisition & Integration
Data Cleaning
Data Reduction & Projection
Matching the Goals
Exploratory Data Analysis
Data Mining
Interpretation and Testing
Consolidation & Use
Knowledge Discovery Process
STEP – 1: IDENTIFYING THE GOAL
• First step is developing an understanding of
the application domain and the relevant
prior knowledge and identifying the goal of
the KDD process from the customer’s
viewpoint.
• Goals
• Data Selection,
Acquisition & Integration
• Data Cleaning
•Data reduction and
Projection
•Matching the goals
• Exploratory Data
Analysis
• Data Mining
•Interpretation and
Testing
• Consolidation & Use
Knowledge Discovery Process
STEP – 2: CREATING A TARGET DATA SET
• Selecting a data set, or focusing on a subset
of variables or data samples, on which
discovery is to be performed.
• Goals
• Data Selection,
Acquisition & Integration
• Data Cleaning
•Data reduction and
Projection
•Matching the goals
• Exploratory Data
Analysis
• Data Mining
•Interpretation and
Testing
• Consolidation & Use
Knowledge Discovery Process
STEP – 3: DATA CLEANING AND PREPROCESSING
• Basic operations include removing noise if
appropriate, collecting the necessary
information to model or account for noise,
deciding on strategies for handling missing
data fields, and accounting for time-
sequence information and known changes.
• Goals
• Data Selection,
Acquisition & Integration
• Data Cleaning
•Data reduction and
Projection
•Matching the goals
• Exploratory Data
Analysis
• Data Mining
•Interpretation and
Testing
• Consolidation & Use
Knowledge Discovery Process
• Finding useful features to represent the data
depending on the goal of the task.
• With dimensionality reduction or
transformation methods, the effective
number of variables under consideration can
be reduced, or invariant representations for
the data can be found.
• Goals
• Data Selection,
Acquisition & Integration
• Data Cleaning
•Data reduction and
Projection
•Matching the goals
• Exploratory Data
Analysis
• Data Mining
•Interpretation and
Testing
• Consolidation & Use
STEP – 4: DATAREDUCTION AND
PROJECTION
Knowledge Discovery Process
STEP – 5: MATCHING THE GOALS
• Matching the goals of the KDD process to a
particular data-mining method such as
summarization, classification, regression,
clustering, etc.
• Goals
• Data Selection,
Acquisition & Integration
• Data Cleaning
•Data reduction and
Projection
•Matching the goals
• Exploratory Data
Analysis
• Data Mining
•Interpretation and
Testing
• Consolidation & Use
Knowledge Discovery Process
• Choosing the data mining algorithms and
selecting methods to be used for searching
for data patterns.
• This process includes deciding which models
and parameters might be appropriate and
matching a particular data-mining method
with the overall criteria of the KDD process.
STEP – 6: EXPLORATORY ANALYSIS AND
MODEL & HYPOTHESIS SELECTION
• Goals
• Data Selection,
Acquisition & Integration
• Data Cleaning
•Data reduction and
Projection
•Matching the goals
• Exploratory Data
Analysis
• Data Mining
•Interpretation and
Testing
• Consolidation & Use
Knowledge Discovery Process
• Searching for patterns of interest in a
particular representational form or a set of
such representations, including classification
rules or trees, regression, and clustering.
• The user can significantly aid the data-
mining method by correctly performing the
preceding steps.
STEP – 7: DATA MINING
• Goals
• Data Selection,
Acquisition & Integration
• Data Cleaning
•Data reduction and
Projection
•Matching the goals
• Exploratory Data
Analysis
• Data Mining
•Interpretation and
Testing
• Consolidation & Use
Knowledge Discovery Process
• Interpreting mined patterns, possibly
returning to any of steps 1 through 7 for
further iteration.
• This step can also involve visualization of the
extracted patterns and models or
visualization of the data given the extracted
models.
STEP – 8: INTERPRETATION & TESTING
• Goals
• Data Selection,
Acquisition & Integration
• Data Cleaning
•Data reduction and
Projection
•Matching the goals
• Exploratory Data
Analysis
• Data Mining
•Interpretation and
Testing
• Consolidation & Use
Knowledge Discovery Process
• Using the knowledge directly, incorporating
the knowledge into another system for
further action, or simply documenting it and
reporting it to interested parties.
• This process also includes checking for and
resolving potential conflicts with previously
believed (or extracted) knowledge.
STEP – 9: KNOWLEDGE PRESENTATION
• Goals
• Data Selection,
Acquisition & Integration
• Data Cleaning
•Data reduction and
Projection
•Matching the goals
• Exploratory Data
Analysis
• Data Mining
• Testing and Verification
• Interpretation
• Consolidation & Use
Data Warehousing
• A platform for online analytical processing (OLAP)
• Warehouses collect transactional data from several
transactional databases and organize them in a fashion
amenable to analysis
• Also called “data marts”
• A critical component of the decision support system (DSS) of
enterprises
• Some typical DW queries:
– Which item sells best in each region that has retail outlets?
– Which advertising strategy is best for Dubai Markets?
Data Warehousing
Inventory
Data Cleaning
Data
Warehouse
(OLAP)
OLTP
Data Cleaning
• Performs logical transformation of transactional data to suit the data
warehouse
• Model of operations  model of enterprise
• Usually a semi-automatic process
Orders
Order_id
Price
Cust_id
Inventory
Prod_id
Price
Price_change
Sales
Cust_id
Cust_profit
Total_sales
Data Warehouse
Customers
Products
Orders
Inventory
Price
Time
Primary Tasks of Data Mining
Deviation and
change detection
?
Summarization
Clustering
Regression
finding the description
of several predefined
classes and classify
a data item into one
of them.
Classification
maps a data item
to a real-valued
prediction variable.
identifying a finite
set of categories or
clusters to describe
the data.
finding a
compact description
for a subset of data
finding a model
which describes
significant dependencies
between variables.
Dependency
Modeling
discovering the
most significant
changes in the data
Data Mining Algorithm Components
• Model representation
– descriptions of discovered patterns
– overly limited representation -- unable to capture data patterns
too powerful -- potential for over fit.
(decision trees, rules, linear/non-linear regression & classification,
nearest neighbor and case-based reasoning methods, graphical
dependency models)
• Model evaluation criteria
– how well a pattern (model) meets goals (fit function)
– e.g., accuracy, novelty, etc.
Data Mining Algorithm Components
• Search method
– parameter search: optimization of parameters for a given model
representation
– model search: considers a family of models
Different methods suit different problems. Proper problem formulation
crucial.
Data Mining Techniques
Data Mining Techniques
Descriptive Predictive
Clustering
Association
Classification
Regression
SequentialAnalysis
Decision Tree
Rule Induction
Neural Networks
Nearest Neighbor Classification
Association Rule: Application
• Supermarket Shelf Management
• Goal: to identify items which are bought together (by sufficiently many
customers)
• Approach: process point-of-sale data (collected with barcode scanners)
to find dependencies among items.
• Consider discovered rule:
{Diapers, Milk … } --> {Baby food}
• Example:
– If a customer buys Diapers and Milk, then he is very likely to buy
Baby foods.
– so stack baby foods next to diapers?
Sequential Pattern Discovery: Application
• Sequences in which customers purchase goods/services
• Understanding long term customer behavior -- timely
promotions.
• In point-of--sale transaction sequences
– Computer bookstore:
(Intro to Visual C++) (Java & J2EE) --> (Perl for Dummies, PHP in 24 Hrs)
– Athletic Apparel Store:
(Shoes) (Racket, Racket ball) --> (Sports Jacket)
Hierarchical Clustering (K-Means): Application
10
9
8
7
6
5
4
3
2
1
0
0 1 2 3 4 5 6 7 8 9 10
10
9
8
7
6
5
4
3
2
1
0
0 1 2 3 4 5 6 7 8 9 10
10
9
8
7
6
5
4
3
2
1
0
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrarily choose K
objects as initial
cluster center
Assign
each of
the
objects
to most
similar
center
Update
the
cluster
means
Update
the
cluster
means
reassign
Hierarchical clustering: Clusters are formed at different levels by
merging clusters at a lower level
Decision Tree Identification: Application
Outlook Temp Play?
Sunny Warm Yes
Overcast Chilly No
Sunny Chilly Yes
Cloudy Pleasant Yes
Overcast Pleasant Yes
Overcast Chilly No
Cloudy Chilly No
Cloudy Warm Yes
Sunny
Cloudy
Overcast
Yes
Yes/No
Yes/No
Decision Tree Identification Example
Decision Tree Identification: Application
Yes/No
Yes/No Yes Yes/No
Sunny
Cloudy Overcast
Yes No Yes
No
Yes
Warm
Chilly
Pleasant Chilly
Pleasant
Major Application Areas for Data
Mining (Classification)
• Advertising
• Bioinformatics
• Customer Relationship Management (CRM)
• Database Marketing
• Fraud Detection
• ecommerce
• Health Care
• Investment/Securities
• Manufacturing, Process Control
• Sports and Entertainment
• Telecommunications
• Web
Major Application Areas for Data
Mining: Marketing
• Direct Marketing:
Most major direct marketing companies are using
modeling and data mining.
• Customer segmentation:
All industries can take advantage of DM to discover
discrete segments in their customer bases by considering
additional variables beyond traditional analysis.
• CRM:
Find other people in similar life stages and determine
which customers are following similar behavior patterns
– Up-sell
– Cross-sell
– Keeping the customers for a longer period of time
For e.g. Verizon
Wireless
reduced churn
rate from 2% to
1.5%
Major Application Areas for Data
Mining: Fraud Detection
• Credit Card Fraud Detection
• Money laundering
– FAIS (US Treasury)
• Securities Fraud
– NASDAQ Sonar system
• Phone fraud
– AT&T, Bell Atlantic, British Telecom/MCI
• Bio-terrorism detection at Salt Lake
Olympics 2002
Major Application Areas for Data
Mining: Retail
• Sales forecasting:
Examining time-based patterns helps retailers make
stocking decisions.
• Database Retailing:
Retailers can develop profiles of customers with
certain behaviors, for example, those who purchase
designer labels clothing or those who attend sales.
• Merchandise planning and allocation:
When retailers add new stores, they can improve
merchandise planning and allocation by examining
patterns in stores with similar demographic
characteristics.
Major Application Areas for Data
Mining: Banking
• Credit Card marketing
By identifying customer segments, card
issuers and acquirers can improve
profitability with more effective acquisition
and retention programs.
• Cardholder pricing and profitability
Card issuers can take advantage of data
mining technology to price their products so
as to maximize profit and minimize loss of
customers.
Major Application Areas for Data
Mining: Telecommunication
• Call detail record analysis:
Telecommunication companies accumulate
detailed call records. By identifying customer
segments with similar use patterns, the
companies can develop attractive pricing and
feature promotions.
• Customer loyalty:
Some customers repeatedly switch providers, or
“churn”, to take advantage of attractive incentives
by competing companies. The companies can use
DM to identify the characteristics of customers
who are likely to remain loyal once they switch,
thus enabling the companies to target their
spending on customers who will produce the most
profit.
Major Application Areas for Data
Mining: Manufacturing
• Manufacturing:
Through choice boards, manufacturers are
beginning to customize products for
customers; therefore they must be able to
predict which features should be bundled to
meet customer demand.
• Warranties:
Manufacturers need to predict the number of
customers who will submit warranty claims
and the average cost of those claims.
Issues and Challenges
• Large data
– Number of variables (features), number of cases (examples)
– Multi gigabyte, terabyte databases
– Efficient algorithms, parallel processing
• High dimensionality
– Large number of features: exponential increase in search space
– Potential for spurious patterns
– Dimensionality reduction
• Over fitting
– Models noise in training data, rather than just the general patterns
• Changing data, missing and noisy data
• Use of domain knowledge
– Utilizing knowledge on complex data relationships, known facts
• Understandability of patterns
Success Stories
• Network intrusion detection using a combination of sequential
rule discovery and classification tree on 4 GB DARPA data
– Won over (manual) knowledge engineering approach
– http://www.cs.columbia.edu/~sal/JAM/PROJECT/ provides
good detailed description of the entire process
• Major US bank: customer attrition prediction
– First segment customers based on financial behavior: found 3
segments
– Build attrition models for each of the 3 segments
– 40-50% of attritions were predicted == factor of 18 increase
• Targeted credit marketing: major US banks
– Find customer segments based on 13 months credit balances
– Build another response model based on surveys
– Increased response 4 times -- 2%
Amitava Manna
(11DCP007)
Amritanshu Mehra
(11DCP008)
Animesh Ranjan
(11DCP009)

Más contenido relacionado

Similar a finalestkddfinalpresentation-111207021040-phpapp01.pptx

Additional themes of data mining for Msc CS
Additional themes of data mining for Msc CSAdditional themes of data mining for Msc CS
Additional themes of data mining for Msc CS
Thanveen
 
Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onword
Sulman Ahmed
 

Similar a finalestkddfinalpresentation-111207021040-phpapp01.pptx (20)

An Introduction to Advanced analytics and data mining
An Introduction to Advanced analytics and data miningAn Introduction to Advanced analytics and data mining
An Introduction to Advanced analytics and data mining
 
Understanding the Lifecycle of a Data Analysis Project
Understanding the Lifecycle of a Data Analysis ProjectUnderstanding the Lifecycle of a Data Analysis Project
Understanding the Lifecycle of a Data Analysis Project
 
Data Mining and Data Warehouse
Data Mining and Data WarehouseData Mining and Data Warehouse
Data Mining and Data Warehouse
 
Business Intelligence Data Warehouse System
Business Intelligence Data Warehouse SystemBusiness Intelligence Data Warehouse System
Business Intelligence Data Warehouse System
 
Additional themes of data mining for Msc CS
Additional themes of data mining for Msc CSAdditional themes of data mining for Msc CS
Additional themes of data mining for Msc CS
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
Lecture2 (1).ppt
Lecture2 (1).pptLecture2 (1).ppt
Lecture2 (1).ppt
 
Data mining
Data miningData mining
Data mining
 
dwdm unit 1.ppt
dwdm unit 1.pptdwdm unit 1.ppt
dwdm unit 1.ppt
 
Data Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).pptData Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).ppt
 
Ch~2.pdf
Ch~2.pdfCh~2.pdf
Ch~2.pdf
 
Data mining
Data miningData mining
Data mining
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
Introduction to Big Data Analytics
Introduction to Big Data AnalyticsIntroduction to Big Data Analytics
Introduction to Big Data Analytics
 
Data ware housing- Introduction to data ware housing
Data ware housing- Introduction to data ware housingData ware housing- Introduction to data ware housing
Data ware housing- Introduction to data ware housing
 
Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onword
 
What is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ NewyorksysWhat is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
 
Data mining
Data miningData mining
Data mining
 
Data mining
Data miningData mining
Data mining
 
presentationofism-complete-1-100227093028-phpapp01.pptx
presentationofism-complete-1-100227093028-phpapp01.pptxpresentationofism-complete-1-100227093028-phpapp01.pptx
presentationofism-complete-1-100227093028-phpapp01.pptx
 

Último

Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
ssuserdda66b
 

Último (20)

Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
 

finalestkddfinalpresentation-111207021040-phpapp01.pptx

  • 1.
  • 2. KDD: A Definition • KDD is the automatic extraction of non-obvious, hidden knowledge from large volumes of data. 106-1012 bytes: we never see the whole data set, so will put it in the memory of computers What is the knowledge? How to represent and use it? Then run Data Mining algorithms
  • 3. Wal-Mart records 20 millions per day Why do we need KDD ? Data Overload Science Marketing Finance Healthcare Retail Health care transactions: multi-gigabyte databases Mobil Oil: geological data of over 100 terabytes Some Data Overload Examples: Data is the most Important tool to gain a competitive edge by providing improved, customized services.
  • 5. Knowledge Discovery in Database • Knowledge discovery in databases (KDD) is the non-trivial process of identifying valid, potentially useful and ultimately understandable patterns in data Clean, Collect, Summarize Data Warehouse Data Preparation Training Data Data Mining Model Patterns Verification, Evaluation Operational Databases
  • 6. Knowledge Discovery Process Goals Data Selection, Acquisition & Integration Data Cleaning Data Reduction & Projection Matching the Goals Exploratory Data Analysis Data Mining Interpretation and Testing Consolidation & Use
  • 7. Knowledge Discovery Process STEP – 1: IDENTIFYING THE GOAL • First step is developing an understanding of the application domain and the relevant prior knowledge and identifying the goal of the KDD process from the customer’s viewpoint. • Goals • Data Selection, Acquisition & Integration • Data Cleaning •Data reduction and Projection •Matching the goals • Exploratory Data Analysis • Data Mining •Interpretation and Testing • Consolidation & Use
  • 8. Knowledge Discovery Process STEP – 2: CREATING A TARGET DATA SET • Selecting a data set, or focusing on a subset of variables or data samples, on which discovery is to be performed. • Goals • Data Selection, Acquisition & Integration • Data Cleaning •Data reduction and Projection •Matching the goals • Exploratory Data Analysis • Data Mining •Interpretation and Testing • Consolidation & Use
  • 9. Knowledge Discovery Process STEP – 3: DATA CLEANING AND PREPROCESSING • Basic operations include removing noise if appropriate, collecting the necessary information to model or account for noise, deciding on strategies for handling missing data fields, and accounting for time- sequence information and known changes. • Goals • Data Selection, Acquisition & Integration • Data Cleaning •Data reduction and Projection •Matching the goals • Exploratory Data Analysis • Data Mining •Interpretation and Testing • Consolidation & Use
  • 10. Knowledge Discovery Process • Finding useful features to represent the data depending on the goal of the task. • With dimensionality reduction or transformation methods, the effective number of variables under consideration can be reduced, or invariant representations for the data can be found. • Goals • Data Selection, Acquisition & Integration • Data Cleaning •Data reduction and Projection •Matching the goals • Exploratory Data Analysis • Data Mining •Interpretation and Testing • Consolidation & Use STEP – 4: DATAREDUCTION AND PROJECTION
  • 11. Knowledge Discovery Process STEP – 5: MATCHING THE GOALS • Matching the goals of the KDD process to a particular data-mining method such as summarization, classification, regression, clustering, etc. • Goals • Data Selection, Acquisition & Integration • Data Cleaning •Data reduction and Projection •Matching the goals • Exploratory Data Analysis • Data Mining •Interpretation and Testing • Consolidation & Use
  • 12. Knowledge Discovery Process • Choosing the data mining algorithms and selecting methods to be used for searching for data patterns. • This process includes deciding which models and parameters might be appropriate and matching a particular data-mining method with the overall criteria of the KDD process. STEP – 6: EXPLORATORY ANALYSIS AND MODEL & HYPOTHESIS SELECTION • Goals • Data Selection, Acquisition & Integration • Data Cleaning •Data reduction and Projection •Matching the goals • Exploratory Data Analysis • Data Mining •Interpretation and Testing • Consolidation & Use
  • 13. Knowledge Discovery Process • Searching for patterns of interest in a particular representational form or a set of such representations, including classification rules or trees, regression, and clustering. • The user can significantly aid the data- mining method by correctly performing the preceding steps. STEP – 7: DATA MINING • Goals • Data Selection, Acquisition & Integration • Data Cleaning •Data reduction and Projection •Matching the goals • Exploratory Data Analysis • Data Mining •Interpretation and Testing • Consolidation & Use
  • 14. Knowledge Discovery Process • Interpreting mined patterns, possibly returning to any of steps 1 through 7 for further iteration. • This step can also involve visualization of the extracted patterns and models or visualization of the data given the extracted models. STEP – 8: INTERPRETATION & TESTING • Goals • Data Selection, Acquisition & Integration • Data Cleaning •Data reduction and Projection •Matching the goals • Exploratory Data Analysis • Data Mining •Interpretation and Testing • Consolidation & Use
  • 15. Knowledge Discovery Process • Using the knowledge directly, incorporating the knowledge into another system for further action, or simply documenting it and reporting it to interested parties. • This process also includes checking for and resolving potential conflicts with previously believed (or extracted) knowledge. STEP – 9: KNOWLEDGE PRESENTATION • Goals • Data Selection, Acquisition & Integration • Data Cleaning •Data reduction and Projection •Matching the goals • Exploratory Data Analysis • Data Mining • Testing and Verification • Interpretation • Consolidation & Use
  • 16. Data Warehousing • A platform for online analytical processing (OLAP) • Warehouses collect transactional data from several transactional databases and organize them in a fashion amenable to analysis • Also called “data marts” • A critical component of the decision support system (DSS) of enterprises • Some typical DW queries: – Which item sells best in each region that has retail outlets? – Which advertising strategy is best for Dubai Markets?
  • 18. Data Cleaning • Performs logical transformation of transactional data to suit the data warehouse • Model of operations  model of enterprise • Usually a semi-automatic process Orders Order_id Price Cust_id Inventory Prod_id Price Price_change Sales Cust_id Cust_profit Total_sales Data Warehouse Customers Products Orders Inventory Price Time
  • 19. Primary Tasks of Data Mining Deviation and change detection ? Summarization Clustering Regression finding the description of several predefined classes and classify a data item into one of them. Classification maps a data item to a real-valued prediction variable. identifying a finite set of categories or clusters to describe the data. finding a compact description for a subset of data finding a model which describes significant dependencies between variables. Dependency Modeling discovering the most significant changes in the data
  • 20. Data Mining Algorithm Components • Model representation – descriptions of discovered patterns – overly limited representation -- unable to capture data patterns too powerful -- potential for over fit. (decision trees, rules, linear/non-linear regression & classification, nearest neighbor and case-based reasoning methods, graphical dependency models) • Model evaluation criteria – how well a pattern (model) meets goals (fit function) – e.g., accuracy, novelty, etc.
  • 21. Data Mining Algorithm Components • Search method – parameter search: optimization of parameters for a given model representation – model search: considers a family of models Different methods suit different problems. Proper problem formulation crucial.
  • 22. Data Mining Techniques Data Mining Techniques Descriptive Predictive Clustering Association Classification Regression SequentialAnalysis Decision Tree Rule Induction Neural Networks Nearest Neighbor Classification
  • 23. Association Rule: Application • Supermarket Shelf Management • Goal: to identify items which are bought together (by sufficiently many customers) • Approach: process point-of-sale data (collected with barcode scanners) to find dependencies among items. • Consider discovered rule: {Diapers, Milk … } --> {Baby food} • Example: – If a customer buys Diapers and Milk, then he is very likely to buy Baby foods. – so stack baby foods next to diapers?
  • 24. Sequential Pattern Discovery: Application • Sequences in which customers purchase goods/services • Understanding long term customer behavior -- timely promotions. • In point-of--sale transaction sequences – Computer bookstore: (Intro to Visual C++) (Java & J2EE) --> (Perl for Dummies, PHP in 24 Hrs) – Athletic Apparel Store: (Shoes) (Racket, Racket ball) --> (Sports Jacket)
  • 25. Hierarchical Clustering (K-Means): Application 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 K=2 Arbitrarily choose K objects as initial cluster center Assign each of the objects to most similar center Update the cluster means Update the cluster means reassign Hierarchical clustering: Clusters are formed at different levels by merging clusters at a lower level
  • 26. Decision Tree Identification: Application Outlook Temp Play? Sunny Warm Yes Overcast Chilly No Sunny Chilly Yes Cloudy Pleasant Yes Overcast Pleasant Yes Overcast Chilly No Cloudy Chilly No Cloudy Warm Yes Sunny Cloudy Overcast Yes Yes/No Yes/No Decision Tree Identification Example
  • 27. Decision Tree Identification: Application Yes/No Yes/No Yes Yes/No Sunny Cloudy Overcast Yes No Yes No Yes Warm Chilly Pleasant Chilly Pleasant
  • 28. Major Application Areas for Data Mining (Classification) • Advertising • Bioinformatics • Customer Relationship Management (CRM) • Database Marketing • Fraud Detection • ecommerce • Health Care • Investment/Securities • Manufacturing, Process Control • Sports and Entertainment • Telecommunications • Web
  • 29. Major Application Areas for Data Mining: Marketing • Direct Marketing: Most major direct marketing companies are using modeling and data mining. • Customer segmentation: All industries can take advantage of DM to discover discrete segments in their customer bases by considering additional variables beyond traditional analysis. • CRM: Find other people in similar life stages and determine which customers are following similar behavior patterns – Up-sell – Cross-sell – Keeping the customers for a longer period of time For e.g. Verizon Wireless reduced churn rate from 2% to 1.5%
  • 30. Major Application Areas for Data Mining: Fraud Detection • Credit Card Fraud Detection • Money laundering – FAIS (US Treasury) • Securities Fraud – NASDAQ Sonar system • Phone fraud – AT&T, Bell Atlantic, British Telecom/MCI • Bio-terrorism detection at Salt Lake Olympics 2002
  • 31. Major Application Areas for Data Mining: Retail • Sales forecasting: Examining time-based patterns helps retailers make stocking decisions. • Database Retailing: Retailers can develop profiles of customers with certain behaviors, for example, those who purchase designer labels clothing or those who attend sales. • Merchandise planning and allocation: When retailers add new stores, they can improve merchandise planning and allocation by examining patterns in stores with similar demographic characteristics.
  • 32. Major Application Areas for Data Mining: Banking • Credit Card marketing By identifying customer segments, card issuers and acquirers can improve profitability with more effective acquisition and retention programs. • Cardholder pricing and profitability Card issuers can take advantage of data mining technology to price their products so as to maximize profit and minimize loss of customers.
  • 33. Major Application Areas for Data Mining: Telecommunication • Call detail record analysis: Telecommunication companies accumulate detailed call records. By identifying customer segments with similar use patterns, the companies can develop attractive pricing and feature promotions. • Customer loyalty: Some customers repeatedly switch providers, or “churn”, to take advantage of attractive incentives by competing companies. The companies can use DM to identify the characteristics of customers who are likely to remain loyal once they switch, thus enabling the companies to target their spending on customers who will produce the most profit.
  • 34. Major Application Areas for Data Mining: Manufacturing • Manufacturing: Through choice boards, manufacturers are beginning to customize products for customers; therefore they must be able to predict which features should be bundled to meet customer demand. • Warranties: Manufacturers need to predict the number of customers who will submit warranty claims and the average cost of those claims.
  • 35. Issues and Challenges • Large data – Number of variables (features), number of cases (examples) – Multi gigabyte, terabyte databases – Efficient algorithms, parallel processing • High dimensionality – Large number of features: exponential increase in search space – Potential for spurious patterns – Dimensionality reduction • Over fitting – Models noise in training data, rather than just the general patterns • Changing data, missing and noisy data • Use of domain knowledge – Utilizing knowledge on complex data relationships, known facts • Understandability of patterns
  • 36. Success Stories • Network intrusion detection using a combination of sequential rule discovery and classification tree on 4 GB DARPA data – Won over (manual) knowledge engineering approach – http://www.cs.columbia.edu/~sal/JAM/PROJECT/ provides good detailed description of the entire process • Major US bank: customer attrition prediction – First segment customers based on financial behavior: found 3 segments – Build attrition models for each of the 3 segments – 40-50% of attritions were predicted == factor of 18 increase • Targeted credit marketing: major US banks – Find customer segments based on 13 months credit balances – Build another response model based on surveys – Increased response 4 times -- 2%