SlideShare una empresa de Scribd logo
1 de 17
Data Mining
www.StudsPlanet.com
Agenda
 What is Data Mining?
 Data Mining Tasks
 Challenges in Data mining
www.StudsPlanet.com
What is Data Mining
 Data mining is integral part of knowledge
discovery in databases (KDD), which is the
overall process of converting raw data into
useful information. This process consists of
series of transformation steps from
preprocessing to postprocessing of data
mining results
www.StudsPlanet.com
Process of Knowledge
Discovery in Database(KDD)
Data
Preprocessing
Data Mining PostProcessing
Normalization.
Data subsetting
Normalization.
Data subsetting
Filtering
Patterns,Visualization,
Pattern Interpretation
Filtering
Patterns,Visualization,
Pattern Interpretation
Inputdata
Input
Data Information
www.StudsPlanet.com
Data Mining Tasks
 Data Mining is generally divided into two
tasks.
1. Predictive tasks
2. Descriptive tasks
www.StudsPlanet.com
Predictive Tasks
 Objective: Predict the value of a specific
attribute (target/dependent variable)based
on the value of other attributes
(explanatory).
Example: Judge if a patient has specific
disease based on his/her medical tests results.
www.StudsPlanet.com
Descriptive Tasks
 Objective: To derive patterns
(correlation,trends,trajectories) that
summarizes the underlying relationship
between data.
Example: Identifying web pages that are
accessed together.(human interpretable
pattern)
www.StudsPlanet.com
Data Mining Tasks [contd.]
 Classification [Predictive]
 Clustering [Descriptive]
 Association Rule Discovery[Descriptive]
 Sequential Pattern Discovery [Descriptive]
 Regression [Predictive]
 Deviation Detection [Predictive]
www.StudsPlanet.com
Classification: Definition
 Classification: Given a collection of records
 Each record contains a set of attributes, one of the
attribute is a class.
 Find a model for class attribute as a function of
values of other attributes.
 Goal: previously unseen records should be
assigned a class as accurately as possible.
 A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and
test sets, with training set used to build the model and
test set used to validate it.www.StudsPlanet.com
Classification: Example
 Direct Marketing
Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new product.
 Approach:
 Use the data for a similar product introduced before.
 We know which customers decided to buy and which decided
otherwise. This {buy, don’t buy} decision forms the class
attribute.
 Collect various demographic, lifestyle, and company-interaction
related information about all such customers.
 Type of business, where they stay, how much they earn, etc.
 Use this information as input attributes to learn a classifier
model. (from Berry & Linoff, 1997)
www.StudsPlanet.com
Clustering: Definition
 Given a set of data points, each having a set
of attributes, and a similarity measure among
them, find clusters such that
 Data points in one cluster are more similar to one
another.
 Data points in separate clusters are less similar to
one another.
www.StudsPlanet.com
Clustering: Example
 Document Clustering:
 Goal: To find groups of documents that are similar to
each other based on the important terms appearing in
them.
 Approach: To identify frequently occurring terms in
each document. Form a similarity measure based on the
frequencies of different terms. Use it to cluster.
 Gain: Information Retrieval can utilize the clusters to
relate a new document or search term to clustered
documents.
www.StudsPlanet.com
Illustrating Document Clustering
Category Total
Articles
Correctly Placed
Financial 555 364
Foreign 341 260
National 273 36
Metro 943 746
Sports 738 573
Entertainment 354 278
Clustering Points: 3204 Articles Of Los Angles Times.
Similarity Measure: How Many words are common in these
documents. (after some word filtering) (Introduction to Data mining 2007)
www.StudsPlanet.com
Association Rule Discovery:
Definition
Given a set of records each of which contain some number of items
from a given collection;
Apriori principle: If an item set is frequent then its subset is also
frequent
TID Items
1 Bread, Coke Milk
2
3
Beer, Bread
Beer,Coke, Diaper, Milk
4 Beer, Bread, Diaper,
Milk
5 Coke, Diaper, Milk
Rule Discovered:
Milk -> Coke
Diaper, Milk -> Beer
www.StudsPlanet.com
Other Mining Tasks in Nutshell
 Sequential Pattern Discovery
In point-of-sale transaction sequences,
 Computer Bookstore:
(Intro_To_Visual_C) (C++_Primer) -->
(Perl_for_dummies,Tcl_Tk)
 Regression: Neural Networks
 Deviation Detection: Detect deviation from normal
behavior. Eg. Credit card fraud.
www.StudsPlanet.com
Challenges of Data Mining
 Scalability
 Dimensionality
 Complex and Heterogeneous Data
 Data Quality
 Data Ownership and Distribution
 Privacy Preservation
 Streaming Data
www.StudsPlanet.com
References
 Tan, P., Steinbach, M., & Kumar, V.,
Introduction to Data Mining. Addison
Wesley, 2006.
www.StudsPlanet.com

Más contenido relacionado

La actualidad más candente

Data mining query languages
Data mining query languagesData mining query languages
Data mining query languages
Marcy Morales
 
Data Mining with JDM API by Regina Wang (4/11)
Data Mining with JDM API by Regina Wang (4/11)Data Mining with JDM API by Regina Wang (4/11)
Data Mining with JDM API by Regina Wang (4/11)
butest
 
Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641
Aiswaryadevi Jaganmohan
 

La actualidad más candente (19)

1 Introduction to-data-mining lecture
1   Introduction to-data-mining lecture1   Introduction to-data-mining lecture
1 Introduction to-data-mining lecture
 
Knowledge Discovery and Data Mining
Knowledge Discovery and Data MiningKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining
 
Mc0088 data mining
Mc0088  data miningMc0088  data mining
Mc0088 data mining
 
Testing
TestingTesting
Testing
 
An Efficient Compressed Data Structure Based Method for Frequent Item Set Mining
An Efficient Compressed Data Structure Based Method for Frequent Item Set MiningAn Efficient Compressed Data Structure Based Method for Frequent Item Set Mining
An Efficient Compressed Data Structure Based Method for Frequent Item Set Mining
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining Process
 
Data mining
Data miningData mining
Data mining
 
Data mining query languages
Data mining query languagesData mining query languages
Data mining query languages
 
Kdd process
Kdd processKdd process
Kdd process
 
Talk
TalkTalk
Talk
 
Protection models
Protection modelsProtection models
Protection models
 
Seminar Presentation
Seminar PresentationSeminar Presentation
Seminar Presentation
 
3 Data Mining Tasks
3  Data Mining Tasks3  Data Mining Tasks
3 Data Mining Tasks
 
G045033841
G045033841G045033841
G045033841
 
Chapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data MiningChapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data Mining
 
Data Mining with JDM API by Regina Wang (4/11)
Data Mining with JDM API by Regina Wang (4/11)Data Mining with JDM API by Regina Wang (4/11)
Data Mining with JDM API by Regina Wang (4/11)
 
Data Cleaning Techniques
Data Cleaning TechniquesData Cleaning Techniques
Data Cleaning Techniques
 
Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641
 

Destacado

E commerce 2008 section-c
E commerce 2008 section-cE commerce 2008 section-c
E commerce 2008 section-c
StudsPlanet.com
 
Graphic narrative evidence task 2
Graphic narrative evidence task 2 Graphic narrative evidence task 2
Graphic narrative evidence task 2
OliviaBolt
 
Safe surf parent flyer revised
Safe surf parent flyer revisedSafe surf parent flyer revised
Safe surf parent flyer revised
Gemey McNabb
 
Sales Manager Questions
Sales Manager QuestionsSales Manager Questions
Sales Manager Questions
SalesLoft
 
Speciale salute: Sistemi iperpolarizzazione gas
Speciale salute: Sistemi iperpolarizzazione gasSpeciale salute: Sistemi iperpolarizzazione gas
Speciale salute: Sistemi iperpolarizzazione gas
Apulian ICT Living Labs
 
Letter of Recommendation Hassan Jukhadar
Letter of Recommendation Hassan JukhadarLetter of Recommendation Hassan Jukhadar
Letter of Recommendation Hassan Jukhadar
jukhadar
 
Presentation18iughu79
Presentation18iughu79Presentation18iughu79
Presentation18iughu79
ibeeliyah
 
Introducción
IntroducciónIntroducción
Introducción
Grupo6ma
 

Destacado (20)

Custom clearance
Custom clearanceCustom clearance
Custom clearance
 
Employee motivation
Employee motivationEmployee motivation
Employee motivation
 
Human environment
Human environmentHuman environment
Human environment
 
Derivatives
DerivativesDerivatives
Derivatives
 
Forex
ForexForex
Forex
 
E commerce 2008 section-c
E commerce 2008 section-cE commerce 2008 section-c
E commerce 2008 section-c
 
Factor influencing ihrm
Factor influencing ihrmFactor influencing ihrm
Factor influencing ihrm
 
Graphic narrative evidence task 2
Graphic narrative evidence task 2 Graphic narrative evidence task 2
Graphic narrative evidence task 2
 
Safe surf parent flyer revised
Safe surf parent flyer revisedSafe surf parent flyer revised
Safe surf parent flyer revised
 
Modulo unidad #2
Modulo unidad #2Modulo unidad #2
Modulo unidad #2
 
didactica de la quimica elaborado por: patricia sanchez
didactica de la quimica elaborado por: patricia sanchezdidactica de la quimica elaborado por: patricia sanchez
didactica de la quimica elaborado por: patricia sanchez
 
Horno Siemens HB676G5S1
Horno Siemens HB676G5S1Horno Siemens HB676G5S1
Horno Siemens HB676G5S1
 
Presentation_NEW.PPTX
Presentation_NEW.PPTXPresentation_NEW.PPTX
Presentation_NEW.PPTX
 
Link chemical oil&gas 2013
Link chemical oil&gas 2013Link chemical oil&gas 2013
Link chemical oil&gas 2013
 
Sales Manager Questions
Sales Manager QuestionsSales Manager Questions
Sales Manager Questions
 
motionQR Updated Overview
motionQR Updated OverviewmotionQR Updated Overview
motionQR Updated Overview
 
Speciale salute: Sistemi iperpolarizzazione gas
Speciale salute: Sistemi iperpolarizzazione gasSpeciale salute: Sistemi iperpolarizzazione gas
Speciale salute: Sistemi iperpolarizzazione gas
 
Letter of Recommendation Hassan Jukhadar
Letter of Recommendation Hassan JukhadarLetter of Recommendation Hassan Jukhadar
Letter of Recommendation Hassan Jukhadar
 
Presentation18iughu79
Presentation18iughu79Presentation18iughu79
Presentation18iughu79
 
Introducción
IntroducciónIntroducción
Introducción
 

Similar a Data mining

data mining presentation power point for the study
data mining presentation power point for the studydata mining presentation power point for the study
data mining presentation power point for the study
anjanishah774
 
lect1lect1lect1lect1lect1lect1lect1lect1.ppt
lect1lect1lect1lect1lect1lect1lect1lect1.pptlect1lect1lect1lect1lect1lect1lect1lect1.ppt
lect1lect1lect1lect1lect1lect1lect1lect1.ppt
DEEPAK948083
 
File 498 Doc 4 01 Dm Intro To Dm
File 498 Doc 4 01 Dm Intro To DmFile 498 Doc 4 01 Dm Intro To Dm
File 498 Doc 4 01 Dm Intro To Dm
mupa
 

Similar a Data mining (20)

Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
 
data mining presentation power point for the study
data mining presentation power point for the studydata mining presentation power point for the study
data mining presentation power point for the study
 
lect1.ppt
lect1.pptlect1.ppt
lect1.ppt
 
lect1lect1lect1lect1lect1lect1lect1lect1.ppt
lect1lect1lect1lect1lect1lect1lect1lect1.pptlect1lect1lect1lect1lect1lect1lect1lect1.ppt
lect1lect1lect1lect1lect1lect1lect1lect1.ppt
 
Data Mining
Data MiningData Mining
Data Mining
 
data mining
data miningdata mining
data mining
 
Introduction-to-Knowledge Discovery in Database
Introduction-to-Knowledge Discovery in DatabaseIntroduction-to-Knowledge Discovery in Database
Introduction-to-Knowledge Discovery in Database
 
Data mining
Data miningData mining
Data mining
 
Data-Mining-ppt (1).pptx
Data-Mining-ppt (1).pptxData-Mining-ppt (1).pptx
Data-Mining-ppt (1).pptx
 
Data-Mining-ppt.pptx
Data-Mining-ppt.pptxData-Mining-ppt.pptx
Data-Mining-ppt.pptx
 
data.2.pptx
data.2.pptxdata.2.pptx
data.2.pptx
 
Testing
TestingTesting
Testing
 
D M1
D M1D M1
D M1
 
Data Mining
Data MiningData Mining
Data Mining
 
Data Mining
Data MiningData Mining
Data Mining
 
Chapter 1.pdf
Chapter 1.pdfChapter 1.pdf
Chapter 1.pdf
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture Notes
 
Part1
Part1Part1
Part1
 
File 498 Doc 4 01 Dm Intro To Dm
File 498 Doc 4 01 Dm Intro To DmFile 498 Doc 4 01 Dm Intro To Dm
File 498 Doc 4 01 Dm Intro To Dm
 

Más de StudsPlanet.com

World electronic industry 2008
World electronic industry 2008World electronic industry 2008
World electronic industry 2008
StudsPlanet.com
 
Trompenaars cultural dimensions
Trompenaars cultural dimensionsTrompenaars cultural dimensions
Trompenaars cultural dimensions
StudsPlanet.com
 
The building of the toyota car factory
The building of the toyota car factoryThe building of the toyota car factory
The building of the toyota car factory
StudsPlanet.com
 
The International legal environment of business
The International legal environment of businessThe International legal environment of business
The International legal environment of business
StudsPlanet.com
 
Roles of strategic leaders
Roles  of  strategic  leadersRoles  of  strategic  leaders
Roles of strategic leaders
StudsPlanet.com
 
Resolution of intl commr disputes
Resolution of intl commr disputesResolution of intl commr disputes
Resolution of intl commr disputes
StudsPlanet.com
 
Presentation on india's ftp
Presentation on india's ftpPresentation on india's ftp
Presentation on india's ftp
StudsPlanet.com
 

Más de StudsPlanet.com (20)

Hardware enhanced association rule mining
Hardware enhanced association rule miningHardware enhanced association rule mining
Hardware enhanced association rule mining
 
Hardware enhanced association rule mining
Hardware enhanced association rule miningHardware enhanced association rule mining
Hardware enhanced association rule mining
 
Face recognition using laplacianfaces
Face recognition using laplacianfaces Face recognition using laplacianfaces
Face recognition using laplacianfaces
 
Face recognition using laplacianfaces
Face recognition using laplacianfaces Face recognition using laplacianfaces
Face recognition using laplacianfaces
 
Worldwide market and trends for electronic manufacturing services
Worldwide market and trends for electronic manufacturing servicesWorldwide market and trends for electronic manufacturing services
Worldwide market and trends for electronic manufacturing services
 
World electronic industry 2008
World electronic industry 2008World electronic industry 2008
World electronic industry 2008
 
Weberian model
Weberian modelWeberian model
Weberian model
 
Value orientation model
Value orientation modelValue orientation model
Value orientation model
 
Value orientation model
Value orientation modelValue orientation model
Value orientation model
 
Uk intellectual model
Uk intellectual modelUk intellectual model
Uk intellectual model
 
Trompenaars cultural dimensions
Trompenaars cultural dimensionsTrompenaars cultural dimensions
Trompenaars cultural dimensions
 
The building of the toyota car factory
The building of the toyota car factoryThe building of the toyota car factory
The building of the toyota car factory
 
The International legal environment of business
The International legal environment of businessThe International legal environment of business
The International legal environment of business
 
Textile Industry
Textile IndustryTextile Industry
Textile Industry
 
Sales
SalesSales
Sales
 
Roles of strategic leaders
Roles  of  strategic  leadersRoles  of  strategic  leaders
Roles of strategic leaders
 
Role of ecgc
Role of ecgcRole of ecgc
Role of ecgc
 
Resolution of intl commr disputes
Resolution of intl commr disputesResolution of intl commr disputes
Resolution of intl commr disputes
 
Presentation on india's ftp
Presentation on india's ftpPresentation on india's ftp
Presentation on india's ftp
 
Players in ib
Players in ibPlayers in ib
Players in ib
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 

Data mining

  • 2. Agenda  What is Data Mining?  Data Mining Tasks  Challenges in Data mining www.StudsPlanet.com
  • 3. What is Data Mining  Data mining is integral part of knowledge discovery in databases (KDD), which is the overall process of converting raw data into useful information. This process consists of series of transformation steps from preprocessing to postprocessing of data mining results www.StudsPlanet.com
  • 4. Process of Knowledge Discovery in Database(KDD) Data Preprocessing Data Mining PostProcessing Normalization. Data subsetting Normalization. Data subsetting Filtering Patterns,Visualization, Pattern Interpretation Filtering Patterns,Visualization, Pattern Interpretation Inputdata Input Data Information www.StudsPlanet.com
  • 5. Data Mining Tasks  Data Mining is generally divided into two tasks. 1. Predictive tasks 2. Descriptive tasks www.StudsPlanet.com
  • 6. Predictive Tasks  Objective: Predict the value of a specific attribute (target/dependent variable)based on the value of other attributes (explanatory). Example: Judge if a patient has specific disease based on his/her medical tests results. www.StudsPlanet.com
  • 7. Descriptive Tasks  Objective: To derive patterns (correlation,trends,trajectories) that summarizes the underlying relationship between data. Example: Identifying web pages that are accessed together.(human interpretable pattern) www.StudsPlanet.com
  • 8. Data Mining Tasks [contd.]  Classification [Predictive]  Clustering [Descriptive]  Association Rule Discovery[Descriptive]  Sequential Pattern Discovery [Descriptive]  Regression [Predictive]  Deviation Detection [Predictive] www.StudsPlanet.com
  • 9. Classification: Definition  Classification: Given a collection of records  Each record contains a set of attributes, one of the attribute is a class.  Find a model for class attribute as a function of values of other attributes.  Goal: previously unseen records should be assigned a class as accurately as possible.  A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.www.StudsPlanet.com
  • 10. Classification: Example  Direct Marketing Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new product.  Approach:  Use the data for a similar product introduced before.  We know which customers decided to buy and which decided otherwise. This {buy, don’t buy} decision forms the class attribute.  Collect various demographic, lifestyle, and company-interaction related information about all such customers.  Type of business, where they stay, how much they earn, etc.  Use this information as input attributes to learn a classifier model. (from Berry & Linoff, 1997) www.StudsPlanet.com
  • 11. Clustering: Definition  Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that  Data points in one cluster are more similar to one another.  Data points in separate clusters are less similar to one another. www.StudsPlanet.com
  • 12. Clustering: Example  Document Clustering:  Goal: To find groups of documents that are similar to each other based on the important terms appearing in them.  Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster.  Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents. www.StudsPlanet.com
  • 13. Illustrating Document Clustering Category Total Articles Correctly Placed Financial 555 364 Foreign 341 260 National 273 36 Metro 943 746 Sports 738 573 Entertainment 354 278 Clustering Points: 3204 Articles Of Los Angles Times. Similarity Measure: How Many words are common in these documents. (after some word filtering) (Introduction to Data mining 2007) www.StudsPlanet.com
  • 14. Association Rule Discovery: Definition Given a set of records each of which contain some number of items from a given collection; Apriori principle: If an item set is frequent then its subset is also frequent TID Items 1 Bread, Coke Milk 2 3 Beer, Bread Beer,Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk Rule Discovered: Milk -> Coke Diaper, Milk -> Beer www.StudsPlanet.com
  • 15. Other Mining Tasks in Nutshell  Sequential Pattern Discovery In point-of-sale transaction sequences,  Computer Bookstore: (Intro_To_Visual_C) (C++_Primer) --> (Perl_for_dummies,Tcl_Tk)  Regression: Neural Networks  Deviation Detection: Detect deviation from normal behavior. Eg. Credit card fraud. www.StudsPlanet.com
  • 16. Challenges of Data Mining  Scalability  Dimensionality  Complex and Heterogeneous Data  Data Quality  Data Ownership and Distribution  Privacy Preservation  Streaming Data www.StudsPlanet.com
  • 17. References  Tan, P., Steinbach, M., & Kumar, V., Introduction to Data Mining. Addison Wesley, 2006. www.StudsPlanet.com

Notas del editor

  1. .