SlideShare una empresa de Scribd logo
1 de 41
Introduction to Data Mining Ch. 2 Data Preprocessing Heon Gyu Lee ( [email_address] ) http://dblab.chungbuk.ac.kr/~hglee DB/Bioinfo., Lab.  http://dblab.chungbuk.ac.kr Chungbuk National University
Why Data Preprocessing? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
What is Data? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Attributes Objects
Types of Attributes  ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Discrete and Continuous Attributes  ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Data Quality  ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Noise ,[object Object],[object Object],Two Sine Waves Two Sine Waves + Noise
Outliers ,[object Object]
Missing Values ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Duplicate Data ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Major Tasks in Data Preprocessing ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Forms of Data Preprocessing
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Data Cleaning
Data Cleaning  : How to Handle Missing Data? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Data Cleaning  : How to Handle Noisy Data? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Data Cleaning  : Binning Methods ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Data Cleaning : Regression x y y = x + 1 X1 Y1 Y1’
Data Cleaning : Cluster Analysis
Data Integration ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Data Integration  : Handling Redundancy in Data Integration ,[object Object],[object Object],[object Object],[object Object],[object Object]
Data Integration :  Correlation Analysis (Numerical Data) ,[object Object],[object Object],[object Object],[object Object]
Data Integration  : Correlation Analysis (Categorical Data) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Chi-Square Calculation: An Example ,[object Object],[object Object],1500 1200 300 Sum(col.) 1050 1000(840) 50(210) Not like science fiction 450 200(360) 250(90) Like science fiction Sum (row) Not play chess Play chess
Data Transformation ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Data Transformation : Normalization ,[object Object],[object Object],[object Object],[object Object],[object Object],Where  j  is the smallest integer such that Max(| ν ’ |) < 1
Data Reduction Strategies ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Data Reduction : Aggregation ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Data Reduction : Aggregation Standard Deviation of Average Monthly Precipitation Standard Deviation of Average Yearly Precipitation Variation of Precipitation in Australia
Data Reduction : Sampling  ,[object Object],[object Object],[object Object],[object Object]
Data Reduction : Types of Sampling ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Data Reduction  : Dimensionality Reduction ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Dimensionality Reduction : PCA ,[object Object],x 2 x 1 e
Dimensionality Reduction : PCA ,[object Object],[object Object],x 2 x 1 e
Data Reduction  : Feature Subset Selection ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Data Reduction  : Feature Subset Selection ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Data Reduction  : Feature Creation ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Data Reduction  : Mapping Data to a New Space Two Sine Waves Two Sine Waves + Noise Frequency ,[object Object],[object Object]
Data Reduction  : Discretization Using Class Labels ,[object Object],3 categories for both x and y 5 categories for both x and y
Data Reduction  : Discretization Without Using Class Labels Data Equal interval width Equal frequency K-means
Data Reduction  : Attribute Transformation ,[object Object],[object Object],[object Object]
Question & Answer

Más contenido relacionado

La actualidad más candente

Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
ksamyMCA
 

La actualidad más candente (20)

Data Preprocessing || Data Mining
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data Mining
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
 
data mining
data miningdata mining
data mining
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
 
Data PreProcessing
Data PreProcessingData PreProcessing
Data PreProcessing
 
Data Mining: Data processing
Data Mining: Data processingData Mining: Data processing
Data Mining: Data processing
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and Correlations
 
Data Mining
Data MiningData Mining
Data Mining
 
Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Knowledge discovery process
Knowledge discovery process Knowledge discovery process
Knowledge discovery process
 
OLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSEOLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSE
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with Python
 
Data Integration and Transformation in Data mining
Data Integration and Transformation in Data miningData Integration and Transformation in Data mining
Data Integration and Transformation in Data mining
 
Data preprocess
Data preprocessData preprocess
Data preprocess
 
Modelling and evaluation
Modelling and evaluationModelling and evaluation
Modelling and evaluation
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 

Similar a Data Preprocessing

03 preprocessing
03 preprocessing03 preprocessing
03 preprocessing
purnimatm
 
Data Preprocessing and Visualizsdjvnovrnververdfvdfation
Data Preprocessing and VisualizsdjvnovrnververdfvdfationData Preprocessing and Visualizsdjvnovrnververdfvdfation
Data Preprocessing and Visualizsdjvnovrnververdfvdfation
wokati2689
 
03Predddddddddddddddddddddddprocessling.ppt
03Predddddddddddddddddddddddprocessling.ppt03Predddddddddddddddddddddddprocessling.ppt
03Predddddddddddddddddddddddprocessling.ppt
a99150433
 

Similar a Data Preprocessing (20)

03Preprocessing01.pdf
03Preprocessing01.pdf03Preprocessing01.pdf
03Preprocessing01.pdf
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Datapreprocessing
DatapreprocessingDatapreprocessing
Datapreprocessing
 
03 preprocessing
03 preprocessing03 preprocessing
03 preprocessing
 
Chapter 3. Data Preprocessing.ppt
Chapter 3. Data Preprocessing.pptChapter 3. Data Preprocessing.ppt
Chapter 3. Data Preprocessing.ppt
 
Upstate CSCI 525 Data Mining Chapter 3
Upstate CSCI 525 Data Mining Chapter 3Upstate CSCI 525 Data Mining Chapter 3
Upstate CSCI 525 Data Mining Chapter 3
 
02Data updated.pdf
02Data updated.pdf02Data updated.pdf
02Data updated.pdf
 
Cs501 data preprocessingdw
Cs501 data preprocessingdwCs501 data preprocessingdw
Cs501 data preprocessingdw
 
Unit 3-2.ppt
Unit 3-2.pptUnit 3-2.ppt
Unit 3-2.ppt
 
data processing.pdf
data processing.pdfdata processing.pdf
data processing.pdf
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
1.6.data preprocessing
1.6.data preprocessing1.6.data preprocessing
1.6.data preprocessing
 
03Preprocessing_plp.pptx
03Preprocessing_plp.pptx03Preprocessing_plp.pptx
03Preprocessing_plp.pptx
 
03Preprocessing.ppt
03Preprocessing.ppt03Preprocessing.ppt
03Preprocessing.ppt
 
03Preprocessing_plp.pptx
03Preprocessing_plp.pptx03Preprocessing_plp.pptx
03Preprocessing_plp.pptx
 
Data Preprocessing and Visualizsdjvnovrnververdfvdfation
Data Preprocessing and VisualizsdjvnovrnververdfvdfationData Preprocessing and Visualizsdjvnovrnververdfvdfation
Data Preprocessing and Visualizsdjvnovrnververdfvdfation
 
03Preprocessing for student computer sciecne.ppt
03Preprocessing for student computer sciecne.ppt03Preprocessing for student computer sciecne.ppt
03Preprocessing for student computer sciecne.ppt
 
03Preprocessing.ppt
03Preprocessing.ppt03Preprocessing.ppt
03Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
03Predddddddddddddddddddddddprocessling.ppt
03Predddddddddddddddddddddddprocessling.ppt03Predddddddddddddddddddddddprocessling.ppt
03Predddddddddddddddddddddddprocessling.ppt
 

Más de Object-Frontier Software Pvt. Ltd (9)

Chap9
Chap9Chap9
Chap9
 
Wsh96 Wilkinson
Wsh96 WilkinsonWsh96 Wilkinson
Wsh96 Wilkinson
 
Dc 11 Brucepotter
Dc 11 BrucepotterDc 11 Brucepotter
Dc 11 Brucepotter
 
Ieee 802.11overview
Ieee 802.11overviewIeee 802.11overview
Ieee 802.11overview
 
Presentation
PresentationPresentation
Presentation
 
Gsm Network
Gsm NetworkGsm Network
Gsm Network
 
GPRS
GPRSGPRS
GPRS
 
CORBA
CORBACORBA
CORBA
 
Rmi
RmiRmi
Rmi
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

Data Preprocessing

  • 1. Introduction to Data Mining Ch. 2 Data Preprocessing Heon Gyu Lee ( [email_address] ) http://dblab.chungbuk.ac.kr/~hglee DB/Bioinfo., Lab. http://dblab.chungbuk.ac.kr Chungbuk National University
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12. Forms of Data Preprocessing
  • 13.
  • 14.
  • 15.
  • 16.
  • 17. Data Cleaning : Regression x y y = x + 1 X1 Y1 Y1’
  • 18. Data Cleaning : Cluster Analysis
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28. Data Reduction : Aggregation Standard Deviation of Average Monthly Precipitation Standard Deviation of Average Yearly Precipitation Variation of Precipitation in Australia
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39. Data Reduction : Discretization Without Using Class Labels Data Equal interval width Equal frequency K-means
  • 40.