SlideShare a Scribd company logo
1 of 22
1.
2.
3.
4.
5.
6.
7.
8.

Introduction
Data Quality: Needs of Preprocessing the data?
Data Preprocessing tasks
Data Cleaning
Data integration
Data reduction
Data Transformation and Data Discretization
Conclusion
• It is a process which is comes before applying data mining
technique's
• Low-quality data will lead to low-quality mining results.
• So we need to smear Data Preprocessing techniques such as:
- Data quality
- Data cleaning
- Data integration
- Data reduction
- Data transformation
- Data discremination
• Data have quality if the requirements of the intended use.
• There are many factors comprising data quality, including:
–
–
–
–
–
–

Accuracy
Completeness
Consistency
Timeliness
Believability
Interpretability
• Data cleaning routines attempt to fill in missing values , smooth out
noise while identifying outliers, and inconsistencies in data.
•

Basic methods of data cleaning:
– Missing value
– Noisy Data
– Data Cleaning as a process
• Ignore the tuple
• Fill in missing values manually
[ time consuming and infeasible]
• Fill in it automatically with
[a global constant : e.g., “Unknown”, ∞]
• Use the most portable value to fill in the missing value [regression,
inference-based tools using Bayesian formalism or decision tree
induction]
• Noise is the random error or variance in a measured variable.
• Binning:
Binning method smooth a sorted data value by consulting its
“neighborhood”, that is, the value around it.
The sorted values are distributed into number of “buckets”, or
“bins”.
• Smoothing by bin means:
Each value in a bin is replaced by the mean value of the bin [4,8,15
in bin is 9].
• Smoothing by bin medians:
Each value in a bin replaced by the bin median
• Smoothing by bin boundaries:
The minimum and maximum values in a given bin are identified as
the bin boundaries each bin values is then replaced by closest
boundary value
Binning is also used as a discretization technique.
• Regression:
Data smoothing can also done by regression, a technique that
conforms of values to the function
– Linear regression involves finding “best” line to fit two
attributes. one attribute used to predict other
– Multiple linear regression extension of linear regression.
• Outlier analysis:
it may be detected by clustering. Where similar values are
organized into groups or clusters.
• The first step in the data cleaning is discrepancy detection
[inconsistent data] .
• The data should examined regarding :
– Unique rule [ each attribute value must be different from all
other attribute value ]
– Consecutive rule [no missing values between lowest and highest
values of the attribute]
– Null rule [specifies the use of blanks, question marks, special
characters]
• Use commercial tools
Data scrubbing: use simple domain knowledge (e.g, postal code,
spell-check) to detect errors and make corrections
Data auditing: by analyzing data to discover rules and relationship
to detect violators (e.g., correlation and clustering to find outliers)
• Data migration and integration
Data migration tools: allow transformations to be specified
ETL (Extraction/Transformation/Loading) tools:
allow users to
specify transformations through a graphical user interface
• It is the merging of data from multiple
data stores.
• Careful integration avoid and reduce redundancies and
inconsistencies in resulting data set.
• Schema integration: [ Integrate metadata from different sources]
• Entity identification problem: [ Identify real world entities from
multiple data sources]
• Redundancy analysis: [an attribute value may be redundant that
can be detect by correlation analysis]
• This technique applied to obtain a reduced representation of the
data set.
• Data reduction strategies include
– Dimensionality reduction :
Remove unimportant attributes
Its method include wavelet transforms , principal components
analysis(PCA) which transforms the original data onto a smaller
space.
– Numerosity reduction:
Replace the original data volume by alternative
– Data compression:
transformations are applied to obtain a reduced or
“compressed” representation of the original data.
• If the compressed data without any information loss then
the Data reduction is called “lossless”.
• If we reconstruct only an approximation of the original data,
then the Data reduction is called “lossy”.
• Dimensionality reduction and numerosity reduction
techniques can also be considered forms of “data
compression”.
Data compression

Original Data

Compressed
Data

lossless
ss y
lo
Original Data
Approximated

16
• Data transformation routines convert the data into appropriate
forms for mining.
• Strategies for data transformation includes:
 Smoothing: Remove noise from data
 Attribute/feature construction: New attributes constructed
from the given ones to help mining process.
 Aggregation: Summarization, data cube construction. (e.g) daily
sales aggregate to compute monthly or annual total amounts.
 Normalization: Scaled to fall within a smaller, specified range,
min-max normalization(0.1 to 1.0 or 0.0 to 1.0)
• It transforms numeric data by mapping values to interval or
concept labels.
• Discretization and concept hierarchy generation can also be useful,
• where raw values for attributes are replaced by ranges or higher
conceptual levels .
• raw values of a numeric attribute (e.g age) are replaced by interval
lables (e.g 0-10, 11-20, etc) or higher-level concepts (e.g youth ,
adult, senior).
• Three types of attributes
– Nominal values from an unordered set, e.g., color, profession
– Ordinal values from an ordered set [military or academic rank ]
– Numeric real numbers, e.g integer or real numbers

• Discretization:
Divide the range of a continuous attribute into intervals
–
–
–
–
–
–

Interval labels can then be used to replace actual data values
Reduce data size by discretization
Supervised vs. unsupervised
Split (top-down) vs. merge (bottom-up)
Discretization can be performed recursively on an attribute
Prepare for further analysis, e.g., classification
Although numerous methods of data preprocessing have been
developed ,data preprocessing remains an active area of research
,due to the huge amount of inconsistent or dirty data and the
complexity of the problem.
Data preprocess
Data preprocess

More Related Content

What's hot

Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning Gopal Sakarkar
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingankur bhalla
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingkayathri02
 
Data preprocessing in Data Mining
Data preprocessing  in Data MiningData preprocessing  in Data Mining
Data preprocessing in Data MiningSamad Baseer Khan
 
Dimensional Modeling
Dimensional ModelingDimensional Modeling
Dimensional ModelingSunita Sahu
 
Data Visualization in Data Science
Data Visualization in Data ScienceData Visualization in Data Science
Data Visualization in Data ScienceMaloy Manna, PMP®
 
DATA PREPROCESSING AND DATA CLEANSING
DATA PREPROCESSING AND DATA CLEANSINGDATA PREPROCESSING AND DATA CLEANSING
DATA PREPROCESSING AND DATA CLEANSINGAhtesham Ullah khan
 
Data Mining: Concepts and Techniques — Chapter 2 —
Data Mining:  Concepts and Techniques — Chapter 2 —Data Mining:  Concepts and Techniques — Chapter 2 —
Data Mining: Concepts and Techniques — Chapter 2 —Salah Amean
 
Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Dhilsath Fathima
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessingSalah Amean
 
1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalitiesKrish_ver2
 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profilingShailja Khurana
 

What's hot (20)

Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
 
Data preprocessing in Data Mining
Data preprocessing  in Data MiningData preprocessing  in Data Mining
Data preprocessing in Data Mining
 
Dimensional Modeling
Dimensional ModelingDimensional Modeling
Dimensional Modeling
 
Dbms schemas for decision support
Dbms schemas for decision supportDbms schemas for decision support
Dbms schemas for decision support
 
Data Mining: Data processing
Data Mining: Data processingData Mining: Data processing
Data Mining: Data processing
 
Data Visualization in Data Science
Data Visualization in Data ScienceData Visualization in Data Science
Data Visualization in Data Science
 
DATA PREPROCESSING AND DATA CLEANSING
DATA PREPROCESSING AND DATA CLEANSINGDATA PREPROCESSING AND DATA CLEANSING
DATA PREPROCESSING AND DATA CLEANSING
 
Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
 
Data Mining: Concepts and Techniques — Chapter 2 —
Data Mining:  Concepts and Techniques — Chapter 2 —Data Mining:  Concepts and Techniques — Chapter 2 —
Data Mining: Concepts and Techniques — Chapter 2 —
 
Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
 
1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalities
 
Data Cleaning Techniques
Data Cleaning TechniquesData Cleaning Techniques
Data Cleaning Techniques
 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profiling
 

Similar to Data preprocess

Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1meenas06
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data PreparationUmair Shafique
 
Data Preprocessing&tools
Data Preprocessing&toolsData Preprocessing&tools
Data Preprocessing&toolsAmandeep Gill
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.pptcongtran88
 
Data preprocessing 2
Data preprocessing 2Data preprocessing 2
Data preprocessing 2extraganesh
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.pptchatbot9
 
Preprocessing
PreprocessingPreprocessing
Preprocessingmmuthuraj
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingextraganesh
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data miningDhilsath Fathima
 
Data mining techniques unit 2
Data mining techniques unit 2Data mining techniques unit 2
Data mining techniques unit 2malathieswaran29
 
overview of_data_processing
overview of_data_processingoverview of_data_processing
overview of_data_processingFEG
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessingKnoldus Inc.
 

Similar to Data preprocess (20)

Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
 
Data Preprocessing&tools
Data Preprocessing&toolsData Preprocessing&tools
Data Preprocessing&tools
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Data preprocessing 2
Data preprocessing 2Data preprocessing 2
Data preprocessing 2
 
Dmblog
DmblogDmblog
Dmblog
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Data1
Data1Data1
Data1
 
Data1
Data1Data1
Data1
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
 
Preprocessing_new.ppt
Preprocessing_new.pptPreprocessing_new.ppt
Preprocessing_new.ppt
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Datapreprocessing
DatapreprocessingDatapreprocessing
Datapreprocessing
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data mining
 
Data mining techniques unit 2
Data mining techniques unit 2Data mining techniques unit 2
Data mining techniques unit 2
 
overview of_data_processing
overview of_data_processingoverview of_data_processing
overview of_data_processing
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
 

Recently uploaded

Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 

Recently uploaded (20)

Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 

Data preprocess

  • 1.
  • 2. 1. 2. 3. 4. 5. 6. 7. 8. Introduction Data Quality: Needs of Preprocessing the data? Data Preprocessing tasks Data Cleaning Data integration Data reduction Data Transformation and Data Discretization Conclusion
  • 3. • It is a process which is comes before applying data mining technique's
  • 4. • Low-quality data will lead to low-quality mining results. • So we need to smear Data Preprocessing techniques such as: - Data quality - Data cleaning - Data integration - Data reduction - Data transformation - Data discremination
  • 5. • Data have quality if the requirements of the intended use. • There are many factors comprising data quality, including: – – – – – – Accuracy Completeness Consistency Timeliness Believability Interpretability
  • 6. • Data cleaning routines attempt to fill in missing values , smooth out noise while identifying outliers, and inconsistencies in data. • Basic methods of data cleaning: – Missing value – Noisy Data – Data Cleaning as a process
  • 7. • Ignore the tuple • Fill in missing values manually [ time consuming and infeasible] • Fill in it automatically with [a global constant : e.g., “Unknown”, ∞] • Use the most portable value to fill in the missing value [regression, inference-based tools using Bayesian formalism or decision tree induction]
  • 8. • Noise is the random error or variance in a measured variable. • Binning: Binning method smooth a sorted data value by consulting its “neighborhood”, that is, the value around it. The sorted values are distributed into number of “buckets”, or “bins”.
  • 9. • Smoothing by bin means: Each value in a bin is replaced by the mean value of the bin [4,8,15 in bin is 9]. • Smoothing by bin medians: Each value in a bin replaced by the bin median • Smoothing by bin boundaries: The minimum and maximum values in a given bin are identified as the bin boundaries each bin values is then replaced by closest boundary value Binning is also used as a discretization technique.
  • 10. • Regression: Data smoothing can also done by regression, a technique that conforms of values to the function – Linear regression involves finding “best” line to fit two attributes. one attribute used to predict other – Multiple linear regression extension of linear regression. • Outlier analysis: it may be detected by clustering. Where similar values are organized into groups or clusters.
  • 11. • The first step in the data cleaning is discrepancy detection [inconsistent data] . • The data should examined regarding : – Unique rule [ each attribute value must be different from all other attribute value ] – Consecutive rule [no missing values between lowest and highest values of the attribute] – Null rule [specifies the use of blanks, question marks, special characters]
  • 12. • Use commercial tools Data scrubbing: use simple domain knowledge (e.g, postal code, spell-check) to detect errors and make corrections Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g., correlation and clustering to find outliers) • Data migration and integration Data migration tools: allow transformations to be specified ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations through a graphical user interface
  • 13. • It is the merging of data from multiple data stores. • Careful integration avoid and reduce redundancies and inconsistencies in resulting data set. • Schema integration: [ Integrate metadata from different sources] • Entity identification problem: [ Identify real world entities from multiple data sources] • Redundancy analysis: [an attribute value may be redundant that can be detect by correlation analysis]
  • 14. • This technique applied to obtain a reduced representation of the data set. • Data reduction strategies include – Dimensionality reduction : Remove unimportant attributes Its method include wavelet transforms , principal components analysis(PCA) which transforms the original data onto a smaller space.
  • 15. – Numerosity reduction: Replace the original data volume by alternative – Data compression: transformations are applied to obtain a reduced or “compressed” representation of the original data. • If the compressed data without any information loss then the Data reduction is called “lossless”. • If we reconstruct only an approximation of the original data, then the Data reduction is called “lossy”. • Dimensionality reduction and numerosity reduction techniques can also be considered forms of “data compression”.
  • 17. • Data transformation routines convert the data into appropriate forms for mining. • Strategies for data transformation includes:  Smoothing: Remove noise from data  Attribute/feature construction: New attributes constructed from the given ones to help mining process.  Aggregation: Summarization, data cube construction. (e.g) daily sales aggregate to compute monthly or annual total amounts.  Normalization: Scaled to fall within a smaller, specified range, min-max normalization(0.1 to 1.0 or 0.0 to 1.0)
  • 18. • It transforms numeric data by mapping values to interval or concept labels. • Discretization and concept hierarchy generation can also be useful, • where raw values for attributes are replaced by ranges or higher conceptual levels . • raw values of a numeric attribute (e.g age) are replaced by interval lables (e.g 0-10, 11-20, etc) or higher-level concepts (e.g youth , adult, senior).
  • 19. • Three types of attributes – Nominal values from an unordered set, e.g., color, profession – Ordinal values from an ordered set [military or academic rank ] – Numeric real numbers, e.g integer or real numbers • Discretization: Divide the range of a continuous attribute into intervals – – – – – – Interval labels can then be used to replace actual data values Reduce data size by discretization Supervised vs. unsupervised Split (top-down) vs. merge (bottom-up) Discretization can be performed recursively on an attribute Prepare for further analysis, e.g., classification
  • 20. Although numerous methods of data preprocessing have been developed ,data preprocessing remains an active area of research ,due to the huge amount of inconsistent or dirty data and the complexity of the problem.