Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Data Preprocessing- Data Warehouse & Data Mining

1.258 visualizaciones

Publicado el

Data Preprocessing- Data Warehouse & Data Mining
Data Quality
Reasons for inaccurate data
Reasons for incomplete data
Major Tasks in Data Preprocessing
Forms of Data Preprocessing
Data Cleaning
Incomplete (Missing) Data

Publicado en: Educación
  • Sé el primero en comentar

Data Preprocessing- Data Warehouse & Data Mining

  1. 1. TRINITY INSTITUTE OF PROFESSIONAL STUDIES Sector – 9, Dwarka Institutional Area, New Delhi-75 Affiliated Institution of G.G.S.IP.U, Delhi BCA Data Warehouse & Data Mining 20302 Data Preprocessing
  2. 2. TRINITY INSTITUTE OF PROFESSIONAL STUDIES Sector – 9, Dwarka Institutional Area, New Delhi-75 2 • Data Preprocessing: An Overview – Data Quality – Major Tasks in Data Preprocessing • Data Cleaning • Data Integration • Data Reduction • Data Transformation and Data Discretization • Summary Data Preprocessing
  3. 3. TRINITY INSTITUTE OF PROFESSIONAL STUDIES Sector – 9, Dwarka Institutional Area, New Delhi-75 3 Data Quality: Why Preprocess the Data?  Measures for data quality: A multidimensional view ◦ Accuracy: correct or wrong, accurate or not ◦ Completeness: not recorded, unavailable, … ◦ Consistency: some modified but some not, dangling, … ◦ Timeliness: timely update? ◦ Believability: how trustable the data are correct? ◦ Interpretability: how easily the data can be understood?
  4. 4. TRINITY INSTITUTE OF PROFESSIONAL STUDIES Sector – 9, Dwarka Institutional Area, New Delhi-75 4 Reasons for inaccurate data  Data collection instruments may be faulty  Human or computer errors occurring at data entry  Users may purposely submit incorrect data for mandatory fields when they don’t want to share personal information  Technology limitations such as buffer size  Incorrect data may also result from inconsistencies in naming conventions or inconsistent formats  Duplicate tuples also require cleaning
  5. 5. TRINITY INSTITUTE OF PROFESSIONAL STUDIES Sector – 9, Dwarka Institutional Area, New Delhi-75 5 Reasons for incomplete data Attributes of interest may not be available Other data may not be included as it was not considered imp at the time of entry Relevant data may not be recorded due to misunderstanding or equipment malfunctions Inconsistent data may be deleted Data history or modifications may be overlooked Missing data
  6. 6. TRINITY INSTITUTE OF PROFESSIONAL STUDIES Sector – 9, Dwarka Institutional Area, New Delhi-75 6 Major Tasks in Data Preprocessing  Data cleaning ◦ Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies  Data integration ◦ Integration of multiple databases, data cubes, or files  Data reduction ◦ Dimensionality reduction ◦ Numerosity reduction ◦ Data compression  Data transformation and data discretization ◦ Normalization ◦ Concept hierarchy generation
  7. 7. TRINITY INSTITUTE OF PROFESSIONAL STUDIES Sector – 9, Dwarka Institutional Area, New Delhi-75 7 Forms of Data Preprocessing
  8. 8. TRINITY INSTITUTE OF PROFESSIONAL STUDIES Sector – 9, Dwarka Institutional Area, New Delhi-75 8 Why Is Data Preprocessing Important?  No quality data, no quality mining results!  Quality decisions must be based on quality data  e.g., duplicate or missing data may cause incorrect or even misleading statistics.  Data warehouse needs consistent integration of quality data  Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse
  9. 9. TRINITY INSTITUTE OF PROFESSIONAL STUDIES Sector – 9, Dwarka Institutional Area, New Delhi-75 9 Data Cleaning  Data in the Real World Is Dirty :- Lots of potentially incorrect data, e.g., instrument faulty, human or computer error, transmission error ◦ incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data  e.g., Occupation = “ ” (missing data ◦ noisy: containing noise, errors, or outliers  e.g., Salary = “−10” (an error) ◦ inconsistent: containing discrepancies in codes or names, e.g.,  Age = “42”, Birthday = “03/07/2010”  Was rating “1, 2, 3”, now rating “A, B, C”  discrepancy between duplicate records ◦ Intentional(e.g., disguised missing data)  Jan. 1 as everyone’s birthday?
  10. 10. TRINITY INSTITUTE OF PROFESSIONAL STUDIES Sector – 9, Dwarka Institutional Area, New Delhi-75 10 Incomplete (Missing) Data  Data is not always available ◦ E.g., many tuples have no recorded value for several attributes, such as customer income in sales data  Missing data may be due to ◦ equipment malfunction ◦ inconsistent with other recorded data and thus deleted ◦ data not entered due to misunderstanding ◦ certain data may not be considered important at the time of entry ◦ not register history or changes of the data  Missing data may need to be inferred
  11. 11. TRINITY INSTITUTE OF PROFESSIONAL STUDIES Sector – 9, Dwarka Institutional Area, New Delhi-75 How to Handle Missing Data?  Ignore the tuple: usually done when class label is missing (when doing classification)—not effective when the % of missing values per attribute varies considerably  Fill in the missing value manually: tedious + infeasible  Fill in it automatically with ◦ a global constant : e.g., “unknown”, a new class?! ◦ the attribute mean ◦ the attribute mean for all samples belonging to the same class: smarter ◦ the most probable value: inference-based such as Bayesian formula or decision tree
  12. 12. TRINITY INSTITUTE OF PROFESSIONAL STUDIES Sector – 9, Dwarka Institutional Area, New Delhi-75 Noisy Data • Noise: random error or variance in a measured variable • Incorrect attribute values may be due to – faulty data collection instruments – data entry problems – data transmission problems – technology limitation – inconsistency in naming convention • Other data problems which require data cleaning – duplicate records – incomplete data – inconsistent data
  13. 13. TRINITY INSTITUTE OF PROFESSIONAL STUDIES Sector – 9, Dwarka Institutional Area, New Delhi-75 How to Handle Noisy Data?  Binning ◦ first sort data and partition into (equal-frequency) bins ◦ then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.  Regression ◦ smooth by fitting the data into regression functions  Outlier Analysis by Clustering ◦ detect and remove outliers  Combined computer and human inspection ◦ detect suspicious values and check by human (e.g., deal with possible outliers)
  14. 14. TRINITY INSTITUTE OF PROFESSIONAL STUDIES Sector – 9, Dwarka Institutional Area, New Delhi-75 Binning method to Smooth data Binning Method smooth the sorted data value by consulting its neighborhood i.e. the values around it. The sorted values are distributed into a number of “buckets” or “bins”. Since binning methods consult the neighborhood of values, they perform local smoothing.
  15. 15. TRINITY INSTITUTE OF PROFESSIONAL STUDIES Sector – 9, Dwarka Institutional Area, New Delhi-75 Binning Methods for Data Smoothing  Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 15, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 34, 34
  16. 16. TRINITY INSTITUTE OF PROFESSIONAL STUDIES Sector – 9, Dwarka Institutional Area, New Delhi-75 Data Cleaning as a Process I. Data discrepancy detection Discrepancies can be caused by the following factors: Poorly designed data entry forms that have many optional fields Human error in data entry Deliberate errors Data decay(outdated addresses) Inconsistent data representations Instrumental and device errors System errors
  17. 17. TRINITY INSTITUTE OF PROFESSIONAL STUDIES Sector – 9, Dwarka Institutional Area, New Delhi-75 Data discrepancy can be detected by :  Use metadata (e.g., domain, range, dependency, distribution)  Check field overloading  Check uniqueness rule, consecutive rule and null rule  Use commercial tools  Data scrubbing tools : use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make corrections  Data auditing tools : by analyzing data to discover rules and relationship to detect violators (e.g., correlation and clustering to find outliers) II. Data migration and integration  Data migration tools: allow transformations to be specified  ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations through a graphical user interface # Potter’s Wheel is a publicly available data cleaning tool that integrates discrepancy detection and transformation. Data Cleaning as a Process
  18. 18. TRINITY INSTITUTE OF PROFESSIONAL STUDIES Sector – 9, Dwarka Institutional Area, New Delhi-75 THANK YOU

×