SlideShare una empresa de Scribd logo
1 de 23
:
Presented by:
 Kunal Jain (071309)
                 Under the guidance of
                 Mr. Praveen Kumar Tripathi
                 Dept of CSE & IT (JUIT)
 Introduction
 Steps in Data Cleansing
 Conclusion
 References
“A company’s most important asset is information. A
 corporation’s ability to compete, adapt, and grow in a
 business climate of rapid change is dependent in large
 measure on how well the company uses information
 to make decisions. Sharing information that isn’t
 clean and consolidated to the fullest extent can
 substantially reduce the effectiveness of a system of
 significant investment and considerable pay-off
 potential.”
   Data cleansing or data scrubbing is the act of
    detecting and correcting (or removing) corrupt or
    inaccurate records from a record set, table, or
    database. Used mainly in databases, the term
    refers to identifying
    incomplete, incorrect, inaccurate, irrelevant etc.
    parts of the data and then replacing, modifying or
    deleting this dirty data.
•Data cleansing can occur within a single set of records, or
between multiple sets of data which need to be merged, or
which will work together.

•Typos and spelling errors are corrected, mislabeled data
is properly labeled and filed, and incomplete or missing
entries are completed.

•In more complex operations, data cleansing can be
performed by computer programs. These data cleansing
programs can check the data with a variety of rules and
procedures decided upon by the user
•The goal of data cleansing is not just to clean up the data
in a database but also to bring consistency to different sets
of data that have been merged from separate databases.
Dummy Values,
Absence of Data,
Multipurpose Fields,
Cryptic Data,
Contradicting Data,
Inappropriate Use of Address Lines,
Violation of Business Rules,
Reused Primary Keys,
Non-Unique Identifiers, and
Data Integration Problems
Parsing
Correcting
Standardizing
Matching
Consolidating
Parsing locates and identifies individual data
elements in the source files and then isolates
these data elements in the target files.
Parsed Data in Target File
                                 First Name:       Beth
                                 Middle Name:     Christine
Input Data from Source File      Last Name:       Parker
Beth Christine Parker, SLS MGR   Title:           SLS MGR
Regional Port Authority          Firm:            Regional Port Authority
Federal Building                 Location:        Federal Building
12800 Lake Calumet               Number:          12800
Hedgewisch, IL                   Street:          Lake Calumet
                                 City:            Hedgewisch
                                 State:           IL
Corrects parsed individual data components
using sophisticated data algorithms and
secondary data sources.
Corrected Data
Parsed Data                              First Name:       Beth
First Name:     Beth                     Middle Name:     Christine
Middle Name:   Christine                 Last Name:       Parker
Last Name:     Parker                    Title:           SLS MGR
Title:         SLS MGR                   Firm:            Regional Port Authority
Firm:          Regional Port Authority   Location:        Federal Building
Location:      Federal Building          Number:          12800
Number:        12800                     Street:          South Butler Drive
Street:        Lake Calumet              City:            Chicago
City:          Hedgewisch                State:           IL
State:         IL                        Zip:             60633
                                         Zip+Four:        2398
Standardizing applies conversion routines to
transform data into its preferred (and
consistent) format using both standard and
custom business rules.
Corrected Data
Corrected Data                             Pre-name:        Ms.
First Name:       Beth                     First Name:      Beth
Middle Name:     Christine                 1st Name Match
Last Name:       Parker                     Standards:       Elizabeth, Bethany, Bethel
Title:           SLS MGR                   Middle Name:     Christine
Firm:            Regional Port Authority   Last Name:       Parker
Location:        Federal Building          Title:           Sales Mgr.
Number:          12800                     Firm:            Regional Port Authority
Street:          South Butler Drive        Location:        Federal Building
City:            Chicago                   Number:          12800
State:           IL                        Street:          S. Butler Dr.
Zip:             60633                     City:            Chicago
Zip+Four:        2398                      State:           IL
                                           Zip:             60633
                                           Zip+Four:        2398
Searching and matching records within and
across the parsed, corrected and standardized
data based on predefined business rules to
eliminate duplications.
Business    Street   Branch Customer   City    Vendor   Pattern   Pattern
 Name                 Type #/Tax ID             Code               I.D.

Exact      Exact     Exact   Exact     Exact   Exact    AAAAAA P110

 Exact     VClose    Exact   VClose    Exact   Blanks ABAAA- P115

 Exact     VClose    Exact   Blanks    Exact   Exact    ABA-AA P120

 Exact     VClose    Close   Close     Exact   Exact    ABCCAA S300

 VClose    VClose    Exact   Close     Exact   Exact    BBACAA S310
Corrected Data (Data Source #2)
Corrected Data (Data Source #1)                Pre-name:        Ms.
Pre-name:        Ms.                           First Name:       Elizabeth
First Name:       Beth                         1st Name Match
1st Name Match                                  Standards:       Beth, Bethany, Bethel
 Standards:       Elizabeth, Bethany, Bethel   Middle Name:     Christine
Middle Name:     Christine                     Last Name:       Parker-Lewis
Last Name:       Parker                        Title:
Title:           Sales Mgr.                    Firm:            Regional Port Authority
Firm:            Regional Port Authority       Location:        Federal Building
Location:        Federal Building              Number:          12800
Number:          12800                         Street:          S. Butler Dr., Suite 2
Street:          S. Butler Dr.                 City:            Chicago
City:            Chicago                       State:           IL
State:           IL                            Zip:             60633
Zip:             60633                         Zip+Four:        2398
Zip+Four:        2398                          Phone:           708-555-1234
                                               Fax:              708-555-5678
Analyzing and identifying relationships between
matched records and consolidating/merging
them into ONE representation.
Consolidated Data
                                  Name:            Ms. Beth (Elizabeth)
Corrected Data (Data Source #1)                    Christine Parker-Lewis
                                  Title:           Sales Mgr.
                                  Firm:            Regional Port Authority
                                  Location:        Federal Building
                                  Address:         12800 S. Butler Dr., Suite 2
                                                   Chicago, IL 60633-2398
Corrected Data (Data Source #2)
                                  Phone:           708-555-1234
                                  Fax:              708-555-5678
1.Use metadata to document rules .


2.Determine data cleansing schedule .


3.Build quality into new and existing systems.
Hence we conclude that DATA CLEANSING is
not only an effective tool for removing
unwanted ,“dirty” data ,but also the medium to
make data in our databases and systems
concise, selective and appropriate in order to
server our clients better and cater to their
demands as well.
Web:
 en.wikipedia.org/wiki/Data_cleansing
 www2.gbif.org/DataCleaning.pdf
 www.webopedia.com/TERM/D/data_cleansing.html
Books:
 Data Mining by Ian H. Witten and Eibe Frank

   Exploratory Data Mining and Data Quality
                by Dasu and Johnson
                    (Wiley, 2004)
Data cleansing

Más contenido relacionado

La actualidad más candente

Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessingSalah Amean
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with PythonDavis David
 
Data cleaning and visualization
Data cleaning and visualizationData cleaning and visualization
Data cleaning and visualizationTapan Gautam
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data AnalysisUmair Shafique
 
03. Data Exploration.pptx
03. Data Exploration.pptx03. Data Exploration.pptx
03. Data Exploration.pptxSarojkumari55
 
Data Quality
Data QualityData Quality
Data QualityVijaya K
 
kinds of analytics
kinds of analyticskinds of analytics
kinds of analyticsBenila Paul
 
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?What’s The Difference Between Structured, Semi-Structured And Unstructured Data?
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?Bernard Marr
 
Introduction to Data Management
Introduction to Data ManagementIntroduction to Data Management
Introduction to Data ManagementAmanda Whitmire
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecturepcherukumalla
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning Gopal Sakarkar
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data MiningDHIVYADEVAKI
 
Data Wrangling
Data WranglingData Wrangling
Data WranglingGramener
 

La actualidad más candente (20)

Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
 
Data Analytics
Data AnalyticsData Analytics
Data Analytics
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with Python
 
Data cleaning and visualization
Data cleaning and visualizationData cleaning and visualization
Data cleaning and visualization
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data Analysis
 
03. Data Exploration.pptx
03. Data Exploration.pptx03. Data Exploration.pptx
03. Data Exploration.pptx
 
Data Cleansing
Data CleansingData Cleansing
Data Cleansing
 
Business Intelligence
Business IntelligenceBusiness Intelligence
Business Intelligence
 
Data Quality
Data QualityData Quality
Data Quality
 
Analytical tools
Analytical toolsAnalytical tools
Analytical tools
 
Data Analytics Life Cycle
Data Analytics Life CycleData Analytics Life Cycle
Data Analytics Life Cycle
 
kinds of analytics
kinds of analyticskinds of analytics
kinds of analytics
 
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?What’s The Difference Between Structured, Semi-Structured And Unstructured Data?
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?
 
Introduction to Data Management
Introduction to Data ManagementIntroduction to Data Management
Introduction to Data Management
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecture
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
Data Wrangling
Data WranglingData Wrangling
Data Wrangling
 

Similar a Data cleansing

DataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census dataDataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census dataRitvvij Parrikh
 
Identity Theft Resource Center - 3/11/2014
Identity Theft Resource Center - 3/11/2014Identity Theft Resource Center - 3/11/2014
Identity Theft Resource Center - 3/11/2014- Mark - Fullbright
 
Fake Amazon email bbb@amazon.com
Fake Amazon email bbb@amazon.com Fake Amazon email bbb@amazon.com
Fake Amazon email bbb@amazon.com Aurorasa Coaching
 
Business Search Business Entities Business Programs
Business Search   Business Entities   Business ProgramsBusiness Search   Business Entities   Business Programs
Business Search Business Entities Business ProgramsAlex Greer
 
Tapping the Data Deluge with R
Tapping the Data Deluge with RTapping the Data Deluge with R
Tapping the Data Deluge with RJeffrey Breen
 
pilgrim's pride 10k_FY2004
pilgrim's pride 10k_FY2004pilgrim's pride 10k_FY2004
pilgrim's pride 10k_FY2004finance30
 
pilgrim's pride 10k_FY2004
pilgrim's pride 10k_FY2004pilgrim's pride 10k_FY2004
pilgrim's pride 10k_FY2004finance30
 
Law Offices of Kevin J Roach
Law Offices of Kevin J RoachLaw Offices of Kevin J Roach
Law Offices of Kevin J RoachVivianMilliron
 
Fried data summit data quality data analytics together
Fried data summit data quality data analytics togetherFried data summit data quality data analytics together
Fried data summit data quality data analytics togetherJeff Fried
 
Society of Insurance Research, 3rd Party Data
Society of Insurance Research, 3rd Party DataSociety of Insurance Research, 3rd Party Data
Society of Insurance Research, 3rd Party DataKevin McCarthy
 
In:Confidence 2019 - Balancing the conflicting objectives of data access and ...
In:Confidence 2019 - Balancing the conflicting objectives of data access and ...In:Confidence 2019 - Balancing the conflicting objectives of data access and ...
In:Confidence 2019 - Balancing the conflicting objectives of data access and ...Privitar
 

Similar a Data cleansing (15)

DataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census dataDataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census data
 
Identity Theft Resource Center - 3/11/2014
Identity Theft Resource Center - 3/11/2014Identity Theft Resource Center - 3/11/2014
Identity Theft Resource Center - 3/11/2014
 
Fake Amazon email bbb@amazon.com
Fake Amazon email bbb@amazon.com Fake Amazon email bbb@amazon.com
Fake Amazon email bbb@amazon.com
 
Business Search Business Entities Business Programs
Business Search   Business Entities   Business ProgramsBusiness Search   Business Entities   Business Programs
Business Search Business Entities Business Programs
 
Tapping the Data Deluge with R
Tapping the Data Deluge with RTapping the Data Deluge with R
Tapping the Data Deluge with R
 
pilgrim's pride 10k_FY2004
pilgrim's pride 10k_FY2004pilgrim's pride 10k_FY2004
pilgrim's pride 10k_FY2004
 
pilgrim's pride 10k_FY2004
pilgrim's pride 10k_FY2004pilgrim's pride 10k_FY2004
pilgrim's pride 10k_FY2004
 
Form Miarticlesofincorporation
Form MiarticlesofincorporationForm Miarticlesofincorporation
Form Miarticlesofincorporation
 
Law Offices of Kevin J Roach
Law Offices of Kevin J RoachLaw Offices of Kevin J Roach
Law Offices of Kevin J Roach
 
VRA 2012, Cataloging Case Studies, Metadata Magic
VRA 2012, Cataloging Case Studies, Metadata MagicVRA 2012, Cataloging Case Studies, Metadata Magic
VRA 2012, Cataloging Case Studies, Metadata Magic
 
Fried data summit data quality data analytics together
Fried data summit data quality data analytics togetherFried data summit data quality data analytics together
Fried data summit data quality data analytics together
 
United States Supreme Court
United States Supreme CourtUnited States Supreme Court
United States Supreme Court
 
Society of Insurance Research, 3rd Party Data
Society of Insurance Research, 3rd Party DataSociety of Insurance Research, 3rd Party Data
Society of Insurance Research, 3rd Party Data
 
Morning Vista Cave Creek Open House Brochure
Morning Vista Cave Creek Open House BrochureMorning Vista Cave Creek Open House Brochure
Morning Vista Cave Creek Open House Brochure
 
In:Confidence 2019 - Balancing the conflicting objectives of data access and ...
In:Confidence 2019 - Balancing the conflicting objectives of data access and ...In:Confidence 2019 - Balancing the conflicting objectives of data access and ...
In:Confidence 2019 - Balancing the conflicting objectives of data access and ...
 

Último

Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxCeline George
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxAmanpreet Kaur
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jisc
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSCeline George
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and ModificationsMJDuyan
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Pooja Bhuva
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsKarakKing
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxDr. Sarita Anand
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Association for Project Management
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...pradhanghanshyam7136
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxJisc
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024Elizabeth Walsh
 

Último (20)

Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 

Data cleansing

  • 1. : Presented by:  Kunal Jain (071309) Under the guidance of Mr. Praveen Kumar Tripathi Dept of CSE & IT (JUIT)
  • 2.  Introduction  Steps in Data Cleansing  Conclusion  References
  • 3. “A company’s most important asset is information. A corporation’s ability to compete, adapt, and grow in a business climate of rapid change is dependent in large measure on how well the company uses information to make decisions. Sharing information that isn’t clean and consolidated to the fullest extent can substantially reduce the effectiveness of a system of significant investment and considerable pay-off potential.”
  • 4. Data cleansing or data scrubbing is the act of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. Used mainly in databases, the term refers to identifying incomplete, incorrect, inaccurate, irrelevant etc. parts of the data and then replacing, modifying or deleting this dirty data.
  • 5. •Data cleansing can occur within a single set of records, or between multiple sets of data which need to be merged, or which will work together. •Typos and spelling errors are corrected, mislabeled data is properly labeled and filed, and incomplete or missing entries are completed. •In more complex operations, data cleansing can be performed by computer programs. These data cleansing programs can check the data with a variety of rules and procedures decided upon by the user
  • 6. •The goal of data cleansing is not just to clean up the data in a database but also to bring consistency to different sets of data that have been merged from separate databases.
  • 7. Dummy Values, Absence of Data, Multipurpose Fields, Cryptic Data, Contradicting Data, Inappropriate Use of Address Lines, Violation of Business Rules, Reused Primary Keys, Non-Unique Identifiers, and Data Integration Problems
  • 9. Parsing locates and identifies individual data elements in the source files and then isolates these data elements in the target files.
  • 10. Parsed Data in Target File First Name: Beth Middle Name: Christine Input Data from Source File Last Name: Parker Beth Christine Parker, SLS MGR Title: SLS MGR Regional Port Authority Firm: Regional Port Authority Federal Building Location: Federal Building 12800 Lake Calumet Number: 12800 Hedgewisch, IL Street: Lake Calumet City: Hedgewisch State: IL
  • 11. Corrects parsed individual data components using sophisticated data algorithms and secondary data sources.
  • 12. Corrected Data Parsed Data First Name: Beth First Name: Beth Middle Name: Christine Middle Name: Christine Last Name: Parker Last Name: Parker Title: SLS MGR Title: SLS MGR Firm: Regional Port Authority Firm: Regional Port Authority Location: Federal Building Location: Federal Building Number: 12800 Number: 12800 Street: South Butler Drive Street: Lake Calumet City: Chicago City: Hedgewisch State: IL State: IL Zip: 60633 Zip+Four: 2398
  • 13. Standardizing applies conversion routines to transform data into its preferred (and consistent) format using both standard and custom business rules.
  • 14. Corrected Data Corrected Data Pre-name: Ms. First Name: Beth First Name: Beth Middle Name: Christine 1st Name Match Last Name: Parker Standards: Elizabeth, Bethany, Bethel Title: SLS MGR Middle Name: Christine Firm: Regional Port Authority Last Name: Parker Location: Federal Building Title: Sales Mgr. Number: 12800 Firm: Regional Port Authority Street: South Butler Drive Location: Federal Building City: Chicago Number: 12800 State: IL Street: S. Butler Dr. Zip: 60633 City: Chicago Zip+Four: 2398 State: IL Zip: 60633 Zip+Four: 2398
  • 15. Searching and matching records within and across the parsed, corrected and standardized data based on predefined business rules to eliminate duplications.
  • 16. Business Street Branch Customer City Vendor Pattern Pattern Name Type #/Tax ID Code I.D. Exact Exact Exact Exact Exact Exact AAAAAA P110 Exact VClose Exact VClose Exact Blanks ABAAA- P115 Exact VClose Exact Blanks Exact Exact ABA-AA P120 Exact VClose Close Close Exact Exact ABCCAA S300 VClose VClose Exact Close Exact Exact BBACAA S310
  • 17. Corrected Data (Data Source #2) Corrected Data (Data Source #1) Pre-name: Ms. Pre-name: Ms. First Name: Elizabeth First Name: Beth 1st Name Match 1st Name Match Standards: Beth, Bethany, Bethel Standards: Elizabeth, Bethany, Bethel Middle Name: Christine Middle Name: Christine Last Name: Parker-Lewis Last Name: Parker Title: Title: Sales Mgr. Firm: Regional Port Authority Firm: Regional Port Authority Location: Federal Building Location: Federal Building Number: 12800 Number: 12800 Street: S. Butler Dr., Suite 2 Street: S. Butler Dr. City: Chicago City: Chicago State: IL State: IL Zip: 60633 Zip: 60633 Zip+Four: 2398 Zip+Four: 2398 Phone: 708-555-1234 Fax: 708-555-5678
  • 18. Analyzing and identifying relationships between matched records and consolidating/merging them into ONE representation.
  • 19. Consolidated Data Name: Ms. Beth (Elizabeth) Corrected Data (Data Source #1) Christine Parker-Lewis Title: Sales Mgr. Firm: Regional Port Authority Location: Federal Building Address: 12800 S. Butler Dr., Suite 2 Chicago, IL 60633-2398 Corrected Data (Data Source #2) Phone: 708-555-1234 Fax: 708-555-5678
  • 20. 1.Use metadata to document rules . 2.Determine data cleansing schedule . 3.Build quality into new and existing systems.
  • 21. Hence we conclude that DATA CLEANSING is not only an effective tool for removing unwanted ,“dirty” data ,but also the medium to make data in our databases and systems concise, selective and appropriate in order to server our clients better and cater to their demands as well.
  • 22. Web:  en.wikipedia.org/wiki/Data_cleansing  www2.gbif.org/DataCleaning.pdf  www.webopedia.com/TERM/D/data_cleansing.html Books:  Data Mining by Ian H. Witten and Eibe Frank  Exploratory Data Mining and Data Quality by Dasu and Johnson (Wiley, 2004)