This document discusses various approaches for finding, loading, and cleaning data. It provides examples of public data sources like government websites and catalogs. It also discusses different file formats for storing data and databases for processing it. The document outlines common data issues like missing values, invalid data types and incorrect structure that require cleaning. It provides examples of how to fix such issues through techniques like standardizing values, filtering rows and columns, and validating data.
3. Client data isn't easy to get
THERE'S CLIENT DATA, AND THERE'S PUBLIC DATA
3
Public data isn't relevant
4. We have internal
information. Getting
information from outside is
our challenge. There’s no
way of doing that.
– Senior Editor
Leading Media Company
“
5. INDIA’S RELIGIONS
5
If you search on google.co.in for "how do I convert to", here are the suggestions Google shows
The popularity influences the order.
So there's a good chance that the religions on top are more often searched for.
6. AUSTRALIA’S RELIGIONS
6
But be careful of how you interpret it.
In Australia, PDF is not a religion. Unless you're a data scientist.
8. USE MULTIPLE APPROACHES TO FIND YOUR DATA
8
Public data catalogues
https://github.com/caesar0301/awesome-public-datasets
https://github.com/rasbt/pattern_classification/blob/master/resources/dataset_collections.md
Govt data websites
https://data.gov.in/
https://data.gov/
https://data.gov.uk/
https://data.gov.sg/
http://publicdata.eu/
or search on Google
https://www.google.com/
or ask people
Humans™
1
2
3
4
12. 12
EXERCISE
LET'S LOAD FROM A SITE
THE GOOGLE SEARCH DATA YOU SAW EARLIER
LET'S LOAD A BIG DATASET
A FEW COLUMNS FROM A LEAKED OK CUPID SURVEY
LET'S LOAD AN UNSTRUCTURED TABLE
A TABLE FROM THE MEDICAL CERTIFICATION OF CAUSE OF DEATH 2013 PDF
14. CHECK FOR ALL THESE DATA CLEANSING ACTIVITIES
14
Fix rows &
columns
Fix missing
values
Standarise
values
Fix invalid
values
Filter
data
When we receive a dataset, we find a pattern of things that go wrong. These
can be fixed in specific ways.
Here's a workflow / checklist of things to look out for and fix.
After this, check if the data is complete, and sufficient to solve the problem.
15. FIX ROWS AND COLUMNS
15
Fix rows Examples
Delete incorrect rows Header rows, Footer rows
Delete summary rows Total, subtotal rows
Delete extra rows
Column number indicators (1), (2), ...
Blank rows
Fix columns Examples
Add column names if missing Files with missing header row
Rename columns consistently Abbreviations, encoded columns
Delete unnecessary columns Unidentified columns, irrelevant columns
Split columns for more data Split http://host:port/path into [Host, Port, Path]
Merge columns for identifiers Merge Firstname, Lastname into Name
Merge State, District into FullDistrict
Align misaligned columns Dataset may have shifted columns
16. FIX MISSING VALUES
16
Fix missing values Examples
Set values as missing values Treat blanks, "NA", "XX", "999", etc as missing
Fill missing values with...
Constant (e.g. zero)
Column (e.g. created date defaults to updated date)
Function (e.g. average of rows/columns)
External data
Remove missing values
Delete row
Delete column
Fill partial missing values Missing time zone, century etc.
17. STANDARDISE VALUES
17
Standardise numbers Examples
Remove outliers Removing high and low values
Standardise units lbs to kgs, m/s for speed
Scale values if required Fit to percentage scale
Standardise precision 2.1 to 2.10
Standardise text Examples
Remove extra characters Common prefix/suffix, leading/trailing/multiple spaces
Standardise case Uppercase, lowercase, Title Case, Sentence case, etc
Standardise format 23/10/16 to 2016/10/20
“Modi, Narendra" to “Narendra Modi"
18. FIX INVALID VALUES
18
Fix invalid values Examples
Encode unicode properly CP1252 instead of UTF-8
Convert incorrect data types
String to number: "12,300"
String to date: "2013-Aug"
Number to string: PIN Code 110001 to "110001"
Correct values not in list Non-existent country, PIN code
Correct wrong structure Phone number with over 10 digits
Correct values beyond range Temperature less than -273° C (0° K)
Validate internal rules
Gross sales > Net sales
Date of delivery > Date of ordering
If Title is "Mr" then Gender is "M"
In these cases, treat value as "missing".
Remove it, or fix it with a formula.
The formula may involve the value, row, column,
entire dataset, or external data
19. FILTER DATA
19
Filter data Examples
Deduplicate data
Remove identical rows
Remove rows where some columns are identical
Filter rows
Filter by segments
Filter by date period
Filter columns Pick columns relevant to analysis
Aggregate data Group by required keys, aggregate the rest
25. … which, with some effort, can be converted into a structured format
… and at this point, we need to start checking for errors.
25
26. At this point, we start checking what’s gone wrong
Each row here
is one
constituency.
The number of
candidates
that have
contested in
each
constituency
in every year
is shown as a
table.
You can see
that some
patterns
emerge here.
26
27. Not every spelling error is easily identifiable by the first letter
Parties are mis-spelt
MADMK
MAMAK
MDMK
Party names change
AIADMK
ADMK
ADK
Parties restructure
INC(I)
INC
Constituency names mis-spelt
BHADRACHALAM
BHADRACHELAM
BHADRAHCALAM
27