SlideShare a Scribd company logo
1 of 30
1
DATA WRANGLING
FIND LOAD CLEAN
2
DATA WRANGLING
FIND LOAD CLEAN
WHERE CAN I GET DATA FROM?
Client data isn't easy to get
THERE'S CLIENT DATA, AND THERE'S PUBLIC DATA
3
Public data isn't relevant
We have internal
information. Getting
information from outside is
our challenge. There’s no
way of doing that.
– Senior Editor
Leading Media Company
“
INDIA’S RELIGIONS
5
If you search on google.co.in for "how do I convert to", here are the suggestions Google shows
The popularity influences the order.
So there's a good chance that the religions on top are more often searched for.
AUSTRALIA’S RELIGIONS
6
But be careful of how you interpret it.
In Australia, PDF is not a religion. Unless you're a data scientist.
7
USE MULTIPLE APPROACHES TO FIND YOUR DATA
8
Public data catalogues
https://github.com/caesar0301/awesome-public-datasets
https://github.com/rasbt/pattern_classification/blob/master/resources/dataset_collections.md
Govt data websites
https://data.gov.in/
https://data.gov/
https://data.gov.uk/
https://data.gov.sg/
http://publicdata.eu/
or search on Google
https://www.google.com/
or ask people
Humans™
1
2
3
4
9
EXERCISE
LET'S FIND SOME DATASETS
(YOU PICK WHAT YOU WANT TO FIND. WE WILL SEARCH FOR IT)
10
DATA WRANGLING
FIND LOAD CLEAN
HOW DO I STORE & PROCESS DATA?
WE LOAD DATA INTO OUR PROGRAMS OR OTHERS'
11
Files Databases
• Delimited text: CSV, TSV, PSV
• Formatted text: TXT, PRN
• Marked up text: HTML, XML, JSON,
JSON Line, YAML, SQL
• Spreadsheets: XLS*, ODS, MDB,
ACCDB, DBF
• Specialised formats: HDF5, SQLite,
DTA (Stata), C4.5, CDF
• Graph formats: GEXF, GDF, GML,
GraphML, GraphViz DOT
• Unstructured: TXT, PDF, Images,
Audio, Video, ...
• In-memory databases: DataFrames
• Relational databases: Oracle, MySQL,
PostgreSQL, SQL Server, DB2, Sybase,
Informix, ...
• Document databases: MongoDB,
CouchDB, ElasticSearch, Firebase
• Distributed databases: HFS, Spark
• Cloud data stores: BigQuery,
DynamoDB, RedShift, Azure SQL
Database, DocumentDB, ...
• APIs: Twitter, Facebook, Google,
Wikipedia, YouTube, ...
Use CSV when sharing tabular data.
Use JSON for hierarchical data.
Use in-memory, else relational databases.
Don't analyse big data. Shrink it.
12
EXERCISE
LET'S LOAD FROM A SITE
THE GOOGLE SEARCH DATA YOU SAW EARLIER
LET'S LOAD A BIG DATASET
A FEW COLUMNS FROM A LEAKED OK CUPID SURVEY
LET'S LOAD AN UNSTRUCTURED TABLE
A TABLE FROM THE MEDICAL CERTIFICATION OF CAUSE OF DEATH 2013 PDF
13
DATA WRANGLING
FIND LOAD CLEAN
HOW DO I FIX THE DATA ISSUES?
CHECK FOR ALL THESE DATA CLEANSING ACTIVITIES
14
Fix rows &
columns
Fix missing
values
Standarise
values
Fix invalid
values
Filter
data
When we receive a dataset, we find a pattern of things that go wrong. These
can be fixed in specific ways.
Here's a workflow / checklist of things to look out for and fix.
After this, check if the data is complete, and sufficient to solve the problem.
FIX ROWS AND COLUMNS
15
Fix rows Examples
Delete incorrect rows Header rows, Footer rows
Delete summary rows Total, subtotal rows
Delete extra rows
Column number indicators (1), (2), ...
Blank rows
Fix columns Examples
Add column names if missing Files with missing header row
Rename columns consistently Abbreviations, encoded columns
Delete unnecessary columns Unidentified columns, irrelevant columns
Split columns for more data Split http://host:port/path into [Host, Port, Path]
Merge columns for identifiers Merge Firstname, Lastname into Name
Merge State, District into FullDistrict
Align misaligned columns Dataset may have shifted columns
FIX MISSING VALUES
16
Fix missing values Examples
Set values as missing values Treat blanks, "NA", "XX", "999", etc as missing
Fill missing values with...
Constant (e.g. zero)
Column (e.g. created date defaults to updated date)
Function (e.g. average of rows/columns)
External data
Remove missing values
Delete row
Delete column
Fill partial missing values Missing time zone, century etc.
STANDARDISE VALUES
17
Standardise numbers Examples
Remove outliers Removing high and low values
Standardise units lbs to kgs, m/s for speed
Scale values if required Fit to percentage scale
Standardise precision 2.1 to 2.10
Standardise text Examples
Remove extra characters Common prefix/suffix, leading/trailing/multiple spaces
Standardise case Uppercase, lowercase, Title Case, Sentence case, etc
Standardise format 23/10/16 to 2016/10/20
“Modi, Narendra" to “Narendra Modi"
FIX INVALID VALUES
18
Fix invalid values Examples
Encode unicode properly CP1252 instead of UTF-8
Convert incorrect data types
String to number: "12,300"
String to date: "2013-Aug"
Number to string: PIN Code 110001 to "110001"
Correct values not in list Non-existent country, PIN code
Correct wrong structure Phone number with over 10 digits
Correct values beyond range Temperature less than -273° C (0° K)
Validate internal rules
Gross sales > Net sales
Date of delivery > Date of ordering
If Title is "Mr" then Gender is "M"
In these cases, treat value as "missing".
Remove it, or fix it with a formula.
The formula may involve the value, row, column,
entire dataset, or external data
FILTER DATA
19
Filter data Examples
Deduplicate data
Remove identical rows
Remove rows where some columns are identical
Filter rows
Filter by segments
Filter by date period
Filter columns Pick columns relevant to analysis
Aggregate data Group by required keys, aggregate the rest
20
EXERCISE
ASSEMBLY ELECTION DATA
SOMETHING WE DID A FEW YEARS AGO, AND IS WELL DOCUMENTED
The ECI website has this data.
21
… and, most of the data is in PDFs
22
The PDF files have a reasonably clear structure
23
… that translates into text that can be parsed
24
… which, with some effort, can be converted into a structured format
… and at this point, we need to start checking for errors.
25
At this point, we start checking what’s gone wrong
Each row here
is one
constituency.
The number of
candidates
that have
contested in
each
constituency
in every year
is shown as a
table.
You can see
that some
patterns
emerge here.
26
Not every spelling error is easily identifiable by the first letter
Parties are mis-spelt
MADMK
MAMAK
MDMK
Party names change
AIADMK
ADMK
ADK
Parties restructure
INC(I)
INC
Constituency names mis-spelt
BHADRACHALAM
BHADRACHELAM
BHADRAHCALAM
27
Fortunately, large scale data itself can provide a solution
28
… with modern tools that support machine learning
29
30
DATA WRANGLING
FIND LOAD CLEAN

More Related Content

What's hot

Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
ankur bhalla
 
Data Warehouse Modeling
Data Warehouse ModelingData Warehouse Modeling
Data Warehouse Modeling
vivekjv
 
12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMS12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMS
koolkampus
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
kayathri02
 

What's hot (20)

Data Visualization in Data Science
Data Visualization in Data ScienceData Visualization in Data Science
Data Visualization in Data Science
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Exploratory data analysis
Exploratory data analysis Exploratory data analysis
Exploratory data analysis
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
 
Data Warehouse Modeling
Data Warehouse ModelingData Warehouse Modeling
Data Warehouse Modeling
 
The Data Science Process
The Data Science ProcessThe Data Science Process
The Data Science Process
 
Oltp vs olap
Oltp vs olapOltp vs olap
Oltp vs olap
 
Data Mining
Data MiningData Mining
Data Mining
 
Data Warehouse Basic Guide
Data Warehouse Basic GuideData Warehouse Basic Guide
Data Warehouse Basic Guide
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
 
Introduction to Python Pandas for Data Analytics
Introduction to Python Pandas for Data AnalyticsIntroduction to Python Pandas for Data Analytics
Introduction to Python Pandas for Data Analytics
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecases
 
03. Data Exploration.pptx
03. Data Exploration.pptx03. Data Exploration.pptx
03. Data Exploration.pptx
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
 
12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMS12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMS
 
Data preprocessing PPT
Data preprocessing PPTData preprocessing PPT
Data preprocessing PPT
 
Data Analysis in Python
Data Analysis in PythonData Analysis in Python
Data Analysis in Python
 
Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 

Similar to Data Wrangling

Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...
Simplilearn
 
03 preprocessing
03 preprocessing03 preprocessing
03 preprocessing
purnimatm
 

Similar to Data Wrangling (20)

Data analysis training
Data analysis trainingData analysis training
Data analysis training
 
ML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.ppt
 
Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2
 
Databases
DatabasesDatabases
Databases
 
Gupta ayankprojectassignmnet
Gupta ayankprojectassignmnetGupta ayankprojectassignmnet
Gupta ayankprojectassignmnet
 
Alteryx Tableau Integration | Clean Your Data Faster for Tableau with Alteryx
Alteryx Tableau Integration | Clean Your Data Faster for Tableau with AlteryxAlteryx Tableau Integration | Clean Your Data Faster for Tableau with Alteryx
Alteryx Tableau Integration | Clean Your Data Faster for Tableau with Alteryx
 
4 preprocess
4 preprocess4 preprocess
4 preprocess
 
DataPreprocessing.ppt
DataPreprocessing.pptDataPreprocessing.ppt
DataPreprocessing.ppt
 
Data Preparation and Preprocessing , Data Cleaning
Data Preparation and Preprocessing , Data CleaningData Preparation and Preprocessing , Data Cleaning
Data Preparation and Preprocessing , Data Cleaning
 
OutlierAnalysisIDIO071216.pptx.otliers is the main
OutlierAnalysisIDIO071216.pptx.otliers is the mainOutlierAnalysisIDIO071216.pptx.otliers is the main
OutlierAnalysisIDIO071216.pptx.otliers is the main
 
How to source good data
How to source good dataHow to source good data
How to source good data
 
Complete Guide to Data Quality
Complete Guide to Data QualityComplete Guide to Data Quality
Complete Guide to Data Quality
 
Aen007 Kenigsberg 091807
Aen007 Kenigsberg 091807Aen007 Kenigsberg 091807
Aen007 Kenigsberg 091807
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Lecture 19
Lecture 19Lecture 19
Lecture 19
 
03 preprocessing
03 preprocessing03 preprocessing
03 preprocessing
 
12 Days of Data
12 Days of Data12 Days of Data
12 Days of Data
 
03Preprocessing_plp.pptx
03Preprocessing_plp.pptx03Preprocessing_plp.pptx
03Preprocessing_plp.pptx
 
03Preprocessing.ppt
03Preprocessing.ppt03Preprocessing.ppt
03Preprocessing.ppt
 

More from Gramener

More from Gramener (20)

6 Methods to Improve Your Manufacturing Process with Computer Vision
6 Methods to Improve Your Manufacturing Process with Computer Vision6 Methods to Improve Your Manufacturing Process with Computer Vision
6 Methods to Improve Your Manufacturing Process with Computer Vision
 
Detecting Manufacturing Defects with Computer Vision
Detecting Manufacturing Defects with Computer VisionDetecting Manufacturing Defects with Computer Vision
Detecting Manufacturing Defects with Computer Vision
 
How to Identify the Right Key Opinion Leaders (KOLs) in Pharma & Healthcare
How to Identify the Right Key Opinion Leaders (KOLs) in Pharma  & HealthcareHow to Identify the Right Key Opinion Leaders (KOLs) in Pharma  & Healthcare
How to Identify the Right Key Opinion Leaders (KOLs) in Pharma & Healthcare
 
Automated Barcode Generation System in Manufacturing
Automated Barcode Generation System in ManufacturingAutomated Barcode Generation System in Manufacturing
Automated Barcode Generation System in Manufacturing
 
The Role of Technology to Save Biodiversity
The Role of Technology to Save BiodiversityThe Role of Technology to Save Biodiversity
The Role of Technology to Save Biodiversity
 
Enable Storytelling with Power BI & Comicgen Plugin
Enable Storytelling with Power BI  & Comicgen PluginEnable Storytelling with Power BI  & Comicgen Plugin
Enable Storytelling with Power BI & Comicgen Plugin
 
The Most Effective Method For Selecting Data Science Projects
The Most Effective Method For Selecting Data Science ProjectsThe Most Effective Method For Selecting Data Science Projects
The Most Effective Method For Selecting Data Science Projects
 
Low Code Platform To Build Data & AI Products
Low Code Platform To Build Data & AI ProductsLow Code Platform To Build Data & AI Products
Low Code Platform To Build Data & AI Products
 
5 Key Foundations To Build An Effective CX Program
5 Key Foundations To Build An Effective CX Program5 Key Foundations To Build An Effective CX Program
5 Key Foundations To Build An Effective CX Program
 
Using Power BI To Improve Media Buying & Ad Performance
Using Power BI To Improve Media Buying & Ad PerformanceUsing Power BI To Improve Media Buying & Ad Performance
Using Power BI To Improve Media Buying & Ad Performance
 
Recession Proofing With Data : Webinar
Recession Proofing With Data : WebinarRecession Proofing With Data : Webinar
Recession Proofing With Data : Webinar
 
Engage Your Audience With PowerPoint Decks: Webinar
Engage Your Audience With PowerPoint Decks: WebinarEngage Your Audience With PowerPoint Decks: Webinar
Engage Your Audience With PowerPoint Decks: Webinar
 
Structure Your Data Science Teams For Best Outcomes
Structure Your Data Science Teams For Best OutcomesStructure Your Data Science Teams For Best Outcomes
Structure Your Data Science Teams For Best Outcomes
 
Dawn Of Geospatial AI - Webinar
Dawn Of Geospatial AI - WebinarDawn Of Geospatial AI - Webinar
Dawn Of Geospatial AI - Webinar
 
5 Steps To Become A Data-Driven Organization : Webinar
5 Steps To Become A Data-Driven Organization : Webinar5 Steps To Become A Data-Driven Organization : Webinar
5 Steps To Become A Data-Driven Organization : Webinar
 
5 Steps To Measure ROI On Your Data Science Initiatives - Webinar
 5 Steps To Measure ROI On Your Data Science Initiatives - Webinar 5 Steps To Measure ROI On Your Data Science Initiatives - Webinar
5 Steps To Measure ROI On Your Data Science Initiatives - Webinar
 
Saving Lives with Geospatial AI - Pycon Indonesia 2020
Saving Lives with Geospatial AI - Pycon Indonesia 2020Saving Lives with Geospatial AI - Pycon Indonesia 2020
Saving Lives with Geospatial AI - Pycon Indonesia 2020
 
Driving Transformation in Industries with Artificial Intelligence (AI)
Driving Transformation in Industries with Artificial Intelligence (AI)Driving Transformation in Industries with Artificial Intelligence (AI)
Driving Transformation in Industries with Artificial Intelligence (AI)
 
The Art of Storytelling Using Data Science
The Art of Storytelling Using Data ScienceThe Art of Storytelling Using Data Science
The Art of Storytelling Using Data Science
 
Storyfying your Data: How to go from Data to Insights to Stories
Storyfying your Data: How to go from Data to Insights to StoriesStoryfying your Data: How to go from Data to Insights to Stories
Storyfying your Data: How to go from Data to Insights to Stories
 

Recently uploaded

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
vexqp
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 

Recently uploaded (20)

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 

Data Wrangling

  • 2. 2 DATA WRANGLING FIND LOAD CLEAN WHERE CAN I GET DATA FROM?
  • 3. Client data isn't easy to get THERE'S CLIENT DATA, AND THERE'S PUBLIC DATA 3 Public data isn't relevant
  • 4. We have internal information. Getting information from outside is our challenge. There’s no way of doing that. – Senior Editor Leading Media Company “
  • 5. INDIA’S RELIGIONS 5 If you search on google.co.in for "how do I convert to", here are the suggestions Google shows The popularity influences the order. So there's a good chance that the religions on top are more often searched for.
  • 6. AUSTRALIA’S RELIGIONS 6 But be careful of how you interpret it. In Australia, PDF is not a religion. Unless you're a data scientist.
  • 7. 7
  • 8. USE MULTIPLE APPROACHES TO FIND YOUR DATA 8 Public data catalogues https://github.com/caesar0301/awesome-public-datasets https://github.com/rasbt/pattern_classification/blob/master/resources/dataset_collections.md Govt data websites https://data.gov.in/ https://data.gov/ https://data.gov.uk/ https://data.gov.sg/ http://publicdata.eu/ or search on Google https://www.google.com/ or ask people Humans™ 1 2 3 4
  • 9. 9 EXERCISE LET'S FIND SOME DATASETS (YOU PICK WHAT YOU WANT TO FIND. WE WILL SEARCH FOR IT)
  • 10. 10 DATA WRANGLING FIND LOAD CLEAN HOW DO I STORE & PROCESS DATA?
  • 11. WE LOAD DATA INTO OUR PROGRAMS OR OTHERS' 11 Files Databases • Delimited text: CSV, TSV, PSV • Formatted text: TXT, PRN • Marked up text: HTML, XML, JSON, JSON Line, YAML, SQL • Spreadsheets: XLS*, ODS, MDB, ACCDB, DBF • Specialised formats: HDF5, SQLite, DTA (Stata), C4.5, CDF • Graph formats: GEXF, GDF, GML, GraphML, GraphViz DOT • Unstructured: TXT, PDF, Images, Audio, Video, ... • In-memory databases: DataFrames • Relational databases: Oracle, MySQL, PostgreSQL, SQL Server, DB2, Sybase, Informix, ... • Document databases: MongoDB, CouchDB, ElasticSearch, Firebase • Distributed databases: HFS, Spark • Cloud data stores: BigQuery, DynamoDB, RedShift, Azure SQL Database, DocumentDB, ... • APIs: Twitter, Facebook, Google, Wikipedia, YouTube, ... Use CSV when sharing tabular data. Use JSON for hierarchical data. Use in-memory, else relational databases. Don't analyse big data. Shrink it.
  • 12. 12 EXERCISE LET'S LOAD FROM A SITE THE GOOGLE SEARCH DATA YOU SAW EARLIER LET'S LOAD A BIG DATASET A FEW COLUMNS FROM A LEAKED OK CUPID SURVEY LET'S LOAD AN UNSTRUCTURED TABLE A TABLE FROM THE MEDICAL CERTIFICATION OF CAUSE OF DEATH 2013 PDF
  • 13. 13 DATA WRANGLING FIND LOAD CLEAN HOW DO I FIX THE DATA ISSUES?
  • 14. CHECK FOR ALL THESE DATA CLEANSING ACTIVITIES 14 Fix rows & columns Fix missing values Standarise values Fix invalid values Filter data When we receive a dataset, we find a pattern of things that go wrong. These can be fixed in specific ways. Here's a workflow / checklist of things to look out for and fix. After this, check if the data is complete, and sufficient to solve the problem.
  • 15. FIX ROWS AND COLUMNS 15 Fix rows Examples Delete incorrect rows Header rows, Footer rows Delete summary rows Total, subtotal rows Delete extra rows Column number indicators (1), (2), ... Blank rows Fix columns Examples Add column names if missing Files with missing header row Rename columns consistently Abbreviations, encoded columns Delete unnecessary columns Unidentified columns, irrelevant columns Split columns for more data Split http://host:port/path into [Host, Port, Path] Merge columns for identifiers Merge Firstname, Lastname into Name Merge State, District into FullDistrict Align misaligned columns Dataset may have shifted columns
  • 16. FIX MISSING VALUES 16 Fix missing values Examples Set values as missing values Treat blanks, "NA", "XX", "999", etc as missing Fill missing values with... Constant (e.g. zero) Column (e.g. created date defaults to updated date) Function (e.g. average of rows/columns) External data Remove missing values Delete row Delete column Fill partial missing values Missing time zone, century etc.
  • 17. STANDARDISE VALUES 17 Standardise numbers Examples Remove outliers Removing high and low values Standardise units lbs to kgs, m/s for speed Scale values if required Fit to percentage scale Standardise precision 2.1 to 2.10 Standardise text Examples Remove extra characters Common prefix/suffix, leading/trailing/multiple spaces Standardise case Uppercase, lowercase, Title Case, Sentence case, etc Standardise format 23/10/16 to 2016/10/20 “Modi, Narendra" to “Narendra Modi"
  • 18. FIX INVALID VALUES 18 Fix invalid values Examples Encode unicode properly CP1252 instead of UTF-8 Convert incorrect data types String to number: "12,300" String to date: "2013-Aug" Number to string: PIN Code 110001 to "110001" Correct values not in list Non-existent country, PIN code Correct wrong structure Phone number with over 10 digits Correct values beyond range Temperature less than -273° C (0° K) Validate internal rules Gross sales > Net sales Date of delivery > Date of ordering If Title is "Mr" then Gender is "M" In these cases, treat value as "missing". Remove it, or fix it with a formula. The formula may involve the value, row, column, entire dataset, or external data
  • 19. FILTER DATA 19 Filter data Examples Deduplicate data Remove identical rows Remove rows where some columns are identical Filter rows Filter by segments Filter by date period Filter columns Pick columns relevant to analysis Aggregate data Group by required keys, aggregate the rest
  • 20. 20 EXERCISE ASSEMBLY ELECTION DATA SOMETHING WE DID A FEW YEARS AGO, AND IS WELL DOCUMENTED
  • 21. The ECI website has this data. 21
  • 22. … and, most of the data is in PDFs 22
  • 23. The PDF files have a reasonably clear structure 23
  • 24. … that translates into text that can be parsed 24
  • 25. … which, with some effort, can be converted into a structured format … and at this point, we need to start checking for errors. 25
  • 26. At this point, we start checking what’s gone wrong Each row here is one constituency. The number of candidates that have contested in each constituency in every year is shown as a table. You can see that some patterns emerge here. 26
  • 27. Not every spelling error is easily identifiable by the first letter Parties are mis-spelt MADMK MAMAK MDMK Party names change AIADMK ADMK ADK Parties restructure INC(I) INC Constituency names mis-spelt BHADRACHALAM BHADRACHELAM BHADRAHCALAM 27
  • 28. Fortunately, large scale data itself can provide a solution 28
  • 29. … with modern tools that support machine learning 29