SlideShare una empresa de Scribd logo
1 de 56
Descargar para leer sin conexión
Garrett Grolemund
Phd Student / Rice University
Department of Statistics
Data cleaning
1. Intro to data cleaning
2. What you can’t fix
3. What you can fix
4. Intro to reshape
Your turn
Do you think men or women leave a larger
tip when dining out? What data would
you collect to test this belief? What would
prompt you to change your belief?
Data Analysis
Data
Residuals
Model
Compare
Visualize
Transform
Data Analysis
Data
Residuals
Model
Compare
Visualize
Transform
Data Analysis
Data
Residuals
Model
Compare
Visualize
Transform
Data Analysis
Data
Residuals
Model
Compare
Visualize
Transform
Data Analysis
Data
Residuals
Model
Compare
Visualize
Transform
Data Analysis
Data
Residuals
Model
Compare
Visualize
Transform
10 - 20%
of an analysis
Data Cleaning
Data
Residuals
Model
Compare
Visualize
Transform
Data
cleaning
“Happy families are all alike;
every unhappy family is
unhappy in its own way.”
—Leo Tolstoy
“Clean datasets are all alike;
every messy dataset is
messy in its own way.”
—Hadley Wickham
Clean data is:
Complete
Correct
(factual and internally consistent)
Concise
Compatible
(required variables: observations in rows, one column per
variable)
What you
can’t fix:
Complete
Correct
Correct
Can’t restore incorrect values without
original data but can remove clearly
incorrect values
Options:
Remove entire row
Mark incorrect value as missing (NA)
When two rows present the same
information with different values, at least
one row is wrong.
Whenever there is inconsistency, you are
going to have to make some tradeoff to
ensure concision.
Detecting inconsistency is not always
easy.
Inconsistency = incorrect
General strategy
To find incorrect values you need to be
creative, combining graphics and data
processing.
Tipping data
One waiter recorded information
about each tip he received over a
period of a few months
244 records
Do men or women tip more?
Your turn
Subset the tipping data to include only
rows without NA’s. Judge whether you
think all of the data points are correct.
How will you make your decision?
tips <- read.csv("tipping.csv",
stringsAsFactors = FALSE)
summary(tips)
tips <- subset(tips, !is.na(smoker) &
!is.na(non_smoker))
qplot(tip, data = tips, binwidth = .5)
qplot(total_bill, data = tips, binwidth = 2)
qplot(total_bill, tip, data = tips)
nrow(tips)
sum(tips$male)
sum(tips$female)
subset(tips, male != female)
What you
can fix:
Concise
(each fact represented once)
Repeating facts:
1. wastes memory
2. creates opportunities for inconsistency
Compatible
(Data is compatible with your analysis
in both form and fact)
1. Do you have the relevant variables for
your analysis?
This often requires some type of calculation.
For example,
proportion = sucesses / attempts
Avg score per game per team = ?
join(), transform(), summarise(), ddply(), plyr
address this need
Compatible
(Data is compatible with your analysis
in both form and fact)
2. Is the data in the right form for your
analysis and visualization tools? (reshape)
Rectangular
Observations
in rows
Variables
in columns
(1 column per variable)
Your turn
What are the variables in tipping.csv?
How are they arranged in rows and
columns? Can you form the variables into
two groups?
Reshape
install.packages("reshape")
library(reshape)
library(stringr)
head(tips)
Molten data
We can use melt to put each
variable into its own column.
“Protect” the good columns.
“Melt” the offending columns.
Then subset.
1. ID variables - identify the object that
measurements will take place on (we
know these before the experiment)
2. Measured variables - the features of
the object that will be measured (we have
to do an experiment to observe these)
Two types of variables
object
ID Variables
Bruce Wayne
Batman
SSN:
555-89-3000
Measured Var.
Height (6’1’’)
IQ (180)
Age (71)
ID Variables
Gotham City +
male +
Top 1% tax
bracket
Identifier variable Measured variable
Index of random
variable
Random variable
Dimension Measure
Experimental design Measurement
predictors (Xi) response (Y)
Molten data
Molten data collapses all the
measured variables into two
columns: 1) the variable being
measured and 2) the value.
Sometimes called “long” form.
To protect a column from being
melted, label it as an id variable.
reshape::melt(data, id)
tips1 <- melt(tips, id =
c("customer_ID", "total_bill", "tip",
"smoker", "non_smoker"))
# assign an appropriate variable name
names(tips1)[6] <- "sex"
# subset out unwanted rows
tips1 <- subset(tips1, value == 1)
tips1 <- tips1[ , c(1,2,6,4,5,3)]
Use melt to fix the smoking variable. One
column should be enough to record
whether a person smokes or not.
Your turn
Rectangular data are
much easier to work with!
qplot(total_bill, tip, data = tips1,
color = sex)
# vs.
qplot(total_bill, tip, data = tip,
colour = ?)
qplot(total_bill, tip, data = tips1, color = sex) +
geom_smooth(method = lm)
Clean data is:
Complete
Correct
(factual and internally consistent)
Concise
Compatible
(required variables: observations in rows, one column per
variable)
Resource
Wickham, H. (2007) Reshaping data with
the reshape package. Journal of
Statistical Software. 22 (12)
http://www.jstatsoft.org/v21/i12
Summary
Clean data is:
Rectangular
(observations in rows, one column per variable)
Consistent
Concise
Complete
Correct
Data
Residuals
Model
Compare
Visualize
Transform
Data
Residuals
Model
Compare
Visualize
Transform
ggplot2
Data
Residuals
Model
Compare
Visualize
Transform
ggplot2
plyr
Data
Residuals
Model
Compare
Visualize
Transform
ggplot2
plyr
reshape
Data
Residuals
Model
Compare
Visualize
Transform
most statistics
classes
This work is licensed under the Creative
Commons Attribution-Noncommercial 3.0 United
States License. To view a copy of this license,
visit http://creativecommons.org/licenses/by-nc/
3.0/us/ or send a letter to Creative Commons,
171 Second Street, Suite 300, San Francisco,
California, 94105, USA.

Más contenido relacionado

La actualidad más candente

Data mining maximumlikelihood
Data mining maximumlikelihoodData mining maximumlikelihood
Data mining maximumlikelihood
Harry Potter
 
Statistics and Public Health. Curso de Inglés Técnico para profesionales de S...
Statistics and Public Health. Curso de Inglés Técnico para profesionales de S...Statistics and Public Health. Curso de Inglés Técnico para profesionales de S...
Statistics and Public Health. Curso de Inglés Técnico para profesionales de S...
Universidad Particular de Loja
 

La actualidad más candente (8)

Mean conceptual
Mean   conceptualMean   conceptual
Mean conceptual
 
Random Forest / Bootstrap Aggregation
Random Forest / Bootstrap AggregationRandom Forest / Bootstrap Aggregation
Random Forest / Bootstrap Aggregation
 
Data mining maximumlikelihood
Data mining maximumlikelihoodData mining maximumlikelihood
Data mining maximumlikelihood
 
cross tabulation
 cross tabulation cross tabulation
cross tabulation
 
Multiple sample test - Anova, Chi-square, Test of association, Goodness of Fit
Multiple sample test - Anova, Chi-square, Test of association, Goodness of Fit Multiple sample test - Anova, Chi-square, Test of association, Goodness of Fit
Multiple sample test - Anova, Chi-square, Test of association, Goodness of Fit
 
Classification modelling review
Classification modelling reviewClassification modelling review
Classification modelling review
 
Statistics and Public Health. Curso de Inglés Técnico para profesionales de S...
Statistics and Public Health. Curso de Inglés Técnico para profesionales de S...Statistics and Public Health. Curso de Inglés Técnico para profesionales de S...
Statistics and Public Health. Curso de Inglés Técnico para profesionales de S...
 
Dive into the Data
Dive into the DataDive into the Data
Dive into the Data
 

Similar a 18 cleaning

Advanced business mathematics and statistics for entrepreneurs
Advanced business mathematics and statistics for entrepreneursAdvanced business mathematics and statistics for entrepreneurs
Advanced business mathematics and statistics for entrepreneurs
Dr. Trilok Kumar Jain
 
2016 Symposium Poster - statistics - Final
2016 Symposium Poster - statistics - Final2016 Symposium Poster - statistics - Final
2016 Symposium Poster - statistics - Final
Brian Lin
 
An Introduction to boosting
An Introduction to boostingAn Introduction to boosting
An Introduction to boosting
butest
 
Statistice Chapter 02[1]
Statistice  Chapter 02[1]Statistice  Chapter 02[1]
Statistice Chapter 02[1]
plisasm
 
Write a Mission Statement 1. What are your most important .docx
Write a Mission Statement 1. What are your most important .docxWrite a Mission Statement 1. What are your most important .docx
Write a Mission Statement 1. What are your most important .docx
edgar6wallace88877
 
Lect 2 basic ppt
Lect 2 basic pptLect 2 basic ppt
Lect 2 basic ppt
Tao Hong
 
Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...
Simplilearn
 

Similar a 18 cleaning (20)

Advanced business mathematics and statistics for entrepreneurs
Advanced business mathematics and statistics for entrepreneursAdvanced business mathematics and statistics for entrepreneurs
Advanced business mathematics and statistics for entrepreneurs
 
Applied statistics part 5
Applied statistics part 5Applied statistics part 5
Applied statistics part 5
 
Quantitative Methods for Lawyers - Class #7 - Probability & Basic Statistics ...
Quantitative Methods for Lawyers - Class #7 - Probability & Basic Statistics ...Quantitative Methods for Lawyers - Class #7 - Probability & Basic Statistics ...
Quantitative Methods for Lawyers - Class #7 - Probability & Basic Statistics ...
 
Introduction to Descriptive & Predictive Analytics
Introduction to Descriptive & Predictive AnalyticsIntroduction to Descriptive & Predictive Analytics
Introduction to Descriptive & Predictive Analytics
 
2016 Symposium Poster - statistics - Final
2016 Symposium Poster - statistics - Final2016 Symposium Poster - statistics - Final
2016 Symposium Poster - statistics - Final
 
Spss basic Dr Marwa Zalat
Spss basic Dr Marwa ZalatSpss basic Dr Marwa Zalat
Spss basic Dr Marwa Zalat
 
Ders 1 mean mod media st dev.pptx
Ders 1 mean mod media st dev.pptxDers 1 mean mod media st dev.pptx
Ders 1 mean mod media st dev.pptx
 
Engineering Statistics
Engineering Statistics Engineering Statistics
Engineering Statistics
 
An Introduction to boosting
An Introduction to boostingAn Introduction to boosting
An Introduction to boosting
 
Correlation and linear regression
Correlation and linear regression Correlation and linear regression
Correlation and linear regression
 
Rclass
RclassRclass
Rclass
 
Statistice Chapter 02[1]
Statistice  Chapter 02[1]Statistice  Chapter 02[1]
Statistice Chapter 02[1]
 
Descriptive Statistics
Descriptive StatisticsDescriptive Statistics
Descriptive Statistics
 
Explore ml day 2
Explore ml day 2Explore ml day 2
Explore ml day 2
 
Krupa rm
Krupa rmKrupa rm
Krupa rm
 
Dymystify Statistics Day 1.pdf
Dymystify Statistics Day 1.pdfDymystify Statistics Day 1.pdf
Dymystify Statistics Day 1.pdf
 
Introduction to spss
Introduction to spssIntroduction to spss
Introduction to spss
 
Write a Mission Statement 1. What are your most important .docx
Write a Mission Statement 1. What are your most important .docxWrite a Mission Statement 1. What are your most important .docx
Write a Mission Statement 1. What are your most important .docx
 
Lect 2 basic ppt
Lect 2 basic pptLect 2 basic ppt
Lect 2 basic ppt
 
Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...
 

Más de Hadley Wickham (20)

27 development
27 development27 development
27 development
 
27 development
27 development27 development
27 development
 
24 modelling
24 modelling24 modelling
24 modelling
 
23 data-structures
23 data-structures23 data-structures
23 data-structures
 
Graphical inference
Graphical inferenceGraphical inference
Graphical inference
 
R packages
R packagesR packages
R packages
 
22 spam
22 spam22 spam
22 spam
 
21 spam
21 spam21 spam
21 spam
 
20 date-times
20 date-times20 date-times
20 date-times
 
19 tables
19 tables19 tables
19 tables
 
17 polishing
17 polishing17 polishing
17 polishing
 
16 critique
16 critique16 critique
16 critique
 
15 time-space
15 time-space15 time-space
15 time-space
 
14 case-study
14 case-study14 case-study
14 case-study
 
13 case-study
13 case-study13 case-study
13 case-study
 
12 adv-manip
12 adv-manip12 adv-manip
12 adv-manip
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
10 simulation
10 simulation10 simulation
10 simulation
 
10 simulation
10 simulation10 simulation
10 simulation
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

18 cleaning