This document discusses data cleaning techniques. It introduces the concept of clean data being complete, correct, concise and compatible. Specific techniques discussed include identifying incorrect values that cannot be fixed, like removing rows, and those that can be fixed, like making data more concise by removing duplicate information. The document demonstrates converting data from wide to long format using melt() to make it more compatible for analysis and visualization. Overall, the document provides an overview of assessing and improving data quality through cleaning.
2. 1. Intro to data cleaning
2. What you can’t fix
3. What you can fix
4. Intro to reshape
3. Your turn
Do you think men or women leave a larger
tip when dining out? What data would
you collect to test this belief? What would
prompt you to change your belief?
18. Correct
Can’t restore incorrect values without
original data but can remove clearly
incorrect values
Options:
Remove entire row
Mark incorrect value as missing (NA)
19. When two rows present the same
information with different values, at least
one row is wrong.
Whenever there is inconsistency, you are
going to have to make some tradeoff to
ensure concision.
Detecting inconsistency is not always
easy.
Inconsistency = incorrect
20. General strategy
To find incorrect values you need to be
creative, combining graphics and data
processing.
21. Tipping data
One waiter recorded information
about each tip he received over a
period of a few months
244 records
Do men or women tip more?
22. Your turn
Subset the tipping data to include only
rows without NA’s. Judge whether you
think all of the data points are correct.
How will you make your decision?
27. Compatible
(Data is compatible with your analysis
in both form and fact)
1. Do you have the relevant variables for
your analysis?
28. This often requires some type of calculation.
For example,
proportion = sucesses / attempts
Avg score per game per team = ?
join(), transform(), summarise(), ddply(), plyr
address this need
29. Compatible
(Data is compatible with your analysis
in both form and fact)
2. Is the data in the right form for your
analysis and visualization tools? (reshape)
36. Molten data
We can use melt to put each
variable into its own column.
“Protect” the good columns.
“Melt” the offending columns.
Then subset.
37. 1. ID variables - identify the object that
measurements will take place on (we
know these before the experiment)
2. Measured variables - the features of
the object that will be measured (we have
to do an experiment to observe these)
Two types of variables
40. Identifier variable Measured variable
Index of random
variable
Random variable
Dimension Measure
Experimental design Measurement
predictors (Xi) response (Y)
41. Molten data
Molten data collapses all the
measured variables into two
columns: 1) the variable being
measured and 2) the value.
Sometimes called “long” form.
To protect a column from being
melted, label it as an id variable.
reshape::melt(data, id)
42. tips1 <- melt(tips, id =
c("customer_ID", "total_bill", "tip",
"smoker", "non_smoker"))
# assign an appropriate variable name
names(tips1)[6] <- "sex"
# subset out unwanted rows
tips1 <- subset(tips1, value == 1)
tips1 <- tips1[ , c(1,2,6,4,5,3)]
43. Use melt to fix the smoking variable. One
column should be enough to record
whether a person smokes or not.
Your turn
44. Rectangular data are
much easier to work with!
qplot(total_bill, tip, data = tips1,
color = sex)
# vs.
qplot(total_bill, tip, data = tip,
colour = ?)
56. This work is licensed under the Creative
Commons Attribution-Noncommercial 3.0 United
States License. To view a copy of this license,
visit http://creativecommons.org/licenses/by-nc/
3.0/us/ or send a letter to Creative Commons,
171 Second Street, Suite 300, San Francisco,
California, 94105, USA.