2. Get your data's history
• Know the source of the data
• Know how it's used
• Know what all the fields mean
• Know what other stories have
been done with it
3. What is dirty data?
• Missing records
• Incorrect information
• Duplicate information
• No standardization
4. Take your data's
temperature
• How many records should you have?
• Double-check totals or counts. Check for
studies/ summary reports.
• Check for duplicates. Make sure they are
real duplicates. Is it possible that there are
hidden duplicates?
• Consistency-check all fields. Are all
city/county names spelled the same? Are
all codes found within documentation?
5. Internal consistency
checks
• Is there more money going to sub-contractors than went to
the prime contractor?
• Are there more teachers than students?
• How about other important fields?
• Check the range of fields. (For example, check for DOBs
that would make people too old or too young.)
• Check for missing data or blank fields. Are they real values,
or did something happen with an import or append query?
6. External Checks
• Compare to reports
• Data reported to other agencies
• On the ground reporting
• Verification from sources
7. Steps for cleaning data
• Assess the problem
• Identify your goal
• Find the right tool for the job
• Set aside time (double what you think)
• Make a backup copy
• Make a backup copy
• Never alter the original data. Make new
columns so you can compare and show
your work.
• Create an audit trail.
• Spot check as you go.
8. Tips for success
• Keep a data notebook
• Duplicate your work
• Duplicate your work
• Bounce your results off folks who really know
the data
• Set up some standards for your
work/newsroom
9. Choose the right
tool
• You don't need to be fancy, just get the job done
• Work with what you're comfortable with
• Don't forget the power of Excel
• Text editors can be lifesavers
• Many tools exist - Open Refine, programming, etc.
• Get training as needed
19. Inoperable data: Pain management
• Explain caveats
• Choose your wording carefully
• Know when to leave out records
• Be transparent
• Know what questions can and can't be
answered with this dataset
• Know when to get more information
20. Continue learning about dirty data: Sat. 3:40 p.m.
Conference Room 11
BYOD (Bring your own data): Sat. 4:50 p.m.,
Conference Room 11
Get your hands dirty