SlideShare una empresa de Scribd logo
1 de 22
Diagnosing
Dirty Data
Jaimi Dowdell, IRE/NICAR
Jennifer LaFleur, ProPublica
Get your data's history
• Know the source of the data
• Know how it's used
• Know what all the fields mean
• Know what other stories have
been done with it
What is dirty data?
• Missing records
• Incorrect information
• Duplicate information
• No standardization
Take your data's
temperature
• How many records should you have?
• Double-check totals or counts. Check for
studies/ summary reports.
• Check for duplicates. Make sure they are
real duplicates. Is it possible that there are
hidden duplicates?
• Consistency-check all fields. Are all
city/county names spelled the same? Are
all codes found within documentation?
Internal consistency
checks
• Is there more money going to sub-contractors than went to
the prime contractor?
• Are there more teachers than students?
• How about other important fields?
• Check the range of fields. (For example, check for DOBs
that would make people too old or too young.)
• Check for missing data or blank fields. Are they real values,
or did something happen with an import or append query?
External Checks
• Compare to reports
• Data reported to other agencies
• On the ground reporting
• Verification from sources
Steps for cleaning data
• Assess the problem
• Identify your goal
• Find the right tool for the job
• Set aside time (double what you think)
• Make a backup copy
• Make a backup copy
• Never alter the original data. Make new
columns so you can compare and show
your work.
• Create an audit trail.
• Spot check as you go.
Tips for success
• Keep a data notebook
• Duplicate your work
• Duplicate your work
• Bounce your results off folks who really know
the data
• Set up some standards for your
work/newsroom
Choose the right
tool
• You don't need to be fancy, just get the job done
• Work with what you're comfortable with
• Don't forget the power of Excel
• Text editors can be lifesavers
• Many tools exist - Open Refine, programming, etc.
• Get training as needed
Focus is important
So get plenty
of food and rest
Get a data
buddy
Common ailments
Dates that aren't dates
Names, names, names...
Location matters
Leading and trailing spaces
"Pretty" reports
Inoperable data: Pain management
• Explain caveats
• Choose your wording carefully
• Know when to leave out records
• Be transparent
• Know what questions can and can't be
answered with this dataset
• Know when to get more information
Continue learning about dirty data: Sat. 3:40 p.m.
Conference Room 11
BYOD (Bring your own data): Sat. 4:50 p.m.,
Conference Room 11
Get your hands dirty
Jennifer.lafleur@propublica.org (@j_la28)
jaimi@ire.org (@jaimidowdell)
Questions?

Más contenido relacionado

Destacado

Cat techie aka vaidehi sachin nbc newsmakers broadcasting real hidden story
Cat techie aka vaidehi sachin nbc newsmakers broadcasting real hidden storyCat techie aka vaidehi sachin nbc newsmakers broadcasting real hidden story
Cat techie aka vaidehi sachin nbc newsmakers broadcasting real hidden storyVAidehi Sachin
 
Number Off
Number OffNumber Off
Number OffLouka5
 
Data journalism at Techraking 6
Data journalism at Techraking 6Data journalism at Techraking 6
Data journalism at Techraking 6Jennifer LaFleur
 
Data journalism without data
Data journalism without dataData journalism without data
Data journalism without dataJennifer LaFleur
 
Mind the Gap NICAR14 (holes in data)
Mind the Gap NICAR14 (holes in data)Mind the Gap NICAR14 (holes in data)
Mind the Gap NICAR14 (holes in data)Jennifer LaFleur
 
VVOJ Intro to data journalism
VVOJ Intro to data journalismVVOJ Intro to data journalism
VVOJ Intro to data journalismJennifer LaFleur
 
Crunching the numbers NR14
Crunching the numbers NR14Crunching the numbers NR14
Crunching the numbers NR14Jennifer LaFleur
 
The CASTLE Principles - mini description
The CASTLE Principles - mini descriptionThe CASTLE Principles - mini description
The CASTLE Principles - mini descriptionLance Secretan
 
The CASTLE Principles - Presentation
The CASTLE Principles - PresentationThe CASTLE Principles - Presentation
The CASTLE Principles - PresentationLance Secretan
 

Destacado (14)

Cat techie aka vaidehi sachin nbc newsmakers broadcasting real hidden story
Cat techie aka vaidehi sachin nbc newsmakers broadcasting real hidden storyCat techie aka vaidehi sachin nbc newsmakers broadcasting real hidden story
Cat techie aka vaidehi sachin nbc newsmakers broadcasting real hidden story
 
Getting it the rightest
Getting it the rightestGetting it the rightest
Getting it the rightest
 
Number Off
Number OffNumber Off
Number Off
 
Data journalism at Techraking 6
Data journalism at Techraking 6Data journalism at Techraking 6
Data journalism at Techraking 6
 
ACP Getting the Goods
ACP Getting the GoodsACP Getting the Goods
ACP Getting the Goods
 
Data journalism without data
Data journalism without dataData journalism without data
Data journalism without data
 
Mind the Gap NICAR14 (holes in data)
Mind the Gap NICAR14 (holes in data)Mind the Gap NICAR14 (holes in data)
Mind the Gap NICAR14 (holes in data)
 
VVOJ Intro to data journalism
VVOJ Intro to data journalismVVOJ Intro to data journalism
VVOJ Intro to data journalism
 
Crunching the numbers NR14
Crunching the numbers NR14Crunching the numbers NR14
Crunching the numbers NR14
 
The CASTLE Principles - mini description
The CASTLE Principles - mini descriptionThe CASTLE Principles - mini description
The CASTLE Principles - mini description
 
The CASTLE Principles - Presentation
The CASTLE Principles - PresentationThe CASTLE Principles - Presentation
The CASTLE Principles - Presentation
 
Transparency ire13
Transparency ire13Transparency ire13
Transparency ire13
 
Ona 2012
Ona 2012Ona 2012
Ona 2012
 
Cats stats
Cats statsCats stats
Cats stats
 

Similar a Diagnosing dirty data_ire2013

Data-driven enterprise off your beat - Aaron Mendelson - Fresno NewsTrain 4.2...
Data-driven enterprise off your beat - Aaron Mendelson - Fresno NewsTrain 4.2...Data-driven enterprise off your beat - Aaron Mendelson - Fresno NewsTrain 4.2...
Data-driven enterprise off your beat - Aaron Mendelson - Fresno NewsTrain 4.2...News Leaders Association's NewsTrain
 
Great Survey Design
Great Survey DesignGreat Survey Design
Great Survey DesignSurveyGizmo
 
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...News Leaders Association's NewsTrain
 
Analysing Qualitative Data
Analysing Qualitative DataAnalysing Qualitative Data
Analysing Qualitative DataMike Crabb
 
Art of a Medical Research (Art of making an Original Research Article)
Art of a Medical Research (Art of making an Original Research Article)Art of a Medical Research (Art of making an Original Research Article)
Art of a Medical Research (Art of making an Original Research Article)Prof. Dr. Hironmoy Roy
 
Preparing for Today's Job Market - The Interview
Preparing for Today's Job Market - The InterviewPreparing for Today's Job Market - The Interview
Preparing for Today's Job Market - The InterviewSusanne Markgren
 
Questionnaires hampshire teaching schools_final
Questionnaires hampshire teaching schools_finalQuestionnaires hampshire teaching schools_final
Questionnaires hampshire teaching schools_finalRuth Deakin Crick
 
Epidata presentation course for heath science
Epidata presentation course for heath scienceEpidata presentation course for heath science
Epidata presentation course for heath scienceMitikuTeka1
 
Data validation in the Digital Age
Data validation in the Digital AgeData validation in the Digital Age
Data validation in the Digital AgeJ T "Tom" Johnson
 
Intro to dh data management
Intro to dh data management Intro to dh data management
Intro to dh data management Rachel Di Cresce
 
Big Data LDN 2017: Preserving The Key Principles Of Academic Research In A Bu...
Big Data LDN 2017: Preserving The Key Principles Of Academic Research In A Bu...Big Data LDN 2017: Preserving The Key Principles Of Academic Research In A Bu...
Big Data LDN 2017: Preserving The Key Principles Of Academic Research In A Bu...Matt Stubbs
 
Four Short Foibles of Organizational Data
Four Short Foibles of Organizational DataFour Short Foibles of Organizational Data
Four Short Foibles of Organizational DataLars von Sneidern
 
Module 1 introduction to machine learning
Module 1  introduction to machine learningModule 1  introduction to machine learning
Module 1 introduction to machine learningSara Hooker
 
Case Studies: When you can't or won't run an experiment (and still want to...
Case Studies: When you can't or  won't run an  experiment (and still  want to...Case Studies: When you can't or  won't run an  experiment (and still  want to...
Case Studies: When you can't or won't run an experiment (and still want to...David Saldaña Sage
 
Case Studies: When you can't or won't run an experiment (and still want to...
Case Studies: When you can't or  won't run an  experiment (and still  want to...Case Studies: When you can't or  won't run an  experiment (and still  want to...
Case Studies: When you can't or won't run an experiment (and still want to...David Saldaña
 
Ramping up to the debate
Ramping up to the debateRamping up to the debate
Ramping up to the debatenstearns
 
ER&L Presentation Chris & Erin Miller
ER&L Presentation Chris & Erin MillerER&L Presentation Chris & Erin Miller
ER&L Presentation Chris & Erin MillerejdmillerUNT
 
Designing Indicators
Designing IndicatorsDesigning Indicators
Designing Indicatorsclearsateam
 
Trendspotting: Helping you make sense of large information sources
Trendspotting: Helping you make sense of large information sourcesTrendspotting: Helping you make sense of large information sources
Trendspotting: Helping you make sense of large information sourcesMarieke Guy
 

Similar a Diagnosing dirty data_ire2013 (20)

Data-driven enterprise off your beat - Aaron Mendelson - Fresno NewsTrain 4.2...
Data-driven enterprise off your beat - Aaron Mendelson - Fresno NewsTrain 4.2...Data-driven enterprise off your beat - Aaron Mendelson - Fresno NewsTrain 4.2...
Data-driven enterprise off your beat - Aaron Mendelson - Fresno NewsTrain 4.2...
 
ICAR-IFPRI - Basic Research Questions lecture 1 - Devesh Roy, IFPRI
ICAR-IFPRI - Basic Research Questions lecture 1 - Devesh Roy, IFPRIICAR-IFPRI - Basic Research Questions lecture 1 - Devesh Roy, IFPRI
ICAR-IFPRI - Basic Research Questions lecture 1 - Devesh Roy, IFPRI
 
Great Survey Design
Great Survey DesignGreat Survey Design
Great Survey Design
 
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
 
Analysing Qualitative Data
Analysing Qualitative DataAnalysing Qualitative Data
Analysing Qualitative Data
 
Art of a Medical Research (Art of making an Original Research Article)
Art of a Medical Research (Art of making an Original Research Article)Art of a Medical Research (Art of making an Original Research Article)
Art of a Medical Research (Art of making an Original Research Article)
 
Preparing for Today's Job Market - The Interview
Preparing for Today's Job Market - The InterviewPreparing for Today's Job Market - The Interview
Preparing for Today's Job Market - The Interview
 
Questionnaires hampshire teaching schools_final
Questionnaires hampshire teaching schools_finalQuestionnaires hampshire teaching schools_final
Questionnaires hampshire teaching schools_final
 
Epidata presentation course for heath science
Epidata presentation course for heath scienceEpidata presentation course for heath science
Epidata presentation course for heath science
 
Data validation in the Digital Age
Data validation in the Digital AgeData validation in the Digital Age
Data validation in the Digital Age
 
Intro to dh data management
Intro to dh data management Intro to dh data management
Intro to dh data management
 
Big Data LDN 2017: Preserving The Key Principles Of Academic Research In A Bu...
Big Data LDN 2017: Preserving The Key Principles Of Academic Research In A Bu...Big Data LDN 2017: Preserving The Key Principles Of Academic Research In A Bu...
Big Data LDN 2017: Preserving The Key Principles Of Academic Research In A Bu...
 
Four Short Foibles of Organizational Data
Four Short Foibles of Organizational DataFour Short Foibles of Organizational Data
Four Short Foibles of Organizational Data
 
Module 1 introduction to machine learning
Module 1  introduction to machine learningModule 1  introduction to machine learning
Module 1 introduction to machine learning
 
Case Studies: When you can't or won't run an experiment (and still want to...
Case Studies: When you can't or  won't run an  experiment (and still  want to...Case Studies: When you can't or  won't run an  experiment (and still  want to...
Case Studies: When you can't or won't run an experiment (and still want to...
 
Case Studies: When you can't or won't run an experiment (and still want to...
Case Studies: When you can't or  won't run an  experiment (and still  want to...Case Studies: When you can't or  won't run an  experiment (and still  want to...
Case Studies: When you can't or won't run an experiment (and still want to...
 
Ramping up to the debate
Ramping up to the debateRamping up to the debate
Ramping up to the debate
 
ER&L Presentation Chris & Erin Miller
ER&L Presentation Chris & Erin MillerER&L Presentation Chris & Erin Miller
ER&L Presentation Chris & Erin Miller
 
Designing Indicators
Designing IndicatorsDesigning Indicators
Designing Indicators
 
Trendspotting: Helping you make sense of large information sources
Trendspotting: Helping you make sense of large information sourcesTrendspotting: Helping you make sense of large information sources
Trendspotting: Helping you make sense of large information sources
 

Último

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 

Último (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

Diagnosing dirty data_ire2013

  • 1. Diagnosing Dirty Data Jaimi Dowdell, IRE/NICAR Jennifer LaFleur, ProPublica
  • 2. Get your data's history • Know the source of the data • Know how it's used • Know what all the fields mean • Know what other stories have been done with it
  • 3. What is dirty data? • Missing records • Incorrect information • Duplicate information • No standardization
  • 4. Take your data's temperature • How many records should you have? • Double-check totals or counts. Check for studies/ summary reports. • Check for duplicates. Make sure they are real duplicates. Is it possible that there are hidden duplicates? • Consistency-check all fields. Are all city/county names spelled the same? Are all codes found within documentation?
  • 5. Internal consistency checks • Is there more money going to sub-contractors than went to the prime contractor? • Are there more teachers than students? • How about other important fields? • Check the range of fields. (For example, check for DOBs that would make people too old or too young.) • Check for missing data or blank fields. Are they real values, or did something happen with an import or append query?
  • 6. External Checks • Compare to reports • Data reported to other agencies • On the ground reporting • Verification from sources
  • 7. Steps for cleaning data • Assess the problem • Identify your goal • Find the right tool for the job • Set aside time (double what you think) • Make a backup copy • Make a backup copy • Never alter the original data. Make new columns so you can compare and show your work. • Create an audit trail. • Spot check as you go.
  • 8. Tips for success • Keep a data notebook • Duplicate your work • Duplicate your work • Bounce your results off folks who really know the data • Set up some standards for your work/newsroom
  • 9. Choose the right tool • You don't need to be fancy, just get the job done • Work with what you're comfortable with • Don't forget the power of Excel • Text editors can be lifesavers • Many tools exist - Open Refine, programming, etc. • Get training as needed
  • 11. So get plenty of food and rest
  • 19. Inoperable data: Pain management • Explain caveats • Choose your wording carefully • Know when to leave out records • Be transparent • Know what questions can and can't be answered with this dataset • Know when to get more information
  • 20. Continue learning about dirty data: Sat. 3:40 p.m. Conference Room 11 BYOD (Bring your own data): Sat. 4:50 p.m., Conference Room 11 Get your hands dirty