SlideShare una empresa de Scribd logo
1 de 14
Big, Ugly Datasets for Thumb-Fingered Journalists	 @nclarkjudd, thumb-fingered journalist
We’re swimming in data Open Graph Social Media Data Mining Government Data
It’s not getting easier to use … With exceptions, like TimeFlow
This is where we come in	 There’s an increasing need for journalists at all levels to be equipped to acquire and analyze big, ugly datasets Without the resources of a New York Times or Washington Post, how do you do that?
What are you doing with data? Exploring: Looking for patterns, following hunches, finding context and background — looking to be surprised Deducing: Proving a hypothesis, pulling specific records — looking for something in particular
Know right questions to ask When you’re picking a dataset to use, understand its: Provenance Sampling Method Quality Completeness
Data Workflow Understand your needs Acquire your data (Download, FOIL, Sources) Clean your data Load it into a Relational Database Management System (RDBMS) Analyze what you’ve got Output relevant segments for visualization
Cleaning Your Data Use a script or a robust text editor like vi It’s difficult. It takes a while. It gets done.
Load your data
Fail and Iterate Again: It probably won’t work the first time. It’s difficult. It takes a while. It gets done.
Analyze Check your script. Did I write my query correctly? Write queries multiple ways. Do the numbers add up the same when the RDBMS makes sums and when I do them? Use checksums: Can I compare results from a segment of this data with previously published and vetted results? Are they the same? Consult experts: Ask — Does this mean what I think it means? Do these results make sense? Output smaller segments of your data to another tool such as Socrata or ManyEyes in order to generate graphs, tables, and visualizations
Share Photo: Britta Bohllinger / Flickr ,[object Object]
IRE.org
HacksHackers.com,[object Object]

Más contenido relacionado

La actualidad más candente

Data Discovery and Visualization
Data Discovery and VisualizationData Discovery and Visualization
Data Discovery and VisualizationDr. Neil Brittliff
 
How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace Mohamadreza Mohtat
 
Data Archiving and Processing
Data Archiving and ProcessingData Archiving and Processing
Data Archiving and ProcessingCRRC-Armenia
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science TJ Stalcup
 
Road map to secondary data
Road map to secondary dataRoad map to secondary data
Road map to secondary databhavniktok
 
DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?DataONE
 
Fsci 2018 tuesday31_july_am6
Fsci 2018 tuesday31_july_am6Fsci 2018 tuesday31_july_am6
Fsci 2018 tuesday31_july_am6ARDC
 
Data Archiving and Sharing
Data Archiving and SharingData Archiving and Sharing
Data Archiving and SharingC. Tobin Magle
 
Fsci 2018 friday3_august_am6
Fsci 2018 friday3_august_am6Fsci 2018 friday3_august_am6
Fsci 2018 friday3_august_am6ARDC
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceEdureka!
 
Introduction to Data Science (Data Science Thailand Meetup #1)
Introduction to Data Science (Data Science Thailand Meetup #1)Introduction to Data Science (Data Science Thailand Meetup #1)
Introduction to Data Science (Data Science Thailand Meetup #1)Data Science Thailand
 
FAIR Data Management and FAIR Data Sharing
FAIR Data Management and FAIR Data SharingFAIR Data Management and FAIR Data Sharing
FAIR Data Management and FAIR Data SharingMerce Crosas
 
Data and Donuts: The Impact of Data Management
Data and Donuts: The Impact of Data ManagementData and Donuts: The Impact of Data Management
Data and Donuts: The Impact of Data ManagementC. Tobin Magle
 

La actualidad más candente (20)

Data Discovery and Visualization
Data Discovery and VisualizationData Discovery and Visualization
Data Discovery and Visualization
 
Data Skills for Digital Era
Data Skills for Digital EraData Skills for Digital Era
Data Skills for Digital Era
 
How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace
 
Data Archiving and Processing
Data Archiving and ProcessingData Archiving and Processing
Data Archiving and Processing
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
 
Data science and_analytics_for_ordinary_people_ebook
Data science and_analytics_for_ordinary_people_ebookData science and_analytics_for_ordinary_people_ebook
Data science and_analytics_for_ordinary_people_ebook
 
BigData
BigDataBigData
BigData
 
Road map to secondary data
Road map to secondary dataRoad map to secondary data
Road map to secondary data
 
charlie
charliecharlie
charlie
 
DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?
 
Fsci 2018 tuesday31_july_am6
Fsci 2018 tuesday31_july_am6Fsci 2018 tuesday31_july_am6
Fsci 2018 tuesday31_july_am6
 
Data Archiving and Sharing
Data Archiving and SharingData Archiving and Sharing
Data Archiving and Sharing
 
Fsci 2018 friday3_august_am6
Fsci 2018 friday3_august_am6Fsci 2018 friday3_august_am6
Fsci 2018 friday3_august_am6
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
 
Introduction to Data Science (Data Science Thailand Meetup #1)
Introduction to Data Science (Data Science Thailand Meetup #1)Introduction to Data Science (Data Science Thailand Meetup #1)
Introduction to Data Science (Data Science Thailand Meetup #1)
 
Electronic Databases
Electronic DatabasesElectronic Databases
Electronic Databases
 
Myths of Data Science
Myths of Data ScienceMyths of Data Science
Myths of Data Science
 
FAIR Data Management and FAIR Data Sharing
FAIR Data Management and FAIR Data SharingFAIR Data Management and FAIR Data Sharing
FAIR Data Management and FAIR Data Sharing
 
Data and Donuts: The Impact of Data Management
Data and Donuts: The Impact of Data ManagementData and Donuts: The Impact of Data Management
Data and Donuts: The Impact of Data Management
 

Destacado

Open Data Value Framework: Open Data's Four Pillars of Value
Open Data Value Framework: Open Data's Four Pillars of ValueOpen Data Value Framework: Open Data's Four Pillars of Value
Open Data Value Framework: Open Data's Four Pillars of ValueSocrata
 
Lecture slides from @Mohamed of @AJEnglish for #MozNewsLab
Lecture slides from @Mohamed of @AJEnglish for #MozNewsLabLecture slides from @Mohamed of @AJEnglish for #MozNewsLab
Lecture slides from @Mohamed of @AJEnglish for #MozNewsLabOpen Journalism on the Open Web
 

Destacado (6)

SHAZNA NESSA on CHANGE AGENTS
SHAZNA NESSA on CHANGE AGENTSSHAZNA NESSA on CHANGE AGENTS
SHAZNA NESSA on CHANGE AGENTS
 
Evan Hansen of Wired talkss to #MozNewsLab
Evan Hansen of Wired talkss to #MozNewsLabEvan Hansen of Wired talkss to #MozNewsLab
Evan Hansen of Wired talkss to #MozNewsLab
 
Oliver Reichenstein's presentation to #MozNewsLab
Oliver Reichenstein's presentation to #MozNewsLabOliver Reichenstein's presentation to #MozNewsLab
Oliver Reichenstein's presentation to #MozNewsLab
 
Open Data Value Framework: Open Data's Four Pillars of Value
Open Data Value Framework: Open Data's Four Pillars of ValueOpen Data Value Framework: Open Data's Four Pillars of Value
Open Data Value Framework: Open Data's Four Pillars of Value
 
Lecture slides from @Mohamed of @AJEnglish for #MozNewsLab
Lecture slides from @Mohamed of @AJEnglish for #MozNewsLabLecture slides from @Mohamed of @AJEnglish for #MozNewsLab
Lecture slides from @Mohamed of @AJEnglish for #MozNewsLab
 
Elements of User Experience by Jesse James Garrett
Elements of User Experience by Jesse James GarrettElements of User Experience by Jesse James Garrett
Elements of User Experience by Jesse James Garrett
 

Similar a Big Ugly Datasets For Thumb-Fingered Journalists

MPhil Lecture on Data Vis for Analysis
MPhil Lecture on Data Vis for AnalysisMPhil Lecture on Data Vis for Analysis
MPhil Lecture on Data Vis for AnalysisShawn Day
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceMahir Haque
 
Data management plans (dmp) for nsf
Data management plans (dmp) for nsfData management plans (dmp) for nsf
Data management plans (dmp) for nsfBrad Houston
 
Data management plans (dmp) for nsf
Data management plans (dmp) for nsfData management plans (dmp) for nsf
Data management plans (dmp) for nsfBrad Houston
 
Data management plans
Data management plansData management plans
Data management plansBrad Houston
 
Sa discover text webinar
Sa discover text webinarSa discover text webinar
Sa discover text webinarQuestionPro
 
big data and machine learning ppt.pptx
big data and machine learning ppt.pptxbig data and machine learning ppt.pptx
big data and machine learning ppt.pptxNATASHABANO
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3varshakumar21
 
Big Data: The 4 Layers Everyone Must Know
Big Data: The 4 Layers Everyone Must KnowBig Data: The 4 Layers Everyone Must Know
Big Data: The 4 Layers Everyone Must KnowBernard Marr
 
Paradigm4 Research Report: Leaving Data on the table
Paradigm4 Research Report: Leaving Data on the tableParadigm4 Research Report: Leaving Data on the table
Paradigm4 Research Report: Leaving Data on the tableParadigm4
 
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...News Leaders Association's NewsTrain
 
Employees, Business Partners and Bad Guys: What Web Data Reveals About Person...
Employees, Business Partners and Bad Guys: What Web Data Reveals About Person...Employees, Business Partners and Bad Guys: What Web Data Reveals About Person...
Employees, Business Partners and Bad Guys: What Web Data Reveals About Person...Connotate
 
Data management plans
Data management plansData management plans
Data management plansBrad Houston
 
Data Science Workshop - day 1
Data Science Workshop - day 1Data Science Workshop - day 1
Data Science Workshop - day 1Aseel Addawood
 
Accelerate Data Discovery
Accelerate Data Discovery   Accelerate Data Discovery
Accelerate Data Discovery Attivio
 
Converged IT and Data Commons
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data CommonsSimon Twigger
 
Claudia Gold: Learning Data Science Online
Claudia Gold: Learning Data Science OnlineClaudia Gold: Learning Data Science Online
Claudia Gold: Learning Data Science Onlinesfdatascience
 
sybca-bigdata-ppt.pptx
sybca-bigdata-ppt.pptxsybca-bigdata-ppt.pptx
sybca-bigdata-ppt.pptxcalf_ville86
 

Similar a Big Ugly Datasets For Thumb-Fingered Journalists (20)

MPhil Lecture on Data Vis for Analysis
MPhil Lecture on Data Vis for AnalysisMPhil Lecture on Data Vis for Analysis
MPhil Lecture on Data Vis for Analysis
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Data management plans (dmp) for nsf
Data management plans (dmp) for nsfData management plans (dmp) for nsf
Data management plans (dmp) for nsf
 
Data management plans (dmp) for nsf
Data management plans (dmp) for nsfData management plans (dmp) for nsf
Data management plans (dmp) for nsf
 
Data management plans
Data management plansData management plans
Data management plans
 
Sa discover text webinar
Sa discover text webinarSa discover text webinar
Sa discover text webinar
 
big data and machine learning ppt.pptx
big data and machine learning ppt.pptxbig data and machine learning ppt.pptx
big data and machine learning ppt.pptx
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
 
Big Data: The 4 Layers Everyone Must Know
Big Data: The 4 Layers Everyone Must KnowBig Data: The 4 Layers Everyone Must Know
Big Data: The 4 Layers Everyone Must Know
 
1 UNIT-DSP.pptx
1 UNIT-DSP.pptx1 UNIT-DSP.pptx
1 UNIT-DSP.pptx
 
Paradigm4 Research Report: Leaving Data on the table
Paradigm4 Research Report: Leaving Data on the tableParadigm4 Research Report: Leaving Data on the table
Paradigm4 Research Report: Leaving Data on the table
 
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
 
Employees, Business Partners and Bad Guys: What Web Data Reveals About Person...
Employees, Business Partners and Bad Guys: What Web Data Reveals About Person...Employees, Business Partners and Bad Guys: What Web Data Reveals About Person...
Employees, Business Partners and Bad Guys: What Web Data Reveals About Person...
 
Data management plans
Data management plansData management plans
Data management plans
 
Data Science Workshop - day 1
Data Science Workshop - day 1Data Science Workshop - day 1
Data Science Workshop - day 1
 
Accelerate Data Discovery
Accelerate Data Discovery   Accelerate Data Discovery
Accelerate Data Discovery
 
Data science unit1
Data science unit1Data science unit1
Data science unit1
 
Converged IT and Data Commons
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data Commons
 
Claudia Gold: Learning Data Science Online
Claudia Gold: Learning Data Science OnlineClaudia Gold: Learning Data Science Online
Claudia Gold: Learning Data Science Online
 
sybca-bigdata-ppt.pptx
sybca-bigdata-ppt.pptxsybca-bigdata-ppt.pptx
sybca-bigdata-ppt.pptx
 

Más de Open Journalism on the Open Web (8)

Open Source Process: jQuery by John Resig
Open Source Process: jQuery by John ResigOpen Source Process: jQuery by John Resig
Open Source Process: jQuery by John Resig
 
Christian Heilmann's 'State of the Browser in 2011'
Christian Heilmann's 'State of the Browser in 2011'Christian Heilmann's 'State of the Browser in 2011'
Christian Heilmann's 'State of the Browser in 2011'
 
Amanda Cox - Visualizing data at the New York Times
Amanda Cox - Visualizing data at the New York TimesAmanda Cox - Visualizing data at the New York Times
Amanda Cox - Visualizing data at the New York Times
 
Burt Herman: Follow the story
Burt Herman: Follow the storyBurt Herman: Follow the story
Burt Herman: Follow the story
 
Amanda Hickman's presentation
Amanda Hickman's presentationAmanda Hickman's presentation
Amanda Hickman's presentation
 
Why does Mozilla care about news?
Why does Mozilla care about news? Why does Mozilla care about news?
Why does Mozilla care about news?
 
Hh p2 pu class
Hh p2 pu classHh p2 pu class
Hh p2 pu class
 
Open journalismfortheopenweb intro-sept2010
Open journalismfortheopenweb intro-sept2010Open journalismfortheopenweb intro-sept2010
Open journalismfortheopenweb intro-sept2010
 

Big Ugly Datasets For Thumb-Fingered Journalists

  • 1. Big, Ugly Datasets for Thumb-Fingered Journalists @nclarkjudd, thumb-fingered journalist
  • 2. We’re swimming in data Open Graph Social Media Data Mining Government Data
  • 3. It’s not getting easier to use … With exceptions, like TimeFlow
  • 4. This is where we come in There’s an increasing need for journalists at all levels to be equipped to acquire and analyze big, ugly datasets Without the resources of a New York Times or Washington Post, how do you do that?
  • 5. What are you doing with data? Exploring: Looking for patterns, following hunches, finding context and background — looking to be surprised Deducing: Proving a hypothesis, pulling specific records — looking for something in particular
  • 6. Know right questions to ask When you’re picking a dataset to use, understand its: Provenance Sampling Method Quality Completeness
  • 7. Data Workflow Understand your needs Acquire your data (Download, FOIL, Sources) Clean your data Load it into a Relational Database Management System (RDBMS) Analyze what you’ve got Output relevant segments for visualization
  • 8. Cleaning Your Data Use a script or a robust text editor like vi It’s difficult. It takes a while. It gets done.
  • 10. Fail and Iterate Again: It probably won’t work the first time. It’s difficult. It takes a while. It gets done.
  • 11. Analyze Check your script. Did I write my query correctly? Write queries multiple ways. Do the numbers add up the same when the RDBMS makes sums and when I do them? Use checksums: Can I compare results from a segment of this data with previously published and vetted results? Are they the same? Consult experts: Ask — Does this mean what I think it means? Do these results make sense? Output smaller segments of your data to another tool such as Socrata or ManyEyes in order to generate graphs, tables, and visualizations
  • 12.
  • 14.
  • 15. Assignment You are an investigative team that does freelance work around the country and are working up a pitch for your next project. Pick a subject matter you want to investigate Identify a dataset or datasets that will help you formulate your story. For this exercise, only pick one available on the Web already, e.g. through Data.gov. Plan: What do you need to clean these data? The schema you’ll make to house the dataset(s) What are you doing with this data — are you using it for exploratory or deductive reasoning? What will your queries look like? Will you join multiple databases together? If so, how are you sure the results will be relevant? How will you express the results of your inquiry? What questions won’t the data answer that you want to address in your project? Who will you turn to as you start looking for these answers?