SlideShare una empresa de Scribd logo
1 de 22
Reporterslab.org

Presentation for computational
     journalism students
        February 2012
STRUCTURED DATA
.. And most reporters’ inability to deal with it
New York Times reporters used Word searches and
annotations to analyze Wikileaks documents in 2010
and 2011.
PANDA project trying to help gather data inside newsrooms
Barriers to Structured data analysis in
                the newsroom
•   Expensive
•   Too hard to collect.
•   It takes practice
•   It takes patience.
•   Once collected, data has a short shelf life – its
    value inside the newsroom effectively ends
    once a story is published.
Web-scraping software:
ephemeral or too
expensive for a task not
viewed as mission-
critical.
Solutions
• User-friendly tool for scraping websites for
  structured data
• Packages of algorithms from fraud and other
  forensic fields for use with public records
  datasets online.
• Packages of queries and statistical tests for
  money, dates, geographical identifiers, names
  and codes, presented in standard English
• Tools for fuzzy matching of datasets: include
  scoring, best match likelihood, interactive
  machine learning for different datasets.
TOO MUCH MATERIAL
With too little information
Too many sources with too little news

• Twitter, Facebook, LinkedIn and other social media
• RSS feeds from other news organizations and blogs
• Press releases from government agencies or beat
  subjects

      Lack of archiving is just as troubling as the lack of
      structure. Reporters can’t hold the powerful
      accountable without information from the past.
Solutions
• Archiving users’ feeds locally or in the cloud
• Mash-up social media, rss feeds into an app
  that reveals more insight into the sources
• Formalize each reporter’s definition of “news”
  through machine learning.
• Alerts for important source material. Example:
  changing time of a press conference.
The buried treasure

UNUSABLE RECORDS
Solutions
• Visual extractor of data from scanned forms.
• Separate scanned boxes of documents into
  their pieces for further analysis
• Use speech recognition tools on government
  audio and video
• OCR video to find the speaker at a hearing
For unstructured data

ANTIQUATED METHODS
Our way                         A newer way



• Hand-enter individual items   • Leverage web scraping and
  into spreadsheets               paid crowdsourcing for data
• Transcribe                      entry (MT)
  interviews, hearings and      • Use speech recognition for
  other audio and video           the first pass on searchable
  content for searching           audio and video
• Read each document            • Use clustering, information
                                  extraction and other
                                  methods for overview of
                                  documents
Reporterslab.org working to tame
audio and video
Associated Press
project to bring order
to unstructured data
Wordseer for
historical text
Jigsaw
REPORTERSLAB.ORG
Creating sample data and documents for researchers based on real
stories

Más contenido relacionado

Similar a Computational journalism projects

Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptxAnusuya123
 
Incentivising the uptake of reusable metadata in the survey production process
Incentivising the uptake of reusable metadata in the survey production processIncentivising the uptake of reusable metadata in the survey production process
Incentivising the uptake of reusable metadata in the survey production processLouise Corti
 
Semanticnews 230913-final
Semanticnews 230913-finalSemanticnews 230913-final
Semanticnews 230913-finalDavid Newman
 
Guy avoiding-dat apocalypse
Guy avoiding-dat apocalypseGuy avoiding-dat apocalypse
Guy avoiding-dat apocalypseENUG
 
2016 Ocean Sciences Meeting tutorial
2016 Ocean Sciences Meeting tutorial2016 Ocean Sciences Meeting tutorial
2016 Ocean Sciences Meeting tutorialJosh Young
 
FAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM data management support for ERACoBioTech ProposalsFAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM data management support for ERACoBioTech ProposalsFAIRDOM
 
Digitization in theory and practice
Digitization in theory and practiceDigitization in theory and practice
Digitization in theory and practiceHelen Nneka Okpala
 
Open minted content_provision
Open minted content_provisionOpen minted content_provision
Open minted content_provisionLucas anastasiou
 
R programming language - Mustafa Wahedi
R programming language - Mustafa WahediR programming language - Mustafa Wahedi
R programming language - Mustafa WahediUNICORNS IN TECH
 
Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Sarah Anna Stewart
 
Advanced Research Investigations for SIU Investigators
Advanced Research Investigations for SIU InvestigatorsAdvanced Research Investigations for SIU Investigators
Advanced Research Investigations for SIU InvestigatorsSloan Carne
 
MPhil Lecture on Data Vis for Analysis
MPhil Lecture on Data Vis for AnalysisMPhil Lecture on Data Vis for Analysis
MPhil Lecture on Data Vis for AnalysisShawn Day
 
Change Management for Libraries
Change Management for LibrariesChange Management for Libraries
Change Management for LibrariesThomas King
 
ERA CoBioTech Data Management Webinar
ERA CoBioTech Data Management WebinarERA CoBioTech Data Management Webinar
ERA CoBioTech Data Management WebinarFAIRDOM
 
Crowdsourcing or bust: The Indexer, Archives NZ
Crowdsourcing or bust: The Indexer, Archives NZ Crowdsourcing or bust: The Indexer, Archives NZ
Crowdsourcing or bust: The Indexer, Archives NZ donellemckinley
 
E research africa presentation (19 nov 2014)
E research africa presentation (19 nov 2014)E research africa presentation (19 nov 2014)
E research africa presentation (19 nov 2014)Isak Van der Walt
 
Going Full Circle: Research Data Management @ University of Pretoria
Going Full Circle: Research Data Management @ University of PretoriaGoing Full Circle: Research Data Management @ University of Pretoria
Going Full Circle: Research Data Management @ University of PretoriaJohann van Wyk
 
Data Management for Undergraduate Researchers
Data Management for Undergraduate ResearchersData Management for Undergraduate Researchers
Data Management for Undergraduate ResearchersRebekah Cummings
 
Digital Tools, Trends and Methodologies in the Humanities and Social Sciences
Digital Tools, Trends and Methodologies in the Humanities and Social SciencesDigital Tools, Trends and Methodologies in the Humanities and Social Sciences
Digital Tools, Trends and Methodologies in the Humanities and Social SciencesShawn Day
 

Similar a Computational journalism projects (20)

Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptx
 
Incentivising the uptake of reusable metadata in the survey production process
Incentivising the uptake of reusable metadata in the survey production processIncentivising the uptake of reusable metadata in the survey production process
Incentivising the uptake of reusable metadata in the survey production process
 
Semanticnews 230913-final
Semanticnews 230913-finalSemanticnews 230913-final
Semanticnews 230913-final
 
Guy avoiding-dat apocalypse
Guy avoiding-dat apocalypseGuy avoiding-dat apocalypse
Guy avoiding-dat apocalypse
 
2016 Ocean Sciences Meeting tutorial
2016 Ocean Sciences Meeting tutorial2016 Ocean Sciences Meeting tutorial
2016 Ocean Sciences Meeting tutorial
 
FAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM data management support for ERACoBioTech ProposalsFAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM data management support for ERACoBioTech Proposals
 
Digitization in theory and practice
Digitization in theory and practiceDigitization in theory and practice
Digitization in theory and practice
 
Open minted content_provision
Open minted content_provisionOpen minted content_provision
Open minted content_provision
 
R programming language - Mustafa Wahedi
R programming language - Mustafa WahediR programming language - Mustafa Wahedi
R programming language - Mustafa Wahedi
 
Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...
 
Advanced Research Investigations for SIU Investigators
Advanced Research Investigations for SIU InvestigatorsAdvanced Research Investigations for SIU Investigators
Advanced Research Investigations for SIU Investigators
 
co:op-READ-Convention Marburg - Günter Mühlberger
co:op-READ-Convention Marburg - Günter Mühlbergerco:op-READ-Convention Marburg - Günter Mühlberger
co:op-READ-Convention Marburg - Günter Mühlberger
 
MPhil Lecture on Data Vis for Analysis
MPhil Lecture on Data Vis for AnalysisMPhil Lecture on Data Vis for Analysis
MPhil Lecture on Data Vis for Analysis
 
Change Management for Libraries
Change Management for LibrariesChange Management for Libraries
Change Management for Libraries
 
ERA CoBioTech Data Management Webinar
ERA CoBioTech Data Management WebinarERA CoBioTech Data Management Webinar
ERA CoBioTech Data Management Webinar
 
Crowdsourcing or bust: The Indexer, Archives NZ
Crowdsourcing or bust: The Indexer, Archives NZ Crowdsourcing or bust: The Indexer, Archives NZ
Crowdsourcing or bust: The Indexer, Archives NZ
 
E research africa presentation (19 nov 2014)
E research africa presentation (19 nov 2014)E research africa presentation (19 nov 2014)
E research africa presentation (19 nov 2014)
 
Going Full Circle: Research Data Management @ University of Pretoria
Going Full Circle: Research Data Management @ University of PretoriaGoing Full Circle: Research Data Management @ University of Pretoria
Going Full Circle: Research Data Management @ University of Pretoria
 
Data Management for Undergraduate Researchers
Data Management for Undergraduate ResearchersData Management for Undergraduate Researchers
Data Management for Undergraduate Researchers
 
Digital Tools, Trends and Methodologies in the Humanities and Social Sciences
Digital Tools, Trends and Methodologies in the Humanities and Social SciencesDigital Tools, Trends and Methodologies in the Humanities and Social Sciences
Digital Tools, Trends and Methodologies in the Humanities and Social Sciences
 

Último

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 

Último (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Computational journalism projects

  • 1. Reporterslab.org Presentation for computational journalism students February 2012
  • 2. STRUCTURED DATA .. And most reporters’ inability to deal with it
  • 3. New York Times reporters used Word searches and annotations to analyze Wikileaks documents in 2010 and 2011.
  • 4. PANDA project trying to help gather data inside newsrooms
  • 5. Barriers to Structured data analysis in the newsroom • Expensive • Too hard to collect. • It takes practice • It takes patience. • Once collected, data has a short shelf life – its value inside the newsroom effectively ends once a story is published.
  • 6. Web-scraping software: ephemeral or too expensive for a task not viewed as mission- critical.
  • 7. Solutions • User-friendly tool for scraping websites for structured data • Packages of algorithms from fraud and other forensic fields for use with public records datasets online. • Packages of queries and statistical tests for money, dates, geographical identifiers, names and codes, presented in standard English • Tools for fuzzy matching of datasets: include scoring, best match likelihood, interactive machine learning for different datasets.
  • 8. TOO MUCH MATERIAL With too little information
  • 9. Too many sources with too little news • Twitter, Facebook, LinkedIn and other social media • RSS feeds from other news organizations and blogs • Press releases from government agencies or beat subjects Lack of archiving is just as troubling as the lack of structure. Reporters can’t hold the powerful accountable without information from the past.
  • 10. Solutions • Archiving users’ feeds locally or in the cloud • Mash-up social media, rss feeds into an app that reveals more insight into the sources • Formalize each reporter’s definition of “news” through machine learning. • Alerts for important source material. Example: changing time of a press conference.
  • 12.
  • 13. Solutions • Visual extractor of data from scanned forms. • Separate scanned boxes of documents into their pieces for further analysis • Use speech recognition tools on government audio and video • OCR video to find the speaker at a hearing
  • 14.
  • 16. Our way A newer way • Hand-enter individual items • Leverage web scraping and into spreadsheets paid crowdsourcing for data • Transcribe entry (MT) interviews, hearings and • Use speech recognition for other audio and video the first pass on searchable content for searching audio and video • Read each document • Use clustering, information extraction and other methods for overview of documents
  • 17. Reporterslab.org working to tame audio and video
  • 18. Associated Press project to bring order to unstructured data
  • 21.
  • 22. REPORTERSLAB.ORG Creating sample data and documents for researchers based on real stories