SlideShare una empresa de Scribd logo
1 de 17
Descargar para leer sin conexión
Slicing and Dicing a Newspaper Corpus
for Historical Ecology Research
Marieke van Erp

Jesse de Does

Katrien Depuydt

Rob Lenders

Thomas van Goethem
Image source: https://kidsbiology.com/wp-content/uploads/2016/10/Martes-americana1141684271.jpg
SERPENS in a Nutshell
• Historical ecologists are starting to use
newspaper corpora for their research

• The abundance of data is both a blessing and a
curse 

• SERPENS aims to make the computer do the
‘boring’ work of filtering relevant articles from
irrelevant ones 

• Historical ecology researchers can then spend
more time on the ‘hard’ analyses

• Partners: 

• Funded by:
Why pest and nuisance species?
• Ambivalent relationship;

• Food, fur, totem

• Diseases, agricultural damages

• Relationships change over time 

• Exotic species, reintroductions, plagues

• Understanding the past helps us to
understand current ecological conditions

• Useful to policy makers, conservationist
biologists etc.
Muskrat Image source: http://www.virtualmuseum.ca/sgc-cms/expositions-exhibitions/faune_urbaine-urban_wildlife/medias/sheets/47.jpg
Why newspapers?
• Which species were considered “pest and
nuisance species”?

• Why were they considered as such?

• How did humans respond? 

Also more tangible information:

• Extermination methods, number of
incidents/sightings, statistics, fur prices
First hurdle: OCR
• The older the source, the harder it is to read 

• OCR errors may result in relevant
documents being missed and irrelevant
documents being retrieved

• We don’t try to ‘fix’ bad OCR but rank
documents by OCR quality through lexicon
overlap
Ambiguity
• Wolf: animal 

• Wolf: last name

• Wolf in sheep’s clothes 

• …

• Context of the document needed to find the
right meaning
Experimental Setup
SERPENS Categories
• Natural history

• Nuisance, material damage

• Nuisance, immaterial damage

• Pest control

• Hunt for economic reasons

• Prevention 

• Accidents

• Figurative

• Other beast

• No beast

• Bad OCR
Training a new topic classifier
• Manually classified 9,940 documents

• Replace occurrences of animal names from
queries with “—ANIMAL—“

• 10-fold cross-validation

• various experiments to measure impact
settings and dataset size 

• Code available at: https://github.com/
CLARIAH/serpens/
Results different algorithms
Zooming in (snippets)
Results per class linear SVM (snippets)
Learning curves
• Total dataset consists of nearly 10,000
annotated examples 

• Learning curves are a measure of
performance vs training set size 

• Results converge rapidly, for two-class
problem, ~1000 examples already achieve
90% accuracy
Preliminary analysis
• Public perception of Mustelidae
(European polecat)

• Combination of distant and close
reading approaches

• Newspaper archives not
equally well digitised over time

• Trends in news may affect
reporting on animals
Lessons Learnt & Future Work
• Domain use cases often need specific
solutions 

• Document classification already very useful
to historical ecologists (probably also to
other domain experts)

• 1,000 annotated examples sufficient for
two-class classification 

• Extend to more species 

• Improve classification sub-categories 

• Add sentiment/opinions
Image source: https://upload.wikimedia.org/wikipedia/commons/1/1c/American_mink.jpg Mink
Shameless plug: 3rd Workshop on Humanities in the Semantic Web
image source: https://www.thesun.co.uk/wp-content/uploads/2017/07/nintchdbpict0001286085811.jpg
Questions?

Más contenido relacionado

Similar a Slicing and Dicing a Newspaper Corpus for Historical Ecology Research

Finding the annotation needs of the botanical community in a digital library
Finding the annotation needs of the botanical community in a digital libraryFinding the annotation needs of the botanical community in a digital library
Finding the annotation needs of the botanical community in a digital library
William Ulate
 
Why would someone with a background in science go into an LIS program?
Why would someone with a background in science go into an LIS program?Why would someone with a background in science go into an LIS program?
Why would someone with a background in science go into an LIS program?
Joseph Kraus
 
Biodiversity Informatics at the Natural History Museum
Biodiversity Informatics at the Natural History MuseumBiodiversity Informatics at the Natural History Museum
Biodiversity Informatics at the Natural History Museum
Edward Baker
 

Similar a Slicing and Dicing a Newspaper Corpus for Historical Ecology Research (20)

Finding the annotation needs of the botanical community in a digital library
Finding the annotation needs of the botanical community in a digital libraryFinding the annotation needs of the botanical community in a digital library
Finding the annotation needs of the botanical community in a digital library
 
Data Mining Dissertations and Adventures and Experiences in the World of Chem...
Data Mining Dissertations and Adventures and Experiences in the World of Chem...Data Mining Dissertations and Adventures and Experiences in the World of Chem...
Data Mining Dissertations and Adventures and Experiences in the World of Chem...
 
Towards Semantic Enrichment of Newspapers: a historical ecology use case
Towards Semantic Enrichment of Newspapers: a historical ecology use case Towards Semantic Enrichment of Newspapers: a historical ecology use case
Towards Semantic Enrichment of Newspapers: a historical ecology use case
 
The Future of Microalgal Taxonomy
The Future of Microalgal TaxonomyThe Future of Microalgal Taxonomy
The Future of Microalgal Taxonomy
 
iEvoBio Keynote: Frontiers of discovery with Encyclopedia of Life -- TRAITBANK
iEvoBio Keynote: Frontiers of discovery with Encyclopedia of Life -- TRAITBANK iEvoBio Keynote: Frontiers of discovery with Encyclopedia of Life -- TRAITBANK
iEvoBio Keynote: Frontiers of discovery with Encyclopedia of Life -- TRAITBANK
 
ContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UKContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UK
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
 
Content Mining at Wellcome Trust
Content Mining at Wellcome TrustContent Mining at Wellcome Trust
Content Mining at Wellcome Trust
 
ContentMine: Mining the Scientific Literature
ContentMine: Mining the Scientific LiteratureContentMine: Mining the Scientific Literature
ContentMine: Mining the Scientific Literature
 
Jim Woolley - Name Registration: One Less Impediment to Taxonomy
Jim Woolley - Name Registration: One Less Impediment to TaxonomyJim Woolley - Name Registration: One Less Impediment to Taxonomy
Jim Woolley - Name Registration: One Less Impediment to Taxonomy
 
Shorthouse
ShorthouseShorthouse
Shorthouse
 
Why would someone with a background in science go into an LIS program?
Why would someone with a background in science go into an LIS program?Why would someone with a background in science go into an LIS program?
Why would someone with a background in science go into an LIS program?
 
The Companion Avian Manifesto: The Value of Birds are Our Patients and Pets
The Companion Avian Manifesto: The Value of Birds are Our Patients and PetsThe Companion Avian Manifesto: The Value of Birds are Our Patients and Pets
The Companion Avian Manifesto: The Value of Birds are Our Patients and Pets
 
Open Access Week - Oxford, 20-24 Oct 2014
Open Access Week - Oxford, 20-24 Oct 2014Open Access Week - Oxford, 20-24 Oct 2014
Open Access Week - Oxford, 20-24 Oct 2014
 
Biodiversity Informatics at the Natural History Museum
Biodiversity Informatics at the Natural History MuseumBiodiversity Informatics at the Natural History Museum
Biodiversity Informatics at the Natural History Museum
 
FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014
FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014
FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014
 
Linking biodiversity data for ecology
Linking biodiversity data for ecologyLinking biodiversity data for ecology
Linking biodiversity data for ecology
 
An Oz Mammals Bioinformatics and Data Resource
An Oz Mammals Bioinformatics and Data ResourceAn Oz Mammals Bioinformatics and Data Resource
An Oz Mammals Bioinformatics and Data Resource
 

Más de Marieke van Erp

Towards Culturally Aware AI Systems - TSDH Symposium
Towards Culturally Aware AI Systems - TSDH SymposiumTowards Culturally Aware AI Systems - TSDH Symposium
Towards Culturally Aware AI Systems - TSDH Symposium
Marieke van Erp
 

Más de Marieke van Erp (20)

Towards Culturally Aware AI Systems - TSDH Symposium
Towards Culturally Aware AI Systems - TSDH SymposiumTowards Culturally Aware AI Systems - TSDH Symposium
Towards Culturally Aware AI Systems - TSDH Symposium
 
A Polyvocal and Contextualised Semantic Web
A Polyvocal and Contextualised Semantic WebA Polyvocal and Contextualised Semantic Web
A Polyvocal and Contextualised Semantic Web
 
AI x Digital Humanities = > Inclusiviteit
AI x Digital Humanities = > Inclusiviteit AI x Digital Humanities = > Inclusiviteit
AI x Digital Humanities = > Inclusiviteit
 
Computationally Tracing Concepts Through Time and Space
Computationally Tracing Concepts Through Time and SpaceComputationally Tracing Concepts Through Time and Space
Computationally Tracing Concepts Through Time and Space
 
The Hitchhiker's Guide to the Future of Digital Humanities
The Hitchhiker's Guide to the Future of Digital HumanitiesThe Hitchhiker's Guide to the Future of Digital Humanities
The Hitchhiker's Guide to the Future of Digital Humanities
 
Why language technology can’t handle Game of Thrones (yet)
Why language technology can’t handle Game of Thrones (yet)Why language technology can’t handle Game of Thrones (yet)
Why language technology can’t handle Game of Thrones (yet)
 
(Beyond) Combining Text and Tables for qualitative and quantitative research
(Beyond) Combining Text and Tables for qualitative and quantitative research (Beyond) Combining Text and Tables for qualitative and quantitative research
(Beyond) Combining Text and Tables for qualitative and quantitative research
 
Finding common ground between text, maps, and tables for quantitative and qua...
Finding common ground between text, maps, and tables for quantitative and qua...Finding common ground between text, maps, and tables for quantitative and qua...
Finding common ground between text, maps, and tables for quantitative and qua...
 
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
 
Good Lynx, bad Lynx: Document enrichment for historical ecologists
Good Lynx, bad Lynx: Document enrichment for historical ecologistsGood Lynx, bad Lynx: Document enrichment for historical ecologists
Good Lynx, bad Lynx: Document enrichment for historical ecologists
 
Natural Language Processing en Named Entity Recognition
Natural Language Processing en Named Entity Recognition Natural Language Processing en Named Entity Recognition
Natural Language Processing en Named Entity Recognition
 
HuC lecture - Digital and Humanities: Continuing the Conversation
HuC lecture - Digital and Humanities: Continuing the ConversationHuC lecture - Digital and Humanities: Continuing the Conversation
HuC lecture - Digital and Humanities: Continuing the Conversation
 
Multilingual Fine-grained Entity Typing
Multilingual Fine-grained Entity Typing Multilingual Fine-grained Entity Typing
Multilingual Fine-grained Entity Typing
 
Entity Typing Using Distributional Semantics and DBpedia
Entity Typing Using Distributional Semantics and DBpedia Entity Typing Using Distributional Semantics and DBpedia
Entity Typing Using Distributional Semantics and DBpedia
 
Entity Typing and Event Extraction
Entity Typing and Event Extraction Entity Typing and Event Extraction
Entity Typing and Event Extraction
 
The domain as unifier, how focusing on social history can bring technical fie...
The domain as unifier, how focusing on social history can bring technical fie...The domain as unifier, how focusing on social history can bring technical fie...
The domain as unifier, how focusing on social history can bring technical fie...
 
Evaluating entity linking an analysis of current benchmark datasets and a ro...
Evaluating entity linking  an analysis of current benchmark datasets and a ro...Evaluating entity linking  an analysis of current benchmark datasets and a ro...
Evaluating entity linking an analysis of current benchmark datasets and a ro...
 
Finding Stories in 1,784,532 Events: Scaling up computational models of narr...
Finding Stories in 1,784,532 Events:  Scaling up computational models of narr...Finding Stories in 1,784,532 Events:  Scaling up computational models of narr...
Finding Stories in 1,784,532 Events: Scaling up computational models of narr...
 
Evaluating Named Entity Recognition and Disambiguation in News and Tweets
Evaluating Named Entity Recognition and Disambiguation in News and TweetsEvaluating Named Entity Recognition and Disambiguation in News and Tweets
Evaluating Named Entity Recognition and Disambiguation in News and Tweets
 
Orientation EBC 2013: Digitising Natural History
Orientation EBC 2013: Digitising Natural HistoryOrientation EBC 2013: Digitising Natural History
Orientation EBC 2013: Digitising Natural History
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 

Slicing and Dicing a Newspaper Corpus for Historical Ecology Research

  • 1. Slicing and Dicing a Newspaper Corpus for Historical Ecology Research Marieke van Erp Jesse de Does Katrien Depuydt Rob Lenders Thomas van Goethem Image source: https://kidsbiology.com/wp-content/uploads/2016/10/Martes-americana1141684271.jpg
  • 2. SERPENS in a Nutshell • Historical ecologists are starting to use newspaper corpora for their research • The abundance of data is both a blessing and a curse • SERPENS aims to make the computer do the ‘boring’ work of filtering relevant articles from irrelevant ones • Historical ecology researchers can then spend more time on the ‘hard’ analyses • Partners: • Funded by:
  • 3. Why pest and nuisance species? • Ambivalent relationship; • Food, fur, totem • Diseases, agricultural damages • Relationships change over time • Exotic species, reintroductions, plagues • Understanding the past helps us to understand current ecological conditions • Useful to policy makers, conservationist biologists etc. Muskrat Image source: http://www.virtualmuseum.ca/sgc-cms/expositions-exhibitions/faune_urbaine-urban_wildlife/medias/sheets/47.jpg
  • 4. Why newspapers? • Which species were considered “pest and nuisance species”? • Why were they considered as such? • How did humans respond? Also more tangible information: • Extermination methods, number of incidents/sightings, statistics, fur prices
  • 5. First hurdle: OCR • The older the source, the harder it is to read • OCR errors may result in relevant documents being missed and irrelevant documents being retrieved • We don’t try to ‘fix’ bad OCR but rank documents by OCR quality through lexicon overlap
  • 6. Ambiguity • Wolf: animal • Wolf: last name • Wolf in sheep’s clothes • … • Context of the document needed to find the right meaning
  • 8. SERPENS Categories • Natural history • Nuisance, material damage • Nuisance, immaterial damage • Pest control • Hunt for economic reasons • Prevention • Accidents • Figurative • Other beast • No beast • Bad OCR
  • 9. Training a new topic classifier • Manually classified 9,940 documents • Replace occurrences of animal names from queries with “—ANIMAL—“ • 10-fold cross-validation • various experiments to measure impact settings and dataset size • Code available at: https://github.com/ CLARIAH/serpens/
  • 12. Results per class linear SVM (snippets)
  • 13. Learning curves • Total dataset consists of nearly 10,000 annotated examples • Learning curves are a measure of performance vs training set size • Results converge rapidly, for two-class problem, ~1000 examples already achieve 90% accuracy
  • 14. Preliminary analysis • Public perception of Mustelidae (European polecat) • Combination of distant and close reading approaches • Newspaper archives not equally well digitised over time • Trends in news may affect reporting on animals
  • 15. Lessons Learnt & Future Work • Domain use cases often need specific solutions • Document classification already very useful to historical ecologists (probably also to other domain experts) • 1,000 annotated examples sufficient for two-class classification • Extend to more species • Improve classification sub-categories • Add sentiment/opinions Image source: https://upload.wikimedia.org/wikipedia/commons/1/1c/American_mink.jpg Mink
  • 16. Shameless plug: 3rd Workshop on Humanities in the Semantic Web