Presented at EKAW 2018
Historical newspapers are a novel source of information for historical ecologists to study the interactions between humans and animals through time and space. Newspaper archives are particularly interesting to analyse because of their breadth and depth. However, the size and the occasional noisiness of such archives also brings difficulties, as manual analysis is impossible. In this paper, we present experiments and results on automatic query expansion and categorisation for the perception of animal species between 1800 and 1940. For query expansion and to the manual annotation process, we used lexicons. For the categorisation we trained a Support Vector Machine model. Our results indicate that we can distinguish newspaper articles that are about animal species from those that are not with an F 1 of 0.92 and the subcategorisation of the different types of newspapers on animals up to 0.84 F 1 .
Slicing and Dicing a Newspaper Corpus for Historical Ecology Research
1. Slicing and Dicing a Newspaper Corpus
for Historical Ecology Research
Marieke van Erp
Jesse de Does
Katrien Depuydt
Rob Lenders
Thomas van Goethem
Image source: https://kidsbiology.com/wp-content/uploads/2016/10/Martes-americana1141684271.jpg
2. SERPENS in a Nutshell
• Historical ecologists are starting to use
newspaper corpora for their research
• The abundance of data is both a blessing and a
curse
• SERPENS aims to make the computer do the
‘boring’ work of filtering relevant articles from
irrelevant ones
• Historical ecology researchers can then spend
more time on the ‘hard’ analyses
• Partners:
• Funded by:
3. Why pest and nuisance species?
• Ambivalent relationship;
• Food, fur, totem
• Diseases, agricultural damages
• Relationships change over time
• Exotic species, reintroductions, plagues
• Understanding the past helps us to
understand current ecological conditions
• Useful to policy makers, conservationist
biologists etc.
Muskrat Image source: http://www.virtualmuseum.ca/sgc-cms/expositions-exhibitions/faune_urbaine-urban_wildlife/medias/sheets/47.jpg
4. Why newspapers?
• Which species were considered “pest and
nuisance species”?
• Why were they considered as such?
• How did humans respond?
Also more tangible information:
• Extermination methods, number of
incidents/sightings, statistics, fur prices
5. First hurdle: OCR
• The older the source, the harder it is to read
• OCR errors may result in relevant
documents being missed and irrelevant
documents being retrieved
• We don’t try to ‘fix’ bad OCR but rank
documents by OCR quality through lexicon
overlap
6. Ambiguity
• Wolf: animal
• Wolf: last name
• Wolf in sheep’s clothes
• …
• Context of the document needed to find the
right meaning
8. SERPENS Categories
• Natural history
• Nuisance, material damage
• Nuisance, immaterial damage
• Pest control
• Hunt for economic reasons
• Prevention
• Accidents
• Figurative
• Other beast
• No beast
• Bad OCR
9. Training a new topic classifier
• Manually classified 9,940 documents
• Replace occurrences of animal names from
queries with “—ANIMAL—“
• 10-fold cross-validation
• various experiments to measure impact
settings and dataset size
• Code available at: https://github.com/
CLARIAH/serpens/
13. Learning curves
• Total dataset consists of nearly 10,000
annotated examples
• Learning curves are a measure of
performance vs training set size
• Results converge rapidly, for two-class
problem, ~1000 examples already achieve
90% accuracy
14. Preliminary analysis
• Public perception of Mustelidae
(European polecat)
• Combination of distant and close
reading approaches
• Newspaper archives not
equally well digitised over time
• Trends in news may affect
reporting on animals
15. Lessons Learnt & Future Work
• Domain use cases often need specific
solutions
• Document classification already very useful
to historical ecologists (probably also to
other domain experts)
• 1,000 annotated examples sufficient for
two-class classification
• Extend to more species
• Improve classification sub-categories
• Add sentiment/opinions
Image source: https://upload.wikimedia.org/wikipedia/commons/1/1c/American_mink.jpg Mink