Digital History and Big Data: text mining historical documents on trade in the British empire

Beatrice Alex
balex@inf.ed.ac.uk
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Digital history and big data:
Text mining historical documents on trade in
the British Empire

Overview
What is text mining?
Text Mining in digital history
Trading Consequences
“Big data”
Visualisation
Challenge of noisy data
Collaborating with historians

Text Mining
Describes a set of linguistic, statistical and
machine learning techniques that model and
structure the information content of textual
resources.
Turns unstructured text into structured data (e.g.
relational database or linked data).
Is very useful for analysing large text collections
automatically.

Text Mining
TM methods often rely on a set of linguistic pre-
processing steps such as tokenisation, sentence
detection, part-of-speech tagging, lemmatisation,
syntactic parsing (chunking).
Currently our focus is on named entity
recognition, entity grounding and relation
extraction.

TM in Digital History
Goal: By analysing large amounts of digitised
data, help historians to discover novel patterns
and explore hypothesis.
Methods: linguistic text analysis, named entity
recognition, geo-grounding and relation extraction
to transform the text into structured data.
Sea-change to methods used in ‘traditional’
history.

“Traditional” Historical
Research
Cinchona plantations in George King’s A Manual of
Cinchona Cultivation in India (1880).
Global Fats Supply 1894-98

Digging into Data II project (till Dec. 2013)
Edinburgh Team: Prof. Ewan Klein, Dr. Beatrice Alex,
Dr. Claire Grover, Clare Llewellyn, Richard Tobin,
James Reid, Nicola Osborne, Ian Fieldhouse

TRADING CONSEQUEnCES

What does archival text say about the economic
and environmental consequences of global
commodity trading during the nineteenth century?
Scope: global, but with focus on Canadian natural
resources.
Example questions:
‣ What were the routes and volumes of international trade in
resource commodities in the nineteenth century?
‣ What were the local environmental consequences of this
demand for these resources?

Document Collections
Big data for historians:

Mined Information
Example sentence:

Mined Information
Example sentence:
Extracted entities:
commodity: cassia bark
date: 1871
location: Padang
location: America
quantity + unit: 6,127 piculs

Mined Information
Example sentence:
Normalised and grounded entities:
commodity: cassia bark
date: 1871 (year=1871)
location: Padang (lat=-0.94924;long=100.35427;country=ID)
location: America (lat=39.76;long=-98.50;country=n/a)
quantity + unit: 6,127 piculs

Mined Information
Example sentence:
Extracted entity attributes and relations:
origin location: Padang
destination location: America
commodity–date relation: cassia bark – 1871
commodity–location relation: cassia bark – Padang
commodity–location relation: cassia bark – America

Commodity Ontology

Improved Search &
Visualisations

Noisy Data
Optical character recognition contains many errors
and often the structure of the page layout is lost.
Sophistication of the OCR engine and scanning equipment.
Quality of the original print and paper.
Use of historical language.
Information in page margins (header, page numbers, etc.).
Information in tables.
Language of the text.

Fixing Noisy Data
Text normalisation and correction:
End-of-line soft hyphen removal
Dehyphen all token-splitting hyphens using a dictionary-based
approach.
“False f”-to-s conversion
Convert all false f characters to s using a corpus.
Example: reduced number of words unrecognised by
spell checker from 61 to 21 -> 67%, on average 12%
reduction in word error rate in a random sample (Alex et
al, 2012).

Fixing Noisy Data

Extract from document
10.2307/60238580 in FCOC.
How Noisy Is Too Noisy?
qBiu si }S3A:req s,uauuaqsu aq} }Bq} uirepo.ifT'papua}X3
sSuiav }qSuq Jiaq} qiiM jib ui snnS bbs aqx 'a"3(s aq} tnojj
ssfitns q}TM Sni5[ooi si jb}s }S.ii; aqx'papnaoSB q}Bq naABSjj
qS;H °1 ssbui s.uauuaqsu aqx

The Users (Historians)
Involvement of historians:
Everything is based on the use cases and build on users’
hypotheses/research questions.
They are responsible for identification of relevant collections
and are involved in the ontology development.
They provide feedback for us to improve technology iteratively:
Partners at York use of the prototype for their research and
track errors; Workshop at CHESS 2013 with a group of
independent historians
Clarity on the text mining accuracy is IMPORTANT.

Summary
Text mining historic documents in Trading
Consequences.
Processing “big data”.
Power of visualising structured data.
Fixing noisy data.
Importance of two-way collaboration between
technology experts and users in digital history.

Thank you
Questions? Fire away or contact me at:
balex@inf.ed.ac.uk

Digital History and Big Data: text mining historical documents on trade in the British empire

Recomendados

Recomendados

Más contenido relacionado

Similar a Digital History and Big Data: text mining historical documents on trade in the British empire

Similar a Digital History and Big Data: text mining historical documents on trade in the British empire (20)

Último

Último (20)

Digital History and Big Data: text mining historical documents on trade in the British empire