SlideShare una empresa de Scribd logo
1 de 26
Beatrice Alex
balex@inf.ed.ac.uk
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Digital history and big data:
Text mining historical documents on trade in
the British Empire
Overview
What is text mining?
Text Mining in digital history
Trading Consequences
“Big data”
Visualisation
Challenge of noisy data
Collaborating with historians
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Text Mining
Describes a set of linguistic, statistical and
machine learning techniques that model and
structure the information content of textual
resources.
Turns unstructured text into structured data (e.g.
relational database or linked data).
Is very useful for analysing large text collections
automatically.
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Text Mining
TM methods often rely on a set of linguistic pre-
processing steps such as tokenisation, sentence
detection, part-of-speech tagging, lemmatisation,
syntactic parsing (chunking).
Currently our focus is on named entity
recognition, entity grounding and relation
extraction.
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
TM in Digital History
Goal: By analysing large amounts of digitised
data, help historians to discover novel patterns
and explore hypothesis.
Methods: linguistic text analysis, named entity
recognition, geo-grounding and relation extraction
to transform the text into structured data.
Sea-change to methods used in ‘traditional’
history.
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
“Traditional” Historical
Research
Cinchona plantations in George King’s A Manual of
Cinchona Cultivation in India (1880).
Global Fats Supply 1894-98
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Trading Consequences
Digging into Data II project (till Dec. 2013)
Edinburgh Team: Prof. Ewan Klein, Dr. Beatrice Alex,
Dr. Claire Grover, Clare Llewellyn, Richard Tobin,
James Reid, Nicola Osborne, Ian Fieldhouse
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
TRADING CONSEQUEnCES
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Trading Consequences
What does archival text say about the economic
and environmental consequences of global
commodity trading during the nineteenth century?
Scope: global, but with focus on Canadian natural
resources.
Example questions:
‣ What were the routes and volumes of international trade in
resource commodities in the nineteenth century?
‣ What were the local environmental consequences of this
demand for these resources?
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Document Collections
Big data for historians:
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Mined Information
Example sentence:
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Mined Information
Example sentence:
Extracted entities:
commodity: cassia bark
date: 1871
location: Padang
location: America
quantity + unit: 6,127 piculs
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Mined Information
Example sentence:
Normalised and grounded entities:
commodity: cassia bark
date: 1871 (year=1871)
location: Padang (lat=-0.94924;long=100.35427;country=ID)
location: America (lat=39.76;long=-98.50;country=n/a)
quantity + unit: 6,127 piculs
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Mined Information
Example sentence:
Extracted entity attributes and relations:
origin location: Padang
destination location: America
commodity–date relation: cassia bark – 1871
commodity–location relation: cassia bark – Padang
commodity–location relation: cassia bark – America
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Commodity Ontology
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Improved Search &
Visualisations
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Improved Search &
Visualisations
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Improved Search &
Visualisations
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Noisy Data
Optical character recognition contains many errors
and often the structure of the page layout is lost.
Sophistication of the OCR engine and scanning equipment.
Quality of the original print and paper.
Use of historical language.
Information in page margins (header, page numbers, etc.).
Information in tables.
Language of the text.
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Fixing Noisy Data
Text normalisation and correction:
End-of-line soft hyphen removal
Dehyphen all token-splitting hyphens using a dictionary-based
approach.
“False f”-to-s conversion
Convert all false f characters to s using a corpus.
Example: reduced number of words unrecognised by
spell checker from 61 to 21 -> 67%, on average 12%
reduction in word error rate in a random sample (Alex et
al, 2012).
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Fixing Noisy Data
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Fixing Noisy Data
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Extract from document
10.2307/60238580 in FCOC.
How Noisy Is Too Noisy?
qBiu si }S3A:req s,uauuaqsu aq} }Bq} uirepo.ifT'papua}X3
sSuiav }qSuq Jiaq} qiiM jib ui snnS bbs aqx 'a"3(s aq} tnojj
ssfitns q}TM Sni5[ooi si jb}s }S.ii; aqx'papnaoSB q}Bq naABSjj
qS;H °1 ssbui s.uauuaqsu aqx
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
The Users (Historians)
Involvement of historians:
Everything is based on the use cases and build on users’
hypotheses/research questions.
They are responsible for identification of relevant collections
and are involved in the ontology development.
They provide feedback for us to improve technology iteratively:
Partners at York use of the prototype for their research and
track errors; Workshop at CHESS 2013 with a group of
independent historians
Clarity on the text mining accuracy is IMPORTANT.
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Summary
Text mining historic documents in Trading
Consequences.
Processing “big data”.
Power of visualising structured data.
Fixing noisy data.
Importance of two-way collaboration between
technology experts and users in digital history.
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Thank you
Questions? Fire away or contact me at:
balex@inf.ed.ac.uk
Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013

Más contenido relacionado

Similar a Digital History and Big Data: text mining historical documents on trade in the British empire

20130711 records2 graphs_madrid
20130711 records2 graphs_madrid20130711 records2 graphs_madrid
20130711 records2 graphs_madrid
Stefan Gradmann
 
20130711 linked datascholarship_madrid
20130711 linked datascholarship_madrid20130711 linked datascholarship_madrid
20130711 linked datascholarship_madrid
Stefan Gradmann
 
poster-Jing-09302014
poster-Jing-09302014poster-Jing-09302014
poster-Jing-09302014
Jing Xie
 

Similar a Digital History and Big Data: text mining historical documents on trade in the British empire (20)

An Open and Shut Case? Shared Standards for Stratigraphic Data and Heritage L...
An Open and Shut Case? Shared Standards for Stratigraphic Data and Heritage L...An Open and Shut Case? Shared Standards for Stratigraphic Data and Heritage L...
An Open and Shut Case? Shared Standards for Stratigraphic Data and Heritage L...
 
20130711 records2 graphs_madrid
20130711 records2 graphs_madrid20130711 records2 graphs_madrid
20130711 records2 graphs_madrid
 
Dark Data In the Long Tail of Science:   Examples in Biology
Dark Data In the Long Tail of Science:  Examples in BiologyDark Data In the Long Tail of Science:  Examples in Biology
Dark Data In the Long Tail of Science:   Examples in Biology
 
Essays on the Smart Grid
Essays on the Smart GridEssays on the Smart Grid
Essays on the Smart Grid
 
Oregon Digital: Collaborative Hydra Development
Oregon Digital: Collaborative Hydra DevelopmentOregon Digital: Collaborative Hydra Development
Oregon Digital: Collaborative Hydra Development
 
Neil Grindley
Neil GrindleyNeil Grindley
Neil Grindley
 
The Data Lifecycle - EUDAT Summer School (Yann Le Franc)
The Data Lifecycle - EUDAT Summer School (Yann Le Franc)The Data Lifecycle - EUDAT Summer School (Yann Le Franc)
The Data Lifecycle - EUDAT Summer School (Yann Le Franc)
 
20130711 linked datascholarship_madrid
20130711 linked datascholarship_madrid20130711 linked datascholarship_madrid
20130711 linked datascholarship_madrid
 
Ways to Extract Variable Insights when Data is Scarse
Ways to Extract Variable Insights when Data is ScarseWays to Extract Variable Insights when Data is Scarse
Ways to Extract Variable Insights when Data is Scarse
 
Knowledge Graphs for Scholarly Communication
Knowledge Graphs for Scholarly CommunicationKnowledge Graphs for Scholarly Communication
Knowledge Graphs for Scholarly Communication
 
Doing Digital Research @ British Library
Doing Digital Research @ British LibraryDoing Digital Research @ British Library
Doing Digital Research @ British Library
 
Internet Prospective Study
Internet Prospective StudyInternet Prospective Study
Internet Prospective Study
 
poster-Jing-09302014
poster-Jing-09302014poster-Jing-09302014
poster-Jing-09302014
 
Modern association rule mining methods
Modern association rule mining methodsModern association rule mining methods
Modern association rule mining methods
 
Data Mining for scholarship
Data Mining for scholarshipData Mining for scholarship
Data Mining for scholarship
 
Developments in catalogues and data sharing
Developments in catalogues and data sharingDevelopments in catalogues and data sharing
Developments in catalogues and data sharing
 
Realising the Potential of Algal Biomass Production through Semantic Web an...
Realising the Potential of Algal Biomass Production   through Semantic Web an...Realising the Potential of Algal Biomass Production   through Semantic Web an...
Realising the Potential of Algal Biomass Production through Semantic Web an...
 
Wehc - Linked Data for Economic-Social historians
Wehc - Linked Data for Economic-Social historiansWehc - Linked Data for Economic-Social historians
Wehc - Linked Data for Economic-Social historians
 
PhD Open Day Intro to Digital Scholarship (13 Jan 2021)
PhD Open Day Intro to Digital Scholarship (13 Jan 2021)PhD Open Day Intro to Digital Scholarship (13 Jan 2021)
PhD Open Day Intro to Digital Scholarship (13 Jan 2021)
 
Swib2014csarasua
Swib2014csarasuaSwib2014csarasua
Swib2014csarasua
 

Último

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 

Digital History and Big Data: text mining historical documents on trade in the British empire

  • 1. Beatrice Alex balex@inf.ed.ac.uk Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013 Digital history and big data: Text mining historical documents on trade in the British Empire
  • 2. Overview What is text mining? Text Mining in digital history Trading Consequences “Big data” Visualisation Challenge of noisy data Collaborating with historians Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 3. Text Mining Describes a set of linguistic, statistical and machine learning techniques that model and structure the information content of textual resources. Turns unstructured text into structured data (e.g. relational database or linked data). Is very useful for analysing large text collections automatically. Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 4. Text Mining TM methods often rely on a set of linguistic pre- processing steps such as tokenisation, sentence detection, part-of-speech tagging, lemmatisation, syntactic parsing (chunking). Currently our focus is on named entity recognition, entity grounding and relation extraction. Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 5. TM in Digital History Goal: By analysing large amounts of digitised data, help historians to discover novel patterns and explore hypothesis. Methods: linguistic text analysis, named entity recognition, geo-grounding and relation extraction to transform the text into structured data. Sea-change to methods used in ‘traditional’ history. Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 6. “Traditional” Historical Research Cinchona plantations in George King’s A Manual of Cinchona Cultivation in India (1880). Global Fats Supply 1894-98 Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 7. Trading Consequences Digging into Data II project (till Dec. 2013) Edinburgh Team: Prof. Ewan Klein, Dr. Beatrice Alex, Dr. Claire Grover, Clare Llewellyn, Richard Tobin, James Reid, Nicola Osborne, Ian Fieldhouse Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 8. TRADING CONSEQUEnCES Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 9. Trading Consequences What does archival text say about the economic and environmental consequences of global commodity trading during the nineteenth century? Scope: global, but with focus on Canadian natural resources. Example questions: ‣ What were the routes and volumes of international trade in resource commodities in the nineteenth century? ‣ What were the local environmental consequences of this demand for these resources? Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 10. Document Collections Big data for historians: Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 11. Mined Information Example sentence: Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 12. Mined Information Example sentence: Extracted entities: commodity: cassia bark date: 1871 location: Padang location: America quantity + unit: 6,127 piculs Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 13. Mined Information Example sentence: Normalised and grounded entities: commodity: cassia bark date: 1871 (year=1871) location: Padang (lat=-0.94924;long=100.35427;country=ID) location: America (lat=39.76;long=-98.50;country=n/a) quantity + unit: 6,127 piculs Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 14. Mined Information Example sentence: Extracted entity attributes and relations: origin location: Padang destination location: America commodity–date relation: cassia bark – 1871 commodity–location relation: cassia bark – Padang commodity–location relation: cassia bark – America Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 15. Commodity Ontology Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 16. Improved Search & Visualisations Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 17. Improved Search & Visualisations Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 18. Improved Search & Visualisations Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 19. Noisy Data Optical character recognition contains many errors and often the structure of the page layout is lost. Sophistication of the OCR engine and scanning equipment. Quality of the original print and paper. Use of historical language. Information in page margins (header, page numbers, etc.). Information in tables. Language of the text. Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 20. Fixing Noisy Data Text normalisation and correction: End-of-line soft hyphen removal Dehyphen all token-splitting hyphens using a dictionary-based approach. “False f”-to-s conversion Convert all false f characters to s using a corpus. Example: reduced number of words unrecognised by spell checker from 61 to 21 -> 67%, on average 12% reduction in word error rate in a random sample (Alex et al, 2012). Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 21. Fixing Noisy Data Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 22. Fixing Noisy Data Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 23. Extract from document 10.2307/60238580 in FCOC. How Noisy Is Too Noisy? qBiu si }S3A:req s,uauuaqsu aq} }Bq} uirepo.ifT'papua}X3 sSuiav }qSuq Jiaq} qiiM jib ui snnS bbs aqx 'a"3(s aq} tnojj ssfitns q}TM Sni5[ooi si jb}s }S.ii; aqx'papnaoSB q}Bq naABSjj qS;H °1 ssbui s.uauuaqsu aqx Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 24. The Users (Historians) Involvement of historians: Everything is based on the use cases and build on users’ hypotheses/research questions. They are responsible for identification of relevant collections and are involved in the ontology development. They provide feedback for us to improve technology iteratively: Partners at York use of the prototype for their research and track errors; Workshop at CHESS 2013 with a group of independent historians Clarity on the text mining accuracy is IMPORTANT. Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 25. Summary Text mining historic documents in Trading Consequences. Processing “big data”. Power of visualising structured data. Fixing noisy data. Importance of two-way collaboration between technology experts and users in digital history. Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  • 26. Thank you Questions? Fire away or contact me at: balex@inf.ed.ac.uk Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013