SlideShare una empresa de Scribd logo
1 de 15
Large-scale refinement of digital historical 
newspapers with named entity recognition 
IFLA Newspaper Pre-Conference 
14 August 2014, Geneva 
Clemens Neudecker, SBB, @cneudecker
Overview 
• Background 
• NER Introduction 
• Approach 
• Challenges 
• Scalability 
• First results 
• Outlook 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp
Background 
• Europeana Newspapers 
EU Best Practice Network 
• 10 million newspaper pages 
with full-text from 12 libraries 
• 36 million newspaper pages 
with metadata for Europeana 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp
Named entity recognition (I) 
1. Detect names of 
persons, places, 
organisations 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp
Named entity recognition (II) 
2. Disambiguate entities 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp
Named entity recognition (III) 
3. Link to online resources 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp
Approach (I) 
• Tackle content in 
Dutch, German, French 
(about 50% of the 10m pages) 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp
Approach (II) 
• Use a machine learning tool (open source) 
developed by Stanford University, adapted 
for Europeana Newspapers by KBNL 
https://github.com/KBNLresearch/europeananp-ner 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp
Approach (III) 
• Create (and release) training 
material by manually annotating 
named entities on OCR‘d 
newspaper pages 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp
Challenges 
• OCR quality 
• Multiple (mixed) 
languages 
• Historical spelling 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp
Scalability 
• Stanford NER software is multi-threaded 
e.g. 4 CPU cores – 4x throughput 
• Optimise the NER classifier by filtering 
noise and sentences without NE‘s marked 
• Robust proven Java technology 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp
First results (Dutch) 
Persons Locations Organizations 
Precision 0.940 0.950 0.942 
Recall 0.588 0.760 0.559 
F-measure 0.689 0.838 0.671 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp
First results (French) 
Persons Locations 
Precision 0.529 0.548 
Recall 0.834 0.216 
F-measure 0.622 0.310 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp 
* Score for 
organisations 
omitted since 
not enough 
present in the 
source material
Outlook 
• Q3: Release of training data for Named Entity Recognition 
in Dutch, German, French 
• Q3: First results for German (Austrian, Italian/South Tirol), 
final results for Dutch, French 
• Q4: Release of software (open source) for disambiguating 
and linking of NER results to DBPedia 
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the 
Competitiveness and Innovation Framework Programme by the European Community 
http://ec.europa.eu/ict_psp
www.europeana-newspapers.eu/ 
www.theeuropeanlibrary.org/tel4/newspapers 
https://github.com/KBNLresearch/europeananp-ner 
Thank you for your attention! 
IFLA Newspaper Pre-Conference 
14 August 2014, Geneva 
Clemens Neudecker, SBB, @cneudecker

Más contenido relacionado

La actualidad más candente

Realising the value of Europe's newspaper heritage
Realising the value of Europe's newspaper heritage Realising the value of Europe's newspaper heritage
Realising the value of Europe's newspaper heritage
Europeana Newspapers
 
Europeana Newspapers Amsterdam workshop introduction
Europeana Newspapers Amsterdam workshop introductionEuropeana Newspapers Amsterdam workshop introduction
Europeana Newspapers Amsterdam workshop introduction
Europeana Newspapers
 
04 europeana newspapers
04 europeana newspapers04 europeana newspapers
04 europeana newspapers
Europeana
 
Europeana Newspapers Polish Information Day
Europeana Newspapers Polish Information DayEuropeana Newspapers Polish Information Day
Europeana Newspapers Polish Information Day
Europeana Newspapers
 

La actualidad más candente (20)

Refinement
RefinementRefinement
Refinement
 
Europeana Newspapers Project
Europeana Newspapers ProjectEuropeana Newspapers Project
Europeana Newspapers Project
 
Metadata
MetadataMetadata
Metadata
 
Realising the value of Europe's newspaper heritage
Realising the value of Europe's newspaper heritage Realising the value of Europe's newspaper heritage
Realising the value of Europe's newspaper heritage
 
EurnewsLDN_Clemens_Neudecker
EurnewsLDN_Clemens_NeudeckerEurnewsLDN_Clemens_Neudecker
EurnewsLDN_Clemens_Neudecker
 
Europeana Newspapers Amsterdam workshop introduction
Europeana Newspapers Amsterdam workshop introductionEuropeana Newspapers Amsterdam workshop introduction
Europeana Newspapers Amsterdam workshop introduction
 
ENP_Dutch_Infoday_LWilms
ENP_Dutch_Infoday_LWilmsENP_Dutch_Infoday_LWilms
ENP_Dutch_Infoday_LWilms
 
The challenges of making Europe's newspapers available online
The challenges of making Europe's newspapers available onlineThe challenges of making Europe's newspapers available online
The challenges of making Europe's newspapers available online
 
The EPO and Tecnology Transfer: a brief overview 4T-Tech Transfer Think Tank
The EPO and Tecnology Transfer: a brief overview 4T-Tech Transfer Think TankThe EPO and Tecnology Transfer: a brief overview 4T-Tech Transfer Think Tank
The EPO and Tecnology Transfer: a brief overview 4T-Tech Transfer Think Tank
 
SLOPE Final Conference - general presentation
SLOPE Final Conference - general presentationSLOPE Final Conference - general presentation
SLOPE Final Conference - general presentation
 
SLOPE Final Conference - online purchase of timber and biomass
SLOPE Final Conference - online purchase of timber and biomassSLOPE Final Conference - online purchase of timber and biomass
SLOPE Final Conference - online purchase of timber and biomass
 
SLOPE Final Conference - intelligent truck
SLOPE Final Conference - intelligent truckSLOPE Final Conference - intelligent truck
SLOPE Final Conference - intelligent truck
 
04 europeana newspapers
04 europeana newspapers04 europeana newspapers
04 europeana newspapers
 
IFLA 2014 Europeana Newspapers Rossitza Atanassova
IFLA 2014 Europeana Newspapers Rossitza AtanassovaIFLA 2014 Europeana Newspapers Rossitza Atanassova
IFLA 2014 Europeana Newspapers Rossitza Atanassova
 
Europeana_Newspapers_ONB_infoday_HJLieder
Europeana_Newspapers_ONB_infoday_HJLiederEuropeana_Newspapers_ONB_infoday_HJLieder
Europeana_Newspapers_ONB_infoday_HJLieder
 
SLOPE Final Conference - innovative cable yarder
SLOPE Final Conference - innovative cable yarderSLOPE Final Conference - innovative cable yarder
SLOPE Final Conference - innovative cable yarder
 
SLOPE Final Conference - electronic marking of trees
SLOPE Final Conference - electronic marking of treesSLOPE Final Conference - electronic marking of trees
SLOPE Final Conference - electronic marking of trees
 
SLOPE Final Conference - 3D harvesting planner
SLOPE Final Conference - 3D harvesting plannerSLOPE Final Conference - 3D harvesting planner
SLOPE Final Conference - 3D harvesting planner
 
EMPOWER - EMPOWERING a reduction in use of conventionally fuelled vehicles us...
EMPOWER - EMPOWERING a reduction in use of conventionally fuelled vehicles us...EMPOWER - EMPOWERING a reduction in use of conventionally fuelled vehicles us...
EMPOWER - EMPOWERING a reduction in use of conventionally fuelled vehicles us...
 
Europeana Newspapers Polish Information Day
Europeana Newspapers Polish Information DayEuropeana Newspapers Polish Information Day
Europeana Newspapers Polish Information Day
 

Similar a Large scale refinement of digital historical newspapers with named entities recognition

Europeana Newspapers LIBER2013 Workshop intro
Europeana Newspapers LIBER2013 Workshop introEuropeana Newspapers LIBER2013 Workshop intro
Europeana Newspapers LIBER2013 Workshop intro
Europeana Newspapers
 
Darko Fercej: Central European Living Lab for Territorial Innovation - Open d...
Darko Fercej: Central European Living Lab for Territorial Innovation - Open d...Darko Fercej: Central European Living Lab for Territorial Innovation - Open d...
Darko Fercej: Central European Living Lab for Territorial Innovation - Open d...
Apulian ICT Living Labs
 
Europeana Newspaper metadata LIBER2013
Europeana Newspaper metadata LIBER2013Europeana Newspaper metadata LIBER2013
Europeana Newspaper metadata LIBER2013
Europeana Newspapers
 

Similar a Large scale refinement of digital historical newspapers with named entities recognition (19)

Europeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday Muehlberger
 
Europeana Newspapers in a nutshell
Europeana Newspapers in a nutshellEuropeana Newspapers in a nutshell
Europeana Newspapers in a nutshell
 
Europeana Newspapers Estonian Infoday Krista Kiisa
Europeana Newspapers Estonian Infoday Krista KiisaEuropeana Newspapers Estonian Infoday Krista Kiisa
Europeana Newspapers Estonian Infoday Krista Kiisa
 
Overview of the Europeana Newspapers Project
Overview of the Europeana Newspapers ProjectOverview of the Europeana Newspapers Project
Overview of the Europeana Newspapers Project
 
Europeana Newspapers LIBER2013 Workshop intro
Europeana Newspapers LIBER2013 Workshop introEuropeana Newspapers LIBER2013 Workshop intro
Europeana Newspapers LIBER2013 Workshop intro
 
FIWARE Global Summit - Role of Digital Innovation Hubs in the Digitization of...
FIWARE Global Summit - Role of Digital Innovation Hubs in the Digitization of...FIWARE Global Summit - Role of Digital Innovation Hubs in the Digitization of...
FIWARE Global Summit - Role of Digital Innovation Hubs in the Digitization of...
 
Max Lemke, Head of Unit, Components and Systems, European Commission
Max Lemke, Head of Unit, Components and Systems, European CommissionMax Lemke, Head of Unit, Components and Systems, European Commission
Max Lemke, Head of Unit, Components and Systems, European Commission
 
Darko Fercej: Central European Living Lab for Territorial Innovation - Open d...
Darko Fercej: Central European Living Lab for Territorial Innovation - Open d...Darko Fercej: Central European Living Lab for Territorial Innovation - Open d...
Darko Fercej: Central European Living Lab for Territorial Innovation - Open d...
 
Centre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens NeudeckerCentre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens Neudecker
 
Europeana Newspaper metadata LIBER2013
Europeana Newspaper metadata LIBER2013Europeana Newspaper metadata LIBER2013
Europeana Newspaper metadata LIBER2013
 
Performance Evaluation and Quality Assessment
Performance Evaluation and Quality AssessmentPerformance Evaluation and Quality Assessment
Performance Evaluation and Quality Assessment
 
Presentation of H2020 ICT-32-2017 Startup Europe for Growth & Innovation Rada...
Presentation of H2020 ICT-32-2017 Startup Europe for Growth & Innovation Rada...Presentation of H2020 ICT-32-2017 Startup Europe for Growth & Innovation Rada...
Presentation of H2020 ICT-32-2017 Startup Europe for Growth & Innovation Rada...
 
DIGITISING EUROPEAN INDUSTRY: THE ROLE OF DIGITAL INNOVATION HUBS
DIGITISING EUROPEAN INDUSTRY: THE ROLE OF DIGITAL INNOVATION HUBSDIGITISING EUROPEAN INDUSTRY: THE ROLE OF DIGITAL INNOVATION HUBS
DIGITISING EUROPEAN INDUSTRY: THE ROLE OF DIGITAL INNOVATION HUBS
 
Fiona ollerenshaw
Fiona ollerenshawFiona ollerenshaw
Fiona ollerenshaw
 
UK FP7 National Contact Point ICT, Peter Walters, FP7UK National Contact Poin...
UK FP7 National Contact Point ICT, Peter Walters, FP7UK National Contact Poin...UK FP7 National Contact Point ICT, Peter Walters, FP7UK National Contact Poin...
UK FP7 National Contact Point ICT, Peter Walters, FP7UK National Contact Poin...
 
ec_rtd_cl4-destination-4.pdf
ec_rtd_cl4-destination-4.pdfec_rtd_cl4-destination-4.pdf
ec_rtd_cl4-destination-4.pdf
 
Digitising European Industry - 12/10/2017
Digitising European Industry - 12/10/2017Digitising European Industry - 12/10/2017
Digitising European Industry - 12/10/2017
 
EOSC-DIH: Bringing industry into the EOSC
EOSC-DIH: Bringing industry into the EOSCEOSC-DIH: Bringing industry into the EOSC
EOSC-DIH: Bringing industry into the EOSC
 
Max Lemke | Innovation actions in Horizon 2020 Fostering collaboration with M...
Max Lemke | Innovation actions in Horizon 2020 Fostering collaboration with M...Max Lemke | Innovation actions in Horizon 2020 Fostering collaboration with M...
Max Lemke | Innovation actions in Horizon 2020 Fostering collaboration with M...
 

Más de cneudecker

OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
cneudecker
 

Más de cneudecker (20)

EuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State LibraryEuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
 
ALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für VolltexteALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für Volltexte
 
OCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für ZeitungenOCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für Zeitungen
 
Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?
 
Multimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical NewspapersMultimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical Newspapers
 
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
 
AI for digitized cultural heritage
AI for digitized cultural heritageAI for digitized cultural heritage
AI for digitized cultural heritage
 
Kuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher IntelligenzKuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher Intelligenz
 
Überblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-DÜberblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-D
 
The many uses of digitized newspapers
The many uses of digitized newspapersThe many uses of digitized newspapers
The many uses of digitized newspapers
 
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
 
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
 
OCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documentsOCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documents
 
Text and Data Mining
Text and Data MiningText and Data Mining
Text and Data Mining
 
Formate für Volltexte
Formate für VolltexteFormate für Volltexte
Formate für Volltexte
 
Extrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in EuropeExtrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in Europe
 
Reise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 MinutenReise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 Minuten
 
Europeana Newspapers in a Nutshell
Europeana Newspapers in a NutshellEuropeana Newspapers in a Nutshell
Europeana Newspapers in a Nutshell
 
lab.sbb.berlin
lab.sbb.berlinlab.sbb.berlin
lab.sbb.berlin
 
Named Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana NewspapersNamed Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana Newspapers
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

Large scale refinement of digital historical newspapers with named entities recognition

  • 1. Large-scale refinement of digital historical newspapers with named entity recognition IFLA Newspaper Pre-Conference 14 August 2014, Geneva Clemens Neudecker, SBB, @cneudecker
  • 2. Overview • Background • NER Introduction • Approach • Challenges • Scalability • First results • Outlook This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  • 3. Background • Europeana Newspapers EU Best Practice Network • 10 million newspaper pages with full-text from 12 libraries • 36 million newspaper pages with metadata for Europeana This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  • 4. Named entity recognition (I) 1. Detect names of persons, places, organisations This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  • 5. Named entity recognition (II) 2. Disambiguate entities This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  • 6. Named entity recognition (III) 3. Link to online resources This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  • 7. Approach (I) • Tackle content in Dutch, German, French (about 50% of the 10m pages) This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  • 8. Approach (II) • Use a machine learning tool (open source) developed by Stanford University, adapted for Europeana Newspapers by KBNL https://github.com/KBNLresearch/europeananp-ner This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  • 9. Approach (III) • Create (and release) training material by manually annotating named entities on OCR‘d newspaper pages This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  • 10. Challenges • OCR quality • Multiple (mixed) languages • Historical spelling This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  • 11. Scalability • Stanford NER software is multi-threaded e.g. 4 CPU cores – 4x throughput • Optimise the NER classifier by filtering noise and sentences without NE‘s marked • Robust proven Java technology This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  • 12. First results (Dutch) Persons Locations Organizations Precision 0.940 0.950 0.942 Recall 0.588 0.760 0.559 F-measure 0.689 0.838 0.671 This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  • 13. First results (French) Persons Locations Precision 0.529 0.548 Recall 0.834 0.216 F-measure 0.622 0.310 This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp * Score for organisations omitted since not enough present in the source material
  • 14. Outlook • Q3: Release of training data for Named Entity Recognition in Dutch, German, French • Q3: First results for German (Austrian, Italian/South Tirol), final results for Dutch, French • Q4: Release of software (open source) for disambiguating and linking of NER results to DBPedia This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
  • 15. www.europeana-newspapers.eu/ www.theeuropeanlibrary.org/tel4/newspapers https://github.com/KBNLresearch/europeananp-ner Thank you for your attention! IFLA Newspaper Pre-Conference 14 August 2014, Geneva Clemens Neudecker, SBB, @cneudecker