SlideShare una empresa de Scribd logo
1 de 4
Descargar para leer sin conexión
Personalised
access to
cultural heritage
spaces

Recommendations for
the automatic enrichment of DL content
using open source software
Authors:

Eneko Agirre and Arantxa Otegi

www.paths-project.eu
PATHS is funded by the European Commission FP7 programme under Digital Libraries and Digital Preservation
Recommendations for the automatic enrichment
of DL content using open source software

1 Introduction
2 Producing intra-collections links
2.1 Similarity links
2.2 Typed-similarity links
3 Producing background links
4 Ontology extension
References

1 Introduction
PATHS uses text processing software to enable the following functionality in the prototypes
(see deliverables D2.1 and D2.2 http://paths-project.eu/eng/Resources):
●
●
●

intra-collection links
background links
ontology extension

The aim of PATHS is to investigate the use of those functionalities to better serve exploration
by users. Most of the software is in-house and relatively complex. The release of the software
as open source was not in the DOW, and the exploitations plan goes in the software-as-aservice direction, where the processing is done in servers prepared by the partners. For
instance, in a closely related Best Practice Network (LoCloud http://www.locloud.eu/), servers
for automatic content enrichment of digital library items are to be produced.
Alternatively, projects like OpeNER (http://www.opener-project.org/) are devoted to the
release of open source tools which are similar to the ones used for text processing in PATHS.
The release of all PATHS software as open source out-of-the-box packages would be a
major undertaking, on a par to those European projects.
The goal of this document is to present an overall set of recommendations for the automatic
enrichment of Digital Libraries content using open source software. We think this would be
useful for third-parties who would like to offer similar services. Note that this is not a step-bystep guide for reimplementation, but an overall view of the required software and
programming effort involved.
This document is structured according to each of the enrichment tasks described in PATHS
deliverables D2.1 and D2.2.

PATHS is funded by the European Commission FP7 programme under Digital Libraries and Digital Preservation
2

Producing intra-collections links

PATHS produced both generic similarity links and more specific typed-similarity links.

2.1 Similarity links
The target items would need to be parsed by the following Natural Language Processing
(NLP) tools: Pos tagging and lemmatization. Several open source products exist, including
the two used in PATHS:
CoreNLP (http://nlp.stanford.edu/software/corenlp.shtml)
and
Freeling (http://nlp.lsi.upc.edu/freeling/).
Regarding multilinguality PATHS worked on Spanish and English texts. Freeling covers both,
and Stanford works out-of-the-box for English. Recent projects like OpenNER also offer a
suite of open source NLP tools including, in addition to English and Spanish, four other
European languages.
In addition, we used in-house scripts to process the internal representation of the items,
extract the textual pieces, and produce the enriched representations.
The actual production of the similarity links requires additional software, which in this case
has been developed in-house. No out-of-the box alternatives exist. The interested parties
would need to replicate the software described in (Aletras et al. 2012).
As an alternative, the PATHS server demonstrating similarity links described in (Agirre et al.
2013b) uses the functionality provided by a search engine (Solr http://lucene.apache.org/solr/)
to provide similar items. Note that this alternative does not require additional NLP tools, as it
uses their own stop words and stemming algorithm.

2.2 Typed-similarity links
The target items would need to be parsed by the following NLP tools: Pos tagging and
lemmatization (see previous section).
In addition, we used in-house scripts to process the internal representation of the items,
extract the textual pieces, and produce the enriched representations.
The actual production of the typed-similarity links requires additional software, including open
source machine learning software (Weka http://www.cs.waikato.ac.nz/ml/weka/) and the inhouse scripts to extract features from items, train the machine learning models on the
publicly available typed-similarity datasets produced by PATHS (http://ixa2.si.ehu.es/sts/),
and use the machine learning models on the target items. No out-of-the-box alternatives
exist. The interested parties would need to replicate the software as described in (Agirre et al.
2013a).
3

Producing background links

In order to produce background links we used Wikipedia Miner, an open source software
available at http://wikipedia-miner.cms.waikato.ac.nz. We used wikipedia miner out-of-thebox. In addition, we used in-house scripts to process the internal representation of the items,
extract the textual pieces, and produce the enriched representations.

4

Ontology extension

The target items would need to be parsed by the following NLP tools: Pos tagging and
lemmatisation (see previous section).
In addition, we used in-house scripts to process the internal representation of the items,
extract the textual pieces, and produce the enriched representations.
The actual production of the vocabulary requires additional software. We first extract the
background links (see previous section), and then find the most relevant Wikipedia articles
per item. This is done globally, analysing the statistics of the whole collection. Those articles
are used to categorize the items according to a Wikipedia-based category system, which is
trimmed-down to only cover the categories which are relevant to the collection at hand. No
out-of-the-box alternatives exist. The interested parties would need to replicate the software
as described in (Fernando et al. 2012).

References
Agirre E., Aletras N., Gonzalez-Agirre A., Rigau G., Stevenson M. (2013a). UBC UOSTYPED: Regression for typed-similarity. The Second Joint Conference on Lexical and
Computational Semantics (*SEM 2013)
Agirre E., Barrena A., Fernandez K., Miranda E., Otegi A., Soroa A. (2013b). PATHSenrich:
a Web Service Prototype for Automatic Cultural Heritage Item Enrichment in Research and
Advanced Technology for Digital Libraries, International Conference on Theory and Practice
of Digital Libraries, TPDL 2013, Valletta, Malta, September 22-26, 2013. Lecture Notes in
Computer Science, vol. 8092, pp 462-465.
Aletras N., Stevenson M., Clough P. (2012). Computing Similarity between Items in a Digital
Library of Cultural Heritage. In ACM JOCCH.
Fernando S., Hall M., Agirre E., Soroa A., Clough P., Stevenson M. (2012). Comparing
taxonomies for organising collections of documents. In Proceedings of the 24th International
Conference on Computational Linguistics (Coling 2012), pages 879-894, Mumbai, India,
December 2012.

Más contenido relacionado

Destacado

PATHS system architecture
PATHS system architecturePATHS system architecture
PATHS system architecturepathsproject
 
Величко М.В. (2012) — Территориальные инновации, их внедрение и кадровое обе...
Величко М.В. (2012) — Территориальные инновации, их внедрение и кадровое обе...Величко М.В. (2012) — Территориальные инновации, их внедрение и кадровое обе...
Величко М.В. (2012) — Территориальные инновации, их внедрение и кадровое обе...mediamera
 
Word pressで情報を得るのに役立つwebサイトの紹介
Word pressで情報を得るのに役立つwebサイトの紹介Word pressで情報を得るのに役立つwebサイトの紹介
Word pressで情報を得るのに役立つwebサイトの紹介Akinori Tateyama
 
My E-mail appears as spam | The 7 major reasons | Part 5#17
My E-mail appears as spam | The 7 major reasons | Part 5#17My E-mail appears as spam | The 7 major reasons | Part 5#17
My E-mail appears as spam | The 7 major reasons | Part 5#17Eyal Doron
 
Evaluating the Use of Clustering for Automatically Organising Digital Library...
Evaluating the Use of Clustering for Automatically Organising Digital Library...Evaluating the Use of Clustering for Automatically Organising Digital Library...
Evaluating the Use of Clustering for Automatically Organising Digital Library...pathsproject
 
Cross-lingual event-mining using wordnet as a shared knowledge interface
Cross-lingual event-mining using wordnet as a shared knowledge interfaceCross-lingual event-mining using wordnet as a shared knowledge interface
Cross-lingual event-mining using wordnet as a shared knowledge interfacepathsproject
 
What are the possible damages of phishing and spoofing mail attacks part 2#...
What are the possible damages of phishing and spoofing mail attacks   part 2#...What are the possible damages of phishing and spoofing mail attacks   part 2#...
What are the possible damages of phishing and spoofing mail attacks part 2#...Eyal Doron
 
PATHS Evaluation of the 1st paths prototype
PATHS Evaluation of the 1st paths prototypePATHS Evaluation of the 1st paths prototype
PATHS Evaluation of the 1st paths prototypepathsproject
 
PATHS state of the art monitoring report
PATHS state of the art monitoring reportPATHS state of the art monitoring report
PATHS state of the art monitoring reportpathsproject
 
A pilot on Semantic Textual Similarity
A pilot on Semantic Textual SimilarityA pilot on Semantic Textual Similarity
A pilot on Semantic Textual Similaritypathsproject
 
PATHS at EuropeanaTech 2011, Vienna
PATHS at EuropeanaTech 2011, ViennaPATHS at EuropeanaTech 2011, Vienna
PATHS at EuropeanaTech 2011, Viennapathsproject
 
Semantic Enrichment of Cultural Heritage content in PATHS
Semantic Enrichment of Cultural Heritage content in PATHSSemantic Enrichment of Cultural Heritage content in PATHS
Semantic Enrichment of Cultural Heritage content in PATHSpathsproject
 
The importance of Exchange 2013 CAS in Exchange 2013 coexistence | Part 2/2 |...
The importance of Exchange 2013 CAS in Exchange 2013 coexistence | Part 2/2 |...The importance of Exchange 2013 CAS in Exchange 2013 coexistence | Part 2/2 |...
The importance of Exchange 2013 CAS in Exchange 2013 coexistence | Part 2/2 |...Eyal Doron
 
My E-mail appears as spam - troubleshooting path - part 11 of 17
My E-mail appears as spam - troubleshooting path - part 11 of 17My E-mail appears as spam - troubleshooting path - part 11 of 17
My E-mail appears as spam - troubleshooting path - part 11 of 17Eyal Doron
 

Destacado (18)

Boletim informativo - janeiro
Boletim informativo - janeiroBoletim informativo - janeiro
Boletim informativo - janeiro
 
Fichas s
Fichas sFichas s
Fichas s
 
PATHS system architecture
PATHS system architecturePATHS system architecture
PATHS system architecture
 
Величко М.В. (2012) — Территориальные инновации, их внедрение и кадровое обе...
Величко М.В. (2012) — Территориальные инновации, их внедрение и кадровое обе...Величко М.В. (2012) — Территориальные инновации, их внедрение и кадровое обе...
Величко М.В. (2012) — Территориальные инновации, их внедрение и кадровое обе...
 
Word pressで情報を得るのに役立つwebサイトの紹介
Word pressで情報を得るのに役立つwebサイトの紹介Word pressで情報を得るのに役立つwebサイトの紹介
Word pressで情報を得るのに役立つwebサイトの紹介
 
My E-mail appears as spam | The 7 major reasons | Part 5#17
My E-mail appears as spam | The 7 major reasons | Part 5#17My E-mail appears as spam | The 7 major reasons | Part 5#17
My E-mail appears as spam | The 7 major reasons | Part 5#17
 
Think before you speak
Think before you speakThink before you speak
Think before you speak
 
Evaluating the Use of Clustering for Automatically Organising Digital Library...
Evaluating the Use of Clustering for Automatically Organising Digital Library...Evaluating the Use of Clustering for Automatically Organising Digital Library...
Evaluating the Use of Clustering for Automatically Organising Digital Library...
 
Zp primary school, ganegaon khalsa
Zp primary school, ganegaon khalsaZp primary school, ganegaon khalsa
Zp primary school, ganegaon khalsa
 
Cross-lingual event-mining using wordnet as a shared knowledge interface
Cross-lingual event-mining using wordnet as a shared knowledge interfaceCross-lingual event-mining using wordnet as a shared knowledge interface
Cross-lingual event-mining using wordnet as a shared knowledge interface
 
What are the possible damages of phishing and spoofing mail attacks part 2#...
What are the possible damages of phishing and spoofing mail attacks   part 2#...What are the possible damages of phishing and spoofing mail attacks   part 2#...
What are the possible damages of phishing and spoofing mail attacks part 2#...
 
PATHS Evaluation of the 1st paths prototype
PATHS Evaluation of the 1st paths prototypePATHS Evaluation of the 1st paths prototype
PATHS Evaluation of the 1st paths prototype
 
PATHS state of the art monitoring report
PATHS state of the art monitoring reportPATHS state of the art monitoring report
PATHS state of the art monitoring report
 
A pilot on Semantic Textual Similarity
A pilot on Semantic Textual SimilarityA pilot on Semantic Textual Similarity
A pilot on Semantic Textual Similarity
 
PATHS at EuropeanaTech 2011, Vienna
PATHS at EuropeanaTech 2011, ViennaPATHS at EuropeanaTech 2011, Vienna
PATHS at EuropeanaTech 2011, Vienna
 
Semantic Enrichment of Cultural Heritage content in PATHS
Semantic Enrichment of Cultural Heritage content in PATHSSemantic Enrichment of Cultural Heritage content in PATHS
Semantic Enrichment of Cultural Heritage content in PATHS
 
The importance of Exchange 2013 CAS in Exchange 2013 coexistence | Part 2/2 |...
The importance of Exchange 2013 CAS in Exchange 2013 coexistence | Part 2/2 |...The importance of Exchange 2013 CAS in Exchange 2013 coexistence | Part 2/2 |...
The importance of Exchange 2013 CAS in Exchange 2013 coexistence | Part 2/2 |...
 
My E-mail appears as spam - troubleshooting path - part 11 of 17
My E-mail appears as spam - troubleshooting path - part 11 of 17My E-mail appears as spam - troubleshooting path - part 11 of 17
My E-mail appears as spam - troubleshooting path - part 11 of 17
 

Similar a Recommendations for the automatic enrichment of digital library content using open source software, PATHS report

PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...pathsproject
 
Real time text stream processing - a dynamic and distributed nlp pipeline
Real time text stream  processing - a dynamic and distributed nlp pipelineReal time text stream  processing - a dynamic and distributed nlp pipeline
Real time text stream processing - a dynamic and distributed nlp pipelineConference Papers
 
Sentiment analyzer and opinion mining
Sentiment analyzer and opinion miningSentiment analyzer and opinion mining
Sentiment analyzer and opinion miningAnkush Mehta
 
2010-04-14 EDUCON eMadrid uned mrartacho
2010-04-14 EDUCON eMadrid uned mrartacho2010-04-14 EDUCON eMadrid uned mrartacho
2010-04-14 EDUCON eMadrid uned mrartachoeMadrid network
 
07419051111154767
0741905111115476707419051111154767
07419051111154767gayatri 24
 
Knowledge based-interaction-in-software-development
Knowledge based-interaction-in-software-developmentKnowledge based-interaction-in-software-development
Knowledge based-interaction-in-software-developmentDimitris Panagiotou
 
ABCD Open Source Software for managing ETD repositories
ABCD Open Source Software for managing ETD repositoriesABCD Open Source Software for managing ETD repositories
ABCD Open Source Software for managing ETD repositoriessangeetadhamdhere
 
A Service-Oriented National E-Theses Information System And Repository
A Service-Oriented National E-Theses Information System And RepositoryA Service-Oriented National E-Theses Information System And Repository
A Service-Oriented National E-Theses Information System And RepositoryJill Brown
 
Research Tool - End Note
Research Tool - End NoteResearch Tool - End Note
Research Tool - End Noteador
 
A Distributed Audio Personalization Framework over Android
A Distributed Audio Personalization Framework over AndroidA Distributed Audio Personalization Framework over Android
A Distributed Audio Personalization Framework over AndroidUniversity of Piraeus
 
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT ecij
 
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT ecij
 
A knowledge-workbench-for-software-development
A knowledge-workbench-for-software-developmentA knowledge-workbench-for-software-development
A knowledge-workbench-for-software-developmentDimitris Panagiotou
 
IRJET- Hosting NLP based Chatbot on AWS Cloud using Docker
IRJET-  	  Hosting NLP based Chatbot on AWS Cloud using DockerIRJET-  	  Hosting NLP based Chatbot on AWS Cloud using Docker
IRJET- Hosting NLP based Chatbot on AWS Cloud using DockerIRJET Journal
 
QEBU: AN ADVANCED GRAPHICAL EDITOR FOR THE EBUCORE METADATA SET | Paolo PASIN...
QEBU: AN ADVANCED GRAPHICAL EDITOR FOR THE EBUCORE METADATA SET | Paolo PASIN...QEBU: AN ADVANCED GRAPHICAL EDITOR FOR THE EBUCORE METADATA SET | Paolo PASIN...
QEBU: AN ADVANCED GRAPHICAL EDITOR FOR THE EBUCORE METADATA SET | Paolo PASIN...FIAT/IFTA
 
Bio2RDF presentation at Combine 2012
Bio2RDF presentation at Combine 2012Bio2RDF presentation at Combine 2012
Bio2RDF presentation at Combine 2012François Belleau
 

Similar a Recommendations for the automatic enrichment of digital library content using open source software, PATHS report (20)

PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
 
C04 07 1519
C04 07 1519C04 07 1519
C04 07 1519
 
Real time text stream processing - a dynamic and distributed nlp pipeline
Real time text stream  processing - a dynamic and distributed nlp pipelineReal time text stream  processing - a dynamic and distributed nlp pipeline
Real time text stream processing - a dynamic and distributed nlp pipeline
 
Sentiment analyzer and opinion mining
Sentiment analyzer and opinion miningSentiment analyzer and opinion mining
Sentiment analyzer and opinion mining
 
2010-04-14 EDUCON eMadrid uned mrartacho
2010-04-14 EDUCON eMadrid uned mrartacho2010-04-14 EDUCON eMadrid uned mrartacho
2010-04-14 EDUCON eMadrid uned mrartacho
 
07419051111154767
0741905111115476707419051111154767
07419051111154767
 
Knowledge based-interaction-in-software-development
Knowledge based-interaction-in-software-developmentKnowledge based-interaction-in-software-development
Knowledge based-interaction-in-software-development
 
ABCD Open Source Software for managing ETD repositories
ABCD Open Source Software for managing ETD repositoriesABCD Open Source Software for managing ETD repositories
ABCD Open Source Software for managing ETD repositories
 
New Goals of PARES: Spanish Archives Web Portal
New Goals of PARES: Spanish Archives Web PortalNew Goals of PARES: Spanish Archives Web Portal
New Goals of PARES: Spanish Archives Web Portal
 
A Service-Oriented National E-Theses Information System And Repository
A Service-Oriented National E-Theses Information System And RepositoryA Service-Oriented National E-Theses Information System And Repository
A Service-Oriented National E-Theses Information System And Repository
 
Research Tool - End Note
Research Tool - End NoteResearch Tool - End Note
Research Tool - End Note
 
Dspace
DspaceDspace
Dspace
 
Dspace
DspaceDspace
Dspace
 
A Distributed Audio Personalization Framework over Android
A Distributed Audio Personalization Framework over AndroidA Distributed Audio Personalization Framework over Android
A Distributed Audio Personalization Framework over Android
 
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
 
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
 
A knowledge-workbench-for-software-development
A knowledge-workbench-for-software-developmentA knowledge-workbench-for-software-development
A knowledge-workbench-for-software-development
 
IRJET- Hosting NLP based Chatbot on AWS Cloud using Docker
IRJET-  	  Hosting NLP based Chatbot on AWS Cloud using DockerIRJET-  	  Hosting NLP based Chatbot on AWS Cloud using Docker
IRJET- Hosting NLP based Chatbot on AWS Cloud using Docker
 
QEBU: AN ADVANCED GRAPHICAL EDITOR FOR THE EBUCORE METADATA SET | Paolo PASIN...
QEBU: AN ADVANCED GRAPHICAL EDITOR FOR THE EBUCORE METADATA SET | Paolo PASIN...QEBU: AN ADVANCED GRAPHICAL EDITOR FOR THE EBUCORE METADATA SET | Paolo PASIN...
QEBU: AN ADVANCED GRAPHICAL EDITOR FOR THE EBUCORE METADATA SET | Paolo PASIN...
 
Bio2RDF presentation at Combine 2012
Bio2RDF presentation at Combine 2012Bio2RDF presentation at Combine 2012
Bio2RDF presentation at Combine 2012
 

Más de pathsproject

Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...pathsproject
 
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...pathsproject
 
Implementing Recommendations in the PATHS system, SUEDL 2013
Implementing Recommendations in the PATHS system, SUEDL 2013Implementing Recommendations in the PATHS system, SUEDL 2013
Implementing Recommendations in the PATHS system, SUEDL 2013pathsproject
 
User-Centred Design to Support Exploration and Path Creation in Cultural Her...
 User-Centred Design to Support Exploration and Path Creation in Cultural Her... User-Centred Design to Support Exploration and Path Creation in Cultural Her...
User-Centred Design to Support Exploration and Path Creation in Cultural Her...pathsproject
 
Generating Paths through Cultural Heritage Collections Latech2013 paper
Generating Paths through Cultural Heritage Collections Latech2013 paperGenerating Paths through Cultural Heritage Collections Latech2013 paper
Generating Paths through Cultural Heritage Collections Latech2013 paperpathsproject
 
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...pathsproject
 
Generating Paths through Cultural Heritage Collections, LATECH 2013 paper
Generating Paths through Cultural Heritage Collections, LATECH 2013 paperGenerating Paths through Cultural Heritage Collections, LATECH 2013 paper
Generating Paths through Cultural Heritage Collections, LATECH 2013 paperpathsproject
 
PATHS @ LATECH 2013
PATHS @ LATECH 2013PATHS @ LATECH 2013
PATHS @ LATECH 2013pathsproject
 
PATHS at the eChallenges conference
PATHS at the eChallenges conferencePATHS at the eChallenges conference
PATHS at the eChallenges conferencepathsproject
 
PATHS at the EAA conference 2013
PATHS at the EAA conference 2013PATHS at the EAA conference 2013
PATHS at the EAA conference 2013pathsproject
 
PATHS at the eCult dialogue day 2013
PATHS at the eCult dialogue day 2013PATHS at the eCult dialogue day 2013
PATHS at the eCult dialogue day 2013pathsproject
 
Comparing taxonomies for organising collections of documents presentation
Comparing taxonomies for organising collections of documents presentationComparing taxonomies for organising collections of documents presentation
Comparing taxonomies for organising collections of documents presentationpathsproject
 
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual SimilaritySemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual Similaritypathsproject
 
Comparing taxonomies for organising collections of documents
Comparing taxonomies for organising collections of documentsComparing taxonomies for organising collections of documents
Comparing taxonomies for organising collections of documentspathsproject
 
PATHS Final prototype interface design v1.0
PATHS Final prototype interface design v1.0PATHS Final prototype interface design v1.0
PATHS Final prototype interface design v1.0pathsproject
 
PATHS Second prototype-functional-spec
PATHS Second prototype-functional-specPATHS Second prototype-functional-spec
PATHS Second prototype-functional-specpathsproject
 
PATHS Final state of art monitoring report v0_4
PATHS  Final state of art monitoring report v0_4PATHS  Final state of art monitoring report v0_4
PATHS Final state of art monitoring report v0_4pathsproject
 
PATHS first paths prototype
PATHS first paths prototypePATHS first paths prototype
PATHS first paths prototypepathsproject
 
PATHS Content processing 2nd prototype-revised.v2
PATHS Content processing 2nd prototype-revised.v2PATHS Content processing 2nd prototype-revised.v2
PATHS Content processing 2nd prototype-revised.v2pathsproject
 
PATHS Content processing 1st prototype
PATHS  Content processing 1st prototypePATHS  Content processing 1st prototype
PATHS Content processing 1st prototypepathsproject
 

Más de pathsproject (20)

Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
 
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
 
Implementing Recommendations in the PATHS system, SUEDL 2013
Implementing Recommendations in the PATHS system, SUEDL 2013Implementing Recommendations in the PATHS system, SUEDL 2013
Implementing Recommendations in the PATHS system, SUEDL 2013
 
User-Centred Design to Support Exploration and Path Creation in Cultural Her...
 User-Centred Design to Support Exploration and Path Creation in Cultural Her... User-Centred Design to Support Exploration and Path Creation in Cultural Her...
User-Centred Design to Support Exploration and Path Creation in Cultural Her...
 
Generating Paths through Cultural Heritage Collections Latech2013 paper
Generating Paths through Cultural Heritage Collections Latech2013 paperGenerating Paths through Cultural Heritage Collections Latech2013 paper
Generating Paths through Cultural Heritage Collections Latech2013 paper
 
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
 
Generating Paths through Cultural Heritage Collections, LATECH 2013 paper
Generating Paths through Cultural Heritage Collections, LATECH 2013 paperGenerating Paths through Cultural Heritage Collections, LATECH 2013 paper
Generating Paths through Cultural Heritage Collections, LATECH 2013 paper
 
PATHS @ LATECH 2013
PATHS @ LATECH 2013PATHS @ LATECH 2013
PATHS @ LATECH 2013
 
PATHS at the eChallenges conference
PATHS at the eChallenges conferencePATHS at the eChallenges conference
PATHS at the eChallenges conference
 
PATHS at the EAA conference 2013
PATHS at the EAA conference 2013PATHS at the EAA conference 2013
PATHS at the EAA conference 2013
 
PATHS at the eCult dialogue day 2013
PATHS at the eCult dialogue day 2013PATHS at the eCult dialogue day 2013
PATHS at the eCult dialogue day 2013
 
Comparing taxonomies for organising collections of documents presentation
Comparing taxonomies for organising collections of documents presentationComparing taxonomies for organising collections of documents presentation
Comparing taxonomies for organising collections of documents presentation
 
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual SimilaritySemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
 
Comparing taxonomies for organising collections of documents
Comparing taxonomies for organising collections of documentsComparing taxonomies for organising collections of documents
Comparing taxonomies for organising collections of documents
 
PATHS Final prototype interface design v1.0
PATHS Final prototype interface design v1.0PATHS Final prototype interface design v1.0
PATHS Final prototype interface design v1.0
 
PATHS Second prototype-functional-spec
PATHS Second prototype-functional-specPATHS Second prototype-functional-spec
PATHS Second prototype-functional-spec
 
PATHS Final state of art monitoring report v0_4
PATHS  Final state of art monitoring report v0_4PATHS  Final state of art monitoring report v0_4
PATHS Final state of art monitoring report v0_4
 
PATHS first paths prototype
PATHS first paths prototypePATHS first paths prototype
PATHS first paths prototype
 
PATHS Content processing 2nd prototype-revised.v2
PATHS Content processing 2nd prototype-revised.v2PATHS Content processing 2nd prototype-revised.v2
PATHS Content processing 2nd prototype-revised.v2
 
PATHS Content processing 1st prototype
PATHS  Content processing 1st prototypePATHS  Content processing 1st prototype
PATHS Content processing 1st prototype
 

Último

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Último (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Recommendations for the automatic enrichment of digital library content using open source software, PATHS report

  • 1. Personalised access to cultural heritage spaces Recommendations for the automatic enrichment of DL content using open source software Authors: Eneko Agirre and Arantxa Otegi www.paths-project.eu PATHS is funded by the European Commission FP7 programme under Digital Libraries and Digital Preservation
  • 2. Recommendations for the automatic enrichment of DL content using open source software 1 Introduction 2 Producing intra-collections links 2.1 Similarity links 2.2 Typed-similarity links 3 Producing background links 4 Ontology extension References 1 Introduction PATHS uses text processing software to enable the following functionality in the prototypes (see deliverables D2.1 and D2.2 http://paths-project.eu/eng/Resources): ● ● ● intra-collection links background links ontology extension The aim of PATHS is to investigate the use of those functionalities to better serve exploration by users. Most of the software is in-house and relatively complex. The release of the software as open source was not in the DOW, and the exploitations plan goes in the software-as-aservice direction, where the processing is done in servers prepared by the partners. For instance, in a closely related Best Practice Network (LoCloud http://www.locloud.eu/), servers for automatic content enrichment of digital library items are to be produced. Alternatively, projects like OpeNER (http://www.opener-project.org/) are devoted to the release of open source tools which are similar to the ones used for text processing in PATHS. The release of all PATHS software as open source out-of-the-box packages would be a major undertaking, on a par to those European projects. The goal of this document is to present an overall set of recommendations for the automatic enrichment of Digital Libraries content using open source software. We think this would be useful for third-parties who would like to offer similar services. Note that this is not a step-bystep guide for reimplementation, but an overall view of the required software and programming effort involved. This document is structured according to each of the enrichment tasks described in PATHS deliverables D2.1 and D2.2. PATHS is funded by the European Commission FP7 programme under Digital Libraries and Digital Preservation
  • 3. 2 Producing intra-collections links PATHS produced both generic similarity links and more specific typed-similarity links. 2.1 Similarity links The target items would need to be parsed by the following Natural Language Processing (NLP) tools: Pos tagging and lemmatization. Several open source products exist, including the two used in PATHS: CoreNLP (http://nlp.stanford.edu/software/corenlp.shtml) and Freeling (http://nlp.lsi.upc.edu/freeling/). Regarding multilinguality PATHS worked on Spanish and English texts. Freeling covers both, and Stanford works out-of-the-box for English. Recent projects like OpenNER also offer a suite of open source NLP tools including, in addition to English and Spanish, four other European languages. In addition, we used in-house scripts to process the internal representation of the items, extract the textual pieces, and produce the enriched representations. The actual production of the similarity links requires additional software, which in this case has been developed in-house. No out-of-the box alternatives exist. The interested parties would need to replicate the software described in (Aletras et al. 2012). As an alternative, the PATHS server demonstrating similarity links described in (Agirre et al. 2013b) uses the functionality provided by a search engine (Solr http://lucene.apache.org/solr/) to provide similar items. Note that this alternative does not require additional NLP tools, as it uses their own stop words and stemming algorithm. 2.2 Typed-similarity links The target items would need to be parsed by the following NLP tools: Pos tagging and lemmatization (see previous section). In addition, we used in-house scripts to process the internal representation of the items, extract the textual pieces, and produce the enriched representations. The actual production of the typed-similarity links requires additional software, including open source machine learning software (Weka http://www.cs.waikato.ac.nz/ml/weka/) and the inhouse scripts to extract features from items, train the machine learning models on the publicly available typed-similarity datasets produced by PATHS (http://ixa2.si.ehu.es/sts/), and use the machine learning models on the target items. No out-of-the-box alternatives exist. The interested parties would need to replicate the software as described in (Agirre et al. 2013a).
  • 4. 3 Producing background links In order to produce background links we used Wikipedia Miner, an open source software available at http://wikipedia-miner.cms.waikato.ac.nz. We used wikipedia miner out-of-thebox. In addition, we used in-house scripts to process the internal representation of the items, extract the textual pieces, and produce the enriched representations. 4 Ontology extension The target items would need to be parsed by the following NLP tools: Pos tagging and lemmatisation (see previous section). In addition, we used in-house scripts to process the internal representation of the items, extract the textual pieces, and produce the enriched representations. The actual production of the vocabulary requires additional software. We first extract the background links (see previous section), and then find the most relevant Wikipedia articles per item. This is done globally, analysing the statistics of the whole collection. Those articles are used to categorize the items according to a Wikipedia-based category system, which is trimmed-down to only cover the categories which are relevant to the collection at hand. No out-of-the-box alternatives exist. The interested parties would need to replicate the software as described in (Fernando et al. 2012). References Agirre E., Aletras N., Gonzalez-Agirre A., Rigau G., Stevenson M. (2013a). UBC UOSTYPED: Regression for typed-similarity. The Second Joint Conference on Lexical and Computational Semantics (*SEM 2013) Agirre E., Barrena A., Fernandez K., Miranda E., Otegi A., Soroa A. (2013b). PATHSenrich: a Web Service Prototype for Automatic Cultural Heritage Item Enrichment in Research and Advanced Technology for Digital Libraries, International Conference on Theory and Practice of Digital Libraries, TPDL 2013, Valletta, Malta, September 22-26, 2013. Lecture Notes in Computer Science, vol. 8092, pp 462-465. Aletras N., Stevenson M., Clough P. (2012). Computing Similarity between Items in a Digital Library of Cultural Heritage. In ACM JOCCH. Fernando S., Hall M., Agirre E., Soroa A., Clough P., Stevenson M. (2012). Comparing taxonomies for organising collections of documents. In Proceedings of the 24th International Conference on Computational Linguistics (Coling 2012), pages 879-894, Mumbai, India, December 2012.