Presentation at Eurolan'15 about the methodology for Linguistic Linked Open Data generation, with its application into a partucluar case: the Apertium family of bilngual dictionaries.
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Methodology for Linguistic Linked Open Data generation. The Apertium RDF case
1. 19/07/2015 1Presenter name
“Methodology for Linguistic Linked Open Data
generation. The Apertium RDF case”
Jorge Gracia, Daniel Vila-Suero
jgracia, dvila@fi.upm.es
The Summer School on Linguistic Linked Open Data
12th EUROLAN School.
20th July 2015
2. 20/07/2015 2Jorge Gracia, Daniel Vila-Suero
Outline
Introduction
Methodology
Analysis of data sources
Modelling
URI/IRI design
RDF Generation
Publication
Traversing the Apertium RDF graph
Conclusions
4. 20/07/2015 4Jorge Gracia, Daniel Vila-Suero
Introduction
Current multilingual lexica and electronic dictionaries
• Proprietary formats
• Non-standard APIs
• Disconnected from other resources
5. 20/07/2015 5Jorge Gracia, Daniel Vila-Suero
Introduction
GOAL: to expose linguistic data contained in language
resources as Linked Data on the Web
6. 20/07/2015 6Jorge Gracia, Daniel Vila-Suero
Introduction
Different methods and guidelines available:
• LOD2
• Datalift
• W3C Linked Data cookbook
• W3C Best Practices for Linked Data
But… multilingualism and linguistic linked data
are not explicitly treated
7. 20/07/2015 7Jorge Gracia, Daniel Vila-Suero
Introduction
Guidelines for Multilingual Linked Data
Guidelines for LD generation of language resources
(at W3C BPMLOD community group)
• Bilingual dictionaries
• Multilingual dictionaries (BabelNet)
• WordNets
• Terminologies in TBX
D. Vila-Suero, A. Gómez-Pérez, E. Montiel-Ponsoda, J. Gracia, and
G. Aguado-de Cea, "Publishing Linked Data: the multilingual
dimension". Springer Berlin Heidelberg, Aug. 2014, pp. 101-118.
8. 20/07/2015 8Jorge Gracia, Daniel Vila-Suero
Introduction
Reference cards for Linguistic Linked Data
• How to publish Linguistic Linked Data
• Language Resource Licensing - ODRL Reference Card
• Inclusion in the LLOD Cloud
• Data ID
• Discovering Language Resources with Ling
• NIF corpus
• How to represent crosslingual links
• Documenting a language resource in Datahub
http://www.lider-project.eu/guidelines
9. 20/07/2015 9Jorge Gracia, Daniel Vila-Suero
Introduction
Motivating example:
the Apertium bilingual dictionaries
...although the methodology is general enough to
be applied to many other scenarios
10. 20/07/2015 10Jorge Gracia, Daniel Vila-Suero
Introduction
Apertium [http://www.apertium.org] open source
platform for Machine Translation. Its bilingual
dictionaries available in XML.
11. 20/07/2015 11Jorge Gracia, Daniel Vila-Suero
Introduction
Afrikaans <-> Dutch
Breton --> French
Catalan <-> Italian
Welsh <-> English
Danish <-- Norwegian
English <-> Catalan
English <-> Spanish
English <-> Galician
Esperanto <-- Catalan
Esperanto <-> English
Esperanto <-- Spanish
Esperanto <-- French
Spanish <-> Aragonese
Spanish <-> Asturian
Spanish <-> Catalan
Spanish <-> Galician
Spanish <-> Italian
Spanish <-> Portuguese
Spanish <-> Romanian
Basque --> English
Basque --> Spanish
French <-> Catalan
French <-> Spanish
Serbo-Croatian <-> English
Serbo-Croatian <-> Macedonian
Serbo-Croatian <-> Slovenian
Indonesian <-> Malaysian
Icelandic <-> Swedish
Icelandic --> English
Kazakh <-> Tatar
Macedonian <-> Bulgarian
Macedonian --> English
Norwegian Nynorsk <-> Norwegian
Bokmål
Occitan <-> Catalan
Occitan <-> Spanish
Portuguese <-> Catalan
Portuguese <-> Galician
Northern Sami --> Norwegian
Bokmål
Swedish <-> Danish
……
More that 40 language pairs
22 of them (more stable) available in LMF
13. 20/07/2015 13Jorge Gracia, Daniel Vila-Suero
Main activities:
1. Analysis of data sources
2. Modelling
3. URI/IRI design
4. RDF Generation
5. Publication
Each activity composed of several tasks
14. 20/07/2015 14Jorge Gracia, Daniel Vila-Suero
Main activities:
1. Analysis of data sources
2. Modelling
3. URI/IRI design
4. RDF Generation
5. Publication
15. 20/07/2015 15Jorge Gracia, Daniel Vila-Suero
Analysis of data sources
The goal is to:
• Specify and analyse the data sources in order to
plan and manage the subsequent activities
• Main aspects to specify are:
– Format
– Identifiers structure
– Access methods: file, webservice, etc.
– Data models: Standards, terminologies, etc.
– Language representation: how languages are tagged,
represented, etc.
– License and provenance: existing license of data sources
16. 20/07/2015 16Jorge Gracia, Daniel Vila-Suero
Analysis of data sources EXAMPLE
Documentation of data source:
– Type of data: Bilingual dictionary (English and
Spanish)
– Data model: LMF (Lexical Markup Framework)
– Format: XML files
– License: GPL 3.0
– Provenance: Apertium EN-ES
– ….
17. 20/07/2015 17Jorge Gracia, Daniel Vila-Suero
Analysis of data sources EXAMPLE
<Lexicon>
<feat att="language" val="en"/>
...
<LexicalEntry id="bench-n-en">
<feat att="partOfSpeech" val="n"/>
<Lemma>
<feat att="writtenForm" val="bench"/>
</Lemma>
<Sense id="bench_banco-n-l"/>
</LexicalEntry>
…
18. 20/07/2015 18Jorge Gracia, Daniel Vila-Suero
Main activities:
1. Analysis of data sources
2. Modelling
3. URI/IRI design
4. RDF Generation
5. Publication
19. 20/07/2015 19Jorge Gracia, Daniel Vila-Suero
Modelling
Modelling tasks
1. Analysis and selection of domain vocabularies
2. Selection of vocabularies for representing
licensing, provenance and other metadata
3. Mapping of data sources and vocabularies
20. 20/07/2015 20Jorge Gracia, Daniel Vila-Suero
NIF
NLP Interchange Format
LexInfo
Dublin Core
Use http://lov.okfn.org/
Modelling
Analysis of vocabularies
DCAT
PROV
W3C Provenance Ontology
ODRL
Open Digital Rights Language
25. 20/07/2015 25Jorge Gracia, Daniel Vila-Suero
Main activities:
1. Analysis of data sources
2. Modelling
3. URI/IRI design
4. RDF Generation
5. Publication
26. 20/07/2015 26Jorge Gracia, Daniel Vila-Suero
URI/IRI design
The goal is to:
• Define URI/IRI patterns and namespaces to be
used
• Ensure that LD best practices are followed
27. 20/07/2015 27Jorge Gracia, Daniel Vila-Suero
URI/IRI design
Some good practises…
1. Define namespace(s) (that you own or have control
over).
2. Define how to create the ID of resources (reuse
original data source keys if possible)
3. Define the structure of the URI space to organize the
resources in different addresses and avoid collision.
Useful guidance at:
ISA - Study on persistent URIs Archer et al.,
Linked Data patterns book online URI patterns
28. 20/07/2015 28Jorge Gracia, Daniel Vila-Suero
URI/IRI design
Following ISA recommendations:
http://{domain}/{type}/{concept}/{reference}
where:
{type} : a value from the set of type of resources, examples
are 'id' or 'item' for real world objects; 'doc' for documents
that describe those objects; 'def' for concepts; 'set' for
datasets
Archer, P., Goedertier, S., & Loutas, N. (2012). “Study on persistent
URIs”. Technical report
29. 20/07/2015 29Jorge Gracia, Daniel Vila-Suero
URI/IRI design EXAMPLE
# Apertium English lexicon:
http://linguistic.linkeddata.es/id/apertium/lexiconEN
# Apertium Spanish lexicon:
http://linguistic.linkeddata.es/id/apertium/lexiconES
# Apertium English-Spanish translation set:
http://linguistic.linkeddata.es/id/apertium/tranSetEN-ES
Following ISA recommendations:
30. 20/07/2015 30Jorge Gracia, Daniel Vila-Suero
Main activities:
1. Analysis of data sources
2. Modelling
3. URI/IRI design
4. RDF Generation
5. Publication
31. 20/07/2015 31Jorge Gracia, Daniel Vila-Suero
RDF Generation
1. Selection, extension or development of
technologies for RDF generation
– Open Refine
– D2RQ
– XMLS
– …
2. Mapping of data sources to RDF
3. Transformation of data sources to RDF
32. 20/07/2015 32Jorge Gracia, Daniel Vila-Suero
RDF Generation EXAMPLE
Goal:
apertium:lexiconEN a lemon:Lexicon ;
dc:source <http://hdl.handle.net/10230/17110> .
...
apertium:lexiconEN lemon:entry apertium:lexiconEN/bench-n-en .
apertium:lexiconEN/bench-n-en a lemon:LexicalEntry ;
lemon:lexicalForm apertium:lexiconEN/bench-n-en-form ;
lexinfo:partOfSpeech lexinfo:noun .
apertium:lexiconEN/bench-n-en-form a lemon:Form ;
lemon:writtenRep "bench"@en .
34. 20/07/2015 34Jorge Gracia, Daniel Vila-Suero
Main activities:
1. Analysis of data sources
2. Modelling
3. URI/IRI design
4. Generation
5. Publication
35. 20/07/2015 35Jorge Gracia, Daniel Vila-Suero
Publication
The goal is to:
• Make available the RDF dataset following Linked
Data best practices
• Facilitate dataset discovery and consumption
36. 20/07/2015 36Jorge Gracia, Daniel Vila-Suero
Publication
Metadata definition using the previously selected
vocabularies (DCAT, DC, VOID, …)
1. Register dataset in Datahub
2. Extend generated DCAT file
and link to Datahub’s one
3. Publish both data and
metadata files
DCAT
Data catalog vocabulary
37. 20/07/2015 37Jorge Gracia, Daniel Vila-Suero
Add "rights" metadata in the
dataset description (e.g., VoID, DCAT)
1
Use standard predicates to declare "rights”
statements (e.g., Dublin Core terms:
dc:rights, dct:license)
2
?
Use rights declaration
language, e.g., ODRL
Yes
Use URI of standard
license e.g., CC0
3b3a
No
Standard license available
ODRL
Open Digital Rights Language
Publication
38. 20/07/2015 38Jorge Gracia, Daniel Vila-Suero
Publication
LD FRONTEND
SPARQL STORE
SPARQL ENDPOINT
HTTP
CONFIGURATION FILE
- Location of the RDF data
- Define access methods
- and even the presentation of
the data
SPARQL QUERY LANGUAGE
Dataset and vocabulary publication on the Web
39. 20/07/2015 39Jorge Gracia, Daniel Vila-Suero
Publication EXAMPLE
• SPARQL endpoint
http://linguistic.linkeddata.es/apertium/sparql-
editor/
• Web interface
http://linguistic.linkeddata.es/apertium/
• Datahub
http://datahub.io/dataset?q=apertium+rdf&organiz
ation=oeg-upm
40. 20/07/2015 40Jorge Gracia, Daniel Vila-Suero
Publication EXAMPLE
http://datahub.io/dataset/apertium-rdf-en-es
42. 20/07/2015 42Jorge Gracia, Daniel Vila-Suero
• Loading the RDF data into a SPARQL endpoint
is not enough for publishing LD:
– Why? We provide a queryable repository, but URIs
are not de-referenceable
• We need a mechanism to make our URIs de-
referenceable:
– Through a common web server (as files)
– Linked Data front-ends:
• Pubby
• More sophisticated: LD APIs (Puelia, Elda)
Publication: SOME TIPS
47. 20/07/2015 47Jorge Gracia, Daniel Vila-Suero
Apertium RDF
Direct translations for “bank”@en
Translated written repr. Part of Speech
"banc"@ca http://www.lexinfo.net/ontology/2.0/lexinfo#noun
"riba"@ca http://www.lexinfo.net/ontology/2.0/lexinfo#noun
"banco"@es http://www.lexinfo.net/ontology/2.0/lexinfo#noun
"orilla"@es http://www.lexinfo.net/ontology/2.0/lexinfo#noun
"ribera"@es http://www.lexinfo.net/ontology/2.0/lexinfo#noun
"beira"@gl http://www.lexinfo.net/ontology/2.0/lexinfo#noun
"banco"@gl http://www.lexinfo.net/ontology/2.0/lexinfo#noun
"ourela"@gl http://www.lexinfo.net/ontology/2.0/lexinfo#noun
"orela"@gl http://www.lexinfo.net/ontology/2.0/lexinfo#noun
"banku"@eu http://www.lexinfo.net/ontology/2.0/lexinfo#noun
"erribera"@eu http://www.lexinfo.net/ontology/2.0/lexinfo#noun
"ertz"@eu http://www.lexinfo.net/ontology/2.0/lexinfo#noun
"amuntegar"@ca http://www.lexinfo.net/ontology/2.0/lexinfo#verb
"agolpar"@es http://www.lexinfo.net/ontology/2.0/lexinfo#verb
"amontonar"@es http://www.lexinfo.net/ontology/2.0/lexinfo#verb
"apelotonar"@es http://www.lexinfo.net/ontology/2.0/lexinfo#verb
"hacinar"@es http://www.lexinfo.net/ontology/2.0/lexinfo#verb
.... ...
48. 20/07/2015 48Jorge Gracia, Daniel Vila-Suero
Lexicon CA
Lexicon EN
Lexicon EN
Lexicon ES
Translation
Set EN-ES
Translation
Set EN-CA
Apertium LMF
Apertium RDF
EN-ES
EN-CA
Monolingual
lexicons
Translation sets
Apertium RDF
52. 20/07/2015 52Jorge Gracia, Daniel Vila-Suero
bench
banco
LexiconEN LexiconESLexiconCA
banc
orilla
ribera
bank
riba
How to measure confidence
Apertium RDF
53. 20/07/2015 53Jorge Gracia, Daniel Vila-Suero
Given a lexical entry s:
1. Get direct translations of s in the pivot language Ps
2. ∀ p ∈ Ps, get its translations in the target language Tp
3. For every t ∈ Tp,
(a) gets its set of translations in the pivot language (Pt)
(b) calculates the score for t:
||||
*2)(
ts
ts
PP
PP
tscore
+
∩
=
Tanaka, K., & Umemura, K. (1994). “Construction of a bilingual dictionary
intermediated by a third language”. In COLING, pp. 297–303.
One time inverse consultation (OTIC)
Apertium RDF
54. 20/07/2015 54Jorge Gracia, Daniel Vila-Suero
bench
banco
LexiconEN LexiconESLexiconCA
banc
orilla
ribera
bank
riba
s = “banco”@es
Pbanco={“bank”@en, “bench”@en}
Tbank={“banc”@ca, “riba”@ca}
Tbench={“banc”@ca}
Pbanc={“bank”@en, “bench”@en}
Priba={“bank”@en}
score(“banc”@ca) = 1.0
score(“riba”@ca) = 0.5
Apertium RDF
55. 20/07/2015 55Jorge Gracia, Daniel Vila-Suero
Around 270.000 links between Apertium RDF – BabelNet
Apertium RDF
Linking Apertium to external resources
56. 20/07/2015 56Jorge Gracia, Daniel Vila-Suero
Apertium RDF
Translated
Written Repr.
BabelSynset BabelNet gloss
"banco" @es http://babelnet.org/rdf/s00008371n
“A building in which the business
of banking transacted”
"banco" @es http://babelnet.org/rdf/s00008366n
“An arrangement of similar
objects in a row or in tiers”
"banco" @es http://babelnet.org/rdf/s15346085n
“An ocean bank, sometimes
referred to as a fishing bank or
simply bank, ...”
… … …
"orilla" @es http://babelnet.org/rdf/s00008363n
“Sloping land (especially the
slope beside a body of water)”
"ribera" @es http://babelnet.org/rdf/s00008363n
“Sloping land (especially the
slope beside a body of water)”
Translations for “bank”@en
58. 20/07/2015 58Jorge Gracia, Daniel Vila-Suero
Conclusions
• Methodology, guidelines, and reference cards
for LLOD generation
• Exemplified with the Apertium RDF case
– Apertium data on the Web following SW standards
– Common entry point for all the dictionaries
– Direct and indirect translations can be easily
obtained via SPARQL
– Linked with BabelNet
59. 20/07/2015 59Jorge Gracia, Daniel Vila-Suero
Further readings
J. Gracia, "Multilingual dictionaries and the web of data,”
Kernerman Dictionaries News, no. 23, pp. 1-4, Jun. 2015.
J. Gracia, D. Vila-Suero, J. McCrae, T. Flati, C. Baron, and
M. Dojchinovski, "Language resources and linked data: A practical
perspective," in Proc. of Knowledge Engineering and Knowledge
Management (EKAW'14), ser. Lecture Notes in Computer Science,
Springer International Publishing, Nov. 2014, vol. 8982, pp. 3-17.
J. Gracia, E. Montiel-Ponsoda, D. Vila-Suero, and G. Aguado-de Cea,
"Enabling language resources to expose translations as linked data on
the web," in Proc. of 9th Language Resources and Evaluation
Conference (LREC'14), Reykjavik (Iceland). European Language
Resources Association (ELRA), May 2014, pp. 409-413
J. Gracia, E. Montiel-Ponsoda, P. Cimiano, A. Gómez-Pérez,
P. Buitelaar, and J. McCrae, "Challenges for the multilingual web of
data," Journal of Web Semantics, vol. 11, pp. 63-71, Mar. 2012.