Linked Data for Abbreviations and Segmentation

ULI meeting – 2013/05/28 – Page 1 http://lod2.eu
Creating Knowledge out of Interlinked Data
LOD2 Presentation . 02.09.2010 . Page http://lod2.eu
AKSW, Universität Leipzig
Sebastian Hellmann
Linked Data
for
Abbreviations and Segmentation
http://nlp2rdf.org
http://lod2.eu
http://slideshare.net/kurzum

Sebastian Hellmann – researcher working on LOD2 EU Project
AKSW – Agile Knowledge and the Semantic Web research group in Leipzig -
http://aksw.org
InfAI – Institute for Applied Informatics - http://infai.org
Contents:
• Introduction to Linked Data
• Linked data close-up: DBpedia data set
• Exploitation of free and open data for CLDR
• Collaboration points
Introduction

http://lod-cloud.net

http://lod-cloud.net
Linked Open Data
- All datasets provide open access to individual records via HTTP
- Many are free (no payment required, as in royalty-free)
- Some are openly licensed, e.g. CC-0 or CC-BY-SA
=> Open access also applies to published HTML on the WWW, but here the data
itself is published unrendered via RDF

http://dbpedia.org

• DBpedia is a crowd-sourced community effort to extract structured
information from Wikipedia and make this information available on the
Web.
• allows sophisticated queries against Wikipedia content
• allows links from the different data sets on the Web to Wikipedia data
• data is extracted continuously: http://live.dbpedia.org
• WikiData will be integrated within the next four months
via Google Summer of Code project
http://dbpedia.org

http://dbpedia.org/resource/Berlin
First paragraph in more
than 20 languages

Facts from Wikipedia infoboxes

Several
Hierarchical
Classifications

Links
Multilingual labels

Trend 1: I18n

• DBpedia Extraction Framework can be extended to easily extract any data
from Wikipedia: https://github.com/dbpedia/extraction-framework
• We are using it to extract corpora for NLP
• e.g. URI, surrounding text, surface form
• Probabilities:
• P(sf|URI): P that “apple” refers to wikipedia:Apple_Inc.
• P(URI|sf): P that wikipedia:Apple_Inc. is “apple” in text
Trend 2: DBpedia 4 NLP

• DBpedia is a data dissemination project:
• as download for reuse
• As Linked Data for interlinking
• Corpora will be published via the NLP Interchange RDF Format (NIF) -
http://nlp2rdf.org
Trend 2: DBpedia 4 NLP

DBpedia Live Abbreviation Example
Up-to-date gazetteer
- AFD party was founded earlier this year.
- lexical information and statistics could be included

Linguistic LOD Cloud

• DBpedia
• Main version and I18n chapters
• http://dbpedia.org/Datasets/NLP
• Wiktionary 2 RDF: http://dbpedia.org/Wiktionary
• Wortschatz from Uni Leipzig (planned as Linked Data)
• http://corpora.informatik.uni-leipzig.de/download.html
• JRC Names: http://langtech.jrc.it/JRC-Names.html
• JRC-Names is a highly multilingual named entity resource for person and
organisation names
• Lexvo.org:
• provides URIs for ISO 629-3
• http://lexvo.org/id/iso639-3/spa
Example data sets from LLOD

http://linguistics.okfn.org/resources/llod/
=> CLDR will make an excellent addition to LLOD
Linguistic LOD

• CLDR as Linked Data
• empowers third parties to link to your authoritative data
• links are reusable
• LIDER EU project (presumably starting in October) will provide some
support for linked data adopters
• ULI members can join the industry and advisory board
• Workshop “DBpedia & NLP” in Oct, 2013
• http://nlp-dbpedia2013.blogs.aksw.org/
• Creation of free and open benchmarks in RDF
• We could promote CLDR and collect contributions
Collaboration points I

• Personally, I can:
• Join ULI mailing list
• Look out for appropriate data
• Look for opportunities (e.g. synergies with other projects)
• Provide some counseling (e.g. pointers, technology Q&A)
=> this will be done as preparation for the LIDER EU project, CLDR
• Academic collaboration:
• Excellent PhD student topic: Create corpora, interlink and fuse data and
benchmark effectiveness for segmentation
• Provide knowledge transfer (e.g. tutorials, visits)
Collaboration points II

Open Community – All feedback is welcome!
http://slideshare.net/kurzum
Websites:
http://dbpedia.org
http://nlp2rdf.org
http://lod2.eu
Thanks for your attention

Wiktionary Example

LOD2 EU Project produces LOD2 Stack.
Three requirements to unlock Natural Language Processing (NLP) for the project:
1. NLP tool output is required to be in RDF
2. Scalability (less triples, focus on usefulness)
3. Common vocabulary to integrate and use NLP tools
The NLP Interchange Format (NIF) is an RDF/OWL-based format that aims to
achieve interoperability between Natural Language Processing (NLP) tools,
language resources and annotations.
• Version 1.0 published in November 2011
• Version 2.0 is scheduled for completion within 2013
NLP Interchange Format 2.0

NIF Architecture

Adressing Primary Data

Adressing Primary Data
NIF 1.0:http://www.w3.org/DesignIssues/LinkedData.html#offset_717_729
NIF 2.0 uses RFC 5147:
http://www.w3.org/DesignIssues/LinkedData.html#char=717,729
User extensions possible:
http://www.w3.org/DesignIssues/LinkedData.html#your_own_scheme
(but you have to link to documentation on how it was created)

As a Web Service
curl
--data-urlencode prefix="http://prefix.given.by/theClient#"
--data-urlencode input="[...]"
(--data-urlencode source=”http://www.w3.org/DesignIssues/LinkedData.html”)
http://nlp2rdf.lod2.eu/demo/NIFStanfordCore

• Tibeto-Burman languages: http://purl.org/olia/tibet.owl#VNst
• Russian TreeTagger :
http://purl.org/olia/russ.owl#partizip_prt_sg_neut_passiv_gen_langform
• German STTS: http://purl.org/olia/stts.owl#VAPP
• English Penn: http://purl.org/olia/penn.owl#VBG
→ all map to http://purl.org/olia/olia.owl#NonFiniteVerb
Ontologies of Lingingustic Annotation (OLiA) contain mappings for over 50 Tagsets (free
and open, CC-By)
Vocabulary Module: OLiA

• NIF 2.0 tries to be compatible to (Vocabulary Module):
• ITS 2.0
• FISE used in Apache Stanbol (IKS-EU Project)
• LAF/GrAF XML – ISO standard, recently published
• Fragment Identifiers by IETF and W3C
• Lemon ontology from Monnet EU Project
• NERD ontology from EURECOM and LinkedTV EU Project
• Xpointer/XPath URI scheme
• Open Annotation
NIF 2.0 - plans

NIF 2.0 :
• NIF is free and open (CC-0 or CC-BY)
• All ontologies will be hosted for persistently by University Leipzig
• Sign up on the mailinglist at http://nlp2rdf.org
• Provide Use Cases, Requirements, Implementations at:
• http://wiki.nlp2rdf.org/wiki/Use_cases#Use_cases
• http://wiki.nlp2rdf.org/wiki/Requirements#Requirements
How you can contribute:

LOD 2 Stack
• Currently project half-time
• Most of the tools are free and open source
• Commercial rollout planned
• Many webinars available
• You can integrate your tool via Debian package
http://lod2.eu
http://stack.lod2.eu/
How you can contribute:

Linked Data for Abbreviations and Segmentation

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Linked Data for Abbreviations and Segmentation

Similar a Linked Data for Abbreviations and Segmentation (20)

Más de Sebastian Hellmann

Más de Sebastian Hellmann (12)

Último

Último (20)

Linked Data for Abbreviations and Segmentation