1. Creating Knowledge out of Interlinked Data
http://lod2.eu
ISWC – 2013/10/23 – Page 1
Integrating NLP using Linked Data
Sebastian Hellmann, Jens Lehmann, Sören Auer and Martin Brümmer
http://slideshare.net/kurzum
http://nlp2rdf.org
http://lod2.eu
LOD2 Presentation . 02.09.2010 . Page
AKSW, Universität Leipzig
http://lod2.eu
3. ISWC – 2013/10/23 – Page 3
Introduction
Core problems in integrating NLP:
1. Too much heterogeneity
2. Almost no open standards available
3. Lack of open collaboration
4. Difficult and large domain
http://lod2.eu
4. ISWC – 2013/10/23 – Page 4
Problem analysis
Hardly any reusability in NLP
• Free software (as in free beer), but no open licenses
• Few standards and few mappings
• Integration is hard-wired (you have to write software)
– for each tool, for each framework
Main benefits of using RDF, OWL and Linked Data are:
• lower entry barrier (as a client / user)
• easy data integration (linking, mapping)
• reusability of tools and conceptualisations (ontologies)
• off-the-shelf solutions for common tasks
http://lod2.eu
7. ISWC – 2013/10/23 – Page 7
NLP2RDF project
NLP2RDF (http://nlp2rdf.org)
- community project bootstrapped by LOD2
- develops NLP Interchange Format (NIF)
- umbrella project to combine (and consolidate) existing work
http://lod2.eu
8. ISWC – 2013/10/23 – Page 8
NIF Overview
The NLP Interchange Format (NIF) is an RDF/OWL-based format that aims to
achieve interoperability between Natural Language Processing (NLP) tools,
language resources and annotations.
→ to create an eco-system of interopable web services
http://lod2.eu
9. ISWC – 2013/10/23 – Page 9
http://lod2.eu
NIF Overview
The NLP Interchange Format (NIF) is an RDF/OWL-based format that aims to
achieve interoperability between Natural Language Processing (NLP) tools,
language resources and annotations.
•
Reuse of existing standards such as RDF, OWL2, the PROV Ontology, LAF (ISO
24612), Unicode and RFC 5147
•
Standardize access parameters, annotations (e.g. tokenization), validation
and log messages
•
Reuse of existing ontologies:
10. ISWC – 2013/10/23 – Page 10
http://lod2.eu
Example NIF Workflow
NIF workflow, however, can obviously not provide any better performance (Fmeasure, speed) than a properly configured UIMA or GATE pipeline with the same
components.
11. ISWC – 2013/10/23 – Page 11
Use Cases
•
•
•
Internationalization TagSet 2.0
Part of Speech Tagging
Wikifier API access via RDFaCE (Entity Linking)
http://lod2.eu
12. ISWC – 2013/10/23 – Page 12
http://lod2.eu
UC1 - Internationalisation Tagset 2.0
•
NIF will be the recommended RDF conversion of the Internationalisation
Tagset 2.0 of W3C (ITS 2.0) - http://www.w3.org/TR/its20/
•
NIF turns out to have a unique selling proposition regarding NLP and RDF
•
There were no suitable alternative RDF vocabulary for this conversion
available.
17. ISWC – 2013/10/23 – Page 17
UC3 – Wikifier API access via RDFaCE
https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki
http://lod2.eu
18. ISWC – 2013/10/23 – Page 18
UC3 - Wikifier API access via RDFaCE
http://rdface.aksw.org/
http://lod2.eu
19. ISWC – 2013/10/23 – Page 19
UC3 - Wikifier API access via RDFaCE
http://rdface.aksw.org/
http://lod2.eu
20. ISWC – 2013/10/23 – Page 20
Evaluation
Please see the paper!
1) Quantitative Analysis with Google Wikilinks Corpus as NIF RDF
• Crawl of 3 million web sites, 40 million Wikipedia links
• ~ 477 million triples in NIF
2) Questionnaire and Developers Study for NIF 1.0
• NIF 1.0 was released in September 2009
• Over 30 known implementations (22 not from authors)
• 14 developers participated in the study
• Minimal NIF implementation requires less than 500 LoC
3) Qualitative Comparison with other Frameworks and Formats
http://lod2.eu
21. ISWC – 2013/10/23 – Page 21
State of NIF 2.0
Corpora as Linked Data
• Wikilinks corpus - http://wiki-link.nlp2rdf.org
• KORE 50 - http://www.yovisto.com/labs/ner-benchmarks/
• DBpedia Spotlight dataset
Tools
• entityclassifier.eu – http://entityclassifier.eu
• Spotlight - http://spotlight.dbpedia.org
• Open NLP
• Stanford CoreNLP - https://github.com/NLP2RDF/software
• Validator - https://github.com/NLP2RDF/software
http://lod2.eu
22. ISWC – 2013/10/23 – Page 22
State of NIF 2.0
•
•
•
Rollout is in progress
Distributed implementation at different speed and quality
Software lifecycle:
• Implementation
• Testing/Validation
• Integration in the main software
• Deployment as a web service
•
Hosted web services often not up to date while code base is
http://lod2.eu
23. ISWC – 2013/10/23 – Page 23
How to join - http://nlp2rdf.org
http://lod2.eu
24. ISWC – 2013/10/23 – Page 24
For ontology creators
NLP2RDF provides infrastructure for your NLP ontologies
•
•
•
•
•
•
Redundant, persistent hosting
Maven packages
Code and documentation generation
Continuous Integration (planned)
Indexing
Validation of instance data
Please write to me or the mailing list
nlp2rdf@lists.informatik.uni-leipzig.de
http://lod2.eu
25. http://lod2.eu
ISWC – 2013/10/23 – Page 25
Take home message
•
Early industrial uptake
• OpenLink, Vistatech.ie, Zemanta, Tenforce, Unister
• ITS 2.0 W3C standard was driven by localization industry
•
•
NIF is open and free (CC0 planned)
NIF is designed to be a cost-saver
Not primarily aimed at
increasing features or
performance (F-Measure)
26. ISWC – 2013/10/23 – Page 26
Thanks for your attention
Open Community – All feedback is welcome!
http://slideshare.net/kurzum
Websites:
http://nlp2rdf.org
http://lod2.eu
http://lod2.eu
30. ISWC – 2013/10/23 – Page 30
Unicode Normal Form C
•
•
Recommendation for RDF Literals
http://unicode.org/reports/tr15/#Norm_Forms
http://lod2.eu
31. ISWC – 2013/10/23 – Page 31
Tokenization
Christian Chiarcos, Julia Ritz, Manfred Stede: By all these lovely tokens... Merging conflicting tokenizations.
Language Resources and Evaluation 46(1): 53-74 (2012)
http://lod2.eu
32. http://lod2.eu
ISWC – 2013/10/23 – Page 32
Validation over specification
•
•
•
•
•
•
SPARQL queries produce (find) errors
http://persistence.uni-leipzig.org/nlp2rdf/ontologies/testcase/lib/nif-2.0-suite.t
RLOG – An RDF Logging Ontology
./validate.jar -i nif-erroneous-model.ttl -t file
Demo → character count
Demo → all errors
ALL DEMOS ARE AVAILABLE AT:
http://nlp2rdf.org/leipzig-24-9-2013