The document discusses the Natural Language Processing Interchange Format (NIF), which aims to achieve interoperability between NLP tools and language resources through representing them using RDF and OWL. NIF defines URI schemes for identifying text elements, an ontology for common NLP terms, and supports various use cases including integrating tools via workflows. It is maintained by the AKSW group and supported by several standards bodies and implementations seeking to advance linked data in NLP.
2. www.sti-innsbruck.at
Outline
• What is NIF?
• Design requirements
• URI schemes
• NIF ontologies
• Use cases
• Relationship with ELRA
• Roadmap for NIF 2.0
• Conclusions
2
3. www.sti-innsbruck.at
What is NIF?
• Natural Language Processing Interchange Format
• NIF is an RDF/OWL-based format that aims to achieve interoperability
between Natural Language Processing (NLP) tools, language
resources and annotations.
• Building blocks
– URI scheme for identifying elements in texts
– Ontology for describing common NLP terms
• Created and maintained by AKSW group of University of Leipzig, during
the LOD2 EU project.
• Community project: http://persistence.uni-leipzig.org/nlp2rdf/
3
5. www.sti-innsbruck.at
URI schemes
• Text needs to be referenceable by URIs
• With URI references text can be used as resources in RDF statements
• NIF distinguishes:
– Documents
– Text of the document
– Substrings of the text.
• URI scheme is an algorithm to create IDs for text and substrings
• URI elements
– Document URI
– Separator
– Character indices
5
6. www.sti-innsbruck.at
RFC 5147
• Canonical URI scheme for NIF is based on RFC 5147
• It standardizes fragment identifiers for text/plain media type
6
http://www.w3.org/DesignIssues/LinkedData.html
7. www.sti-innsbruck.at
RFC 5147
• Canonical URI scheme for NIF is based on RFC 5147
• It standardizes fragment identifiers for text/plain media type
7
http://www.w3.org/DesignIssues/LinkedData.html
http://www.w3.org/DesignIssues/LinkedData.html#char=0,26610
8. www.sti-innsbruck.at
RFC 5147
• Canonical URI scheme for NIF is based on RFC 5147
• It standardizes fragment identifiers for text/plain media type
8
http://www.w3.org/DesignIssues/LinkedData.html
http://www.w3.org/DesignIssues/LinkedData.html#char=0,26610
http://www.w3.org/DesignIssues/LinkedData.html#char=1206,1218
10. www.sti-innsbruck.at
NIF Core Ontology
• Additional classes and properties (unstable/testing)
– More URI schemes
– Text structure (words, sentences, paragraphs…)
– Part of Speech (POS)
– Annotations with Stanbol
– Confidence
10
14. www.sti-innsbruck.at
ITS Use Case
• The Internationalization Tag Set 2.0 is a W3C working draft that is
becoming a Recommendation.
• ITS standardizes HTML and XML attributes which can be used to
annotate nodes with processing information for language service
providers (i18n, l10n)
• ITS 2.0 RDF ontology was developed using NIF, including a round-trip
conversion algorithm from ITS to NIF.
• NIF is expected to receive wide adoption by translation & language
service providers
• ITS 2.0 RDF ontology provides properties which can be used to provide
best practices for NLP annotations.
14
15. www.sti-innsbruck.at
OLiA Use Case
• The Ontologies of Linguistic Annotation provide stable identifiers for
morpho-syntactical annotation tag sets, so that NLP tools can use these
ids for better interoperability.
• OLiA provides Annotation Models and a Reference Model, comprising
more than 110 OWL ontologies for over 34 tag sets in 69 languages
• Features
– Documentation
– Flexible Granularity
– Language Independence
• NIF provides two properties
– nif:oliaIndividual (links a nif:String to an OLiA Annotation Model)
– nif:oliaCategory (links to the Reference Model)
15
16. www.sti-innsbruck.at
RDFaCE Use Case
• RDFa Content Editor is a rich text editor that supports WYSIWYM
authoring including various views of the semantically enriched textual
content.
• It combines results of different NLP APIs for automatic content
annotation
– Heterogeneous APIs access, URI generation and output data structure
– Solution: server-side proxy, hard-coded input and connection of each API.
• NIF simplified the integration, adding an interoperability layer
16
17. www.sti-innsbruck.at
What is ELRA?
• European Language Resources Association
• http://www.elra.info
• Effort to make available Language Resources (LR) for language
engineering and to evaluate language engineering technologies.
• LR marketplace
• Related organizations
– ELDA (ELRA’s operational body)
– LREC conferences
17
19. www.sti-innsbruck.at
Relationship with NIF
• Different objectives
• LR written resources (esp. Corpora) can be annotated with NIF for
further interoperability and integration with NLP tools
• ADVANTAGE: Large test data collection to evaluate NLP tools
• DISADVANTAGE: Cost of LR (though there are free ones)
19
20. www.sti-innsbruck.at
Roadmap for NIF 2.0
• Release of NIF 1.0
– DONE (Nov 2009)
• Release of NIF 2.0 Draft
– CURRENT effort on solving pending issues
– Adoption in ITS 2.0 W3C (soon-to-be) Recommendation
– NIF-Core ontology is becoming stable
– RLOG - an RDF Logging Ontology
– NIF Validator software available
• Release of NIF 2.0 Core
• Release of NIF 2.0 Extensions
– ITS ontology, PROV ontology, Lemon Ontology, NERD, UIMA, MARL opinion
ontology…
20
21. www.sti-innsbruck.at
Conclusions
• NIF allows to integrate NLP tools using Linked Data
• Ongoing effort
• Many adopters and supporters
– LOD2 EU project
– Several W3C working groups
– Named Entity Recognition and Disambiguation (NERD)
– Ontologies of Linguistic Annotation (OLiA)
– …
• 27 different implementations and use cases
– Some available at http://persistence.uni-leipzig.org/nlp2rdf/
21