The document discusses Apache Stanbol, an open source software stack that can extract semantic data from documents. It describes how Stanbol uses semantic analysis and natural language processing to automatically tag and enrich content with entities, citations and related information. The document also explains how Stanbol can be integrated with Drupal to enable features like automatic tagging, dynamic annotation and autocompletion through a REST API that returns JSON-LD.
Drupal and Apache Stanbol. What if you could reliably do autotagging?
1. Gabriel Dragomir
Drupal and Apache Stanbol
What if you could reliably do autotagging?
Wednesday, January 23, 13
2. Semantic content is the key!
Most organizations need to organize/analyze/relate
huge amounts of textual, unstructured, dissipated data
E.g. universities check theses for plagiarism
SNSPA: we adapted WebFerret plagiarism checker for
Romanian
http://homepages.stca.herts.ac.uk/~pdgroup/
Wednesday, January 23, 13
3. Semantic content is the key!
Web Ferret - indentifies potential sources from the
Internet and from an institutional repository
CONS:
Desktop based, no REST web services
Cannot detect plagiarism by translation
Wednesday, January 23, 13
4. Semantic content is the key!
Here comes Apache Stanbol
A new approach:
semantic analysis of documents
extract citations in proximity
search the web for documents with a similar citation
structure
Wednesday, January 23, 13
5. From IKS to Apache Stanbol
IKS - Interactive Knowledge Stack for small to medium
CMS providers - EU funding
An open source software stack written in Java
Goal: extract and process semantic data from
documents
Project undergoing incubation at Apache Foundation
http://stanbol.apache.org
Wednesday, January 23, 13
6. Service oriented architecture
Stanbol is designed to offer service oriented integration
RESTful web service API returning RDF or JSON/
JSON-LD
Each component exposes an endpoint independently
Open Services Gateway initiative compliant (OSGi) via
Apache Felix and Apache Sling
Remote component management
Wednesday, January 23, 13
7. Implementation
OSGi layer: Apache Felix and Apache Sling
Build environment: Apache Maven
RDF framework: Apache Clerezza
Triples store, reasoning engine: Apache Jena
Indexing and semantic search: Apache Solr
Content analysis/metadata extraction: Apache Tika
Natural language processing: Apache OpenNLP
Wednesday, January 23, 13
9. Components
Semantic layer:
Enhancer, EntityHub, ContentHub
Enhancement engines: internal, 3rd party
User interfaces
Knowledge integration
Storage integration
Wednesday, January 23, 13
10. Content enhancement
Examples:
retrive additional metadata for a piece of content
identify the language of a text
extract entities (persons, places, organizations)
create annotations to external sources
use 3rd party services for named entities recognition
Wednesday, January 23, 13
11. Drupal meets Stanbol
Drupal supports RDFa allowing semantic annotations
Taxonomy system allows for complex annotation
Fieldable taxonomy terms allow for storage of complex
semantic data
Wednesday, January 23, 13
12. User scenarios
Assisted semantic tagging: autotagging
Content enrichment with semantically related
information (documents, factual data, images etc.)
Tag as you type: dynamic annotation of text in editors
Autocomplete indexes - FAST with Apache Solr
Wednesday, January 23, 13
13. Autotagging with Stanbol
Given a piece of content extract mentions of places,
persons, organizations or other entities
Named entity recognition (NER)
OpenCalais and Zemanta provide similar functionality,
limited free reqs, limited languages
Stanbol does it for free
Multilingual: may be trained for any language
Wednesday, January 23, 13
14. How it works
REST service: Apache Stanbol Enhancer
Returns JSON-LD, RDF/XML, RDF/JSON etc
curl -X POST -H "Accept: text/turtle" -H "Content-type: text/plain"
--data "The Stanbol enhancer can detect famous cities such as
Paris and people such as Barack Obama." http://dev.iks-project.eu:
8081/enhancer
JSON-LD - JavaScript Object Notation for Linked Data
a human readable and simple linked data transport
format
Wednesday, January 23, 13
15. How it works
JSON-LD: is included in Drupal 8 core
Creates a description of the data as a “context” data
structure
Context: links object properties to concepts in an
ontology
Allows for values to be coerced to a certain set or
language
Wednesday, January 23, 13
16. How it works
{
"@context": {
"name": "http://xmlns.com/foaf/0.1/name",
"homepage": {
"@id": "http://xmlns.com/foaf/0.1/workplaceHomepage",
"@type": "@id"
},
"person": "http://xmlns.com/foaf/0.1/Person"
},
"@id": "http://www.barackobama.com",
"@type": "person",
"name": "Barack Obama",
"homepage": "http://www.whitehouse.gov/"
}
Wednesday, January 23, 13
17. How it works
{
"@context": {
"name": "http://xmlns.com/foaf/0.1/name",
"homepage": {
"@id": "http://xmlns.com/foaf/0.1/workplaceHomepage",
"@type": "@id"
},
"person": "http://xmlns.com/foaf/0.1/Person"
},
"@id": "http://www.barackobama.com",
"@type": "person",
"name": "Barack Obama",
"homepage": "http://www.whitehouse.gov/"
}
FOAF: “Friend of a friend” - RDF ontology
describing people, their relations and activities
Wednesday, January 23, 13
19. How it works
Source: blog.iks-project.eu
Wednesday, January 23, 13
20. How it works
On Drupal side we only have to parse the response
Map JSON-LD properties to entity fields
Use Drupal’s native RDFa capability to render semantic
markup
Use your imagination and build semantic content
Wednesday, January 23, 13
21. Quick demo
Semantic CMS - Evo42 communications, early adopter
integration of Drupal with Stanbol
Rene Kapusta - https://github.com/evo42/Semantic-
CMS
Drupal contributor, Aloha Editor core developer
Wednesday, January 23, 13