3. Sectors
Input Processes Software Output
tweet analyse text processor newspaper article
newspaper article select presentation multimedia website
wire copy focus spreadsheet tv report
facebook status update revise email exhibition catalogue
search result read up on browser mobile application
email write groupware mashup (e.g., map)
text message create sector-specific application text piece
concept research CMS concept
text file assess ECMS timeline
video evaluate CRM study
map arrange enterprise software presentation
stockphoto sort graphics/layouting software fact collection
in-house database structure IP telephony description of an exhibit
calendar entry summarise etc. analysis
spreadsheet shorten etc.
archive translate
etc. catch up on
combine
abstract
integrate
visualise
generate
annotate
reference
etc.
Information
Information
Information
Information
Information
Information
Information
Information
Information
? ??
?Information
OutputInput SoftwareProcesses
4. Sectors
Input Processes Software Output
tweet analyse text processor newspaper article
newspaper article select presentation multimedia website
wire copy focus spreadsheet tv report
facebook status update revise email exhibition catalogue
search result read up on browser mobile application
email write groupware mashup (e.g., map)
text message create sector-specific application text piece
concept research CMS concept
text file assess ECMS timeline
video evaluate CRM study
map arrange enterprise software presentation
stockphoto sort graphics/layouting software fact collection
in-house database structure IP telephony description of an exhibit
calendar entry summarise etc. analysis
spreadsheet shorten etc.
archive translate
etc. catch up on
combine
abstract
integrate
visualise
generate
annotate
reference
etc.
Information
Information
Information
Information
Information
Information
Information
Information
Information
? ??
?Information
OutputInput SoftwareProcesses
5. Sectors
Input Processes Software Output
tweet analyse text processor newspaper article
newspaper article select presentation multimedia website
wire copy focus spreadsheet tv report
facebook status update revise email exhibition catalogue
search result read up on browser mobile application
email write groupware mashup (e.g., map)
text message create sector-specific application text piece
concept research CMS concept
text file assess ECMS timeline
video evaluate CRM study
map arrange enterprise software presentation
stockphoto sort graphics/layouting software fact collection
in-house database structure IP telephony description of an exhibit
calendar entry summarise etc. analysis
spreadsheet shorten etc.
archive translate
etc. catch up on
combine
abstract
integrate
visualise
generate
annotate
reference
etc.
Information
Information
Information
Information
Information
Information
Information
Information
Information
? ??
?Information
OutputInput SoftwareProcesses
6. language and knowledge technologies
curation technologies
sector-specific technologies
platformtechnologies
sector-specific solutions
!
Digital Curation Technologies
• Make curation processes in four SMEs (and sectors) more
efficient through language and knowledge technologies.
• Technology transfer project to arrive at proofs of concept.
• Curation services for real companies and real use cases.
• The human expert/curator is always in the centre and loop.
• Platform for digital curation technologies: innovation boost.
Curation Technologies for Multilingual Europe
7. Curation Technologies for Multilingual Europe
CurationDashboard
Structure visualisation
Multilingual multimedia sources
Crossmedia recommendations
Multilingual summarisation
Event timelining
Semantification of content
Multilingual sentiment analysis
Semantic storytelling
Ontology-based knowledge structures
Automatic hyperlinking of document collections
Curation Processes Processing, exploration and
re-aggregation of domain- and task-
specific document collections.
8. Key Characteristics
• Technology transfer and integration project
• Broad set of tools and technologies
• Focus on building proofs of concept
• Our technologies don’t have to be perfect
• Human expert, i.e., the curator, always in the loop
• Important for all SME partners: domain-adaptability.
• WPs: Semantic Analysis, Semantic Generation,
Multilingual Technologies, Integration into Curation Tech
Curation Technologies for Multilingual Europe
9. platform for digital curation technologies
broker REST API
curation service 1
language or knowledge
technology
curation service 2
language or knowledge
technology
client using
the API
external
service 1
external
service 2
client using
the API
client using
the API
client using
the API
pipelined curation workflow
Curation Technologies for Multilingual Europe
• Curation process: e-service available through REST API.
• Services can be combined to form pipelines or workflows.
• Domain-adaptability: every curation process has a training API to create
and use domain-specific models.
10. Current Results
• Implemented the following baseline services:
– NER – e-entityrecognition e-service
– Geolocation – e-entityrecognition and visualisation
– Temporal Analyser – e-entityrecognition and visualisation
– Classification – e-classification e-service
– Clustering – e-clustering e-service
– Machine Translation – e-translation e-service
• Curation Dashboard (first prototype)
• Semantic Storytelling (work in progress)
Curation Technologies for Multilingual Europe
11. NER, Entity Linking, Geolocation
Curation Technologies for Multilingual Europe
...
In the Viking colony of Iceland,
an extraordinary vernacular
literature blossomed in the 12th
through 14th centuries
...
...
The ships were scuttled there
in the 11th century, to block a
navigation channel and thus
protect Roskilde, then
Copenhagen from seaborne
assault
...
...
Viking Age inscriptions have
also been discovered on the
Manx runestones on the
Isle of Man.
…
Plain Text NIF enrichment visualisation
http://api.digitale-kuratierung.de/api/e-nlp/namedEntityRecognition?analysis=ner http://http://dev.digitale-kuratierung.de/admini/pages/geolocalization.php
• Currently based on OpenNLP (with NIF integration)
• Mode 1: model-based (for domains where annotated
data is available)
• Mode 2: dictionary-based (for domains where only a
list of names is available)
• Entity Linking through SPARQL queries to DBPedia
• For locations, GPS-coordinates are retrieved,
document level average and standard deviation (over
all locations) are calculated to visualise positioning of
documents on a map.
12. Curation Technologies for Multilingual Europe
NER Training
http://api.digitale-kuratierung.de/api/e-nlp/trainModel?analysis=dict
(in the suboptimal case that only a list of terms and their URIs in an
ontology is available)
http://api.digitale-kuratierung.de/api/e-nlp/trainModel?analysis=ner
(if annotated training data is available)
directly usable on new input
NER model
13. Curation Technologies for Multilingual Europe
Temporal Analysis
...
The ships were scuttled there
in the 11th century, to block a
navigation channel and thus
protect Roskilde, then
Copenhagen from seaborne
assault
...
...
Viking Age inscriptions have
also been discovered on the
Manx runestones on the
Isle of Man.
...
...
In the Viking colony of Iceland,
an extraordinary vernacular
literature blossomed in the 12th
through 14th centuries
…
900
1600
http://api.digitale-kuratierung.de/api/e-nlp/namedEntityRecognition?analysis=temp
http://dev.digitale-kuratierung.de/admini/pages/timelining.php
Plain Text NIF enrichment visualisation
• Sort and rank documents from a
collection on chronological scale.
• Developed rule-based system due
to our focus in terms of languages
(EN, DE), domain adaptability,
normalisation requirements.
• Analysis of temporal expressions
in a document (or, later,
paragraphs or even sentences).
• Compute mean value for date and
time, allowing positioning on a
timeline.
• Future plans: adaptability through
user-specific rules.
• Related work: SUTime,
HeidelTime, Tango, Tarsgi; many
papers at LREC 2016
14. Classification
• Mallet – Maximum Entropy Algorithm
• Algorithm for text classification, easy integration.
• Goal: text classification, i.e., assign a topic (class) to a
document (or parts of a document) to apply domain- or topic-
specific NLP processing techniques.
• Future plans: improvement of classification schema by means
of new training data and additional algorithms.
Curation Technologies for Multilingual Europe
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix nif: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
<http://dkt.dfki.de/documents/#char=0,1257>
a nif:RFC5147String , nif:String , nif:Context ;
nif:beginIndex "0"^^xsd:nonNegativeInteger ;
nif:endIndex "1257"^^xsd:nonNegativeInteger ;
nif:documentClassificationLabel "Frühjahrsoffensive_1918"^^xsd:string ;
nif:isString "Ceylon-Teestube B. Walther München Maximilian-Strasse 44 Gegenüber dem Königl. Hoftheater
Telephon 428 München, den 26.XI.13. Von hier nach Dresden ab München 8.25 9.00 10.20 an Dresden 7.28 10.47 9.48 Sie
müssen unbedingt Donnerstag hier bleiben. So können Sie doch nicht vorbeifahren. Donnerstag Abend eine interessante
Uraufführung in den Kammerspielen "unseligen Gedenkens " Ich werde Billets dafür besorgen. […]"^^xsd:string .
15. Clustering
• WEKA (Expectation Maximisation algorithm)
• Easy integration, availability, additional algorithms.
• Goal: identification of distinct features of document collections.
• Example use case: a user has to prepare a museum exhibit on
“Birds”. Knowing which documents can be grouped can be useful to
split the documents into exhibition rooms.
• Future plans: allow users to easily recognize groups of documents in
new domains and collections; faceted search.
Curation Technologies for Multilingual Europe
ARFF Input JSON Output
@RELATION iris
@ATTRIBUTE sepallength NUMERIC
@ATTRIBUTE sepalwidth NUMERIC
@ATTRIBUTE petallength NUMERIC
@ATTRIBUTE petalwidth NUMERIC
@DATA
5.1,3.5,1.4,0.2
4.9,3.0,1.4,0.2
4.7,3.2,1.3,0.2
4.6,3.1,1.5,0.2
5.0,3.6,1.4,0.2
5.4,3.9,1.7,0.4
4.6,3.4,1.4,0.3
5.0,3.4,1.5,0.2
4.4,2.9,1.4,0.2
4.9,3.1,1.5,0.1
{
"results": {
"numberClusters": -1,
"clusters": {"cluster1": {
"clusterId": 1,
"entitites": {
"entity1": {
"meanValue": 3.3099999999999996,
"label": "sepalwidth"
},
"entity2": {
"meanValue": 1.45,
"label": "petallength"
},
"entity3": {
"meanValue": 0.22000000000000003,
"label": "petalwidth"
}
}
}}}}
16. Machine Translation
Curation Technologies for Multilingual Europe
Workflow
Language &
Translation
Models trained
on DGT, News,
Europarl, TED
Herr Modi befindet sich auf einer fünftägigen
Reise nach Japan, um die wirtschaftlichen
Beziehungen mit der drittgrößten
Wirtschaftsnation der Welt zu festigen.
Mr Modi is located on a five-day trip to Japan to
strengthen the economic ties with the third largest
economy in the world.
Named Entity
Recognition
Entity Linking
Temporal
Expressions
Metadata
Processing
Post-Edit
Retraining
Example
• Robust, adaptable and customised models of MT as e-services (Moses-based SMT)
• Scenarios: museums, showrooms; news, media; publishers; cultural institutions, archives
• Integration in curation workflows with other DKT services (NER, Temporal Analyser)
• Plug-in multiple knowledge sources (Linked Data)
17. Semantic Storytelling
• Important objective for all partner use cases: Automatic
hyper-linking of task-specific, self-contained collections.
• Input: coherent, self-contained document collection
• Output: processed collection with added analysis information,
easily accessible as a hypertext, for efficient browsing
• Semantic Storytelling – operates on the hypertext graph that
we construct on top of the original collection
• Enables multiple different paths through the collection
• Semantic storytelling is the identification, ranking and
recommendation of meaningful hypertext paths.
Curation Technologies for Multilingual Europe
20. Conclusions
• Curation technologies are smart technologies to support
knowledge workers handling content and knowledge.
• The multilingual Digital Single Market will create a
massive need for multilingual Curation Technologies due
to an ever-increasing need for multilingual content.
• DKT is mostly centred around German and English.
• We cater for a small set of curation processes.
• To be extended in a larger follow-up project.
• Extended set of curation processes, more complex
approaches, many more languages.
Curation Technologies for Multilingual Europe