ArCo is a very ambitious ontology project. Starting from the official central catalogue of Italian Cultural Heritage (maintained by the Ministry) as its main source, its goal is to release an open knowledge graph encoding knowledge about the entities described in catalogue records. This means going beyond the mere representation of their metadata. Although there's still a long way to go, ArCo reached its first 'stable' version (https://w3id.org/arco). The experience in developing this project has taught us important lessons both in knowledge engineering in general, and on its application to Cultural Heritage. In this talk I will tell ArCo's story and lessons learned focusing on methodological, social and ontological perspectives.
ArCo: the Knowledge Graph of Italian Cultural Heritage
1. ArCo: the Knowledge Graph
of Italian Cultural Heritage
Valentina Presutti
University of Bologna
STLab, ISTC, National Research Council, Italy
ArCo team also includes Valentina Carriero (ISTC-CNR), Andrea Giovanni Nuzzolese (ISTC-CNR) and Aldo Gangemi (UniBo)
https://w3id.org/arco
This slides can be reused as they are according to the CC BY 4.0 license.
Please cite the author and link to the original.
2.
3. ArCo’s ambitious goal is to build the knowledge graph of Italian Cultural Heritage
Valentina Anita Carriero, Aldo Gangemi, Maria Letizia Mancinelli, Ludovica Marinucci, Andrea Giovanni Nuzzolese, Valentina
Presutti and Chiara Veninata: ArCo: the Italian Cultural Heritage Knowledge Graph. In Proceedings of ISWC 2019 (To appear)
Preprint at: https://arxiv.org/abs/1905.02840
4. ArCo KG in numbers
ONTOLOGY NETWORK
• 7 modules
• 5058 axioms
• 1049 predicates
DATA
• 169.151.644 triples
• 28.838 owl:sameAs linking to 20.479 distinct entities in other datasets
5. How to use ArCo
https://w3id.org/arco CC BY-SA 4.0 license
6. USER GUIDES for supporting users in understanding the content of each release, with Graffoo
diagrams and narrative explanations of every ontology module
https://essepuntato.it/graffoo/
8. A SPARQL endpoint storing ArCo KG, which is also
downloadable as a compressed dump
9. Examples of Competency
Questions (CQs) that ArCo KG can
answer, with their corresponding
SPARQL queries.
This helps users to have a quick
understanding of what is in ArCo
ontologies and data, and how to
use it.
e.g.:
12. Why ArCo?
• Regulations (open data)
• Fostering reuse by third party
• Improving PA organisational data management
• Modeling Cultural Heritage knowledge vs metadata
• Cataloguing
• Providing data to scholars and researchers
• Connecting to other relevant knowledge bases
13. The general catalogue of Italian Cultural Heritage
…among ArCo’s main data and conceptual sources
14. the official institutional database of
Italian CH, maintained and published
by ICCD (Institute of the General
Catalogue and Documentation)
about 800.000
(out of
2.735.343)
publicly available
catalogue records
General
Catalogue
SIGEC
web
General Catalogue
of Italian Cultural
Heritage
the collaborative platform to which
formally authorised institutions can
submit their catalogue records, which
undergo a validation phase
30 types of
cultural
properties
ICCD
Cataloguing
standards
~15M catalogue
record numbers
released
21. Ontology Design Patterns
An ontology design pattern
is a reusable successful solution
to a recurrent modeling problem
Aldo Gangemi, Valentina Presutti: Ontology Design Patterns. Handbook on
22. Ontology Design Patterns
Ontology patterns derive from
foundational theories
Agile methodology for
ontology design
Pattern representation
language within ontologies
23. A language for ontology engineers
Trajectory
AgentRole
RecurrentEvent
PartOf
Sequence
Observation
TimeIndexedParticipation TimeInterval
26. ODPs from DUL + DnS Ultra Lite
DOLCE+D&S and its main ontology design patterns: Valentina Presutti and Aldo Gangemi. Ontology Engineering with ontology design patterns. Pages 81-103. IOS Press (2016)
27. 28
Experimenting with ODPs usability with vs. without
45 participants
Eva Blomqvist, Aldo Gangemi, Valentina Presutti: Experiments on pattern-based ontology design. K-
28. Experimenting with ODPs and XD
Usability: without vs. with
29
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Strongly
disagree
Disagree to
some extent
Neither
agree nor
disagree
Agree to
some extent
Strongly
agree
Not
applicable
The XD methodology helped me to
organize my work while modelling.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Strongly
disagree
Disagree to
some extent
Neither agree
nor disagree
Agree to
some extent
Strongly
agree
Not
applicable
I already organized my work in a way
similar to XD in the previous exercises...
ODP ODP + XD
Terminology
coverage
79% 83%
Task coverage 69% 81%
Disjoint
axioms
37% 52%
35 participants
Eva Blomqvist, Valentina Presutti, Enrico Daga, Aldo Gangemi: Experimenting with eXtreme Design. EKAW 2010: 120-
29. Experiments on ontology learning: with vs.
without
• ODP-based ontology learning improves results
• Ontologies are better in terms of cohesion, consistency,
functional quality, etc.
• Experiments with OntoCase applied to Text2Onto ontology
learning
Eva Blomqvist: OntoCase-Automatic Ontology Enrichment Based on Ontology Design Patterns. International Semantic Web
30. Paulheim, H. and Gangemi, A. Serving DBpedia with DOLCE – More than Just Adding a Cherry on Top. Proceedings of ISWC2015, the Thirteenth International Semantic Web
Conference, LNCS, Springer, 2015
31. Paulheim, H. and Gangemi, A. Serving DBpedia with DOLCE – More than Just Adding a Cherry on Top. Proceedings of ISWC2015, the Thirteenth International Semantic Web
Conference, LNCS, Springer, 2015
32. eXtreme Design (XD)
Building ArCo knowledge graph with ontology design patterns
Eva Blomqvist, Karl Hammar, Valentina Presutti:Engineering Ontologies with Patterns - The eXtreme Design Methodology.
Ontology Engineering with Ontology Design Patterns. Pages 23-50. IOS Press (2016)
Eva Blomqvist, Valentina Presutti, Enrico Daga, Aldo Gangemi: Experimenting with eXtreme Design. EKAW 2010: 120-134
35. User stories
Continuous feedback
New emerging requirements and errors
arco-project@googlegroups.com
Methods and tools to
collect requirements
from users with
heterogenous expertise
36. Look for the story of an artwork, which was
confiscated from organised crime to appreciate
the value of social return through this type of
confiscation
What requests should this application reply
to?
Visualise the cronology of an artwork that was
confiscated from organised crime
User stories
Goal
High level
requirements
37. To enable identifying cataloguing activity of
diverse organisations in specific location
areas
What requests should this application reply
to?
How many catalogue records describing cultural
properties in a certain region have been produced? How
many of them have been filled by Heritage Protection
Agencies? How many by other organisations
(universities, regions, etc.)?
User stories
Goal
High level
requirements
38. Data about residential estates
Data about cultural heritage
Data about touristic services
Archeological data
Archival data
CH data
Touristic services and touristic-cultural
itineraries
Accessibility
Consultation of CH data
39. From stories to competency questions and constraints
• What are the geographical
coordinates of cultural property X?
• What cultural events involved cultural
property X?
• What is the conservation status of
cultural property X? And what
interventions have been proposed for
it?
• When cultural property X was
realised? And what is its history? And
why?
• Who are the attributed authors of
cultural property X?
• …
• A cultural property can be associated
with different types of locations, each
possibly having a temporal validity
• Tangible and intangible cultural
properties are disjoint.
• Tangible cultural properties can be
either movable or immovable, not
both.
• …
testing
team
40. Lessons learned
• Depending on the domain under analysis you may need requirements
from beyond domain experts
• Social aspects
• Terminology
• Administrative or even political constraints
• Diverse means to collect requirements
• Different tools for different elicitators
• Bias towards cataloguing standards
43. Ontology Design Patterns
Shortcut binary relations along with N-ary relations
Multiple languages (ita + eng)
Detailed documentation
(comments, usage examples, diagrams)
Design principles
Direct and indirect reuse
44. Direct and Indirect reuse
Direct Reuse
• Delegating the conceptualisation of
predicates and axioms to external ontologies
• e.g. dul:Event as type of individuals in
my ontology
• When?
• you want or have to comply with, an
ontology
• Effects
• Changes in reused external ontology
impact the semantics of your ontology
• Less design effort
Indirect reuse
• Defining predicates and axioms in your
ontology and align them to external
ontologies
• e.g. myont:CulturalEvent
rdfs:subClassOf dolce:Event
• When?
• you want to be interoperable but avoid
dependency on external resources
• Effects
• If external changes impact on the
semantics of your ontology you may
accept it or remove the alignments
• More design effortValentina Presutti, Giorgia Lodi, Andrea Giovanni Nuzzolese, Aldo Gangemi, Silvio Peroni, Luigi
Asprino:The Role of Ontology Design Patterns in Linked Data Projects. ER 2016: 113-121
45. ArCo: direct reuse
• OntoPiA: ontology network for Italian PA data
https://w3id.org/italia/onto/FULL
49. Modelling issues
• Our main source is a catalogue, which is about cultural properties
• A catalogue record describes a cultural property and includes
information about its owner or administrator, as well as other
administrative roles
• A change in the cultural property or in the information available
about it may causes a new version of its associated catalogue record
• ArCo wants to model both the catalogue and the entities it is about
51. Open challenges and research questions
• Investigating the dynamics between catalogue record changes, the
cataloguing process, and the evolution of the cultural property over
time
• Knowledge graphs such as ArCo may be empirically studied to this aim
• Persistence of physical objects (e.g. cultural properties) vs fluency of
information objects
• cidoc:Spacetime_Volume subsumes (union?) cidoc:Presence,
cidoc:Physical_Thing, and cidoc:Period
• cidoc:Place equivalent to dul:space-region (an abstract place then?)
• cidoc:Actor is subsumed by cidoc:Persistent_Item (not a fluent?)
54. 9 categories of Cultural Properties
generalising over 30 more specific types:
e.g. musical, natural, numismatic,
scientific and technological
2 main orthogonal distinctions:
immovable vs. movable
tangible vs intangible
57. Modelling issues
• A cultural property may be associated with different types of locations
• A cultural property’s location has a temporal validity
63. Modelling issues
• Cultural properties can be involved in exhibitions, or other types of
cultural events
• Cultural events always involve one or more cultural property
• There are events that belong to series having a more-or-less regular
frequency
64. ArCo: Event and Recurrent Event Series
http://ontologydesignpatterns.org/wiki/Submissions:RecurrentEventSeries
68. Lessons learned
To favor and facilitate reuse:
• Favoring local constraints
• e.g. constraints using general predicates
• Annotating patterns is tedious but precious
• Call for tools and incentives!
• Identifying potentially relevant ontologies: how to make it easier?
• F.A.I.R. + tools
70. • Findable:
• ArCo has permanent URIs (w3id) to identify its entities
• ArCo knowledge graph has its DOI (10.5281/zenodo.2630447)
• ArCo is indexed on Linked Open Vocabularies
• Accessible:
• Use of open standard protocols and query language (HTTP(S) and SPARQL)
• Interoperable:
• Through RDF/OWL, ontology patterns and ontology reuse
• Reusable:
• We release ArCo under CC BY-SA 4.0 license
F.A.I.R.
72. 1. Identifying a possible key in the XML
source
2. Removing possible URI-illegal characters
and convert lower case the string
3. Sorting in alphabetical order
4. Computing the MD5 checksum
ID generation
<AUTN>Friscia Albert</AUTN>
friscia-albert
albert-friscia
dcd4ca7b54dd3d7dac083dd4c54a9eef.
https://w3id.org/arco/resource/Agent/dcd4ca7b54dd3d7dac083dd4c54
a9eef
From strings to entities
73. Cleaning and enriching ArCo knowledge graph
• Deduplication: two different entities are generated for the same one
• “Andrea d'Agnolo” and “Andrea d'Agnolo detto del Sarto”
• Entity linking applied on ArCo against itself
• Disambiguation: same entity generated for two or more different
ones
• Identification of entity fingerprints: e.g. active period of an author and types
of artworks she’s associated with
• Entity linking
• 28.838 owl:sameAs linking to 20.479 distinct entities in other datasets
• Mainly authors (8.884) and locations (9.862)
78. Lessons learned and open challenges
• Testing is useful to early error-discovery and for detecting new
modeling issues
• Huge manual effort
• TESTaLOD: Prototype for automatic regression tests
• Although the testing methodology supports a systematic approach,
some aspects can be easily overlooked:
• error provocation and inference verification are mainly pulled by designers
http://testalod.herokuapp.com/
79. Conclusion
• Towards a knowledge graph of Italian Cultural Heritage
• Evolving content, enabling diverse usage: from business applications
to science discovery
• Call for tools for facilitating reuse and testing
• Ontology patterns annotation, ontology discovery, automatic testing, etc.
• Open questions and material for reflections
• how to make different modeling approaches co-exist and be compatible
• what is the best way to handle evolution
• how to capture requirements from diverse “types of experts”
I want to share with you the experience of a project that my group has developed in collaboration with the Italian Ministry of Cultural HeritageI
First I’ll tell you what we have so far and then I will tell you how we got there. This will give me the opportunity to share some lessons learned, as well as problems we have faced along the way. For some of them we identified a possible solution, some others are still open.
The Ministry of Cultural Heritage and Activities (MiBAC) together with regions and local agencies cooperatively catalogue Italian CH they own.
ICCD (the Institute of the Central Catalogue and Documentation) of MiBAC coordinates this activity by maintaining the “General Catalogue of Italian Cultural Heritage”.
SIGECweb: collaborative platform to which CH administers submit their catalogue records.
Only authorised CH administers (publc/private institutions/organisations)
Once a catalogue record is submitted it goes through a validation process: format and compliance with ICCD standards, and scientific assessment.
Let’s see how a catalogue record looks like:
This is how ICCD catalogue standards documentation looks like
The content of a catalogue records:
Good side:
The model is very rich: from metadata to restoration processes, measurements, location, associated theories e.g. attribution, physical descriptions, etc.
As compared to other data sources, SIGEC is based on ICCD standards, which are richly documented and conceptualisation is based on scientific competence
There are some elements, whose value is based on controlled vocabularies
Although most elements’ values are text, many of them converge to uniform descriptions, which can help knowledge extraction
Bad side:
Data are mainly textual descriptions
The content types define mainly their format in terms of XML datatypes e.g. text, rather than their types
Labels of XML elements are in italian ArCo provides both English and Italian labels
The text values are in italian: at the moment ArCo does not provide an English translation but this is an undergoing process
Many records have only few mandatory filled fields
During the project and by inspecting the catalogue records we realised that they followed different schemas in some cases.
In fact, the standards have evolved through different versions, which are often backward incompatible
The evolution of versions have limited documentation in terms of mapping
ICCD experts had to provide this additional information to allow us developing automatic conversion of catalogue records into a LOD knowledge graph
A ODP includes a vocabulary, an axiomatization, a set of requirements expressed in terms of competency questions that it addresses, examples of use, an optional implementation and possibly a source, e.g. a theory it is based on. It can be very specific but usually it addresses concepts that are general enough to be relevant for diverse domains.
User stories, then reformulated as Competency Questions and used for ODP selection by the design team, as well as in the testing phase, by the testing team
The customer and the testing teams can contribute continuous and updated feedback, which allows the design team to early detect new emerging requirements and errors, and schedule them for next releases
ArCo is the root of the network, it imports all other modules and defines the main taxonomy of cultural property types
General purpose concepts, reused by all ontologies.
Catalogue: catalogue records linked to the CP they describe
Denotative: measurable characteristics of a CP measurable according to a ref system, measurements e.g. length, constituting materials e.g. clay, employed techniques e.g. melting, conservation status e.g. good, decent, bad.
Context: info about the CP that are not measurable but influence the knowledge of a CP or its ontological status: authors, collectors, copyright holders; relations to other objects such as inventories, bibliography, protective measures, collections; activities such as surveys, conservation interventions; involvement in situations, e.g. commission, coin issuance, estimate, legal proceedings
Location: spatial and geometrical information
A catalogue record is an entity that describes a cultural property
As it denotes a real world object, it can be defined as an information object (a piece of information independent from how it’s realised
A catalogue record is a fluent entity: it changes as the description of the cultural property changes.
A catalogue record update can be caused by an ontological change (its conservation status) of the cultural property or by an epistemological change (new discoveries)
Every change corresponds to a new information object: a version of the catalogue record
The catalogue record however has its persistence in describing the same real-world object independently from the different versions
A catalogue record is a persistence information object related to each of its versions, which are in turn information objects reflecting every change in the content
Why do we model cultural properties mostly focusing at their persistence nature, while we give importance to the temporal evolution of catalogue records, hence modeling them as fluents? More in general: why do we model certain objects as fluents and others by focusing on their persistent attributes? Maybe for catalogue records their changes characterise them more than their persistent attributes, while for cultural properties what interests us more are the aspects that make them recognisable as the same over time
CIDOC models the location of a cultural property as dependent on moving events. It distinguishes, by means of relations between a CP and a place, three types of locations: current, current or former, permanent. It also expresses a concept of an object being a section of a place.
Eventi ricorrenti:
Frequenza: concetto fuzzy
Organizzazione e processi collegati alla serie
Altri eventi, parte di ogni evento che appartiene alla serie- vedi biennale e i suoi eventi
Eventi satellite di eventi ricorrenti – workshop di una conferenza
Event is conceptualised as a change of state in some system, a different concept than our cultural event. An Activity is an intentional action carried out by an actor that may result in a change of state, an event.
This corresponds to ArCo’s Activity, which for example subsumes Interventions such as Restoration, etc.