Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneous Heritage Collections

Technical Challenges and
Approaches to build an
Open Ecosystem of
Heterogeneous Heritage
Collections
Ricard de la Vega
Natalia Torres
Albert Martínez

Agenda
1.Introduction
2.Technical Architecture
3.Challenges and Approaches
4.Lessons learned
5.Results and future developments
6.References

1. Introduction
Who we are?
All the code, specifications and documentation are available
under an open source MIT license on the Github Echoes
page: https://github.com/CSUC/ECHOES-Tools
Technological partner

1. Introduction
What is Echoes?
Echoes provides open, easy and innovative
access to digital cultural assets from different
institutions and is available in several languages.
Within a single and integrated platform, users have
access to a wide range of information on
archaeology, architecture, books, monuments,
people, photography etc. This can be explored
using different criteria: concepts, digital objects,
people, places and time. The platform can be
installed for a region or a theme.

1. Introduction
What is Echoes?
Echoes has developed tools that allow to
analyze, clean and transform data collections to
Europeana Data Model (EDM).
Also tools to validate, enrich and publish
heterogeneous data to a normalized data lake
that can be exploited as linked open data and
used with different data visualizations.

1. Introduction
What is Echoes?

1. Introduction
An example of 1+1=3
Pilot with 3 different collections
‒ Archeologic Heritage
‒ Architectonic Heritage
‒ Institutional repository
Roses
Port de la Selva
Vall de Boí

1. Introduction
An example of 1+1=3

1. Introduction
How Echoes works?
Data access
Data storage
Data homogenization
Data collections
4
3
2
1

2. Technical architecture
Modular approach
1. Input (data collections)
2. Mapping and transformation
tools (data homogenization)
3. Data lake (data storage)
4. Output
– SPARQL endpoint (RDF)
– Portal (WordPress)
– API-Rest, OAI-PMH
5. Enrichments

2. Technical architecture
Modular approach
1. Inputs
2. Mapping
3. Data lake
4. Output
– SPARQL
– Portal
– API-Rest,
– Enrichments

2. Technical architecture | Inputs

2. Technical architecture | Inputs | Examples
– ELO: 4K, 144K, 280K items (A2A)
– Tresoar: 21K, 36K, 2M items (A2A)
– Gencat: 1K (Custom)
1.351.416
983.677 983.677
95.377
10.989 2.380 560
0
200.000
400.000
600.000
800.000
1.000.000
1.200.000
1.400.000
1.600.000
November 2018

2. Technical architecture | Data homogenization

Echoes is a project of interoperability between
different data collections.
Integrating data is not just about putting them
together in a repository, but also to facilitate
their access so it can be properly exploited by
the public

If garbage comes in, then garbage comes out
To simplify the reuse and visualization of the data,
all the records inserted into the system should
have the same structure and format.
There are two ways to ensure the data coherence
and consistency, clean & transform data:
‒ A priori, before insert to the system
‒ A posteriori, in real-time when the data is used

If garbage comes in, then garbage comes out
To simplify the reuse and visualization of the data,
all the records inserted into the system should
have the same structure and format.
There are two ways to ensure the data coherence
and consistency, clean & transform data:
‒ A priori, before insert to the system. Due to the
complexity and the high volume of data
‒ A posteriori, in real-time when the data is used

Analyze Transform
Quality
Assurance
Enrich Publish
1 42 3 5
Analyze
content from
a source to
“know about”
your data
Download
items into
local files
from a
source
Transform to
EDM
Review each
item and
based on
defined rules
decide if it can
be loaded into
Data Lake
Quality report
Enrich
metadata
from different
sources
Publish items
into Data Lake
Only valid
items can be
loaded
Optional Optional Optional
Demo on https://youtu.be/LQSheaKJOiY

Optional
Analyze Transform
Quality
Assurance
Enrich Publish
1 42 3 5
Analyze
content from
a source to
“know about”
your data
Download
items into
local files from
a source
Transform to
EDM
Review each
item and
based on
defined rules
decide if it can
be loaded into
Data Lake
Quality report
Enrich
metadata
from different
sources
Publish items
into Data Lake
Only valid
items can be
loaded
Optional Optional

‒ Gives feedback on the data properties
‒ Useful to get to know the contents of the data
especially if you didn’t create the dataset
‒ Gives the ability to determine the usefulness of
the data when you want to enrich.
Ex. If there are no places in the dataset, enrichment
with coordinates is impossible

ECHOES Workshop Archiving 2019
ECHOES
Analyze
URL
Open Archieves Initiative
protocol for
metadata harvesting
(OAI-PMH)
A2A
Dublin Core
TopX
EAD
CARARE Upload file
Analyze
Report
in xml
as a
Supports:
Accepts data as a:
Delivers:
* An xml file can be easily
imported in your favorite
reporting tool.Custom

Optional
Analyze Transform
Quality
Assurance
Enrich Publish
1 42 3 5
Analyze
content from
a source to
“know about”
your data
Download
items into
local files
from a
source
Transform to
EDM
Review each
item and
based on
defined rules
decide if it can
be loaded into
Data Lake
Quality report
Enrich
metadata
from different
sources
Publish items
into Data Lake
Only valid
items can be
loaded
Optional Optional

ECHOES Workshop Archiving 2019
ECHOES
Transform
URL
Open Archieves Initiative
protocol for
metadata harvesting
(OAI-PMH)
Upload file
Your EDM
dataset
as a
Accepts data as a:
Delivers:A2A
Dublin Core
TopX
EAD
CARARE
Supports:
Custom

Optional
Analyze Transform
Quality
Assurance
Enrich Publish
1 42 3 5
Analyze
content from
a source to
“know about”
your data
Download
items into
local files
from a source
Transform to
EDM
Review each
item and
based on
defined rules
decide if it can
be loaded into
Data Lake
Quality report
Enrich
metadata
from different
sources
Publish items
into Data Lake
Only valid
items can be
loaded
Optional Optional

Quality Assurance Module
Schema Semantics Content
1 2 3
Review:
‒ Tags
‒ mandatory fields
Results:
‒ OK: next step
‒ Error: stop
Review:
‒ Schematron
Results:
‒ OK: next step
‒ Error: stop
Review:
‒ Metadata fields based
on configurable specs.
Results:
‒ OK: valid item
‒ Error: stop
‒ Warning: partial valid
‒ Info: valid item

2. Technical architecture | Data lake
‒ Blazegraph™ DB is an ultra high-performance
graph database supporting Blueprints and
RDF/SPARQL API's
‒ Ex. Wikimedia Foundation Wikidata Query Service
‒ https://github.com/blazegraph/database

2. Technical architecture | Output

2. Technical architecture | Outputs

2. Technical architecture | Outputs
https://echoes.community

The theory is sound, but there still exist
many challenges to tackle…

3. Challenges and Approaches
1. Different metadata schemas
2. Poor data quality
3. Data deduplication
4. Automatic enrichment
5. Too much data
6. Easy SPARQL queries
7. Different scope
8. User enrichment

3.1. Challenges and Approaches | Different metadata schemas
‒ Different collections can have different metadata
schemas…
‒ Dublin Core (DC), A2A, EAD, Custom…

3.1. Challenges and Approaches | Different metadata schemas
‒ It was necessary to have one metadata standard that
was the standard to map the datasets to
‒ We choose: Europeana Data Model (EDM)
‒ Transformation module. Mapping to EDM from DC,
A2A, EAD, Topx, custom metadata and “CARARE”
‒ Transformation tool is easy extensible to other
formats, if someone wants a format that is not on the
list, can create their own EDM mapping (and can
contribute it to the community)

3. Challenges and Approaches
1. Different metadata schemas
2. Poor data quality
3. Data deduplication
4. Automatic enrichments
5. Too much data
6. Easy SPARQL queries
7. Different scope
8. User enrichments

3.2. Challenges and Approaches | Poor data quality
‒ Sometimes the data quality is not as good as we
would like it to be…
‒ This poor quality limits the exploitation of the data
‒ For example
‒ One unique field with different geolocation levels,
Bussum (municipality), Chicago (city), China (country)
‒ The same with dates (day and time, year, centuries…)
‒ Misspellings (Lide4n, Leideb, Lidedn, Leiden…)

3 modules have been developed:
‒ Analyze focus on data profiling
Ex. blank cells, number of instances of each metadata…
‒ Quality assurance to validate the input data
Ex. Empty mandatory field, place without coordinates…
‒ Enrich to complete some metadata
Ex. Coordinates (to show in a map) from a text location
All the modules can be easily extended with new rules,
statistics, checks and enrichments.
Quality reports can be used to improve original data sets.

Item
1
Item
2
Some
Metadata
fields are
not included
Item
x
Item
not
included
Item
1
Item
2
Item
x
Item
1
Item
2
Item
x
Items
- 6 ok
Ítems
- 1 warning
- 1 error
- 4 ok
Ítems
- 6 error

3.3 Challenges and Approaches | Data deduplication

‒ Deduplication is easy if the items have
identificatory metadata.
‒ If not, different similarity and distance metrics can
be used to find duplicates (Levensthein, Jaro-
Winkler…) with the Duke tool.
‒ Useful to only get one value for places, dates…

‒ Ex. Items from different Gencat and DIBA
collections (with an id in the metadata)
‒ Match done using custom identifier, BCIN or BCIL
(local register identifier for cultural assets)

3.4 Challenges and Approaches | Automatic enrichments
‒ Which fields are candidates to enrichments?
We start with geolocations. A2A collections have
a location, but no coordinates, which is necessary to
visualize the data on a map.
If the enrichment is mandatory e.g. for proper
presentation on a map, it's automatically done on the last
step in the quality assurance module;
if enrichment is 'nice to have', it can be configured in the
enrich module.
‒ Use existing or new metadata?
Extend the metadata schema to insert the enrichment
(without modifying the original metadata)

EDM Automatic enrichment Manual enrichment
Metadata Source Metadata Metadata
wgs84_pos:lat Geonames geonames:lat user:lat
wgs84_pos:long Geonames geonames:long user:long
skos:prefLabel Geonames geonames:alternateName
geonames:coloquialName
geonames:historicalName
geonames:officialName
geonames:name
user:prefLabel
DBPedia foaf:name
rdfs:label
owl:sameAs
Getty (TGN) rdfs:label
skos:prefLabel
skos:altLabel
skos:altLabel Getty (TGN) rdfs:label
skos:prefLabel
skos:altLabel
user:altLabel

‒ Data visualization of an A2A collection (without
original coordinates) on a map

‒ Another challenge; some third-party API's have
usage limitations like:
‒ Limit number of connections
‒ Premium options (€)
‒ One approach is to download a part or the total API
(cache), if is it possible…
 MaxResults:10.000
 MaxQueryExecutionTime = 120’
 MaxQueryCostEstimationTime = 1500’
 Connection limit = 50
 maximum request rate = 100
 Daily create a downloaded large
worldwide text file and offers a REST
API.
 Limitation 20’000 results, the hourly
limit is 1000 credits.
 Premium subscription
 Enpoint refreshed monthly.
 Webservice
 No information about limitations

3.5. Challenges and Approaches | Too much data
What's this?
a) A flower?
b) A black hole?
c) A (not user friendly)
data visualization of a
450K nodes graph?

3.5. Challenges and Approaches | Too much data
‒ Divide and conquer strategy. Pick a focus point
(e.g. based in something) and let the system
compute the “optimal” relevant context given the
users current interests.
‒ Don’t target to explore the whole database, focus
on specific domains. Ex. Different visualization
tools are developed based on different types of
information to be displayed (maps, timespan,
graph, etc)

3.6. Challenges and Approaches | Easy SPARQL queries
‒ All the data is accessible in RDF format via a
linked open data endpoint.
‒ A user-friendly interface (YASGUI) is integrated
to access this data.
‒ User-friendly? Only if you know the SPARQL
language and database structure. So we found:

3.6. Challenges and Approaches | Easy SPARQL queries
A visual SPARQL query system to drag the
database elements to a canvas and 'build' your
query (Visual SPARQL Builder)

3.7. Challenges and Approaches | Different scope
‒ From small institutions to regional thematic installations. One
size fits all?
‒ The technology that is developed is scalable, so it covers many
different scenarios. Performance test have been designed to
test the performance with large collections.
‒ The modular approach enables (smaller) institutions to 'mix and
match' modules. for example, only the ECHOES transformation
module to transform one collection to EDM or linked open data

3.8 Challenges and Approaches | User enrichments
One of the objectives of the project is giving the
user a possibility to enrich of the content
Not initiated yet…

4. Lessons Learned
Some decisions that we would take again:
‒ Use of agile methodology. Flexible to changes (Ex.
focus on data quality). Team collaboration on
iterations align all in the same direction.
‒ A multidisciplinary team brings different points of view
to solve the challenges. Also different countries.
‒ Start from the beginning. Focus on input data before
than enrichments.
‒ Learning by doing, the best way to know if it works is
to test it.

5. Results and future development
After 2,5 years we have done…
‒ 27 one-month iterations sprints (cookies meeting)
‒ 7 releases
‒ 1 modular product, version 1.5
‒ 1 open source community
(Benevolent dictator for life model)
All the code, specifications and documentation are available under an
open source MIT license on the GitHub Echoes page:
https://github.com/CSUC/ECHOES-Tools

The developed tools allow you to analyze, clean and
transform data collections to the EDM standard.
Validate, enrich and publish heterogeneous data to a
normalized data lake that can be exploited as linked
open data and with different data visualizations

Demo’s corner
- https://youtu.be/LQSheaKJOiY (Echoes Tool)
- https://youtu.be/LddOAUc9tig (End Point)
- https://youtu.be/bb3Sxyyx8aA (End Point)
- https://youtu.be/oa7aY6p4o5Y (Echoes Portal)

The current status of the development is:
‒ To improve the data sources mapping and
transformation tools
‒ Focus on the enrichments
‒ More users of the platform are expected to help
grow the community
Join us!

6. References
• Ariela Netiv & Walther Hasselo, ECHOES - cooperation across heritage disciplines,
institutes and borders (IS&T, Washington, 2018) pg. 70-74
• Lluís M. Anglada & Sandra Reoyo & Ramon Ros & Ricard de la Vega, “Doing it
together spreading ORCID among Catalan universities and researchers” (ORCID-
CASRAI Joint conference, Barcelona, 2015)
• Anisa Rula & Andrea Maurino & Carlo Batini, “Data Quality Issues in Linked Open
Data”. (Part of the Data-Centric Systems and Applications book series, DCSA, 2016)
• Europeana Data Model (EDM) https://pro.europeana.eu/resources/standardization-
tools/edm-documentation.
• Duke. A tool to find duplicates. https://github.com/larsga/Duke
• Frank van Ham & Adam Perer, “Search, Show Context, expand on demand”:
Supporting Large Graph Exploration with Degree-of-interest.
http://perer.org/papers/adamPerer-DOIGraphs-InfoVis2009.pdf

Contact
Walther Hasselo
w.hasselo@erfgoedleiden.nl
Anna Busom
abusom@gencat.cat
Olav Kwakman
Olav.kwakman@tresoar.nl

Thanks for your attention
Development
team
Ricard de la Vega | ricard.delavega@csuc.cat
Natalia Torres | natalia.torres@csuc.cat
Albert Martínez | albert.martinez@csuc.cat

Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneous Heritage Collections

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneous Heritage Collections

Similar a Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneous Heritage Collections (20)

Más de CSUC - Consorci de Serveis Universitaris de Catalunya

Más de CSUC - Consorci de Serveis Universitaris de Catalunya (20)

Último

Último (20)

Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneous Heritage Collections