2024: Domino Containers - The Next Step. News from the Domino Container commu...
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneous Heritage Collections
1. Technical Challenges and
Approaches to build an
Open Ecosystem of
Heterogeneous Heritage
Collections
Ricard de la Vega
Natalia Torres
Albert Martínez
3. 1. Introduction
Who we are?
All the code, specifications and documentation are available
under an open source MIT license on the Github Echoes
page: https://github.com/CSUC/ECHOES-Tools
Technological partner
4. 1. Introduction
What is Echoes?
Echoes provides open, easy and innovative
access to digital cultural assets from different
institutions and is available in several languages.
Within a single and integrated platform, users have
access to a wide range of information on
archaeology, architecture, books, monuments,
people, photography etc. This can be explored
using different criteria: concepts, digital objects,
people, places and time. The platform can be
installed for a region or a theme.
5. 1. Introduction
What is Echoes?
Echoes has developed tools that allow to
analyze, clean and transform data collections to
Europeana Data Model (EDM).
Also tools to validate, enrich and publish
heterogeneous data to a normalized data lake
that can be exploited as linked open data and
used with different data visualizations.
7. 1. Introduction
An example of 1+1=3
Pilot with 3 different collections
‒ Archeologic Heritage
‒ Architectonic Heritage
‒ Institutional repository
Roses
Port de la Selva
Vall de Boí
19. 2. Technical architecture | Data homogenization
Echoes is a project of interoperability between
different data collections.
Integrating data is not just about putting them
together in a repository, but also to facilitate
their access so it can be properly exploited by
the public
20. 2. Technical architecture | Data homogenization
If garbage comes in, then garbage comes out
To simplify the reuse and visualization of the data,
all the records inserted into the system should
have the same structure and format.
There are two ways to ensure the data coherence
and consistency, clean & transform data:
‒ A priori, before insert to the system
‒ A posteriori, in real-time when the data is used
21. 2. Technical architecture | Data homogenization
If garbage comes in, then garbage comes out
To simplify the reuse and visualization of the data,
all the records inserted into the system should
have the same structure and format.
There are two ways to ensure the data coherence
and consistency, clean & transform data:
‒ A priori, before insert to the system. Due to the
complexity and the high volume of data
‒ A posteriori, in real-time when the data is used
22. 2. Technical architecture | Data homogenization
Analyze Transform
Quality
Assurance
Enrich Publish
1 42 3 5
Analyze
content from
a source to
“know about”
your data
Download
items into
local files
from a
source
Transform to
EDM
Review each
item and
based on
defined rules
decide if it can
be loaded into
Data Lake
Quality report
Enrich
metadata
from different
sources
Publish items
into Data Lake
Only valid
items can be
loaded
Optional Optional Optional
Demo on https://youtu.be/LQSheaKJOiY
23. Optional
2. Technical architecture | Data homogenization
Analyze Transform
Quality
Assurance
Enrich Publish
1 42 3 5
Analyze
content from
a source to
“know about”
your data
Download
items into
local files from
a source
Transform to
EDM
Review each
item and
based on
defined rules
decide if it can
be loaded into
Data Lake
Quality report
Enrich
metadata
from different
sources
Publish items
into Data Lake
Only valid
items can be
loaded
Optional Optional
24. 2. Technical architecture | Data homogenization
‒ Gives feedback on the data properties
‒ Useful to get to know the contents of the data
especially if you didn’t create the dataset
‒ Gives the ability to determine the usefulness of
the data when you want to enrich.
Ex. If there are no places in the dataset, enrichment
with coordinates is impossible
25. 2. Technical architecture | Data homogenization
ECHOES Workshop Archiving 2019
ECHOES
Analyze
URL
Open Archieves Initiative
protocol for
metadata harvesting
(OAI-PMH)
A2A
Dublin Core
TopX
EAD
CARARE Upload file
Analyze
Report
in xml
as a
Supports:
Accepts data as a:
Delivers:
* An xml file can be easily
imported in your favorite
reporting tool.Custom
26. Optional
2. Technical architecture | Data homogenization
Analyze Transform
Quality
Assurance
Enrich Publish
1 42 3 5
Analyze
content from
a source to
“know about”
your data
Download
items into
local files
from a
source
Transform to
EDM
Review each
item and
based on
defined rules
decide if it can
be loaded into
Data Lake
Quality report
Enrich
metadata
from different
sources
Publish items
into Data Lake
Only valid
items can be
loaded
Optional Optional
27. 2. Technical architecture | Data homogenization
ECHOES Workshop Archiving 2019
ECHOES
Transform
URL
Open Archieves Initiative
protocol for
metadata harvesting
(OAI-PMH)
Upload file
Your EDM
dataset
as a
Accepts data as a:
Delivers:A2A
Dublin Core
TopX
EAD
CARARE
Supports:
Custom
28. Optional
2. Technical architecture | Data homogenization
Analyze Transform
Quality
Assurance
Enrich Publish
1 42 3 5
Analyze
content from
a source to
“know about”
your data
Download
items into
local files
from a source
Transform to
EDM
Review each
item and
based on
defined rules
decide if it can
be loaded into
Data Lake
Quality report
Enrich
metadata
from different
sources
Publish items
into Data Lake
Only valid
items can be
loaded
Optional Optional
30. Optional
2. Technical architecture | Data homogenization
Analyze Transform
Quality
Assurance
Enrich Publish
1 42 3 5
Analyze
content from
a source to
“know about”
your data
Download
items into
local files
from a
source
Transform to
EDM
Review each
item and
based on
defined rules
decide if it can
be loaded into
Data Lake
Quality report
Enrich
metadata
from different
sources
Publish items
into Data Lake
Only valid
items can be
loaded
Optional Optional
31. Optional
2. Technical architecture | Data homogenization
Analyze Transform
Quality
Assurance
Enrich Publish
1 42 3 5
Analyze
content from
a source to
“know about”
your data
Download
items into
local files
from a source
Transform to
EDM
Review each
item and
based on
defined rules
decide if it can
be loaded into
Data Lake
Quality report
Enrich
metadata
from different
sources
Publish items
into Data Lake
Only valid
items can be
loaded
Optional Optional
32. 2. Technical architecture | Data lake
‒ Blazegraph™ DB is an ultra high-performance
graph database supporting Blueprints and
RDF/SPARQL API's
‒ Ex. Wikimedia Foundation Wikidata Query Service
‒ https://github.com/blazegraph/database
45. 3. Challenges and Approaches
1. Different metadata schemas
2. Poor data quality
3. Data deduplication
4. Automatic enrichment
5. Too much data
6. Easy SPARQL queries
7. Different scope
8. User enrichment
46. 3.1. Challenges and Approaches | Different metadata schemas
‒ Different collections can have different metadata
schemas…
‒ Dublin Core (DC), A2A, EAD, Custom…
47. 3.1. Challenges and Approaches | Different metadata schemas
‒ It was necessary to have one metadata standard that
was the standard to map the datasets to
‒ We choose: Europeana Data Model (EDM)
‒ Transformation module. Mapping to EDM from DC,
A2A, EAD, Topx, custom metadata and “CARARE”
‒ Transformation tool is easy extensible to other
formats, if someone wants a format that is not on the
list, can create their own EDM mapping (and can
contribute it to the community)
48. 3. Challenges and Approaches
1. Different metadata schemas
2. Poor data quality
3. Data deduplication
4. Automatic enrichments
5. Too much data
6. Easy SPARQL queries
7. Different scope
8. User enrichments
49. 3.2. Challenges and Approaches | Poor data quality
‒ Sometimes the data quality is not as good as we
would like it to be…
‒ This poor quality limits the exploitation of the data
‒ For example
‒ One unique field with different geolocation levels,
Bussum (municipality), Chicago (city), China (country)
‒ The same with dates (day and time, year, centuries…)
‒ Misspellings (Lide4n, Leideb, Lidedn, Leiden…)
50. 3.2. Challenges and Approaches | Poor data quality
3 modules have been developed:
‒ Analyze focus on data profiling
Ex. blank cells, number of instances of each metadata…
‒ Quality assurance to validate the input data
Ex. Empty mandatory field, place without coordinates…
‒ Enrich to complete some metadata
Ex. Coordinates (to show in a map) from a text location
All the modules can be easily extended with new rules,
statistics, checks and enrichments.
Quality reports can be used to improve original data sets.
52. 3. Challenges and Approaches
1. Different metadata schemas
2. Poor data quality
3. Data deduplication
4. Automatic enrichments
5. Too much data
6. Easy SPARQL queries
7. Different scope
8. User enrichments
54. 3.3 Challenges and Approaches | Data deduplication
‒ Deduplication is easy if the items have
identificatory metadata.
‒ If not, different similarity and distance metrics can
be used to find duplicates (Levensthein, Jaro-
Winkler…) with the Duke tool.
‒ Useful to only get one value for places, dates…
55. 3.3 Challenges and Approaches | Data deduplication
‒ Ex. Items from different Gencat and DIBA
collections (with an id in the metadata)
‒ Match done using custom identifier, BCIN or BCIL
(local register identifier for cultural assets)
56. 3. Challenges and Approaches
1. Different metadata schemas
2. Poor data quality
3. Data deduplication
4. Automatic enrichments
5. Too much data
6. Easy SPARQL queries
7. Different scope
8. User enrichments
57. 3.4 Challenges and Approaches | Automatic enrichments
‒ Which fields are candidates to enrichments?
We start with geolocations. A2A collections have
a location, but no coordinates, which is necessary to
visualize the data on a map.
If the enrichment is mandatory e.g. for proper
presentation on a map, it's automatically done on the last
step in the quality assurance module;
if enrichment is 'nice to have', it can be configured in the
enrich module.
‒ Use existing or new metadata?
Extend the metadata schema to insert the enrichment
(without modifying the original metadata)
59. 3.4 Challenges and Approaches | Automatic enrichments
‒ Data visualization of an A2A collection (without
original coordinates) on a map
60. 3.4 Challenges and Approaches | Automatic enrichments
‒ Another challenge; some third-party API's have
usage limitations like:
‒ Limit number of connections
‒ Premium options (€)
‒ One approach is to download a part or the total API
(cache), if is it possible…
MaxResults:10.000
MaxQueryExecutionTime = 120’
MaxQueryCostEstimationTime = 1500’
Connection limit = 50
maximum request rate = 100
Daily create a downloaded large
worldwide text file and offers a REST
API.
Limitation 20’000 results, the hourly
limit is 1000 credits.
Premium subscription
Enpoint refreshed monthly.
Webservice
No information about limitations
61. 3. Challenges and Approaches
1. Different metadata schemas
2. Poor data quality
3. Data deduplication
4. Automatic enrichments
5. Too much data
6. Easy SPARQL queries
7. Different scope
8. User enrichments
62. 3.5. Challenges and Approaches | Too much data
What's this?
a) A flower?
b) A black hole?
c) A (not user friendly)
data visualization of a
450K nodes graph?
63. 3.5. Challenges and Approaches | Too much data
‒ Divide and conquer strategy. Pick a focus point
(e.g. based in something) and let the system
compute the “optimal” relevant context given the
users current interests.
‒ Don’t target to explore the whole database, focus
on specific domains. Ex. Different visualization
tools are developed based on different types of
information to be displayed (maps, timespan,
graph, etc)
64. 3. Challenges and Approaches
1. Different metadata schemas
2. Poor data quality
3. Data deduplication
4. Automatic enrichments
5. Too much data
6. Easy SPARQL queries
7. Different scope
8. User enrichments
65. 3.6. Challenges and Approaches | Easy SPARQL queries
‒ All the data is accessible in RDF format via a
linked open data endpoint.
‒ A user-friendly interface (YASGUI) is integrated
to access this data.
‒ User-friendly? Only if you know the SPARQL
language and database structure. So we found:
66. 3.6. Challenges and Approaches | Easy SPARQL queries
A visual SPARQL query system to drag the
database elements to a canvas and 'build' your
query (Visual SPARQL Builder)
67. 3. Challenges and Approaches
1. Different metadata schemas
2. Poor data quality
3. Data deduplication
4. Automatic enrichments
5. Too much data
6. Easy SPARQL queries
7. Different scope
8. User enrichments
68. 3.7. Challenges and Approaches | Different scope
‒ From small institutions to regional thematic installations. One
size fits all?
‒ The technology that is developed is scalable, so it covers many
different scenarios. Performance test have been designed to
test the performance with large collections.
‒ The modular approach enables (smaller) institutions to 'mix and
match' modules. for example, only the ECHOES transformation
module to transform one collection to EDM or linked open data
69. 3. Challenges and Approaches
1. Different metadata schemas
2. Poor data quality
3. Data deduplication
4. Automatic enrichments
5. Too much data
6. Easy SPARQL queries
7. Different scope
8. User enrichments
70. 3.8 Challenges and Approaches | User enrichments
One of the objectives of the project is giving the
user a possibility to enrich of the content
Not initiated yet…
72. 4. Lessons Learned
Some decisions that we would take again:
‒ Use of agile methodology. Flexible to changes (Ex.
focus on data quality). Team collaboration on
iterations align all in the same direction.
‒ A multidisciplinary team brings different points of view
to solve the challenges. Also different countries.
‒ Start from the beginning. Focus on input data before
than enrichments.
‒ Learning by doing, the best way to know if it works is
to test it.
74. 5. Results and future development
After 2,5 years we have done…
‒ 27 one-month iterations sprints (cookies meeting)
‒ 7 releases
‒ 1 modular product, version 1.5
‒ 1 open source community
(Benevolent dictator for life model)
All the code, specifications and documentation are available under an
open source MIT license on the GitHub Echoes page:
https://github.com/CSUC/ECHOES-Tools
75. 5. Results and future development
The developed tools allow you to analyze, clean and
transform data collections to the EDM standard.
Validate, enrich and publish heterogeneous data to a
normalized data lake that can be exploited as linked
open data and with different data visualizations
77. 5. Results and future development
The current status of the development is:
‒ To improve the data sources mapping and
transformation tools
‒ Focus on the enrichments
‒ More users of the platform are expected to help
grow the community
Join us!
79. 6. References
• Ariela Netiv & Walther Hasselo, ECHOES - cooperation across heritage disciplines,
institutes and borders (IS&T, Washington, 2018) pg. 70-74
• Lluís M. Anglada & Sandra Reoyo & Ramon Ros & Ricard de la Vega, “Doing it
together spreading ORCID among Catalan universities and researchers” (ORCID-
CASRAI Joint conference, Barcelona, 2015)
• Anisa Rula & Andrea Maurino & Carlo Batini, “Data Quality Issues in Linked Open
Data”. (Part of the Data-Centric Systems and Applications book series, DCSA, 2016)
• Europeana Data Model (EDM) https://pro.europeana.eu/resources/standardization-
tools/edm-documentation.
• Duke. A tool to find duplicates. https://github.com/larsga/Duke
• Frank van Ham & Adam Perer, “Search, Show Context, expand on demand”:
Supporting Large Graph Exploration with Degree-of-interest.
http://perer.org/papers/adamPerer-DOIGraphs-InfoVis2009.pdf
81. Thanks for your attention
Development
team
Ricard de la Vega | ricard.delavega@csuc.cat
Natalia Torres | natalia.torres@csuc.cat
Albert Martínez | albert.martinez@csuc.cat