SlideShare una empresa de Scribd logo
1 de 81
Descargar para leer sin conexión
Technical Challenges and
Approaches to build an
Open Ecosystem of
Heterogeneous Heritage
Collections
Ricard de la Vega
Natalia Torres
Albert Martínez
Agenda
1.Introduction
2.Technical Architecture
3.Challenges and Approaches
4.Lessons learned
5.Results and future developments
6.References
1. Introduction
Who we are?
All the code, specifications and documentation are available
under an open source MIT license on the Github Echoes
page: https://github.com/CSUC/ECHOES-Tools
Technological partner
1. Introduction
What is Echoes?
Echoes provides open, easy and innovative
access to digital cultural assets from different
institutions and is available in several languages.
Within a single and integrated platform, users have
access to a wide range of information on
archaeology, architecture, books, monuments,
people, photography etc. This can be explored
using different criteria: concepts, digital objects,
people, places and time. The platform can be
installed for a region or a theme.
1. Introduction
What is Echoes?
Echoes has developed tools that allow to
analyze, clean and transform data collections to
Europeana Data Model (EDM).
Also tools to validate, enrich and publish
heterogeneous data to a normalized data lake
that can be exploited as linked open data and
used with different data visualizations.
1. Introduction
What is Echoes?
1. Introduction
An example of 1+1=3
Pilot with 3 different collections
‒ Archeologic Heritage
‒ Architectonic Heritage
‒ Institutional repository
Roses
Port de la Selva
Vall de Boí
1. Introduction
An example of 1+1=3
1. Introduction
How Echoes works?
Data access
Data storage
Data homogenization
Data collections
4
3
2
1
1. Introduction
How Echoes works?
Data access
Data storage
Data homogenization
Data collections
4
3
2
1
1. Introduction
How Echoes works?
Data access
Data storage
Data homogenization
Data collections
4
3
2
1
1. Introduction
How Echoes works?
Data access
Data storage
Data homogenization
Data collections
4
3
2
1
Agenda
1.Introduction
2.Technical Architecture
3.Challenges and Approaches
4.Lessons learned
5.Results and future developments
6.References
2. Technical architecture
Modular approach
1. Input (data collections)
2. Mapping and transformation
tools (data homogenization)
3. Data lake (data storage)
4. Output
– SPARQL endpoint (RDF)
– Portal (WordPress)
– API-Rest, OAI-PMH
5. Enrichments
2. Technical architecture
Modular approach
1. Inputs
2. Mapping
3. Data lake
4. Output
– SPARQL
– Portal
– API-Rest,
– Enrichments
2. Technical architecture | Inputs
2. Technical architecture | Inputs | Examples
– ELO: 4K, 144K, 280K items (A2A)
– Tresoar: 21K, 36K, 2M items (A2A)
– Gencat: 1K (Custom)
1.351.416
983.677 983.677
95.377
10.989 2.380 560
0
200.000
400.000
600.000
800.000
1.000.000
1.200.000
1.400.000
1.600.000
November 2018
2. Technical architecture | Data homogenization
2. Technical architecture | Data homogenization
Echoes is a project of interoperability between
different data collections.
Integrating data is not just about putting them
together in a repository, but also to facilitate
their access so it can be properly exploited by
the public
2. Technical architecture | Data homogenization
If garbage comes in, then garbage comes out
To simplify the reuse and visualization of the data,
all the records inserted into the system should
have the same structure and format.
There are two ways to ensure the data coherence
and consistency, clean & transform data:
‒ A priori, before insert to the system
‒ A posteriori, in real-time when the data is used
2. Technical architecture | Data homogenization
If garbage comes in, then garbage comes out
To simplify the reuse and visualization of the data,
all the records inserted into the system should
have the same structure and format.
There are two ways to ensure the data coherence
and consistency, clean & transform data:
‒ A priori, before insert to the system. Due to the
complexity and the high volume of data
‒ A posteriori, in real-time when the data is used
2. Technical architecture | Data homogenization
Analyze Transform
Quality
Assurance
Enrich Publish
1 42 3 5
Analyze
content from
a source to
“know about”
your data
Download
items into
local files
from a
source
Transform to
EDM
Review each
item and
based on
defined rules
decide if it can
be loaded into
Data Lake
Quality report
Enrich
metadata
from different
sources
Publish items
into Data Lake
Only valid
items can be
loaded
Optional Optional Optional
Demo on https://youtu.be/LQSheaKJOiY
Optional
2. Technical architecture | Data homogenization
Analyze Transform
Quality
Assurance
Enrich Publish
1 42 3 5
Analyze
content from
a source to
“know about”
your data
Download
items into
local files from
a source
Transform to
EDM
Review each
item and
based on
defined rules
decide if it can
be loaded into
Data Lake
Quality report
Enrich
metadata
from different
sources
Publish items
into Data Lake
Only valid
items can be
loaded
Optional Optional
2. Technical architecture | Data homogenization
‒ Gives feedback on the data properties
‒ Useful to get to know the contents of the data
especially if you didn’t create the dataset
‒ Gives the ability to determine the usefulness of
the data when you want to enrich.
Ex. If there are no places in the dataset, enrichment
with coordinates is impossible
2. Technical architecture | Data homogenization
ECHOES Workshop Archiving 2019
ECHOES
Analyze
URL
Open Archieves Initiative
protocol for
metadata harvesting
(OAI-PMH)
A2A
Dublin Core
TopX
EAD
CARARE Upload file
Analyze
Report
in xml
as a
Supports:
Accepts data as a:
Delivers:
* An xml file can be easily
imported in your favorite
reporting tool.Custom
Optional
2. Technical architecture | Data homogenization
Analyze Transform
Quality
Assurance
Enrich Publish
1 42 3 5
Analyze
content from
a source to
“know about”
your data
Download
items into
local files
from a
source
Transform to
EDM
Review each
item and
based on
defined rules
decide if it can
be loaded into
Data Lake
Quality report
Enrich
metadata
from different
sources
Publish items
into Data Lake
Only valid
items can be
loaded
Optional Optional
2. Technical architecture | Data homogenization
ECHOES Workshop Archiving 2019
ECHOES
Transform
URL
Open Archieves Initiative
protocol for
metadata harvesting
(OAI-PMH)
Upload file
Your EDM
dataset
as a
Accepts data as a:
Delivers:A2A
Dublin Core
TopX
EAD
CARARE
Supports:
Custom
Optional
2. Technical architecture | Data homogenization
Analyze Transform
Quality
Assurance
Enrich Publish
1 42 3 5
Analyze
content from
a source to
“know about”
your data
Download
items into
local files
from a source
Transform to
EDM
Review each
item and
based on
defined rules
decide if it can
be loaded into
Data Lake
Quality report
Enrich
metadata
from different
sources
Publish items
into Data Lake
Only valid
items can be
loaded
Optional Optional
Quality Assurance Module
2. Technical architecture | Data homogenization
Schema Semantics Content
1 2 3
Review:
‒ Tags
‒ mandatory fields
Results:
‒ OK: next step
‒ Error: stop
Review:
‒ Schematron
Results:
‒ OK: next step
‒ Error: stop
Review:
‒ Metadata fields based
on configurable specs.
Results:
‒ OK: valid item
‒ Error: stop
‒ Warning: partial valid
‒ Info: valid item
Optional
2. Technical architecture | Data homogenization
Analyze Transform
Quality
Assurance
Enrich Publish
1 42 3 5
Analyze
content from
a source to
“know about”
your data
Download
items into
local files
from a
source
Transform to
EDM
Review each
item and
based on
defined rules
decide if it can
be loaded into
Data Lake
Quality report
Enrich
metadata
from different
sources
Publish items
into Data Lake
Only valid
items can be
loaded
Optional Optional
Optional
2. Technical architecture | Data homogenization
Analyze Transform
Quality
Assurance
Enrich Publish
1 42 3 5
Analyze
content from
a source to
“know about”
your data
Download
items into
local files
from a source
Transform to
EDM
Review each
item and
based on
defined rules
decide if it can
be loaded into
Data Lake
Quality report
Enrich
metadata
from different
sources
Publish items
into Data Lake
Only valid
items can be
loaded
Optional Optional
2. Technical architecture | Data lake
‒ Blazegraph™ DB is an ultra high-performance
graph database supporting Blueprints and
RDF/SPARQL API's
‒ Ex. Wikimedia Foundation Wikidata Query Service
‒ https://github.com/blazegraph/database
2. Technical architecture | Output
2. Technical architecture | Outputs
2. Technical architecture | Outputs
2. Technical architecture | Outputs
https://echoes.community
2. Technical architecture | Outputs
2. Technical architecture | Outputs
2. Technical architecture | Outputs
2. Technical architecture | Outputs
2. Technical architecture | Outputs
2. Technical architecture | Outputs
The theory is sound, but there still exist
many challenges to tackle…
Agenda
1.Introduction
2.Technical Architecture
3.Challenges and Approaches
4.Lessons learned
5.Results and future developments
6.References
3. Challenges and Approaches
1. Different metadata schemas
2. Poor data quality
3. Data deduplication
4. Automatic enrichment
5. Too much data
6. Easy SPARQL queries
7. Different scope
8. User enrichment
3.1. Challenges and Approaches | Different metadata schemas
‒ Different collections can have different metadata
schemas…
‒ Dublin Core (DC), A2A, EAD, Custom…
3.1. Challenges and Approaches | Different metadata schemas
‒ It was necessary to have one metadata standard that
was the standard to map the datasets to
‒ We choose: Europeana Data Model (EDM)
‒ Transformation module. Mapping to EDM from DC,
A2A, EAD, Topx, custom metadata and “CARARE”
‒ Transformation tool is easy extensible to other
formats, if someone wants a format that is not on the
list, can create their own EDM mapping (and can
contribute it to the community)
3. Challenges and Approaches
1. Different metadata schemas
2. Poor data quality
3. Data deduplication
4. Automatic enrichments
5. Too much data
6. Easy SPARQL queries
7. Different scope
8. User enrichments
3.2. Challenges and Approaches | Poor data quality
‒ Sometimes the data quality is not as good as we
would like it to be…
‒ This poor quality limits the exploitation of the data
‒ For example
‒ One unique field with different geolocation levels,
Bussum (municipality), Chicago (city), China (country)
‒ The same with dates (day and time, year, centuries…)
‒ Misspellings (Lide4n, Leideb, Lidedn, Leiden…)
3.2. Challenges and Approaches | Poor data quality
3 modules have been developed:
‒ Analyze focus on data profiling
Ex. blank cells, number of instances of each metadata…
‒ Quality assurance to validate the input data
Ex. Empty mandatory field, place without coordinates…
‒ Enrich to complete some metadata
Ex. Coordinates (to show in a map) from a text location
All the modules can be easily extended with new rules,
statistics, checks and enrichments.
Quality reports can be used to improve original data sets.
Item
1
Item
2
Some
Metadata
fields are
not included
Item
x
Item
not
included
3.2. Challenges and Approaches | Poor data quality
Item
1
Item
2
Item
x
Item
1
Item
2
Item
x
Items
- 6 ok
Ítems
- 1 warning
- 1 error
- 4 ok
Ítems
- 6 error
3. Challenges and Approaches
1. Different metadata schemas
2. Poor data quality
3. Data deduplication
4. Automatic enrichments
5. Too much data
6. Easy SPARQL queries
7. Different scope
8. User enrichments
3.3 Challenges and Approaches | Data deduplication
3.3 Challenges and Approaches | Data deduplication
‒ Deduplication is easy if the items have
identificatory metadata.
‒ If not, different similarity and distance metrics can
be used to find duplicates (Levensthein, Jaro-
Winkler…) with the Duke tool.
‒ Useful to only get one value for places, dates…
3.3 Challenges and Approaches | Data deduplication
‒ Ex. Items from different Gencat and DIBA
collections (with an id in the metadata)
‒ Match done using custom identifier, BCIN or BCIL
(local register identifier for cultural assets)
3. Challenges and Approaches
1. Different metadata schemas
2. Poor data quality
3. Data deduplication
4. Automatic enrichments
5. Too much data
6. Easy SPARQL queries
7. Different scope
8. User enrichments
3.4 Challenges and Approaches | Automatic enrichments
‒ Which fields are candidates to enrichments?
We start with geolocations. A2A collections have
a location, but no coordinates, which is necessary to
visualize the data on a map.
If the enrichment is mandatory e.g. for proper
presentation on a map, it's automatically done on the last
step in the quality assurance module;
if enrichment is 'nice to have', it can be configured in the
enrich module.
‒ Use existing or new metadata?
Extend the metadata schema to insert the enrichment
(without modifying the original metadata)
3.4 Challenges and Approaches | Automatic enrichments
EDM Automatic enrichment Manual enrichment
Metadata Source Metadata Metadata
wgs84_pos:lat Geonames geonames:lat user:lat
wgs84_pos:long Geonames geonames:long user:long
skos:prefLabel Geonames geonames:alternateName
geonames:coloquialName
geonames:historicalName
geonames:officialName
geonames:name
user:prefLabel
DBPedia foaf:name
rdfs:label
owl:sameAs
Getty (TGN) rdfs:label
skos:prefLabel
skos:altLabel
skos:altLabel Getty (TGN) rdfs:label
skos:prefLabel
skos:altLabel
user:altLabel
3.4 Challenges and Approaches | Automatic enrichments
‒ Data visualization of an A2A collection (without
original coordinates) on a map
3.4 Challenges and Approaches | Automatic enrichments
‒ Another challenge; some third-party API's have
usage limitations like:
‒ Limit number of connections
‒ Premium options (€)
‒ One approach is to download a part or the total API
(cache), if is it possible…
 MaxResults:10.000
 MaxQueryExecutionTime = 120’
 MaxQueryCostEstimationTime = 1500’
 Connection limit = 50
 maximum request rate = 100
 Daily create a downloaded large
worldwide text file and offers a REST
API.
 Limitation 20’000 results, the hourly
limit is 1000 credits.
 Premium subscription
 Enpoint refreshed monthly.
 Webservice
 No information about limitations
3. Challenges and Approaches
1. Different metadata schemas
2. Poor data quality
3. Data deduplication
4. Automatic enrichments
5. Too much data
6. Easy SPARQL queries
7. Different scope
8. User enrichments
3.5. Challenges and Approaches | Too much data
What's this?
a) A flower?
b) A black hole?
c) A (not user friendly)
data visualization of a
450K nodes graph?
3.5. Challenges and Approaches | Too much data
‒ Divide and conquer strategy. Pick a focus point
(e.g. based in something) and let the system
compute the “optimal” relevant context given the
users current interests.
‒ Don’t target to explore the whole database, focus
on specific domains. Ex. Different visualization
tools are developed based on different types of
information to be displayed (maps, timespan,
graph, etc)
3. Challenges and Approaches
1. Different metadata schemas
2. Poor data quality
3. Data deduplication
4. Automatic enrichments
5. Too much data
6. Easy SPARQL queries
7. Different scope
8. User enrichments
3.6. Challenges and Approaches | Easy SPARQL queries
‒ All the data is accessible in RDF format via a
linked open data endpoint.
‒ A user-friendly interface (YASGUI) is integrated
to access this data.
‒ User-friendly? Only if you know the SPARQL
language and database structure. So we found:
3.6. Challenges and Approaches | Easy SPARQL queries
A visual SPARQL query system to drag the
database elements to a canvas and 'build' your
query (Visual SPARQL Builder)
3. Challenges and Approaches
1. Different metadata schemas
2. Poor data quality
3. Data deduplication
4. Automatic enrichments
5. Too much data
6. Easy SPARQL queries
7. Different scope
8. User enrichments
3.7. Challenges and Approaches | Different scope
‒ From small institutions to regional thematic installations. One
size fits all?
‒ The technology that is developed is scalable, so it covers many
different scenarios. Performance test have been designed to
test the performance with large collections.
‒ The modular approach enables (smaller) institutions to 'mix and
match' modules. for example, only the ECHOES transformation
module to transform one collection to EDM or linked open data
3. Challenges and Approaches
1. Different metadata schemas
2. Poor data quality
3. Data deduplication
4. Automatic enrichments
5. Too much data
6. Easy SPARQL queries
7. Different scope
8. User enrichments
3.8 Challenges and Approaches | User enrichments
One of the objectives of the project is giving the
user a possibility to enrich of the content
Not initiated yet…
Agenda
1.Introduction
2.Technical Architecture
3.Challenges and Approaches
4.Lessons learned
5.Results and future developments
6.References
4. Lessons Learned
Some decisions that we would take again:
‒ Use of agile methodology. Flexible to changes (Ex.
focus on data quality). Team collaboration on
iterations align all in the same direction.
‒ A multidisciplinary team brings different points of view
to solve the challenges. Also different countries.
‒ Start from the beginning. Focus on input data before
than enrichments.
‒ Learning by doing, the best way to know if it works is
to test it.
Agenda
1.Introduction
2.Technical Architecture
3.Challenges and Approaches
4.Lessons learned
5.Results and future developments
6.References
5. Results and future development
After 2,5 years we have done…
‒ 27 one-month iterations sprints (cookies meeting)
‒ 7 releases
‒ 1 modular product, version 1.5
‒ 1 open source community
(Benevolent dictator for life model)
All the code, specifications and documentation are available under an
open source MIT license on the GitHub Echoes page:
https://github.com/CSUC/ECHOES-Tools
5. Results and future development
The developed tools allow you to analyze, clean and
transform data collections to the EDM standard.
Validate, enrich and publish heterogeneous data to a
normalized data lake that can be exploited as linked
open data and with different data visualizations
5. Results and future development
Demo’s corner
- https://youtu.be/LQSheaKJOiY (Echoes Tool)
- https://youtu.be/LddOAUc9tig (End Point)
- https://youtu.be/bb3Sxyyx8aA (End Point)
- https://youtu.be/oa7aY6p4o5Y (Echoes Portal)
5. Results and future development
The current status of the development is:
‒ To improve the data sources mapping and
transformation tools
‒ Focus on the enrichments
‒ More users of the platform are expected to help
grow the community
Join us!
Agenda
1.Introduction
2.Technical Architecture
3.Challenges and Approaches
4.Lessons learned
5.Results and future developments
6.References
6. References
• Ariela Netiv & Walther Hasselo, ECHOES - cooperation across heritage disciplines,
institutes and borders (IS&T, Washington, 2018) pg. 70-74
• Lluís M. Anglada & Sandra Reoyo & Ramon Ros & Ricard de la Vega, “Doing it
together spreading ORCID among Catalan universities and researchers” (ORCID-
CASRAI Joint conference, Barcelona, 2015)
• Anisa Rula & Andrea Maurino & Carlo Batini, “Data Quality Issues in Linked Open
Data”. (Part of the Data-Centric Systems and Applications book series, DCSA, 2016)
• Europeana Data Model (EDM) https://pro.europeana.eu/resources/standardization-
tools/edm-documentation.
• Duke. A tool to find duplicates. https://github.com/larsga/Duke
• Frank van Ham & Adam Perer, “Search, Show Context, expand on demand”:
Supporting Large Graph Exploration with Degree-of-interest.
http://perer.org/papers/adamPerer-DOIGraphs-InfoVis2009.pdf
Contact
Walther Hasselo
w.hasselo@erfgoedleiden.nl
Anna Busom
abusom@gencat.cat
Olav Kwakman
Olav.kwakman@tresoar.nl
Thanks for your attention
Development
team
Ricard de la Vega | ricard.delavega@csuc.cat
Natalia Torres | natalia.torres@csuc.cat
Albert Martínez | albert.martinez@csuc.cat

Más contenido relacionado

La actualidad más candente

Open Source Visualization of Scientific Data
Open Source Visualization of Scientific DataOpen Source Visualization of Scientific Data
Open Source Visualization of Scientific Data
Marcus Hanwell
 
Manage democratization of the data - Data Replication in Hadoop
Manage democratization of the data - Data Replication in HadoopManage democratization of the data - Data Replication in Hadoop
Manage democratization of the data - Data Replication in Hadoop
DataWorks Summit
 
Oscon 2011 Practicing Open Science
Oscon 2011 Practicing Open ScienceOscon 2011 Practicing Open Science
Oscon 2011 Practicing Open Science
Marcus Hanwell
 
Chemical Databases and Open Chemistry on the Desktop
Chemical Databases and Open Chemistry on the DesktopChemical Databases and Open Chemistry on the Desktop
Chemical Databases and Open Chemistry on the Desktop
Marcus Hanwell
 
Security event logging and monitoring techniques
Security event logging and monitoring techniquesSecurity event logging and monitoring techniques
Security event logging and monitoring techniques
DataWorks Summit
 

La actualidad más candente (20)

CLARIN CMDI use case and flexible metadata schemes
CLARIN CMDI use case and flexible metadata schemes CLARIN CMDI use case and flexible metadata schemes
CLARIN CMDI use case and flexible metadata schemes
 
Willem VanEssendelft Profile
Willem VanEssendelft ProfileWillem VanEssendelft Profile
Willem VanEssendelft Profile
 
Amersfoort 2016 koch_wg_v02
Amersfoort 2016 koch_wg_v02Amersfoort 2016 koch_wg_v02
Amersfoort 2016 koch_wg_v02
 
GCE11 Apache Rave Presentation
GCE11 Apache Rave PresentationGCE11 Apache Rave Presentation
GCE11 Apache Rave Presentation
 
Open Source Visualization of Scientific Data
Open Source Visualization of Scientific DataOpen Source Visualization of Scientific Data
Open Source Visualization of Scientific Data
 
Cloud Native Analysis Platform for NGS analysis
Cloud Native Analysis Platform for NGS analysisCloud Native Analysis Platform for NGS analysis
Cloud Native Analysis Platform for NGS analysis
 
CLARIAH CMDI use case and flexible metadata schemes
CLARIAH CMDI use case and flexible metadata schemesCLARIAH CMDI use case and flexible metadata schemes
CLARIAH CMDI use case and flexible metadata schemes
 
LDP4j: A framework for the development of interoperable read-write Linked Da...
LDP4j: A framework for the development of interoperable read-write Linked Da...LDP4j: A framework for the development of interoperable read-write Linked Da...
LDP4j: A framework for the development of interoperable read-write Linked Da...
 
MyersTessella_Dec2013
MyersTessella_Dec2013MyersTessella_Dec2013
MyersTessella_Dec2013
 
Manage democratization of the data - Data Replication in Hadoop
Manage democratization of the data - Data Replication in HadoopManage democratization of the data - Data Replication in Hadoop
Manage democratization of the data - Data Replication in Hadoop
 
Oscon 2011 Practicing Open Science
Oscon 2011 Practicing Open ScienceOscon 2011 Practicing Open Science
Oscon 2011 Practicing Open Science
 
Jeevananthan_Informatica
Jeevananthan_InformaticaJeevananthan_Informatica
Jeevananthan_Informatica
 
Interoperability is the key: repositories networks promoting the quality and ...
Interoperability is the key: repositories networks promoting the quality and ...Interoperability is the key: repositories networks promoting the quality and ...
Interoperability is the key: repositories networks promoting the quality and ...
 
OGCE RT Rroject Review
OGCE RT Rroject ReviewOGCE RT Rroject Review
OGCE RT Rroject Review
 
Ontologies, controlled vocabularies and Dataverse
Ontologies, controlled vocabularies and DataverseOntologies, controlled vocabularies and Dataverse
Ontologies, controlled vocabularies and Dataverse
 
Flexible metadata schemes for research data repositories - Clarin Conference...
Flexible metadata schemes for research data repositories  - Clarin Conference...Flexible metadata schemes for research data repositories  - Clarin Conference...
Flexible metadata schemes for research data repositories - Clarin Conference...
 
Chemical Databases and Open Chemistry on the Desktop
Chemical Databases and Open Chemistry on the DesktopChemical Databases and Open Chemistry on the Desktop
Chemical Databases and Open Chemistry on the Desktop
 
Integration of WORSICA’s thematic service in EOSC, Service QA and Dataverse
Integration of WORSICA’s thematic service in EOSC,  Service QA and DataverseIntegration of WORSICA’s thematic service in EOSC,  Service QA and Dataverse
Integration of WORSICA’s thematic service in EOSC, Service QA and Dataverse
 
Security event logging and monitoring techniques
Security event logging and monitoring techniquesSecurity event logging and monitoring techniques
Security event logging and monitoring techniques
 
Metadata Sharing Beyond Your Institution
Metadata Sharing Beyond Your InstitutionMetadata Sharing Beyond Your Institution
Metadata Sharing Beyond Your Institution
 

Similar a Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneous Heritage Collections

FAIRy stories: the FAIR Data principles in theory and in practice
FAIRy stories: the FAIR Data principles in theory and in practiceFAIRy stories: the FAIR Data principles in theory and in practice
FAIRy stories: the FAIR Data principles in theory and in practice
Carole Goble
 

Similar a Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneous Heritage Collections (20)

Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of Metadata
 
Technical integration of data repositories status and challenges
Technical integration of data repositories status and challengesTechnical integration of data repositories status and challenges
Technical integration of data repositories status and challenges
 
FAIR Workflows and Research Objects get a Workout
FAIR Workflows and Research Objects get a Workout FAIR Workflows and Research Objects get a Workout
FAIR Workflows and Research Objects get a Workout
 
Presentation 16 may keynote karin bredenberg
Presentation 16 may keynote karin bredenbergPresentation 16 may keynote karin bredenberg
Presentation 16 may keynote karin bredenberg
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companies
 
Aggregation workflow
Aggregation workflowAggregation workflow
Aggregation workflow
 
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
 
Echoes Project
Echoes ProjectEchoes Project
Echoes Project
 
Iochem.carles bo
Iochem.carles boIochem.carles bo
Iochem.carles bo
 
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
 
Enterprise guide to building a Data Mesh
Enterprise guide to building a Data MeshEnterprise guide to building a Data Mesh
Enterprise guide to building a Data Mesh
 
EUDAT data architecture and interoperability aspects – Daan Broeder
EUDAT data architecture and interoperability aspects – Daan BroederEUDAT data architecture and interoperability aspects – Daan Broeder
EUDAT data architecture and interoperability aspects – Daan Broeder
 
Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"
 
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
 
160606 data lifecycle project outline
160606 data lifecycle project outline160606 data lifecycle project outline
160606 data lifecycle project outline
 
FAIRy stories: the FAIR Data principles in theory and in practice
FAIRy stories: the FAIR Data principles in theory and in practiceFAIRy stories: the FAIR Data principles in theory and in practice
FAIRy stories: the FAIR Data principles in theory and in practice
 
RO-Crate: A framework for packaging research products into FAIR Research Objects
RO-Crate: A framework for packaging research products into FAIR Research ObjectsRO-Crate: A framework for packaging research products into FAIR Research Objects
RO-Crate: A framework for packaging research products into FAIR Research Objects
 
Building Linked Data Applications
Building Linked Data ApplicationsBuilding Linked Data Applications
Building Linked Data Applications
 
SFScon21 - Simone Tritini - The Environmental Data Platform web portal
SFScon21 - Simone Tritini - The Environmental Data Platform web portalSFScon21 - Simone Tritini - The Environmental Data Platform web portal
SFScon21 - Simone Tritini - The Environmental Data Platform web portal
 

Más de CSUC - Consorci de Serveis Universitaris de Catalunya

Más de CSUC - Consorci de Serveis Universitaris de Catalunya (20)

Tendencias en herramientas de monitorización de redes y modelo de madurez en ...
Tendencias en herramientas de monitorización de redes y modelo de madurez en ...Tendencias en herramientas de monitorización de redes y modelo de madurez en ...
Tendencias en herramientas de monitorización de redes y modelo de madurez en ...
 
Quantum Computing Master Class 2024 (Quantum Day)
Quantum Computing Master Class 2024 (Quantum Day)Quantum Computing Master Class 2024 (Quantum Day)
Quantum Computing Master Class 2024 (Quantum Day)
 
Publicar dades de recerca amb el Repositori de Dades de Recerca
Publicar dades de recerca amb el Repositori de Dades de RecercaPublicar dades de recerca amb el Repositori de Dades de Recerca
Publicar dades de recerca amb el Repositori de Dades de Recerca
 
In sharing we trust. Taking advantage of a diverse consortium to build a tran...
In sharing we trust. Taking advantage of a diverse consortium to build a tran...In sharing we trust. Taking advantage of a diverse consortium to build a tran...
In sharing we trust. Taking advantage of a diverse consortium to build a tran...
 
Formació RDM: com fer un pla de gestió de dades amb l’eiNa DMP?
Formació RDM: com fer un pla de gestió de dades amb l’eiNa DMP?Formació RDM: com fer un pla de gestió de dades amb l’eiNa DMP?
Formació RDM: com fer un pla de gestió de dades amb l’eiNa DMP?
 
Com pot ajudar la gestió de les dades de recerca a posar en pràctica la ciènc...
Com pot ajudar la gestió de les dades de recerca a posar en pràctica la ciènc...Com pot ajudar la gestió de les dades de recerca a posar en pràctica la ciènc...
Com pot ajudar la gestió de les dades de recerca a posar en pràctica la ciènc...
 
Security Human Factor Sustainable Outputs: The Network eAcademy
Security Human Factor Sustainable Outputs: The Network eAcademySecurity Human Factor Sustainable Outputs: The Network eAcademy
Security Human Factor Sustainable Outputs: The Network eAcademy
 
The Research Portal of Catalonia: Growing more (information) & more (services)
The Research Portal of Catalonia: Growing more (information) & more (services)The Research Portal of Catalonia: Growing more (information) & more (services)
The Research Portal of Catalonia: Growing more (information) & more (services)
 
Facilitar la gestión, visibilidad y reutilización de los datos de investigaci...
Facilitar la gestión, visibilidad y reutilización de los datos de investigaci...Facilitar la gestión, visibilidad y reutilización de los datos de investigaci...
Facilitar la gestión, visibilidad y reutilización de los datos de investigaci...
 
La gestión de datos de investigación en las bibliotecas universitarias españolas
La gestión de datos de investigación en las bibliotecas universitarias españolasLa gestión de datos de investigación en las bibliotecas universitarias españolas
La gestión de datos de investigación en las bibliotecas universitarias españolas
 
Disposes de recursos il·limitats? Prioritza estratègicament els teus projecte...
Disposes de recursos il·limitats? Prioritza estratègicament els teus projecte...Disposes de recursos il·limitats? Prioritza estratègicament els teus projecte...
Disposes de recursos il·limitats? Prioritza estratègicament els teus projecte...
 
Les persones i les seves capacitats en el nucli de la transformació digital. ...
Les persones i les seves capacitats en el nucli de la transformació digital. ...Les persones i les seves capacitats en el nucli de la transformació digital. ...
Les persones i les seves capacitats en el nucli de la transformació digital. ...
 
Enginyeria Informàtica: una cursa de fons
Enginyeria Informàtica: una cursa de fonsEnginyeria Informàtica: una cursa de fons
Enginyeria Informàtica: una cursa de fons
 
Transformació de rols i habilitats en un món ple d'IA
Transformació de rols i habilitats en un món ple d'IATransformació de rols i habilitats en un món ple d'IA
Transformació de rols i habilitats en un món ple d'IA
 
Difusió del coneixement a l'Il·lustre Col·legi de l'Advocacia de Barcelona
Difusió del coneixement a l'Il·lustre Col·legi de l'Advocacia de BarcelonaDifusió del coneixement a l'Il·lustre Col·legi de l'Advocacia de Barcelona
Difusió del coneixement a l'Il·lustre Col·legi de l'Advocacia de Barcelona
 
Fons de discos perforats de cartró
Fons de discos perforats de cartróFons de discos perforats de cartró
Fons de discos perforats de cartró
 
Biblioteca Digital Gencat
Biblioteca Digital GencatBiblioteca Digital Gencat
Biblioteca Digital Gencat
 
El fons Enrique Tierno Galván: recepció, tractament i difusió
El fons Enrique Tierno Galván: recepció, tractament i difusióEl fons Enrique Tierno Galván: recepció, tractament i difusió
El fons Enrique Tierno Galván: recepció, tractament i difusió
 
El CIDMA: més enllà dels espais físics
El CIDMA: més enllà dels espais físicsEl CIDMA: més enllà dels espais físics
El CIDMA: més enllà dels espais físics
 
Els serveis del CSUC per a la comunitat CCUC
Els serveis del CSUC per a la comunitat CCUCEls serveis del CSUC per a la comunitat CCUC
Els serveis del CSUC per a la comunitat CCUC
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Último (20)

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneous Heritage Collections

  • 1. Technical Challenges and Approaches to build an Open Ecosystem of Heterogeneous Heritage Collections Ricard de la Vega Natalia Torres Albert Martínez
  • 2. Agenda 1.Introduction 2.Technical Architecture 3.Challenges and Approaches 4.Lessons learned 5.Results and future developments 6.References
  • 3. 1. Introduction Who we are? All the code, specifications and documentation are available under an open source MIT license on the Github Echoes page: https://github.com/CSUC/ECHOES-Tools Technological partner
  • 4. 1. Introduction What is Echoes? Echoes provides open, easy and innovative access to digital cultural assets from different institutions and is available in several languages. Within a single and integrated platform, users have access to a wide range of information on archaeology, architecture, books, monuments, people, photography etc. This can be explored using different criteria: concepts, digital objects, people, places and time. The platform can be installed for a region or a theme.
  • 5. 1. Introduction What is Echoes? Echoes has developed tools that allow to analyze, clean and transform data collections to Europeana Data Model (EDM). Also tools to validate, enrich and publish heterogeneous data to a normalized data lake that can be exploited as linked open data and used with different data visualizations.
  • 7. 1. Introduction An example of 1+1=3 Pilot with 3 different collections ‒ Archeologic Heritage ‒ Architectonic Heritage ‒ Institutional repository Roses Port de la Selva Vall de Boí
  • 9. 1. Introduction How Echoes works? Data access Data storage Data homogenization Data collections 4 3 2 1
  • 10. 1. Introduction How Echoes works? Data access Data storage Data homogenization Data collections 4 3 2 1
  • 11. 1. Introduction How Echoes works? Data access Data storage Data homogenization Data collections 4 3 2 1
  • 12. 1. Introduction How Echoes works? Data access Data storage Data homogenization Data collections 4 3 2 1
  • 13. Agenda 1.Introduction 2.Technical Architecture 3.Challenges and Approaches 4.Lessons learned 5.Results and future developments 6.References
  • 14. 2. Technical architecture Modular approach 1. Input (data collections) 2. Mapping and transformation tools (data homogenization) 3. Data lake (data storage) 4. Output – SPARQL endpoint (RDF) – Portal (WordPress) – API-Rest, OAI-PMH 5. Enrichments
  • 15. 2. Technical architecture Modular approach 1. Inputs 2. Mapping 3. Data lake 4. Output – SPARQL – Portal – API-Rest, – Enrichments
  • 17. 2. Technical architecture | Inputs | Examples – ELO: 4K, 144K, 280K items (A2A) – Tresoar: 21K, 36K, 2M items (A2A) – Gencat: 1K (Custom) 1.351.416 983.677 983.677 95.377 10.989 2.380 560 0 200.000 400.000 600.000 800.000 1.000.000 1.200.000 1.400.000 1.600.000 November 2018
  • 18. 2. Technical architecture | Data homogenization
  • 19. 2. Technical architecture | Data homogenization Echoes is a project of interoperability between different data collections. Integrating data is not just about putting them together in a repository, but also to facilitate their access so it can be properly exploited by the public
  • 20. 2. Technical architecture | Data homogenization If garbage comes in, then garbage comes out To simplify the reuse and visualization of the data, all the records inserted into the system should have the same structure and format. There are two ways to ensure the data coherence and consistency, clean & transform data: ‒ A priori, before insert to the system ‒ A posteriori, in real-time when the data is used
  • 21. 2. Technical architecture | Data homogenization If garbage comes in, then garbage comes out To simplify the reuse and visualization of the data, all the records inserted into the system should have the same structure and format. There are two ways to ensure the data coherence and consistency, clean & transform data: ‒ A priori, before insert to the system. Due to the complexity and the high volume of data ‒ A posteriori, in real-time when the data is used
  • 22. 2. Technical architecture | Data homogenization Analyze Transform Quality Assurance Enrich Publish 1 42 3 5 Analyze content from a source to “know about” your data Download items into local files from a source Transform to EDM Review each item and based on defined rules decide if it can be loaded into Data Lake Quality report Enrich metadata from different sources Publish items into Data Lake Only valid items can be loaded Optional Optional Optional Demo on https://youtu.be/LQSheaKJOiY
  • 23. Optional 2. Technical architecture | Data homogenization Analyze Transform Quality Assurance Enrich Publish 1 42 3 5 Analyze content from a source to “know about” your data Download items into local files from a source Transform to EDM Review each item and based on defined rules decide if it can be loaded into Data Lake Quality report Enrich metadata from different sources Publish items into Data Lake Only valid items can be loaded Optional Optional
  • 24. 2. Technical architecture | Data homogenization ‒ Gives feedback on the data properties ‒ Useful to get to know the contents of the data especially if you didn’t create the dataset ‒ Gives the ability to determine the usefulness of the data when you want to enrich. Ex. If there are no places in the dataset, enrichment with coordinates is impossible
  • 25. 2. Technical architecture | Data homogenization ECHOES Workshop Archiving 2019 ECHOES Analyze URL Open Archieves Initiative protocol for metadata harvesting (OAI-PMH) A2A Dublin Core TopX EAD CARARE Upload file Analyze Report in xml as a Supports: Accepts data as a: Delivers: * An xml file can be easily imported in your favorite reporting tool.Custom
  • 26. Optional 2. Technical architecture | Data homogenization Analyze Transform Quality Assurance Enrich Publish 1 42 3 5 Analyze content from a source to “know about” your data Download items into local files from a source Transform to EDM Review each item and based on defined rules decide if it can be loaded into Data Lake Quality report Enrich metadata from different sources Publish items into Data Lake Only valid items can be loaded Optional Optional
  • 27. 2. Technical architecture | Data homogenization ECHOES Workshop Archiving 2019 ECHOES Transform URL Open Archieves Initiative protocol for metadata harvesting (OAI-PMH) Upload file Your EDM dataset as a Accepts data as a: Delivers:A2A Dublin Core TopX EAD CARARE Supports: Custom
  • 28. Optional 2. Technical architecture | Data homogenization Analyze Transform Quality Assurance Enrich Publish 1 42 3 5 Analyze content from a source to “know about” your data Download items into local files from a source Transform to EDM Review each item and based on defined rules decide if it can be loaded into Data Lake Quality report Enrich metadata from different sources Publish items into Data Lake Only valid items can be loaded Optional Optional
  • 29. Quality Assurance Module 2. Technical architecture | Data homogenization Schema Semantics Content 1 2 3 Review: ‒ Tags ‒ mandatory fields Results: ‒ OK: next step ‒ Error: stop Review: ‒ Schematron Results: ‒ OK: next step ‒ Error: stop Review: ‒ Metadata fields based on configurable specs. Results: ‒ OK: valid item ‒ Error: stop ‒ Warning: partial valid ‒ Info: valid item
  • 30. Optional 2. Technical architecture | Data homogenization Analyze Transform Quality Assurance Enrich Publish 1 42 3 5 Analyze content from a source to “know about” your data Download items into local files from a source Transform to EDM Review each item and based on defined rules decide if it can be loaded into Data Lake Quality report Enrich metadata from different sources Publish items into Data Lake Only valid items can be loaded Optional Optional
  • 31. Optional 2. Technical architecture | Data homogenization Analyze Transform Quality Assurance Enrich Publish 1 42 3 5 Analyze content from a source to “know about” your data Download items into local files from a source Transform to EDM Review each item and based on defined rules decide if it can be loaded into Data Lake Quality report Enrich metadata from different sources Publish items into Data Lake Only valid items can be loaded Optional Optional
  • 32. 2. Technical architecture | Data lake ‒ Blazegraph™ DB is an ultra high-performance graph database supporting Blueprints and RDF/SPARQL API's ‒ Ex. Wikimedia Foundation Wikidata Query Service ‒ https://github.com/blazegraph/database
  • 36. 2. Technical architecture | Outputs https://echoes.community
  • 43. The theory is sound, but there still exist many challenges to tackle…
  • 44. Agenda 1.Introduction 2.Technical Architecture 3.Challenges and Approaches 4.Lessons learned 5.Results and future developments 6.References
  • 45. 3. Challenges and Approaches 1. Different metadata schemas 2. Poor data quality 3. Data deduplication 4. Automatic enrichment 5. Too much data 6. Easy SPARQL queries 7. Different scope 8. User enrichment
  • 46. 3.1. Challenges and Approaches | Different metadata schemas ‒ Different collections can have different metadata schemas… ‒ Dublin Core (DC), A2A, EAD, Custom…
  • 47. 3.1. Challenges and Approaches | Different metadata schemas ‒ It was necessary to have one metadata standard that was the standard to map the datasets to ‒ We choose: Europeana Data Model (EDM) ‒ Transformation module. Mapping to EDM from DC, A2A, EAD, Topx, custom metadata and “CARARE” ‒ Transformation tool is easy extensible to other formats, if someone wants a format that is not on the list, can create their own EDM mapping (and can contribute it to the community)
  • 48. 3. Challenges and Approaches 1. Different metadata schemas 2. Poor data quality 3. Data deduplication 4. Automatic enrichments 5. Too much data 6. Easy SPARQL queries 7. Different scope 8. User enrichments
  • 49. 3.2. Challenges and Approaches | Poor data quality ‒ Sometimes the data quality is not as good as we would like it to be… ‒ This poor quality limits the exploitation of the data ‒ For example ‒ One unique field with different geolocation levels, Bussum (municipality), Chicago (city), China (country) ‒ The same with dates (day and time, year, centuries…) ‒ Misspellings (Lide4n, Leideb, Lidedn, Leiden…)
  • 50. 3.2. Challenges and Approaches | Poor data quality 3 modules have been developed: ‒ Analyze focus on data profiling Ex. blank cells, number of instances of each metadata… ‒ Quality assurance to validate the input data Ex. Empty mandatory field, place without coordinates… ‒ Enrich to complete some metadata Ex. Coordinates (to show in a map) from a text location All the modules can be easily extended with new rules, statistics, checks and enrichments. Quality reports can be used to improve original data sets.
  • 51. Item 1 Item 2 Some Metadata fields are not included Item x Item not included 3.2. Challenges and Approaches | Poor data quality Item 1 Item 2 Item x Item 1 Item 2 Item x Items - 6 ok Ítems - 1 warning - 1 error - 4 ok Ítems - 6 error
  • 52. 3. Challenges and Approaches 1. Different metadata schemas 2. Poor data quality 3. Data deduplication 4. Automatic enrichments 5. Too much data 6. Easy SPARQL queries 7. Different scope 8. User enrichments
  • 53. 3.3 Challenges and Approaches | Data deduplication
  • 54. 3.3 Challenges and Approaches | Data deduplication ‒ Deduplication is easy if the items have identificatory metadata. ‒ If not, different similarity and distance metrics can be used to find duplicates (Levensthein, Jaro- Winkler…) with the Duke tool. ‒ Useful to only get one value for places, dates…
  • 55. 3.3 Challenges and Approaches | Data deduplication ‒ Ex. Items from different Gencat and DIBA collections (with an id in the metadata) ‒ Match done using custom identifier, BCIN or BCIL (local register identifier for cultural assets)
  • 56. 3. Challenges and Approaches 1. Different metadata schemas 2. Poor data quality 3. Data deduplication 4. Automatic enrichments 5. Too much data 6. Easy SPARQL queries 7. Different scope 8. User enrichments
  • 57. 3.4 Challenges and Approaches | Automatic enrichments ‒ Which fields are candidates to enrichments? We start with geolocations. A2A collections have a location, but no coordinates, which is necessary to visualize the data on a map. If the enrichment is mandatory e.g. for proper presentation on a map, it's automatically done on the last step in the quality assurance module; if enrichment is 'nice to have', it can be configured in the enrich module. ‒ Use existing or new metadata? Extend the metadata schema to insert the enrichment (without modifying the original metadata)
  • 58. 3.4 Challenges and Approaches | Automatic enrichments EDM Automatic enrichment Manual enrichment Metadata Source Metadata Metadata wgs84_pos:lat Geonames geonames:lat user:lat wgs84_pos:long Geonames geonames:long user:long skos:prefLabel Geonames geonames:alternateName geonames:coloquialName geonames:historicalName geonames:officialName geonames:name user:prefLabel DBPedia foaf:name rdfs:label owl:sameAs Getty (TGN) rdfs:label skos:prefLabel skos:altLabel skos:altLabel Getty (TGN) rdfs:label skos:prefLabel skos:altLabel user:altLabel
  • 59. 3.4 Challenges and Approaches | Automatic enrichments ‒ Data visualization of an A2A collection (without original coordinates) on a map
  • 60. 3.4 Challenges and Approaches | Automatic enrichments ‒ Another challenge; some third-party API's have usage limitations like: ‒ Limit number of connections ‒ Premium options (€) ‒ One approach is to download a part or the total API (cache), if is it possible…  MaxResults:10.000  MaxQueryExecutionTime = 120’  MaxQueryCostEstimationTime = 1500’  Connection limit = 50  maximum request rate = 100  Daily create a downloaded large worldwide text file and offers a REST API.  Limitation 20’000 results, the hourly limit is 1000 credits.  Premium subscription  Enpoint refreshed monthly.  Webservice  No information about limitations
  • 61. 3. Challenges and Approaches 1. Different metadata schemas 2. Poor data quality 3. Data deduplication 4. Automatic enrichments 5. Too much data 6. Easy SPARQL queries 7. Different scope 8. User enrichments
  • 62. 3.5. Challenges and Approaches | Too much data What's this? a) A flower? b) A black hole? c) A (not user friendly) data visualization of a 450K nodes graph?
  • 63. 3.5. Challenges and Approaches | Too much data ‒ Divide and conquer strategy. Pick a focus point (e.g. based in something) and let the system compute the “optimal” relevant context given the users current interests. ‒ Don’t target to explore the whole database, focus on specific domains. Ex. Different visualization tools are developed based on different types of information to be displayed (maps, timespan, graph, etc)
  • 64. 3. Challenges and Approaches 1. Different metadata schemas 2. Poor data quality 3. Data deduplication 4. Automatic enrichments 5. Too much data 6. Easy SPARQL queries 7. Different scope 8. User enrichments
  • 65. 3.6. Challenges and Approaches | Easy SPARQL queries ‒ All the data is accessible in RDF format via a linked open data endpoint. ‒ A user-friendly interface (YASGUI) is integrated to access this data. ‒ User-friendly? Only if you know the SPARQL language and database structure. So we found:
  • 66. 3.6. Challenges and Approaches | Easy SPARQL queries A visual SPARQL query system to drag the database elements to a canvas and 'build' your query (Visual SPARQL Builder)
  • 67. 3. Challenges and Approaches 1. Different metadata schemas 2. Poor data quality 3. Data deduplication 4. Automatic enrichments 5. Too much data 6. Easy SPARQL queries 7. Different scope 8. User enrichments
  • 68. 3.7. Challenges and Approaches | Different scope ‒ From small institutions to regional thematic installations. One size fits all? ‒ The technology that is developed is scalable, so it covers many different scenarios. Performance test have been designed to test the performance with large collections. ‒ The modular approach enables (smaller) institutions to 'mix and match' modules. for example, only the ECHOES transformation module to transform one collection to EDM or linked open data
  • 69. 3. Challenges and Approaches 1. Different metadata schemas 2. Poor data quality 3. Data deduplication 4. Automatic enrichments 5. Too much data 6. Easy SPARQL queries 7. Different scope 8. User enrichments
  • 70. 3.8 Challenges and Approaches | User enrichments One of the objectives of the project is giving the user a possibility to enrich of the content Not initiated yet…
  • 71. Agenda 1.Introduction 2.Technical Architecture 3.Challenges and Approaches 4.Lessons learned 5.Results and future developments 6.References
  • 72. 4. Lessons Learned Some decisions that we would take again: ‒ Use of agile methodology. Flexible to changes (Ex. focus on data quality). Team collaboration on iterations align all in the same direction. ‒ A multidisciplinary team brings different points of view to solve the challenges. Also different countries. ‒ Start from the beginning. Focus on input data before than enrichments. ‒ Learning by doing, the best way to know if it works is to test it.
  • 73. Agenda 1.Introduction 2.Technical Architecture 3.Challenges and Approaches 4.Lessons learned 5.Results and future developments 6.References
  • 74. 5. Results and future development After 2,5 years we have done… ‒ 27 one-month iterations sprints (cookies meeting) ‒ 7 releases ‒ 1 modular product, version 1.5 ‒ 1 open source community (Benevolent dictator for life model) All the code, specifications and documentation are available under an open source MIT license on the GitHub Echoes page: https://github.com/CSUC/ECHOES-Tools
  • 75. 5. Results and future development The developed tools allow you to analyze, clean and transform data collections to the EDM standard. Validate, enrich and publish heterogeneous data to a normalized data lake that can be exploited as linked open data and with different data visualizations
  • 76. 5. Results and future development Demo’s corner - https://youtu.be/LQSheaKJOiY (Echoes Tool) - https://youtu.be/LddOAUc9tig (End Point) - https://youtu.be/bb3Sxyyx8aA (End Point) - https://youtu.be/oa7aY6p4o5Y (Echoes Portal)
  • 77. 5. Results and future development The current status of the development is: ‒ To improve the data sources mapping and transformation tools ‒ Focus on the enrichments ‒ More users of the platform are expected to help grow the community Join us!
  • 78. Agenda 1.Introduction 2.Technical Architecture 3.Challenges and Approaches 4.Lessons learned 5.Results and future developments 6.References
  • 79. 6. References • Ariela Netiv & Walther Hasselo, ECHOES - cooperation across heritage disciplines, institutes and borders (IS&T, Washington, 2018) pg. 70-74 • Lluís M. Anglada & Sandra Reoyo & Ramon Ros & Ricard de la Vega, “Doing it together spreading ORCID among Catalan universities and researchers” (ORCID- CASRAI Joint conference, Barcelona, 2015) • Anisa Rula & Andrea Maurino & Carlo Batini, “Data Quality Issues in Linked Open Data”. (Part of the Data-Centric Systems and Applications book series, DCSA, 2016) • Europeana Data Model (EDM) https://pro.europeana.eu/resources/standardization- tools/edm-documentation. • Duke. A tool to find duplicates. https://github.com/larsga/Duke • Frank van Ham & Adam Perer, “Search, Show Context, expand on demand”: Supporting Large Graph Exploration with Degree-of-interest. http://perer.org/papers/adamPerer-DOIGraphs-InfoVis2009.pdf
  • 81. Thanks for your attention Development team Ricard de la Vega | ricard.delavega@csuc.cat Natalia Torres | natalia.torres@csuc.cat Albert Martínez | albert.martinez@csuc.cat