Presentation of agINFRA project (www.aginfra.eu) in the EGI-APARSEN workshop, Amsterdam, 4-6 March 2014
“Managing, computing and preserving big data for research”
https://indico.egi.eu/indico/conferenceDisplay.py?confId=2052
Boost PC performance: How more available memory can improve productivity
agINFRA EGI-APARSEN workshop, Amsterdam, 4-6 March 2014
1. agINFRA
A data infrastructure to
support agricultural scientific
communities
Andreas Drakos, University of Alcala
EGI-APARSEN workshop, Amsterdam, 4-6 March 2014
2. Our project
in agINFRA we will:
share agricultural research…
…over a data e-infrastructure
EGI-APARSEN workshop, Amsterdam, 4-6 March 2014
2
3. Agricultural research data
• Primary data:
– Structured, e.g. datasets as tables
– Digitized : images, videos, etc.
• Secondary data (elaborations, e.g. a dendogram)
• Provenance information, incl. authors, their
organizations and projects
• Methods and procedures followed
• Reports, including papers
• Secondary documents, e.g. training resources
• Metadata about the above
• Social data, tags, ratings, etc.
EGI-APARSEN workshop, Amsterdam, 4-6 March 2014
3
4. agINFRA values: scientific data must be
A
| Open |
Must be open and interlinked
NOT subject to barriers, based on standard formats and avoiding building
data silos due to lack of interrelatedness and ad-hoc APIs.
B
| Meaningful | Must be meaningful through explicit semantics
Reusing the semantics already provided in mature terminologies and
ontologies that are exposed and interlinked through the Web.
C
| Reliable | Must be reliable, traceable and accessible
Any kind of research objects can be stored in the data infrastructure, and
there are NO barriers to expressing relations between these objects to
capture the context of research activities.
D
| Actionable | Must be actionable via services that empower research
Data is not useful without flexible and adaptable services that allow
researchers to act on the data in the ways they need.
EGI-APARSEN workshop, Amsterdam, 4-6 March 2014
4
5. There is a lot of data
EGI-APARSEN workshop, Amsterdam, 4-6 March 2014
5
6. CONTENT PROVIDER
WITH UNORGANISED
COLLECTION
(e.g. listed at Web
site or in DVD-ROM)
chooses sharing
compliant tool
register as
data source
hosted over agINFRA
(meta)data export in
proprietary format & ingestion in sharing
mapping to known
compliant tool
CONTENT PROVIDER
WITH CMS THAT DOES
NOT SUPPORT
SHARING (e.g.
proprietary DB)
register as
data source
hosted over agINFRA
computed over agINFRA
register as
data source
hosted over agINFRA
CONTENT PROVIDER
WITH CMS THAT
SUPPORTS SHARING
(e.g. OAI-PMH,
EGI-APARSEN workshop, Amsterdam, 4-6 March 2014
RSS,...)
6
7. shares (meta)data
e.g. through OAI-PMH
computed over agINFRA
hosted over agINFRA
shares (meta)data
e.g. through OAI-PMH
computed over agINFRA
computed over agINFRA
(META)DATA
AGGREGATOR
indexed & available
through CIARD RING
served through agINFRA
shares (meta)data
e.g. through OAI-PMH
computed over agINFRA
EGI-APARSEN workshop, Amsterdam, 4-6 March 2014
7
8. computed over agINFRA
computed over agINFRA
…
EGI-APARSEN workshop, Amsterdam, 4-6 March 2014
hosted over agINFRA
computed over agINFRA
8
9. Actors over the infrastructure
Registry of
Datasets and APIs
collections
Registry of
vocabularies
and tools
data sources
Cloud / SaaS tools
APIs
LOD Vocabularies
agINFRA RDF
vocabularies
Public REST APIs
Grid jobs
Grid workflowss
Productivity Tools
EGI-APARSEN workshop, Amsterdam, 4-6 March 2014
Information services
agINFRA LOD KOSs
9
10. Actors over the infrastructure
Developers
Information
systems
providers
Registry of
Datasets and APIs
collections
Registry of
vocabularies
and tools
data sources
Cloud / SaaS tools
Public REST APIs
Grid jobs
Grid workflowss
Productivity Tools
Taxonomists
APIs
LOD Vocabularies
Data providers
agINFRA RDF
vocabularies
agINFRA LOD KOSs
Researchers
EGI-APARSEN workshop, Amsterdam, 4-6 March 2014
Information services
Policy makers 10
11. An existing data community
• a global community movement to make
agricultural research information and
knowledge publicly accessible to all
– http://www.ciard.net
agINFRA 2nd Review Meeting, 13th of December 2013
11
12. A core registry service
• CIARD RING (Routemap to Information Nodes
and Gateways)
– global registry to give access to any kind of
information sources pertaining to agricultural
research for development
– principal tool created through CIARD to allow
information providers to register their services in
various categories and facilitate discovery of
sources of agriculture-related information across
the world
agINFRA 2nd Review Meeting, 13th of December 2013
12
15. RING data registry usage scenario 1
• data aggregators registering their data
providers to
CIARD RING
– asking directly to
be registered there
(AGRIS)
– federating own
smaller registries
(GLN)
agINFRA 2nd Review Meeting, 13th of December 2013
15
16. RING data registry usage scenario 2
• new data providers using agINFRA cloud tools
can be automatically registered to CIARD RING
– cloud-hosted AgriDrupal or AgriOceanDSpace
instances for document repositories
– cloud-hosted agLR instances for learning
repositories
• agINFRA Cloud hosting services
– In collaboration with other cloud communities
(eg. OKEANOS/GRNET)
– In collaboration with CHAIN-REDS project etc.
agINFRA 2nd Review Meeting, 13th of December 2013
16
17. Data provider scenario 1
Data provider in
need of hosting &
storage of smallscale CMS
Use a cloud
hosted CMS
Cloud / SaaS tools
Registry of
Datasets and APIs
collections
Registry of
vocabularies
and tools
data sources
APIs
LOD Vocabularies
Public REST APIs
Grid jobs
Grid workflowss
Productivity Tools
agINFRA RDF
vocabularies
agINFRA LOD KOSs
sets up own CMS instance
agINFRA 2nd Review Meeting, 13th of December 2013
Information services
17
18. Data provider scenario 2
Data provider in
need of large scale
hosting &
replication CMS
Requests
space/accounts
in large-scale
CMS
Cloud / SaaS tools
Registry of
Datasets and APIs
collections
Registry of
vocabularies
and tools
data sources
APIs
LOD Vocabularies
agINFRA RDF
vocabularies
Public REST APIs
Grid jobs
Grid workflowss
Productivity Tools
agINFRA 2nd Review Meeting, 13th of December 2013
Information services
agINFRA LOD KOSs
18
19. A semantic backbone for agINFRA
• to help all data providers declaring, publishing &
linking their metadata properties and value
spaces
– Publishing their KOSs using the VocBench and their
metadata vocabularies using Neologism
– Linking them to existing vocabularies, e.g. AGROVOC
for KOSs, Dublin Core for metadata
• guidelines & tools to support data providers in
adopting such a LOD framework
– e.g. LODE-BD recommendations
• to provide an entry point to existing relevant
vocabularies
agINFRA 2nd Review Meeting, 13th of December 2013
19
20. Exposing to the e-infrastructure scenario
Data provider
hosting CMS at
own or
external/commerci
al infrastructure
Interested to expose
(meta)data to einfrastructure
Cloud / SaaS tools
Registry of
Datasets and APIs
collections
Registry of
vocabularies
and tools
data sources
APIs
LOD Vocabularies
agINFRA RDF
vocabularies
Public REST APIs
Grid jobs
Grid workflowss
Productivity Tools
agINFRA 2nd Review Meeting, 13th of December 2013
Information services
agINFRA LOD KOSs
20
21. agINFRA LOD layer usage scenario 1
• A data owner wants to share their data as Linked
Data
• The data owner uses non-LOD vocabularies and
KOSs and wants to publish them as LOD and link
them to existing vocabularies
• agINFRA offers tools for publishing vocabularies
and KOSs
Once the vocabularies are published, all metadata
and all concepts have URIs and can be referenced by
any other system
agINFRA 2nd Review Meeting, 13th of December 2013
21
22. agINFRA LOD layer usage scenario 2
• Once KOSs are published, all metadata and all
concepts have URIs and can be referenced by any
other system
• Data aggregators like AGRIS and GLN can create
mash ups between their core data and other
agricultural data types (e.g. germplasm, soil maps,
statistics, ….) by using the LOD semantic backbone as
a crosswalk between metadata formalizations and
concepts in different vocabularies
agINFRA 2nd Review Meeting, 13th of December 2013
22
23. agINFRA LOD layer usage scenario 2
Example: LOD-based mash-ups in AGRIS
AGRIS bibliographic metadata
Journal
AGRIS
Journals
RDF store
Topic
Geographic
metadata
Thematic
metadata
DBpedia
Scientific
names
FAO Country
Profiles
FAO
Fisheries
WorldBank
indicators by
country
Info on
journal
Info on
topic
Info on
country
agINFRA 2nd Review Meeting, 13th of December 2013
Info on
species
Specific
indicators on
country
23
24. Workflow architecture
File system
(DC, IEEE
LOM, MODS
XML)
Stores
Ariadne
harvester
File system
(DC, IEEE
LOM, MODS
XML)
Stores
Filtering
component
To be ported on
the Grid
MySQL
Records
with
Broken
Links
File
system
(XMLs)
Get unique ID
Identification and
de-duplication
component
Transformation
component
Stores
Duplicates
Store
metadata
in JSON
Link checking
component
PostProcessing/
Enrichment
component