SlideShare una empresa de Scribd logo
1 de 83
Descargar para leer sin conexión
This document is part of a project that has received funding from the European Union’s Horizon 2020 research and innovation
programme under grant agreement No 732064. It is the property of the DataBio consortium and shall not be distributed or
reproduced without the formal approval of the DataBio Management Committee.
Project Acronym: DataBio
Grant Agreement number: 732064 (H2020-ICT-2016-1 – Innovation Action)
Project Full Title: Data-Driven Bioeconomy
Project Coordinator: INTRASOFT International
DELIVERABLE
D6.2 – Data Management Plan
Dissemination level PU -Public
Type of Document Report
Contractual date of delivery M06 – 30/6/2017
Deliverable Leader CREA
Status - version, date Final – v1.0, 30/6/2017
WP / Task responsible WP6
Keywords: Data management plan, big data, bioeconomy
D6.2 – Data Management Plan
H2020 Contract No. 732064 Final – v1.0, 30/6/2017
Dissemination level: PU -Public
Page 2
Executive Summary
This document presents DataBio’s D6.2 deliverable, Data Management Plan (DMP), the key
element of good data management. DataBio participates in the European Commission H2020
Program’s extended open research data pilot and hence, a DMP is required. And,
consequently, DataBio project’s datasets will be as open as possible and as closed as
necessary, focusing on sound big data management for the sake of best research practice,
and in order to create value, and foster knowledge and technology out of big datasets for the
good of man. The deliverable describes the data management life cycle for the data to be
collected, processed and/or generated by DataBio project, accounting also for the necessity
to make research data findable, accessible, interoperable and reusable (FAIR).
DataBio’s partners will be encouraged to adhere to sound data management to ensure that
data are well-managed, archived and preserved. Data preservation is synonymous to data
relevance since: (1) data can then be reused by other researchers, (2) data collector can direct
requests for data to the database, rather than address requests individually, (3) preserved
data have the potential to lead to new, unanticipated discoveries, (4) preserved data prevent
duplication of scientific studies that have already been conducted, and (5) archiving data
insures against loss by the data collector. The main issues addressed in this deliverable
include: (1) the purpose of data collection, (2) data type, format, size, velocity, beneficiaries,
and provenance, (3) use of historical data, (4) making data FAIR, (5) data management
support, (6) data security, and (7) ethical aspects.
Doubtless, big data is a new paradigm and is coercing changes in businesses and other
organizations. A few entities in EU are starting to manage the massive data sets and non-
traditional data structures that are typical of big data and/or managing big data by extending
their data management skills and their portfolios of data management software. Big data
management empowers those entities to efficiently automate business operations, operate
closer to real time, and through analytics, add value and learn valuable new facts about
business operations, customers, partners, etc. Within the DataBio framework, big data
management (BDM), is a mixture of conventional and new best practices, skills, teams, data
types, and in-house grown or vendor-built functionality. All of these are being realigned under
DataBio platform built upon partners own experiences and tools. It is anticipated that DataBio
will provide a solution which will assume that datasets will be distributed among different
infrastructures and that their accessibility could be complex, needing to have mechanisms
which facilitate data retrieval, processing, manipulation and visualization as seamlessly as
possible. The infrastructure will open new possibilities for ICT sector, including SMEs to
develop new Bioeconomy 4.0 and will also open new possibilities for companies from the
Earth Observation sector.
Some partners have scaled up pre-existing applications and databases to handle burgeoning
volumes of relational big data, or they have acquired new data management platforms that
D6.2 – Data Management Plan
H2020 Contract No. 732064 Final – v1.0, 30/6/2017
Dissemination level: PU -Public
Page 3
are purpose-built for managing and analyzing multi-structured big data, including streaming
big data. Others are evaluating big data platforms in order to create a brisk market of vendor
products and services for managing and harnessing big data. The Hadoop Distributed File
System (HDFS), MapReduce, various Hadoop tools, complex event processing (for streaming
big data), NoSQL databases (for schema-free big data), in-memory databases (for real-time
analytic processing of big data), private clouds, in-database analytics, and grid computing, will
be some of the software products implemented within the DataBio framework.
During the lifecycle of the DataBio project, big data will be collected that is, very large data
sets (multi-terabyte or larger) consist of a wide range of data types (relational, text, multi-
structured data, etc.) from numerous sources. Most data will come from farm and forestry
machinery, fishing vessels, remote and proximal sensors and imagery, and many other
technologies. DataBio is purposefully collecting big data, specifically:
• Forestry: Big Data methods are expected to bring the possibility to both increase the
value of the forests as well as to decrease the costs within sustainability limits set by
natural growth and ecological aspects. The key technology is to gather more and more
accurate information about the trees from a host of sensors including new generation
of satellites, UAV images, laser scanning, mobile devices through crowdsourcing and
machines operating in the forests.
• Agriculture: Big Data in Agriculture is currently a hot topic. DataBio aims at building a
European vision of Big Data for agriculture. This vision is to offer solution which will
increase role of Big Data role in Agri Food chains in Europe: a perspective, which
prepared recommendation for future big data development in Europe.
• Fisheries: the ambition of this project is to herald and promote the use of Big Data
analytical tools within fisheries applications by initiating several pilots which will
demonstrate benefits of using Big Data in an analytical way for the fisheries, such as
improved analysis of operational data, tools for planning and operational choices,
crowdsourcing methods for fish stock estimation.
This is the first version of DataBio DMP; it will be updated over the course of the project as
warranted by significant changes arising during the project implementation, and the
requirements of the project consortium. At least two updates will be prepared, on Months
18 and 36 of the project.
D6.2 – Data Management Plan
H2020 Contract No. 732064 Final – v1.0, 30/6/2017
Dissemination level: PU -Public
Page 4
Deliverable Leader: Ephrem Habyarimana (CREA)
Contributors:
Jaroslav Šmejkal (ZETOR), Tomas Mildorf (UWB), Bernard
Stevenot (SPACEBEL), Irene Matzakou (INTRASOFT), Ingo
Simonis (OGSE), Christian Zinke (INFAI), Karel Charvat (LESPRO)
Reviewers:
Kyrill Meyer (INFAI), Tomas Mildorf (UWB), Erwin Goor (VITO),
Fabiana Fournier (IBM), Marco Folegani (MEEO)
Approved by: Athanasios Poulakidas (INTRASOFT)
Document History
Version Date Contributor(s) Description
0.1.1-2 12/05/2017
Ephrem
Habyarimana
TOC
0.1.3 22/05/2017
Ephrem
Habyarimana
Reviewed TOC, First assignments
0.2 30/05/2017 Tomas Mildorf Section 4.1 FAIR data costs
0.3 05/06/2017 Bernard Stevenot Section 6 Ethical issues
0.4 09/06/2017
Irene Matzakou,
Athanasios
Poulakidas
Section 5.4 - 5.5 Privacy and sensitive data
management
0.5.1 21/06/2017 Ingo Simonis Section 3.3 and 3.4 added
0.5.2 22/06/2017
Christian Zinke,
Jaroslav Šmejkal
Sections 2.2.4.4 Machine-generated data
and 4.2 added
0.6 23/06/2017
Ephrem
Habyarimana
Added: Executive summary, sections 1.2 &
2.1, and chapter 7
0.7 27/06/2017
Ephrem
Habyarimana
added section 1.3 and made edits
throughout the document.
D6.2 – Data Management Plan
H2020 Contract No. 732064 Final – v1.0, 30/6/2017
Dissemination level: PU -Public
Page 5
0.8 28/06/2017 Tomas Mildorf
Update of Section 2.2.4.3, Section 2.5.4,
Section 2.5.5, Section 3.1.3 and Section
4.1
0.9 30/06/2017
Ephrem
Habyarimana
Included all tables for currently described
DataBio’s datasets; overall edit of entire
document.
1.0 30/06/2017 Athanasios
Poulakidas
Compliance to submission format and
minor changes.
D6.2 – Data Management Plan
H2020 Contract No. 732064 Final – v1.0, 30/6/2017
Dissemination level: PU -Public
Page 6
Table of Contents
EXECUTIVE SUMMARY.....................................................................................................................................2
TABLE OF CONTENTS........................................................................................................................................6
TABLE OF FIGURES ...........................................................................................................................................8
LIST OF TABLES ................................................................................................................................................8
DEFINITIONS, ACRONYMS AND ABBREVIATIONS.............................................................................................9
INTRODUCTION ....................................................................................................................................10
1.1 PROJECT SUMMARY.....................................................................................................................................10
1.2 DOCUMENT SCOPE......................................................................................................................................13
1.3 DOCUMENT STRUCTURE ...............................................................................................................................14
DATA SUMMARY ..................................................................................................................................15
2.1 PURPOSE OF DATA COLLECTION......................................................................................................................15
2.2 DATA TYPES AND FORMATS ...........................................................................................................................17
2.2.1 Structured data.............................................................................................................................17
2.2.2 Semi-structured data ....................................................................................................................17
2.2.3 Unstructured data.........................................................................................................................19
2.2.4 New generation big data ..............................................................................................................19
2.3 HISTORICAL DATA........................................................................................................................................25
2.4 EXPECTED DATA SIZE AND VELOCITY.................................................................................................................26
2.5 DATA BENEFICIARIES ....................................................................................................................................26
2.5.1 Agricultural Sector ........................................................................................................................27
2.5.2 Forestry Sector..............................................................................................................................27
2.5.3 Fishery Sector................................................................................................................................28
2.5.4 Technical Staff...............................................................................................................................28
2.5.5 ICT sector ......................................................................................................................................28
2.5.6 Research and education................................................................................................................30
2.5.7 Policy making bodies.....................................................................................................................30
FAIR DATA ............................................................................................................................................31
3.1 DATA FINDABILITY .......................................................................................................................................31
3.1.1 Data discoverability and metadata provision...............................................................................31
3.1.2 Data identification, naming mechanisms and search keyword approaches.................................33
3.1.3 Data lineage..................................................................................................................................34
3.2 DATA ACCESSIBILITY .....................................................................................................................................37
3.2.1 Open data and closed data...........................................................................................................37
3.2.2 Data access mechanisms, software and tools ..............................................................................38
3.2.3 Big data warehouse architectures and database management systems .....................................38
3.3 DATA INTEROPERABILITY ...............................................................................................................................40
3.3.1 Interoperability mechanisms ........................................................................................................41
3.3.2 Inter-discipline interoperability and ontologies ............................................................................41
3.4 PROMOTING DATA REUSE..............................................................................................................................42
DATA MANAGEMENT SUPPORT............................................................................................................43
4.1 FAIR DATA COSTS........................................................................................................................................43
D6.2 – Data Management Plan
H2020 Contract No. 732064 Final – v1.0, 30/6/2017
Dissemination level: PU -Public
Page 7
4.2 BIG DATA MANAGERS...................................................................................................................................43
4.2.1 Project manager ...........................................................................................................................43
4.2.2 Business Analysts ..........................................................................................................................44
4.2.3 Data Scientists ..............................................................................................................................44
4.2.4 Data Engineer / Architect .............................................................................................................44
4.2.5 Platform architects .......................................................................................................................44
4.2.6 IT/Operation manager..................................................................................................................44
4.2.7 Consultant.....................................................................................................................................45
4.2.8 Business User ................................................................................................................................45
4.2.9 Pilot experts ..................................................................................................................................45
DATA SECURITY ....................................................................................................................................46
5.1 INTRODUCTION...........................................................................................................................................46
5.2 DATA RECOVERY..........................................................................................................................................47
5.3 PRIVACY AND SENSITIVE DATA MANAGEMENT ...................................................................................................48
5.3.1 Introduction ..................................................................................................................................48
5.3.2 Enterprise Data (commercial sensitive data)................................................................................48
5.3.3 Personal Data................................................................................................................................49
5.4 GENERAL PRIVACY CONCERNS ........................................................................................................................50
ETHICAL ISSUES.....................................................................................................................................51
CONCLUSIONS ......................................................................................................................................52
REFERENCES .........................................................................................................................................54
APPENDIX A DATABIO DATASETS ...........................................................................................................55
A.1 SMART POI DATA SET (UWB - D03.01) ....................................................................................................56
A.2 OPEN TRANSPORT MAP (UWB - D03.02) .................................................................................................58
A.3 SENTINELS SCIENTIFIC HUB DATASETS VIA FEDEO GATEWAY (SPACEBEL -D07.01)..........................................60
A.4 NASA CMR LANDSAT DATASETS VIA FEDEO GATEWAY (SPACEBEL - D07.02)...............................................61
A.5 OPEN LAND USE (LESPRO - D02.01) .........................................................................................................62
A.6 FOREST RESOURCE DATA (METSAK - D18.01)............................................................................................64
A.7 CUSTOMER AND FOREST ESTATE DATA (METSAK - D18.02)..........................................................................65
A.8 STORM DAMAGE OBSERVATIONS AND POSSIBLE RISK AREAS (METSAK - D18.03)..............................................67
A.9 QUALITY CONTROL DATA (METSAK - D18.04) ...........................................................................................68
A.10 ONTOLOGY FOR (PRECISION) AGRICULTURE (PSNC - D09.01).......................................................................69
A.11 WUUDIS DATA (MHGS - D20.01)............................................................................................................71
A.12 SIGPAC (TRAGSA - D11.05)....................................................................................................................72
A.13 FIELD DATA - PILOT B2 (TRAGSA - D11.07).................................................................................................74
A.14 IACS (NP - D13.01)..............................................................................................................................75
A.15 SENTINEL DATA......................................................................................................................................76
A.16 TREE SPECIES MAP (FMI - D14.03) ..........................................................................................................76
A.17 STAND AGE MAP (FMI - D14.04) .............................................................................................................77
A.18 CANOPY HEIGHT MAP (FMI - D14.05).......................................................................................................78
A.19 LEAF AREA INDEX (FMI - D14.06).............................................................................................................79
A.20 FOREST DAMAGE (FMI - D14.07).............................................................................................................80
A.21 HYPERSPECTRAL IMAGE ORTHOMOSAIC (SENOP - D44.02) ............................................................................81
A.22 GAIATRONS IOT (DS13.01) ...................................................................................................................81
A.23 PHENOMICS, METABOLOMICS, GENOMICS AND ENVIRONMENTAL DATASETS (CERTH - DS40.01) .........................82
D6.2 – Data Management Plan
H2020 Contract No. 732064 Final – v1.0, 30/6/2017
Dissemination level: PU -Public
Page 8
Table of Figures
FIGURE 1: DATABIO’S ANALYTICS AND BIG DATA VALUE APPROACH .....................................................................................16
FIGURE 2: THE PROCESSING DATA LIFECYCLE ...................................................................................................................36
FIGURE 3: THE “DISCIPLINARY DATA INTEGRATION PLATFORM: WHERE DO YOU SSIT? (SOURCE: WYBORN)..................................41
FIGURE 4: DATABIO’S DATA MANAGERS.........................................................................................................................45
FIGURE 5: DATA LIFECYCLE ..........................................................................................................................................46
FIGURE 6: THE DATA MODEL OF SMART POINTS OF INTEREST ............................................................................................58
FIGURE 7: THE DATA MODEL OF OPEN TRANSPORT MAP...................................................................................................60
FIGURE 8: FEDEO CLIENT (C07.05) .............................................................................................................................61
List of Tables
TABLE 1: THE DATABIO CONSORTIUM PARTNERS.............................................................................................................10
TABLE 2: SENSOR DATA TOOLS, RESOLUTION AND SPATIAL DENSITY .....................................................................................20
TABLE 3: GEOSPATIAL DATA TOOLS, FORMAT AND ORIGIN .................................................................................................24
TABLE 4: GENOMIC, BIOCHEMICAL AND METABOLOMIC DATA TOOLS, DESCRIPTION AND ACQUISITION........................................25
D6.2 – Data Management Plan
H2020 Contract No. 732064 Final – v1.0, 30/6/2017
Dissemination level: PU -Public
Page 9
Definitions, Acronyms and Abbreviations
Acronym/
Abbreviation
Title
BDVA Big Data Value Association
EC European Commission
EO Earth Observation
ETL Extract Transform Load
DMP Data Management Plan
GSM Global System for Mobile
GSP Global Positioning System
FAIR Findable Accessible Interoperable and Reusable
HDFS Hadoop Distributed File System
ICT Information and Communications Technology
IoT Internet of Things
JDBC Java DataBase Connectivity
JSON JavaScript Object Notation
NoSQL Not Only SQL
OBDC Open Database Connectivity
OEM Object Exchange Model
OGC Open Geospatial Consortium
REST Representational State Transfer
RFID Radio-Frequency IDentification
RPAS Remotely Piloted Aircraft Systems
SME Small-Medium Enterprise
SOAP Simple Object Access Protocol
SQL Structured Query Language
UAV Unmanned Air Vehicle
UI User Interface
WP Work Package
XML eXtensible Markup Language
D6.2 – Data Management Plan
H2020 Contract No. 732064 Final – v1.0, 30/6/2017
Dissemination level: PU -Public
Page 10
Introduction
1.1 Project Summary
The data intensive target sector on which the
DataBio project focuses is the Data-Driven
Bioeconomy. DataBio focuses on utilizing Big
Data to contribute to the production of the
best possible raw materials from agriculture,
forestry and fishery (aquaculture) for the
bioeconomy industry, as well as their further
processing into food, energy and
biomaterials, while taking into account various accountability and sustainability issues.
DataBio will deploy state-of-the-art big data technologies and existing partners’ infrastructure
and solutions, linked together through the DataBio Platform. These will aggregate Big Data
from the three identified sectors (agriculture, forestry and fishery), intelligently process them
and allow the three sectors to selectively utilize numerous platform components, according
to their requirements. The execution will be through continuous cooperation of end user and
technology provider companies, bioeconomy and technology research institutes, and
stakeholders from the big data value PPP programme.
DataBio is driven by the development, use and evaluation of a large number of pilots in the
three identified sectors, where associated partners and additional stakeholders are also
involved. The selected pilot concepts will be transformed to pilot implementations utilizing
co-innovative methods and tools. The pilots select and utilize the best suitable market-ready
or almost market-ready ICT, Big Data and Earth Observation methods, technologies, tools and
services to be integrated to the common DataBio Platform.
Based on the pilot results and the new DataBio Platform, new solutions and new business
opportunities are expected to emerge. DataBio will organize a series of trainings and
hackathons to support its uptake and to enable developers outside the consortium to design
and develop new tools, services and applications based on and for the DataBio Platform.
The DataBio consortium is listed in Table 1. For more information about the project see [REF-
01].
Table 1: The DataBio consortium partners
Number Name Short name Country
1 (CO) INTRASOFT INTERNATIONAL SA INTRASOFT Belgium
D6.2 – Data Management Plan
H2020 Contract No. 732064 Final – v1.0, 30/6/2017
Dissemination level: PU -Public
Page 11
2 LESPROJEKT SLUZBY SRO LESPRO Czech Republic
3 ZAPADOCESKA UNIVERZITA V PLZNI UWB Czech Republic
4
FRAUNHOFER GESELLSCHAFT ZUR FOERDERUNG DER
ANGEWANDTEN FORSCHUNG E.V. Fraunhofer Germany
5 ATOS SPAIN SA ATOS Spain
6 STIFTELSEN SINTEF SINTEF ICT Norway
7 SPACEBEL SA SPACEBEL Belgium
8
VLAAMSE INSTELLING VOOR TECHNOLOGISCH
ONDERZOEK N.V. VITO Belgium
9
INSTYTUT CHEMII BIOORGANICZNEJ POLSKIEJ
AKADEMII NAUK PSNC Poland
10 CIAOTECH Srl CiaoT Italy
11 EMPRESA DE TRANSFORMACION AGRARIA SA TRAGSA Spain
12 INSTITUT FUR ANGEWANDTE INFORMATIK (INFAI) EV INFAI Germany
13 NEUROPUBLIC AE PLIROFORIKIS & EPIKOINONION NP Greece
14
Ústav pro hospodářskou úpravu lesů Brandýs nad
Labem UHUL FMI Czech Republic
15 INNOVATION ENGINEERING SRL InnoE Italy
16 Teknologian tutkimuskeskus VTT Oy VTT Finland
17 SINTEF FISKERI OG HAVBRUK AS
SINTEF
Fishery Norway
18 SUOMEN METSAKESKUS-FINLANDS SKOGSCENTRAL METSAK Finland
19 IBM ISRAEL - SCIENCE AND TECHNOLOGY LTD IBM Israel
20 MHG SYSTEMS OY - MHGS MHGS Finland
21 NB ADVIES BV NB Advies Netherlands
22
CONSIGLIO PER LA RICERCA IN AGRICOLTURA E
L'ANALISI DELL'ECONOMIA AGRARIA CREA Italy
23 FUNDACION AZTI - AZTI FUNDAZIOA AZTI Spain
24 KINGS BAY AS KingsBay Norway
D6.2 – Data Management Plan
H2020 Contract No. 732064 Final – v1.0, 30/6/2017
Dissemination level: PU -Public
Page 12
25 EROS AS Eros Norway
26 ERVIK & SAEVIK AS ESAS Norway
27 LIEGRUPPEN FISKERI AS LiegFi Norway
28 E-GEOS SPA e-geos Italy
29 DANMARKS TEKNISKE UNIVERSITET DTU Denmark
30 FEDERUNACOMA SRL UNIPERSONALE Federu Italy
31
CSEM CENTRE SUISSE D'ELECTRONIQUE ET DE
MICROTECHNIQUE SA - RECHERCHE ET
DEVELOPPEMENT CSEM Switzerland
32 UNIVERSITAET ST. GALLEN UStG Switzerland
33 NORGES SILDESALGSLAG SA Sildes Norway
34 EXUS SOFTWARE LTD EXUS
United
Kingdom
35 CYBERNETICA AS CYBER Estonia
36
GAIA EPICHEIREIN ANONYMI ETAIREIA PSIFIAKON
YPIRESION GAIA Greece
37 SOFTEAM Softeam France
38
FUNDACION CITOLIVA, CENTRO DE INNOVACION Y
TECNOLOGIA DEL OLIVAR Y DEL ACEITE CITOLIVA Spain
39 TERRASIGNA SRL TerraS Romania
40
ETHNIKO KENTRO EREVNAS KAI TECHNOLOGIKIS
ANAPTYXIS CERTH Greece
41
METEOROLOGICAL AND ENVIRONMENTAL EARTH
OBSERVATION SRL MEEO Italy
42 ECHEBASTAR FLEET SOCIEDAD LIMITADA ECHEBF Spain
43 NOVAMONT SPA Novam Italy
44 SENOP OY Senop Finland
45
UNIVERSIDAD DEL PAIS VASCO/ EUSKAL HERRIKO
UNIBERTSITATEA EHU/UPV Spain
46
OPEN GEOSPATIAL CONSORTIUM (EUROPE) LIMITED
LBG OGCE
United
Kingdom
D6.2 – Data Management Plan
H2020 Contract No. 732064 Final – v1.0, 30/6/2017
Dissemination level: PU -Public
Page 13
47 ZETOR TRACTORS AS ZETOR Czech Republic
48
COOPERATIVA AGRICOLA CESENATE SOCIETA
COOPERATIVA AGRICOLA CAC Italy
1.2 Document Scope
This document outlines DataBio’s data management plan (DMP), formally documenting how
data will be handled both during the implementation and upon natural termination of the
project. Many DMP aspects will be considered including metadata generation, data
preservation, data security and ethics, accounting for the FAIR (Findable, Accessible,
Interoperable, Re-usable) data principle. DataBio, Data-driven Bioeconomy project, is an
innovation big data intensive action involving public private partnership to promote
productivity on EU companies in three of the major bioeconomy sectors namely, Agriculture,
forestry and fishery. Experiences from US show that bioeconomy can get a significant boost
from Big Data. In Europe, this sector has until now attracted few large ICT vendors. A central
goal of DataBio is to increase participation of European ICT industry in the development of
Big Data systems for boosting the lagging bioeconomy productivity. As a good case in point,
European agriculture, forestry and fishery can benefit greatly from the European Copernicus
space program which has currently launched its third Sentinel satellite, telemetry IoT, UAVs,
etc.
Farm and forestry machinery, and fishing vessels in use today collect large quantities of data
in unprecedented pattern. Remote and proximal sensors and imagery, and many other
technologies, are all working together to give details about crop and soil properties, marine
environment, weeds and pests, sunlight and shade, and many other primary production
relevant variables. Deploying big data analytics in these data can help the farmers, foresters
and fishers to adjust and improve the productivity of their business operations. On the other
hand, large data sets such as those coming from the Copernicus earth monitoring
infrastructure, are increasingly available on different levels of granularity, but they are
heterogeneous, at times also unstructured, hard to analyze and distributed across various
sectors and different providers. It is here that data management plan comes in. It is
anticipated that DataBio will provide a solution which will assume that datasets will be
distributed among different infrastructures and that their accessibility could be complex,
needing to have mechanisms which facilitate data retrieval, processing, manipulation and
visualization as seamlessly as possible. The infrastructure will open new possibilities for ICT
sector, including SMEs to develop new Bioeconomy 4.0 and will also open new possibilities
for companies from the Earth Observation sector.
This DMP will be updated over the course of DataBio project whenever significant changes
arise. The updates of this document will increasingly provide in-depths on DataBio DMP
strategies with particular interest on the aspects of findability, accessibility, interoperability
D6.2 – Data Management Plan
H2020 Contract No. 732064 Final – v1.0, 30/6/2017
Dissemination level: PU -Public
Page 14
and reusability of the Big Data the project produces. At least two updates will be prepared,
on Month 18 and Month 36 of the project.
1.3 Document Structure
This document is comprised of the following chapters:
Chapter 1 presents an introduction to the project and the document.
Chapter 2 presents the data summary including the purpose of data collection, data size, type
and format, historical data reuse and data beneficiaries.
Chapter 3 outlines DataBio’s FAIR data strategies.
Chapter 4 describes data management support.
Chapter 5 describes data security.
Chapter 6 describes ethical issues.
Chapter 7 presents the concluding remarks.
Appendix A presents the managed data sets.
D6.2 – Data Management Plan
H2020 Contract No. 732064 Final – v1.0, 30/6/2017
Dissemination level: PU -Public
Page 15
Data Summary
2.1 Purpose of data collection
During the lifecycle of the DataBio project, big data will be collected that is, very large data
sets (multi-terabyte or larger) consisting of a wide range of data types (relational, text, multi-
structured data, etc.) from numerous sources, including relatively new generation big data
(machines, sensors, genomics, etc.). The ultimate purpose of data collection is to use the data
as a source of information in the implementation of a variety of big data analytics algorithms,
services and applications DataBio will deploy to create a value, new business facts and insights
with a particular focus on the bioeconomy industry. The big datasets are part of the building
blocks of the DataBio’s big data technology platform (Figure 1) that was designed to help
European companies increase productivity. Big Data experts provide common analytic
technology support for the main common and typical Bioeconomy applications/analytics that
are now emerging through the pilots in the project. Data from the past will be managed and
analyzed, including many different kind of data sources: i.e., descriptive analytics and classical
query/reporting (in need of variety management - and handling and analysis of all of the data
from the past, including performance data, transactional data, attitudinal data, descriptive
data, behavioural data, location-related data, interactional data, from many different
sources). Big data from the present time will be harnessed in the process of monitoring and
real-time analytics - pilot services (in need of velocity processing - and handling of real-time
data from the present) - trigging alarms, actuators etc.
Harnessing big data for the future time include forecasting, prediction and recommendation
analytics - pilot services (in need of volume processing - and processing of large amounts of
data combining knowledge from the past and present, and from models, to provide insight
for the future).
D6.2 – Data Management Plan
H2020 Contract No. 732064 Final – v1.0, 30/6/2017
Dissemination level: PU -Public
Page 16
Figure 1: DataBio’s analytics and big data value approach
Specifically:
• Forestry: Big Data methods are expected to bring the possibility to both increase the
value of the forests as well as to decrease the costs within sustainability limits set by
natural growth and ecological aspects. The key technology is to gather more and more
accurate information about the trees from a host of sensors including new generation
of satellites, UAV images, laser scanning, mobile devices through crowdsourcing and
machines operating in the forests.
• Agriculture: Big Data in Agriculture is currently a hot topic. The DataBio intention is to
build a European vision of Big Data for agriculture. This vision is to offer solutions
which will increase the role of Big Data role in Agri Food chains in Europe: a
perspective, which will prepare recommendation for future big data development in
Europe.
• Fisheries: the ambition is to herald and promote the use of Big Data analytical tools
within fisheries applications by initiating several pilots which will demonstrate
benefits of using Big Data in an analytical way for the fisheries, such as improved
analysis of operational data, tools for planning and operational choices,
crowdsourcing methods for fish stock estimation.
D6.2 – Data Management Plan
H2020 Contract No. 732064 Final – v1.0, 30/6/2017
Dissemination level: PU -Public
Page 17
• The use of Big data analytics will bring about innovation. It will generate significant
economic value, extend the relevant market sectors, and herald novel
business/organizational models. The cross-cutting character of the geo-spatial Big
Data solutions allows the straightforward extension of the scope of applications
beyond the bio-economy sectors. Such extensions of the market for the Big Data
technologies are foreseen in economic sectors, such as: Urban planning, Water
quality, Public safety (incl. technological and natural hazards), Protection of critical
infrastructures, Waste management. On the other hand, the Big Data technologies
revolutionize the business approach in the geospatial market and foster the
emergence of innovative business/organizational models; indeed, to achieve the cost
effectiveness of the services to the customers, it is necessary to organize the offer to
the market on a territorial/local basis, as the users share the same geospatial sources
of data and are best served by local players (service providers). This can be illustrated
by a network of European services providers, developing proximity relationships with
their customers and sharing their knowledge through the network.
2.2 Data types and formats
The DataBio specific data types, formats and sources are listed in detail in Appendix A; below
are described key features of the data used in the project.
2.2.1 Structured data
Structured data refers to any data that resides in a fixed field within a record or file. This
includes data contained in relational databases, spreadsheets, and data in forms of events
such as sensor data. Structured data first depends on creating a data model – a model of the
types of business data that will be recorded and how they will be stored, processed and
accessed. This includes defining what fields of data will be stored and how that data will be
stored: data type (numeric, currency, alphabetic, name, date, address) and any restrictions
on the data input (number of characters; restricted to certain terms such as Mr., Ms. or Dr.;
M or F).
2.2.2 Semi-structured data
Semi-structured data is a cross between structured and unstructured data. It is a type of
structured data, but lacks the strict data model structure. With semi-structured data, tags or
other types of markers are used to identify certain elements within the data, but the data
doesn't have a rigid structure. For example, word processing software now can include
metadata showing the author's name and the date created, with the bulk of the document
just being unstructured text. Emails have the sender, recipient, date, time and other fixed
fields added to the unstructured data of the email message content and any attachments.
Photos or other graphics can be tagged with keywords such as the creator, date, location and
keywords, making it possible to organize and locate graphics. XML and other markup
languages are often used to manage semi-structured data. Semi-structured data is therefore
D6.2 – Data Management Plan
H2020 Contract No. 732064 Final – v1.0, 30/6/2017
Dissemination level: PU -Public
Page 18
a form of structured data that does not conform with the formal structure of data models
associated with relational databases or other forms of data tables, but nonetheless contains
tags or other markers to separate semantic elements and enforce hierarchies of records and
fields within the data. Therefore, it is also known as self-describing structure. In semi-
structured data, the entities belonging to the same class may have different attributes even
though they are grouped together, and the attributes' order is not important. Semi-structured
data are increasingly occurring since the advent of the Internet where full-text documents
and databases are not the only forms of data anymore, and different applications need a
medium for exchanging information. In object-oriented databases, one often finds semi-
structured data.
XML and other markup languages, email, and EDI are all forms of semi-structured data. OEM
(Object Exchange Model) was created prior to XML as a means of self-describing a data
structure. XML has been popularized by web services that are developed utilizing SOAP
principles. Some types of data described here as "semi-structured", especially XML, suffer
from the impression that they are incapable of structural rigor at the same functional level as
Relational Tables and Rows. Indeed, the view of XML as inherently semi-structured
(previously, it was referred to as "unstructured") has handicapped its use for a widening range
of data-centric applications. Even documents, normally thought of as the epitome of semi-
structure, can be designed with virtually the same rigor as database schema, enforced by the
XML schema and processed by both commercial and custom software programs without
reducing their usability by human readers.
In view of this fact, XML might be referred to as having "flexible structure" capable of human-
centric flow and hierarchy as well as highly rigorous element structure and data typing. The
concept of XML as "human-readable", however, can only be taken so far. Some
implementations/dialects of XML, such as the XML representation of the contents of a
Microsoft Word document, as implemented in Office 2007 and later versions, utilize dozens
or even hundreds of different kinds of tags that reflect a particular problem domain - in
Word's case, formatting at the character and paragraph and document level, definitions of
styles, inclusion of citations, etc. - which are nested within each other in complex ways.
Understanding even a portion of such an XML document by reading it, let alone catching
errors in its structure, is impossible without a very deep prior understanding of the specific
XML implementation, along with assistance by software that understands the XML schema
that has been employed. Such text is not "human-understandable" any more than a book
written in Swahili (which uses the Latin alphabet) would be to an American or Western
European who does not know a word of that language: the tags are symbols that are
meaningless to a person unfamiliar with the domain.
JSON or JavaScript Object Notation, is an open standard format that uses human-readable
text to transmit data objects consisting of attribute–value pairs. It is used primarily to transmit
data between a server and web application, as an alternative to XML. JSON has been
D6.2 – Data Management Plan
H2020 Contract No. 732064 Final – v1.0, 30/6/2017
Dissemination level: PU -Public
Page 19
popularized by web services developed utilizing REST principles. There is a new breed of
databases such as MongoDB and Couchbase that store data natively in JSON format,
leveraging the pros of semi-structured data architecture.
2.2.3 Unstructured data
Unstructured data (or unstructured information) refers to information that either does not
have a pre-defined data model or is not organized in a pre-defined manner. This results in
irregularities and ambiguities that make it difficult to understand using traditional programs
as compared to data stored in “field” form in databases or annotated (semantically tagged)
in documents. Unstructured data can't be so readily classified and fit into a neat box: photos
and graphic images, videos, streaming instrument data, webpages, PDF files, PowerPoint
presentations, emails, blog entries, wikis and word processing documents.
In 1998, Merrill Lynch cited a rule of thumb that somewhere around 80-90% of all potentially
usable business information may originate in unstructured form. This rule of thumb is not
based on primary or any quantitative research, but nonetheless is accepted by some. IDC and
EMC project that data will grow to 40 zettabytes by 2020, resulting in a 50-fold growth from
the beginning of 2010. Computer World states that unstructured information might account
for more than 70%–80% of all data in organizations.
Software that creates machine-processable structure can utilize the linguistic, auditory, and
visual structure that exist in all forms of human communication. Algorithms can infer this
inherent structure from text, for instance, by examining word morphology, sentence syntax,
and other small- and large-scale patterns. Unstructured information can then be enriched and
tagged to address ambiguities and relevancy-based techniques then used to facilitate search
and discovery. Examples of "unstructured data" may include books, journals, documents,
metadata, health records, audio, video, analog data, images, files, and unstructured text such
as the body of an e-mail message, Web page, or word-processor document. While the main
content being conveyed does not have a defined structure, it generally comes packaged in
objects (e.g. in files or documents, …) that themselves have structure and are thus a mix of
structured and unstructured data, but collectively this is still referred to as "unstructured
data".
2.2.4 New generation big data
The new generation big data is in particular focusing on semi-structured and unstructured
data, often in combination with structured data.
In the BDVA reference model for big data technologies a distinction is done between 6
different big data types.
D6.2 – Data Management Plan
H2020 Contract No. 732064 Final – v1.0, 30/6/2017
Dissemination level: PU -Public
Page 20
2.2.4.1 Sensor data
Within the Databio pilots, several key parameters will be monitored through sensorial
platforms and sensor data will be collected along the way to support the project activities.
Two types of sensor data have been already identified and namely, a) IoT data from in-situ
sensors and telemetric stations, b) imagery data from unmanned aerial sensing platforms
(drones), c) imagery from hand-held or mounted optical sensors.
2.2.4.1.1 Internet of Things data
The IoT data are a major subgroup of sensor data involved in multiple pilot activities in the
Databio project. IoT data are sent via TCP/UDP protocol in various formats (e.g. txt with time
series data, json strings) and can be further divided into the following categories:
• Agro-climatic/Field telemetry stations which contribute with raw data (numerical
values) related to several parameters. As different pilots focus on different application
scenarios, the following table summarizes several IoT-based monitoring approaches
to be followed.
Table 2: Sensor data tools, resolution and spatial density
Pilot Mission, instrument Data resolution and spatial
density
A1.1,
B1.2,
C1.1,
C2.2
NP’s GAIAtrons, which are telemetry IoT stations
with modular/expandable design will be used to
monitor ambient temperature, humidity, solar
radiation, leaf wetness, rainfall volume, wind
speed and direction, barometric pressure
(GAIAtron atmo), soil temperature and humidity
(multi-depth) (GAIAtron soil)
Time step for data collection
every 10 minutes. One station
per microclimate zone (300ha -
1100 ha for atmo, 300ha -
3300ha for soil)
D6.2 – Data Management Plan
H2020 Contract No. 732064 Final – v1.0, 30/6/2017
Dissemination level: PU -Public
Page 21
A1.2,
B1.3
Field bound sensors will be used to monitor air
temperature, air moisture, solar radiation, leaf
wetness, rainfall, wind speed and direction, soil
moisture, soil temperature, soil EC/salinity, PAR,
and barometric pressure. These sensors consist
in technology platform of retriever and pups
wireless sensor network and SpecConnect, a
cloud based crop data management solution.
Time step for data collection is
customizable from 1 to 60
minutes; Field sensors will be
used to monitor 5 tandemly
located sites at a density: a) Air
temperature, air moisture,
rainfall, wind data and solar
radiation: one bloc of sensors
per 5 ha
b) Leaf wetness: two sensors
per ha
c) Soil moisture, soil
temperature and soil
EC/salinity: one combined
sensor per ha
A2.1 Environmental indoor: air temperature, air
relative humidity, solar radiation, crop leaf
temperature (remotely and in contact),
soil/substrate water content. Environmental
outdoor: wind speed and direction, evaporation,
rain, UVA, UVB
To be determined
B1.1 Agro-climatic IoT stations monitoring
temperature, relative and absolute humidity,
wind parameters
To be determined
• Control data in the parcels/fields measuring sprinklers, drippers, metering devices,
valves, alarm settings, heating, pumping state, pressure switches, etc.
• Contact sensing data that determine problems with great precision, speeding up the
use of techniques which help to solve problems
• Vessel and buoy-based stations which contribute with raw data (numerical values),
typically hydro acoustic and machinery data
2.2.4.1.2 Drone data
A specific subset of sensor data generated and processed within DataBio project is images
produced by cameras on-board drones or RPAS (Remotely Piloted Aircraft Systems). In
particular, some DataBio pilots will use optical (RGB), thermal or multispectral images and 3D
D6.2 – Data Management Plan
H2020 Contract No. 732064 Final – v1.0, 30/6/2017
Dissemination level: PU -Public
Page 22
point-clouds acquired from RPAS. The information generated by drone-airborne cameras is
usually Image Data (JPEG or JPEG2000). A general description of the workflow is provided
below.
Data acquired by the RGB sensor
The RGB sensor acquires individual pictures in .JPG format, together with their ‘geotag’ files,
which are downloaded from the RPAS and processed into:
• .LAS files: 3D point clouds (x, y, z), which are then processed to produce Digital Models
(Terrain- DTM, Surface-DSM, Elevation-DEM, Vegetation-DVM)
• .TIF files: which are then processed into an orthorectified mosaic. In order to obtain
smaller files, mosaics are usually exported to compressed .ECW format.
Data acquired by the thermal sensor
The Thermal sensor acquires a video file which is downloaded from the RPAS and:
• split into frames in .TIF format (pixels contain Digital Numbers: 0-255)
• 1 of every 10 frames is selected (with an overlap of about 80%, so as not to process an
excessive amount of information)
Data acquired by the multispectral sensor
The multispectral sensor acquires individual pictures from the 6 spectral channels in .RAW
format, which are downloaded from the RPAS and processed into:
• .TIF files (16 bits), which are then processed to produce a 6-bands .TIF mosaic (pixels
contain Digital Numbers: 0-255)
2.2.4.1.3 Data from hand-held or mounted optical sensors
Images from hand-held or mounted cameras will be collected using truck-held or hand held
full Range / high resolution UV-VIS-NIR-SWIR Spectroradiometer.
2.2.4.2 Machine-generated data
Machine-generated data in the DataBio project are data produced by ships, boats and
machinery used in agriculture and in forestry (such as tractors). These data will serve for
further analysis and optimisation of processes in the bio-economy sector.
For illustration purposes, examples of data collected by tractors in agriculture are described.
Tractors are equipped by the following units:
• Control units for data control, data collection and analyses including dashboards,
transmission control unit, hydrostatic or hydrodynamic system control unit, engine
control unit.
• Global Positioning System (GPS) units or Global System for Mobile Communications
(GSM) units for tractor tracking.
D6.2 – Data Management Plan
H2020 Contract No. 732064 Final – v1.0, 30/6/2017
Dissemination level: PU -Public
Page 23
• Unit for displaying characteristics of field/soil characteristics including area, quality,
boundaries and yields.
These units generate the following data:
• Identification of tractor + identification of driver by code or by RFID module.
• Identification of the current operation status.
• Time identification by the date and the current time.
• Precise tractor location tracking (daily route, starts, stops, speed).
• Tractor hours - monitoring working hours in time and place.
• Information from tachometer [Σ km] and [Σ working hrs and min].
• Identification of the current maintenance status.
• Tractor diagnostic: failure modes or failure codes
• Information about the date of the last calibration of each tractor systems +
information about setting, information about SW version, last update, etc.
• The amount of fuel in the fuel tank [L].
• Online information about sudden loss of fuel in the fuel tank.
• Fuel consumption per trip / per time period / per kilometer (monitoring of fuel
consumption in various dependencies e.g. motor load).
• Total fuel consumption per day [L/day].
• Engine speed [run/min].
• Possibility to online setup engine speed in range [run/min from - to], signaling when
limits are exceeding.
• Current position of accelerator pedal [% from scale 0-100 %].
• Charging level of the main battery [V].
• Current temperature of the cooling weather [C ͦ or F ͦ ].
• Current temperature of the motor oil [C ͦ or F ͦ ].
• Current temperature of after treatment [C ͦ or F ͦ ].
• Current temperature of the transmission oil [C ͦ or F ͦ ].
• Diagnosis gear shift [grades backward and forward].
• Current engine load [% from scale 0-100 %]
2.2.4.3 Geospatial data
The DataBio pilots will collect earth observation (EO) data from a number of sources which
will be refined during the project. Currently, it is confirmed that the following EO data will be
collected and used as input data:
D6.2 – Data Management Plan
H2020 Contract No. 732064 Final – v1.0, 30/6/2017
Dissemination level: PU -Public
Page 24
Table 3: Geospatial data tools, format and origin
Mission,
instrument
Format Origin
Sentinel-1, C-SAR SLC, GRD Copernicus Open Access Hub
(https://scihub.copernicus.eu/)
Sentinel-2, MSI L1C Copernicus Open Access Hub
(https://scihub.copernicus.eu/)
Information about the expected sizes will be added, when the information becomes available.
In addition to EO data, DataBio will utilise other geospatial data from EU, national, local,
private and open repositories including Land Parcel Identification System data, cadastral data,
Open Land Use map (http://sdi4apps.eu/open_land_use/), Urban Atlas and Corine Land
Cover, Proba-V data (www.vito-eodata.be).
The meteo-data will be collected mainly from EO systems based and will be collected from
European data sources such as COPERNICUS products, EUMETSAT H-SAF products, but also
other EO data sources such as VIIRS and MODIS and ASTER will be considered. As
complementary data sources, the weather forecast models output (ECMWF) and the regional
weather services output usually based on ground weather stations can be considered
according to the specific target areas of the pilots."
2.2.4.4 Genomics data
Within the DataBio Pilot 1.1.2 different data will be collected and produced. Three categories
of data have been already identified for the Pilot and namely, a) in-situ sensors (including
image capture) and farm data, b) genomic data from plant breeding efforts in Green Houses
produced using Next Generation Sequencers (NGS), c) biochemical data of tomato fruits
produced by chromatographs (LC/MS/MS, GS/MS, HPLC).
In-situ sensors/Environmental outdoor: Wind speed and direction, Evaporation, Rain, Light
intensity, UVA, UVB.
In-situ sensors/Environmental indoor: Air temperature, Air relative humidity, Crop leaf
temperature (remotely and in contact), Soil/substrate water content, crop type, etc.).
Farm Data:
• In-Situ measurements: Soil nutritional status.
• Farm logs (work calendar, technical practices at farm level, irrigation information,).
D6.2 – Data Management Plan
H2020 Contract No. 732064 Final – v1.0, 30/6/2017
Dissemination level: PU -Public
Page 25
• Farm profile (Static farm information, such as size
Table 4: Genomic, biochemical and metabolomic data tools, description and acquisition
Pilot A1.1.2 Mission, Instrument Data description and acquisition
Genomic
data
To characterize the genetic
diversity of local tomato
varieties used for breeding. To
use the genetic- genomic
information to guide the
breeding efforts (as a selection
tool for higher performance)
and develop a model to predict
the final breeding result in
order to achieve rapidly and
with less financial burden
varieties of higher performance.
Data will be produced using two
Illumina NGS Macchines.
Data produced from Illumina machines
stored in compressed text files (fastq).
Data will be produced from plant
biological samples (leaf and fruit).
Collection will be done in 2 different
plant stages (plantlets and mature
plants). Genomic data will be produced
using standard and customized
protocols at CERTH. Genomic data,
although plait text in format, are big-
volume data and pose challenges in
their storage, handling and processing.
Preliminary analysis will be performed
using the local HPC computational
facility.
Biochemical,
metabolomic
data
To characterize the biochemical
profile of fruits from tomato
varieties used for breeding.
Data will be produced from
different chromatographs and
mass spectrometers
Data will be mainly proprietary binary
based archives converted to XML or
other open formats. Data will be
acquired from biological samples of
tomato fruits.
While genomic data are stored in raw format as files, environmental data, which are
generated using a network of sensors, will be stored in a database along with the time
information and will be processed as time series data.
2.3 Historical data
In the context of doing machine learning and predictive and prescriptive analytics it is
important to be able to use historical data for training and validation purposes. Machine
learning algorithms will use existing historical data as training data both for supervised and
unsupervised learning. Information about datasets and the time periods concerned with
historical datasets to be used for DataBio can be found in Appendix A. Historical data can also
D6.2 – Data Management Plan
H2020 Contract No. 732064 Final – v1.0, 30/6/2017
Dissemination level: PU -Public
Page 26
serve as training complex event processing applications. In this case, historical data is injected
as “happening in real-time” therefore serving as testing the complex event driven application
in hand before running it in real-environment.
2.4 Expected data size and velocity
The big data “V” characteristics of Volume and Velocity is being described for each of the
identified data sets in the DataBio projects - typically with measurements of total historical
volumes and new/additional data per time unit. The DataBio-specific Data Volumes and
velocities (or injection rates) can be found in Appendix A.
2.5 Data beneficiaries
In this section, this document analyses the key data beneficiaries who will benefit from the
use of big data in several fields as analytics, data sets, business value, sales or marketing. This
section will consider both tangibles and intangibles concepts.
In examining the value of big data, it is necessary to evaluate who is affected by them and
their usage. In some cases, the individual whose data is processed directly receives a benefit.
Nevertheless, regarding Data Driven Bio-Economy, the benefit to the individual can be
considered as indirect. In other cases, the relevant individual receives no benefit attributable,
with big data value reaped by business, government, or society at large.
Concerning General Community, the collection and use of an individual’s data benefits not
only that individual, but also members of a proximate class, such as users of a similar product
or residents of a geographical area. In the case of organizations, Big Data analysis often
benefits those organizations that collect and harness the data. Data-driven profits may be
viewed as enhancing allocative efficiency by facilitating the free economy. The emergence,
expansion, and widespread use of innovative products and services at decreasing marginal
costs have revolutionized global economies and societal structures, facilitating access to
technology and knowledge and fomenting social change. With more data, businesses can
optimize distribution methods, efficiently allocate credit, and robustly combat fraud,
benefitting consumers as a whole.
On the other hand, big data analysis can provide a direct benefit to those individuals whose
information is being used. However, DataBio project is not directly involved on those specific
cases (see chapter6 about ethical issues).
Regarding general benefits, big data is creating enormous value for the global economy,
driving innovation, productivity, efficiency, and growth. Data has become the driving force
behind almost every interaction between individuals, businesses, and governments. The uses
of big data can be transformative and are sometimes difficult to anticipate at the time of initial
collection.
D6.2 – Data Management Plan
H2020 Contract No. 732064 Final – v1.0, 30/6/2017
Dissemination level: PU -Public
Page 27
This section does not provide a comprehensive taxonomy of big data benefits. It would be
pretentious to do so, ranking the relative importance of weighty social goals. Rather it posits
that such benefits must be accounted for by rigorous analysis considering the priorities of a
nation, society, or economy. Only then, can benefits be assessed within an economic
framework.
Besides those general concepts on Big Data Beneficiaries, it is possible to analyse the impact
of DataBio project results regarding the final users of the different technologies, tools and
services to be developed. Using this approach, and taking into account that more detailed
information is available at Deliverables D1.1, D2.1 and D3.1 regarding Agricultural, Forestry
and Fishery pilots definition, the main beneficiaries of big data are described in the following
sections.
2.5.1 Agricultural Sector
One of the proposed agricultural pilots is about the use of tractor units able to online send
information regarding current operations to the driver or farmer. The prototypes will be
equipped with units for tracking and tracing (GPS - Global Positioning System or GSM - Global
System for Mobile Communications) and the unit for displaying characteristics of soil units.
The proposed solution will meet Farmers requests on cost reduction and improved
productivity in order to increase their economic benefits following, also, sustainable
agriculture practices.
In other case, Smart farming services provided as irrigation through flexible mechanisms and
UIs (web, mobile, tablet compatible) will promote the adoption of technological tools (IoT,
data analytics) and collaboration with certified professionals to optimize farm productivity.
Therefore, Farming Cooperatives will obtain, again, cost reduction and improved productivity
migrating from standard to sustainable smart-agriculture practices. As a summary, main
beneficiaries of DataBio will be Farming cooperatives, farmers and land owners.
2.5.2 Forestry Sector
Data sharing and a collaborative environment enable improved tools for sustainable forest
management decisions and operations. Forest management services make data accessible for
forest owners, and other end users, and integrate this data for e-contracting, online purchase
and sales of timber and biomass. Higher data volumes and better data accessibility increase
the probability that the data will be updated and maintained.
DataBio WP2 will develop and pilot standardized procedures for collecting and transferring
Big Data based on DataBio WP4 platform from silvicultural activities executed in the forest.
As a summary, the Big Data beneficiaries related to WP2 – Forestry Pilots activities will be:
• Forest owners (private, public, timberland investors)
• Forest authority experts
• Forest companies
D6.2 – Data Management Plan
H2020 Contract No. 732064 Final – v1.0, 30/6/2017
Dissemination level: PU -Public
Page 28
• Contractors and service providers
2.5.3 Fishery Sector
Regarding WP3 – Fisheries Pilot, in Pilot A2: Small pelagic fisheries immediate operational
choices, the main users and beneficiaries of this pilot will be the ship owners and masters on
board small pelagic vessels. Modern pelagic vessels are equipped with increasingly complex
machinery systems for propulsion, manoeuvring and power generation. Due to that, the
vessel is always in an operational state, but the configuration of the vessel systems imposes
constraints on operation. The captain is tasked with safe operation of the vessel, while the
efficiency of the vessel systems may be increased if the captain is informed about the actual
operational state, potential for improvement and expected results of available actions.
The goal of the pilot B2: Oceanic tuna fisheries planning is to create tools that aid in trip
planning by presenting historical catch data as well as attempting to forecast where the fish
might be in the near future. The forecast model will be constructed from historical data of
catches with the data available by the skippers at that moment (oceanographical data, buoys
data etc). In that case, the main beneficiary of DataBio development will be tuna fisheries
companies. Therefore, as a summary, DataBio WP3 beneficiaries will be the broad range of
fisheries stakeholders from companies, captains and vessels owners.
2.5.4 Technical Staff
Adoption rates aside, the potential benefits of utilising big data and related technologies are
significant both in scale and scope and include, for example: better/more targeted marketing
activities, improved business decision making, cost reduction and generation of operational
efficiencies, enhanced planning and strategic decision making and increased business agility,
fraud detection, waste reduction and customer retention to name but a few. Obviously, the
ability of firms to realize business benefits will be dependent on company characteristics such
as size, data dependency and nature of business activity.
A core concern voiced by many of those participating in big data focused studies is the ability
of employers to find and attract the talent needed for both a) the successful implementation
of big data solutions and b) the subsequent realisation of associated business benefits.
Although ‘Data Scientist’ may currently be the most requested profile in big data, the
recruitment of Data Scientists (in volume terms at least) appears relatively low down the wish
list of recruiters. Instead, the openings most commonly arising in the big data field (as is the
case for IT recruitment) are development positions.
2.5.5 ICT sector
2.5.5.1 Developers
The generic title of developer is normally employed together with a detailed description of
the specific technical related skills required for the post and it is this description that defines
D6.2 – Data Management Plan
H2020 Contract No. 732064 Final – v1.0, 30/6/2017
Dissemination level: PU -Public
Page 29
the specific type of development activity undertaken. The technical skills most often cited by
recruiters in adverts for big data Developers are: NoSQL (MongoDB in particular), Java, SQL,
JavaScript, MySQL, Linux, Oracle, Hadoop (especially Cassandra), HTML and Spring.
2.5.5.2 Architects
More specifically, however, applicants for these positions are required to hold skills in a range
of technical disciplines including: Oracle (in particular, BI EE), Java, SQL, Hadoop and SQL
Server, whilst the main generic areas of technical Knowledge and competence required were:
Data Modelling, ETL, and Enterprise Architecture, Open Source and Analytics.
2.5.5.3 Analysts
Particular process/methodological skills required from applicants for analyst positions were
primarily in respect of: Data Modelling, ETL, Analytics and Data.
2.5.5.4 Administrators
In general, the technical skills most often requested by employers from big data
Administrators at that time were: Linux, MySQL and Puppet, Hadoop and Oracle, whilst the
process and methodological competences most often requested were in the areas of
Configuration Management, Disaster Recovery, Clustering and ETL.
2.5.5.5 Project Managers
The specific types of Project Manager most often required by big data recruiters are Oracle
Project Managers, Technical Project Managers and Business Intelligence Project Managers.
Aside from Oracle (and in particular BI EE, EBS and EBS R12), which was specified in over two-
thirds of all adverts for big data related Project Management posts, other technical skills often
needed by applicants for this type of position were: Netezza, Business Objects and Hyperion.
Process and methodological skills commonly required included ETL and Agile Software
Development together with a range of more ‘business focused’ skills, i.e. PRINCE2 and
Stakeholder Management.
2.5.5.6 Data Designers
The most commonly requested technical skills associated with these posts to have been
Oracle (particularly BIEE) and SQL followed by Netezza, SQL Server, MySQL and UNIX.
Common process and methodological skills needed were: ETL, Data Modelling, Analytics, CSS,
Unit Testing, Data Integration and Data Mining, whilst more general knowledge requirements
related to the need for experience and understanding of Business Intelligence, Data
Warehouse, Big Data, Migration and Middleware.
2.5.5.7 Data Scientists
The core technical skills needed to secure a position as a Data Scientist are found to be:
Hadoop, Java, NoSQL and C++. As was the case for other big data positions, adverts for Data
Scientists often made reference to a need for various process and methodological skills and
D6.2 – Data Management Plan
H2020 Contract No. 732064 Final – v1.0, 30/6/2017
Dissemination level: PU -Public
Page 30
competences. Interestingly however, in this case, such references were found to be much
more commonplace and (perhaps as would be expected) most often focused upon data
and/or statistical themes, i.e. Statistics, Analytics and Mathematics.
2.5.6 Research and education
Researchers, scientists and academics are one of the largest groups for data reuse. DataBio
data published as open data will be used for further research and for educational purposes
(e.g. thesis).
2.5.7 Policy making bodies
The DataBio data and results will serve as a basis for decision making bodies, especially for
policy evaluation and feedback on policy implementation. This includes mainly the European
Commission, national and regional public authorities.
D6.2 – Data Management Plan
H2020 Contract No. 732064 Final – v1.0, 30/6/2017
Dissemination level: PU -Public
Page 31
FAIR Data
The FAIR principle ensures that data can be discovered through catalogs or search engines, is
accessible through open interfaces, is compliant to standards to interoperable processing of
that data, and therefore can be easily being reused.
3.1 Data findability
3.1.1 Data discoverability and metadata provision
Metadata is, as its name implies, data about data. It describes the properties of a dataset.
Metadata can cover various types of information. Descriptive metadata includes elements
such as the title, abstract, author and keywords, and is mostly used to discover and identify a
dataset. Another type is administrative metadata with elements such as the license,
intellectual property rights, when and how the dataset was created, who has access to it, etc.
The datasets on the DataBio Infrastructure are either added locally, by a user, harvested from
existing data portals, or fetched from operational systems or IoT ecosystems. In DataBio, the
definition of a set of metadata elements is necessary in order to allow identification of the
vast amount information resources managed for which metadata is created, its classification
and identification of its geographic location and temporal reference, quality and validity,
conformity with implementing rules on the interoperability of spatial data sets and services,
constraints related to access and use, and organization responsible for the resource.
In addition, metadata elements related to the metadata record itself are also necessary to
monitor that the metadata created are kept up to date, and for identifying the organization
responsible for the creation and maintenance of the metadata. Such minimum set of
metadata elements is also necessary to comply with Directive 2007/2/EC and does not
preclude the possibility for organizations to document the information resources more
extensively with additional elements derived from international standards or working
practices in their community of interest.
Metadata referred to datasets and dataset series (particularly relevant for DataBio will be the
EO products derived from satellite imagery) should adhere to the profile originating from the
INSPIRE Metadata regulation with added theme-specific metadata elements for the
agriculture, forestry and fishery domains if necessary. This approach will ensure that
metadata created for the datasets, dataset series and services will be compliant with the
INSPIRE requirements as well international standards ISO EN 19115 (Geographic Information
– Metadata; with special emphasis in ISO 19115-2:2009 Geographic information -- Metadata
-- Part 2: Extensions for imagery and gridded data), ISO EN 19119 (Geographic Information –
Services), ISO EN 19139 (Geographic Information – Metadata – Metadata XML Schema) and
ISO EN ISO 19156 (Earth Observation Metadata profile of Observations & Measurements).
Besides, INSPIRE conformant metadata may be expressed also through the DCAT Application
D6.2 – Data Management Plan
H2020 Contract No. 732064 Final – v1.0, 30/6/2017
Dissemination level: PU -Public
Page 32
Profile1, which defines a minimum set of metadata elements to ensure cross-domain and
cross-border interoperability between metadata schemas used in European data portals. If
adopted by DataBio, such a mapping could support the inclusion of INSPIRE metadata in the
Pan-European Open Data Portal for wider discovery across sectors beyond the geospatial
domain.
A Distribution represents a way in which the data is made available. DCAT is a rather small
vocabulary, but deliberately leaves many details open. It welcomes “application profiles”:
more specific specifications built on top of DCAT resp GeoDCAT – AP as geospatial extension.
For sensors we will focused on SensorML. SensorML can be used to describe a wide range of
sensors, including both dynamic and stationary platforms and both in-situ and remote
sensors. Other possibility is Semantic Sensor Net Ontology which describes sensors and
observations, and related concepts. It does not describe domain concepts, time, locations,
etc. these are intended to be included from other ontologies via OWL imports. This ontology
is developed by the W3C Semantic Sensor Networks Incubator Group (SSN-XG).
In DataBio, there is a need for metadata harmonization of the spatial and non-spatial datasets
and services. GeoDCAT-AP was an obvious choice due to the strong focus on geographic
datasets. The main advantage is that it enables users to query all datasets in a uniform way.
GeoDCAT-AP is still very new, and the implementation of the new standard within EUXDAT
can provide feedback to OGC, W3C & JRC from both technical and end user point of view.
Several software components are available in the DataBio architecture that have varying
support for GeoDCAT-AP, being Micka2, CKAN3 and GeoNetwork4. For the DataBio purposes
we will need also integrate Semantic Sensor Net Ontology and SensorML.
For enabling compatibility with COPERNICUS, INSPIRE and GEOSS, the DataBio project will
make three extensions: i) Module for extended harvesting INSPIRE metadata to DCAT, based
on XSLT and easy configuration; ii)Module for user friendly visualisation of INSPIRE metadata
in CKAN; and iii)Module to output metadata in GeoDCAT-AP resp SensorDCAT. We plan use
Micka and CKAN systems. MICKA is a complex system for metadata management used for
building Spatial Data Infrastructure (SDI) and geo portal solutions. It contains tools for editing
and the management of spatial data and services metadata, and other sources (documents,
websites, etc.). CKAN supports DCAT to import or export its datasets. CKAN enables
harvesting data from OGC:CSW catalogues, but not all mandatory INSPIRE metadata elements
are supported. Unfortunately, the DCAT output does not fulfil all INSPIRE requirements, nor
is GeoDCAT-AP fully supported.
1 https://joinup.ec.europa.eu/asset/dcat_application_profile/description
2 http://micka.bnhelp.cz/
3 https://ckan.org/
4 http://geonetwork-opensource.org/
D6.2 – Data Management Plan
H2020 Contract No. 732064 Final – v1.0, 30/6/2017
Dissemination level: PU -Public
Page 33
An ongoing programme of spatial data infrastructure projects, undertaken with academic and
commercial partners, enables DataBio to contribute to the creation of standard data
specifications and policies. This ensures their databases remain of high quality, compatible
and can interact with one another to deliver data which provides practical and tangible
benefits for European society. The network’s mission is to provide and disseminate statistical
information which has to be objective, independent and of high quality. Federal statistics are
available to everybody: politicians, authorities, businesses and citizens.
3.1.2 Data identification, naming mechanisms and search keyword approaches
For data identification, naming and search keywords we will use INSPIRE data registry. The
INSPIRE infrastructure involves a number of items, which require clear descriptions and the
possibility to be referenced through unique identifiers. Examples for such items include
INSPIRE themes, code lists, application schemas or discovery services. Registers provide a
means to assign identifiers to items and their labels, definitions and descriptions (in different
languages). The INSPIRE Registry is a service giving access to INSPIRE semantic assets (e.g.
application schemas, meta/data codelists, themes), and assigning to each of them a persistent
URI. As such, this service can be considered also as a metadata directory/catalogue for
INSPIRE, as well as a registry for the INSPIRE "terminology". Starting from June 2013, when
the INSPIRE Registry was first published, a number of version have been released,
implementing new features based on the community's feedback. Now, recently, a new
version of the INSPIRE Registry has been published, which, among other features, makes
available its content also in RDF/XML:
http://inspire.ec.europa.eu/registry/5
The INSPIRE registry provides a central access point to a number of centrally managed INSPIRE
registers6. INSPIRE registry include:
● INSPIRE application schema register
● INSPIRE code list register
● INSPIRE enumeration register
● INSPIRE feature concept dictionary
● INSPIRE glossary
● INSPIRE layer register
● INSPIRE media-types register
● INSPIRE metadata code list register
● INSPIRE reference document register
● INSPIRE theme register
5 https://www.rd-alliance.org/group/metadata-ig/post/inspire-registry-rdf-representation-now-
supported.html
6 http://inspire.ec.europa.eu/registry/
D6.2 – Data Management Plan
H2020 Contract No. 732064 Final – v1.0, 30/6/2017
Dissemination level: PU -Public
Page 34
Most relevant for naming in metadata is INSPIRE metadata code list register, which contains
the code lists and their values, as defined in the INSPIRE implementing rules on metadata.7
3.1.3 Data lineage
Data lineage refers to the sources of information, such as entities and processes, involved in
producing or delivering an artifact. Data lineage records the derivation history of a data
product. The history could include the algorithms used, the process steps taken, the
computing environment run, data sources input to the processes, the organization/person
responsible for the product, etc. Provenance provides important information to data users
for them to determine the usability and reliability of the product. In the science domain, the
data provenance is especially important since scientists need to use the information to
determine the scientific validity of a data product and to decide if such a product can be used
as the basis for further scientific analysis. The provenance of information is crucial to making
determinations about whether information is trusted, how to integrate diverse information
sources, and how to give credit to originators when reusing information [REF-02]. In an open
and inclusive environment such as the Web, users find information that is often contradictory
or questionable. Reasoners in the Semantic Web will need explicit representations of
provenance information in order to make trust judgments about the information they use.
With the arrival of massive amounts of Semantic Web data (eg, via the Linked Open Data
community) information about the origin of that data, ie, provenance, becomes an important
factor in developing new Semantic Web applications. Therefore, a crucial enabler of the
Semantic Web deployment is the explicit representation of provenance information that is
accessible to machines, not just to humans. Data provenance as the information about how
data was derived. Both are critical to the ability to interpret a particular data item.
Provenance is often conflated with metadata and trust. Metadata is used to represent
properties of objects. Many of those properties have to do with provenance, so the two are
often equated. Trust is derived from provenance information, and typically is a subjective
judgment that depends on context and use [REF-03].
W3C PROV Family of Documents defines a model, corresponding serializations and other
supporting definitions to enable the interoperable interchange of provenance information in
heterogeneous environments such as the Web [REF-04]. Current standards include [REF-05]:
PROV-DM: The PROV Data Model [REF-06] - PROV-DM is a core data model for provenance
for building representations of the entities, people and processes involved in producing a
piece of data or thing in the world. PROV-DM is domain-agnostic, but with well-defined
extensibility points allowing further domain-specific and application-specific extensions to be
defined. It is accompanied by PROV-ASN, a technology-independent abstract syntax notation,
which allows serializations of PROV-DM instances to be created for human consumption,
7 http://inspire.ec.europa.eu/metadata-codelist
D6.2 – Data Management Plan
H2020 Contract No. 732064 Final – v1.0, 30/6/2017
Dissemination level: PU -Public
Page 35
which facilitates its mapping to concrete syntax, and which is used as the basis for a formal
semantics.
PROV-O: The PROV Ontology [REF-07] - This specification defines the PROV Ontology as the
normative representation of the PROV Data Model using the Web Ontology Language
(OWL2). This document is part of a set of specifications being created to address the issue of
provenance interchange in Web applications.
Constraints of the PROV Data Model [REF-08] - PROV-DM, the PROV data model, is a data
model for provenance that describes the entities, people and activities involved in producing
a piece of data or thing. PROV-DM is structured in six components, dealing with: (1) entities
and activities, and the time at which they were created, used, or ended; (2) agents bearing
responsibility for entities that were generated and activities that happened; (3) derivations of
entities from entities; (4) properties to link entities that refer to a same thing; (5) collections
forming a logical structure for its members; (6) a simple annotation mechanism.
PROV-N: The Provenance Notation [REF-09] - PROV-DM, the PROV data model, is a data
model for provenance that describes the entities, people and activities involved in producing
a piece of data or thing. PROV-DM is structured in six components, dealing with: (1) entities
and activities, and the time at which they were created, used, or ended; (2) agents bearing
responsibility for entities that were generated and activities that happened; (3) derivations of
entities from entities; (4) properties to link entities that refer to the same thing; (5) collections
forming a logical structure for its members; (6) a simple annotation mechanism.
Figure 2 [REF-10] is a generic data lifecycle in the context of a data processing environment
where data are first discovered by the user with the help of metadata and provenance
catalogues.
D6.2 – Data Management Plan
H2020 Contract No. 732064 Final – v1.0, 30/6/2017
Dissemination level: PU -Public
Page 36
Figure 2: The processing data lifecycle
During the data processing phase, data replica information may be entered in replica
catalogues (which contain metadata about the data location), data may be transferred
between storage and execution sites, and software components may be staged to the
execution sites as well. While data are being processed, provenance information can be
automatically captured and then stored in a provenance store. The resulting derived data
products (both intermediate and final) can also be stored in an archive, with metadata about
them stored in a metadata catalogue and location information stored in a replica catalogue.
Data Provenance is also addressed in W3C DCAT Metadata model [REF-11].
dcat:CatalogRecord describes a dataset entry in the catalog. It is used to capture provenance
information about dataset entries in a catalog. This class is optional and not all catalogs will
use it. It exists for catalogs where a distinction is made between metadata about a dataset
and metadata about the dataset's entry in the catalog. For example, the publication date
property of the dataset reflects the date when the information was originally made available
by the publishing agency, while the publication date of the catalog record is the date when
the dataset was added to the catalog. In cases where both dates differ, or where only the
latter is known, the publication date should only be specified for the catalog record. W3C
PROV Ontology [prov-o] allows describing further provenance information such as the details
of the process and the agent involved in a particular change to a dataset. Detailed
specification of data provenance is also additional requirements for DCAT – AP specification
effort [REF-12].
D6.2 – Data Management Plan
H2020 Contract No. 732064 Final – v1.0, 30/6/2017
Dissemination level: PU -Public
Page 37
3.2 Data accessibility
Through DataBio experiments with a large number of tools and technologies identified in WP4
and WP5, a common data access pattern shall be developed. Ideally, this pattern is based on
internationally adopted standards, such as OGC WFS for feature data, OGC WCS for coverage
data, OGC WMS for maps, or OGC SOS for sensor data.
3.2.1 Open data and closed data
Everyone from citizens to civil servants, researchers and entrepreneurs can benefit from open
data. In this respect, the aim is to make effective use of Open Data. This data is already
available in public domains and is not within the control of the DataBio project.
All data rests on a scale between closed and open because there are variances in how
information is shared between the two points in the continuum. Closed data might be shared
with specific individuals within a corporate setting. Open data may require attribution to the
contributing source, but still be completely available to the end user.
Generally, open data differs from closed data in three key ways8:
1. Open data is accessible, usually via a data warehouse on the internet.
2. It is available in a readable format.
3. It’s licensed as open source, which allows anyone to use the data or share it for non-
commercial or commercial gain.
Closed data restricts access to the information in several potential ways:
1. It is only available to certain individuals within an organization.
2. The data is patented or proprietary.
3. The data is semi-restricted to certain groups.
4. Data that is open to the public through a licensure fee or other prerequisite.
5. Data that is difficult to access, such as paper records that haven’t been digitized.
The perfect example of closed data could be information that requires a security clearance;
health-related information collected by a hospital or insurance carrier; or, on a smaller scale,
your own personal tax returns.
There are also other datasets used for the pilots, like e.g. cartography, 3D or land use data
but those are stored in databases which are not available through the Open Data portals.
Once the use case specification and requirements have been completed these data may also
be needed for the processing and visualisation within the DataBio applications. However, this
data – in its raw format – may not be made available to external stakeholders for further use
due to licensing and/or privacy issues. Therefore, at this stage, the data management plan
will not cover these datasets.
8 www.opendatasoft.com
D6.2 – Data Management Plan
H2020 Contract No. 732064 Final – v1.0, 30/6/2017
Dissemination level: PU -Public
Page 38
3.2.2 Data access mechanisms, software and tools
Data access is the process of entering a database to store or retrieve data. Data Access Tools
are end user oriented tools that allow users to build structured query language (SQL) queries
by pointing and clicking on the list of table and fields in the data warehouse.
Thorough computing history, there have been different methods and languages already that
were used for data access and these varied depending on the type of data warehouse. The
data warehouse contains a rich repository of data pertaining to organizational business rules,
policies, events and histories and these warehouses store data in different and incompatible
formats so several data access tools have been developed to overcome problems of data
incompatibilities.
Recent advancement in information technology has brought about new and innovative
software applications that have more standardized languages, format, and methods to serve
as interface among different data formats. Some of these more popular standards include
SQL, OBDC, ADO.NET, JDBC, XML, XPath, XQuery and Web Services.
3.2.3 Big data warehouse architectures and database management systems
Depending on the project needs, there are different possibilities to store data:
3.2.3.1 Relational Database
This is a digital database whose organization is based on the relational model of data. The
various software systems used to maintain relational databases are known as a relational
database management system (RDBMS). Virtually all relational database systems use SQL
(Structured Query Language) as the language for querying and maintaining the database. A
relational database has the important advantage of being easy to extend. After the original
database creation, a new data category can be added without requiring that all existing
applications be modified.
This model organizes data into one or more tables (or "relations") of columns and rows, with
a unique key identifying each row. Rows are also called records or tuples. Generally, each
table/relation represents one "entity type" (such as customer or product). The rows represent
instances of that type of entity and the columns representing values attributed to that
instance.
The definition of a relational database results in a table of metadata or formal descriptions of
the tables, columns, domains, and constraints.
When creating a relational database, the domain of possible values can be defined in a data
column and further constraints that may apply to that data value can be described. For
example, a domain of possible customers could allow up to ten possible customer names but
be constrained in one table to allowing only three of these customer names to be specifiable.
D6.2 – Data Management Plan
H2020 Contract No. 732064 Final – v1.0, 30/6/2017
Dissemination level: PU -Public
Page 39
An example of a relational database management system is the Microsoft SQL Server,
developed by Microsoft. As a database server, it is a software product with the primary
function of storing and retrieving data as requested by other software applications—which
may run either on the same computer or on another computer across a network (including
the Internet). Microsoft makes SQL Server available in multiple editions, with different feature
sets and targeting different users.
PostgreSQL – for specific domains: PostgreSQL, often simply Postgres, is an object-relational
database management system (ORDBMS) with an emphasis on extensibility and standards
compliance. As a database server, its primary functions are to store data securely and return
that data in response to requests from other software applications. It can handle workloads
ranging from small single-machine applications to large Internet-facing applications (or for
data warehousing) with many concurrent users; on macOS Server, PostgreSQL is the default
database. It is also available for Microsoft Windows and Linux.
PostgreSQL is developed by the PostgreSQL Global Development Group, a diverse group of
many companies and individual contributors. It is free and open-source, released under the
terms of the PostgreSQL License, a permissive software license. Furthermore, it is ACID-
compliant and transactional. PostgreSQL has updatable views and materialized views,
triggers, foreign keys; supports functions and stored procedures, and other expandability.
3.2.3.2 Big Data storage solutions
A NoSQL (originally referring to "non-SQL", "non-relational" or "not only SQL") database
provides a mechanism for storage and retrieval of data which is modeled in means other than
the tabular relations used in relational databases. Such databases have existed since the late
1960s, but did not obtain the "NoSQL" moniker until a surge of popularity in the early twenty-
first century, triggered by the needs of Web 2.0 companies such as Facebook, Google, and
Amazon.com. NoSQL databases are increasingly used in big data and real-time web
applications. NoSQL systems are also sometimes called "Not only SQL" to emphasize that they
may support SQL-like query languages.
Motivations for this approach include: simplicity of design, simpler "horizontal" scaling to
clusters of machines (which is a problem for relational databases), and finer control over
availability. The data structures used by NoSQL databases (e.g. key-value, wide column, graph,
or document) are different from those used by default in relational databases, making some
operations faster in NoSQL. The particular suitability of a given NoSQL database depends on
the problem it must solve. Sometimes the data structures used by NoSQL databases are also
viewed as "more flexible" than relational database tables.
MongoDB: MongoDB (from humongous) is a free and open-source cross-platform document-
oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-
like documents with schemas. MongoDB is developed by MongoDB Inc. and is free and open-
D6.2 – Data Management Plan
H2020 Contract No. 732064 Final – v1.0, 30/6/2017
Dissemination level: PU -Public
Page 40
source, published under a combination of the GNU Affero General Public License and the
Apache License.
MongoDB supports field, range queries, regular expression searches. Queries can return
specific fields of documents and also include user-defined JavaScript functions. Queries can
also be configured to return a random sample of results of a given size. MongoDB can be used
as a file system with load balancing and data replication features over multiple machines for
storing files. This function, called Grid File System, is included with MongoDB drivers.
MongoDB exposes functions for file manipulation and content to developers. GridFS is used
in plugins for NGINX and lighttpd. GridFS divides a file into parts, or chunks, and stores each
of those chunks as a separate document.
MongoDB based (but not restricted to) is GeoRocket, developed by Fraunhofer IGD. It
provides high-performance data storage and is schema agnostic and format preserving. For
more information please refer to D4.1 which describes the components applied in the DataBio
project.
3.3 Data interoperability
Data can be made available in many different formats implementing different information
models. The heterogeneity of these models reduces the level of interoperability that can be
achieved. In principle, the combination of a standardized data access interface, a standardized
transport protocol, and a standardized data model ensure seamless integration of data across
platforms, tools, domains, or communities.
When the amount of data grows, mechanisms have to be explored to ensure interoperability
while handling large volumes of data. Currently, the amount of data can still be handled using
OGC models and data exchange services. We will need to review this element during the
course of the project. For now, data interoperability is envisioned to be ensured through
compliance with internationally adopted standards.
Eventually, interoperability requires different phenotypes when being applied in various
“disciplinary” settings. The following figure illustrates that concept (source: Wyborn 2017).
D6.2 – Data Management Plan
H2020 Contract No. 732064 Final – v1.0, 30/6/2017
Dissemination level: PU -Public
Page 41
Figure 3: The “disciplinary data integration platform: where do you ssit? (source: Wyborn)
The intra-disciplinary type remains within a single discipline. The level of standardization
needs to cover the discipline needs, but little attention is usually paid to cross-discipline
standards. The multi-disciplinary situation has many people from different domains working
together, but eventually they all remain within their silos and data exchange is limited to the
bare minimum.
The cross-disciplinary setting is what we are experiencing at the beginning of DataBio. All
disciplines are interfacing and reformatting their data to make it fit. The model works as long
as data exchange is minor, but does not scale, as it requires bilateral agreements between
various parties. The interdisciplinary approach is targeted in DataBio. The goal here is to
adhere to a minimum set of standards. Ideally, the specific characteristics are standardized
between all partners upfront. This model adds minimum overhead to all parties, as a single
mapping needs to be implemented per party (or, even better, the new model is used natively
from now on). The transdisciplinary approach starts with data already provided as linked data
with links across the various disciplines, well-defined vocabularies, and a set of mapping rules
to ensure usability of data generated in arbitrary disciplines.
3.3.1 Interoperability mechanisms
Key to interoperable data exchange are standardized interfaces. Currently, the amount of
data processing and exchange tools is extremely large. We expect a consolidation of the
number of tools during the first 15 months of the project. We will revise the requirements set
by the various pilots and the data sets made available regularly to ensure that proper
recommendations can be given at any time.
3.3.2 Inter-discipline interoperability and ontologies
A key element to interoperability within and across disciplines are shared semantics, but the
Semantic Web is still in its infancy and it is not clear to which extent it will become widely
accepted within data intensive communities in the near future. It requires graph-structures
for data and/or metadata, well defined vocabularies and ontologies, and lacks both the
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea

Más contenido relacionado

La actualidad más candente

NIH Data Commons - Note: Presentation has animations
NIH Data Commons  - Note:  Presentation has animations NIH Data Commons  - Note:  Presentation has animations
NIH Data Commons - Note: Presentation has animations Vivien Bonazzi
 
Big Data Paradigm - Analysis, Application and Challenges
Big Data Paradigm - Analysis, Application and ChallengesBig Data Paradigm - Analysis, Application and Challenges
Big Data Paradigm - Analysis, Application and ChallengesUyoyo Edosio
 
IRJET- A Scenario on Big Data
IRJET- A Scenario on Big DataIRJET- A Scenario on Big Data
IRJET- A Scenario on Big DataIRJET Journal
 
Moving Toward Big Data: Challenges, Trends and Perspectives
Moving Toward Big Data: Challenges, Trends and PerspectivesMoving Toward Big Data: Challenges, Trends and Perspectives
Moving Toward Big Data: Challenges, Trends and PerspectivesIJRESJOURNAL
 
How open is public data agi 2011-13
How open is public data agi 2011-13How open is public data agi 2011-13
How open is public data agi 2011-13lgdigitalcomms
 
Proposal for Designing a Linked Data Migrational Framework for Singapore Gove...
Proposal for Designing a Linked Data Migrational Framework for Singapore Gove...Proposal for Designing a Linked Data Migrational Framework for Singapore Gove...
Proposal for Designing a Linked Data Migrational Framework for Singapore Gove...Aravind Sesagiri Raamkumar
 
The effect of technology-organization-environment on adoption decision of bi...
The effect of technology-organization-environment on  adoption decision of bi...The effect of technology-organization-environment on  adoption decision of bi...
The effect of technology-organization-environment on adoption decision of bi...IJECEIAES
 
Sitra rise of the pilots janne enberg
Sitra rise of the pilots janne enbergSitra rise of the pilots janne enberg
Sitra rise of the pilots janne enbergSitra / Hyvinvointi
 
A Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and ChallengesA Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and Challengesijcisjournal
 
Real World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining ToolsReal World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining Toolsijsrd.com
 
Information economics and big data
Information economics and big dataInformation economics and big data
Information economics and big dataMark Albala
 
Identifying and analyzing the transient and permanent barriers for big data
Identifying and analyzing the transient and permanent barriers for big dataIdentifying and analyzing the transient and permanent barriers for big data
Identifying and analyzing the transient and permanent barriers for big datasarfraznawaz
 
big-data-analytics-and-iot-in-logistics-a-case-study-2018.pdf
big-data-analytics-and-iot-in-logistics-a-case-study-2018.pdfbig-data-analytics-and-iot-in-logistics-a-case-study-2018.pdf
big-data-analytics-and-iot-in-logistics-a-case-study-2018.pdfAkuhuruf
 
Two-Phase TDS Approach for Data Anonymization To Preserving Bigdata Privacy
Two-Phase TDS Approach for Data Anonymization To Preserving Bigdata PrivacyTwo-Phase TDS Approach for Data Anonymization To Preserving Bigdata Privacy
Two-Phase TDS Approach for Data Anonymization To Preserving Bigdata Privacydbpublications
 
Big Data Analytics: Recent Achievements and New Challenges
Big Data Analytics: Recent Achievements and New ChallengesBig Data Analytics: Recent Achievements and New Challenges
Big Data Analytics: Recent Achievements and New ChallengesEditor IJCATR
 
wireless sensor network
wireless sensor networkwireless sensor network
wireless sensor networkparry prabhu
 

La actualidad más candente (19)

NIH Data Commons - Note: Presentation has animations
NIH Data Commons  - Note:  Presentation has animations NIH Data Commons  - Note:  Presentation has animations
NIH Data Commons - Note: Presentation has animations
 
Big Data Paradigm - Analysis, Application and Challenges
Big Data Paradigm - Analysis, Application and ChallengesBig Data Paradigm - Analysis, Application and Challenges
Big Data Paradigm - Analysis, Application and Challenges
 
IRJET- A Scenario on Big Data
IRJET- A Scenario on Big DataIRJET- A Scenario on Big Data
IRJET- A Scenario on Big Data
 
Moving Toward Big Data: Challenges, Trends and Perspectives
Moving Toward Big Data: Challenges, Trends and PerspectivesMoving Toward Big Data: Challenges, Trends and Perspectives
Moving Toward Big Data: Challenges, Trends and Perspectives
 
Linked data migrational framework
Linked data migrational frameworkLinked data migrational framework
Linked data migrational framework
 
How open is public data agi 2011-13
How open is public data agi 2011-13How open is public data agi 2011-13
How open is public data agi 2011-13
 
Proposal for Designing a Linked Data Migrational Framework for Singapore Gove...
Proposal for Designing a Linked Data Migrational Framework for Singapore Gove...Proposal for Designing a Linked Data Migrational Framework for Singapore Gove...
Proposal for Designing a Linked Data Migrational Framework for Singapore Gove...
 
R180305120123
R180305120123R180305120123
R180305120123
 
The effect of technology-organization-environment on adoption decision of bi...
The effect of technology-organization-environment on  adoption decision of bi...The effect of technology-organization-environment on  adoption decision of bi...
The effect of technology-organization-environment on adoption decision of bi...
 
Sitra rise of the pilots janne enberg
Sitra rise of the pilots janne enbergSitra rise of the pilots janne enberg
Sitra rise of the pilots janne enberg
 
A Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and ChallengesA Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and Challenges
 
Real World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining ToolsReal World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining Tools
 
Information economics and big data
Information economics and big dataInformation economics and big data
Information economics and big data
 
Identifying and analyzing the transient and permanent barriers for big data
Identifying and analyzing the transient and permanent barriers for big dataIdentifying and analyzing the transient and permanent barriers for big data
Identifying and analyzing the transient and permanent barriers for big data
 
Big data survey
Big data surveyBig data survey
Big data survey
 
big-data-analytics-and-iot-in-logistics-a-case-study-2018.pdf
big-data-analytics-and-iot-in-logistics-a-case-study-2018.pdfbig-data-analytics-and-iot-in-logistics-a-case-study-2018.pdf
big-data-analytics-and-iot-in-logistics-a-case-study-2018.pdf
 
Two-Phase TDS Approach for Data Anonymization To Preserving Bigdata Privacy
Two-Phase TDS Approach for Data Anonymization To Preserving Bigdata PrivacyTwo-Phase TDS Approach for Data Anonymization To Preserving Bigdata Privacy
Two-Phase TDS Approach for Data Anonymization To Preserving Bigdata Privacy
 
Big Data Analytics: Recent Achievements and New Challenges
Big Data Analytics: Recent Achievements and New ChallengesBig Data Analytics: Recent Achievements and New Challenges
Big Data Analytics: Recent Achievements and New Challenges
 
wireless sensor network
wireless sensor networkwireless sensor network
wireless sensor network
 

Similar a Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea

Hadoop and Big Data Readiness in Africa: A Case of Tanzania
Hadoop and Big Data Readiness in Africa: A Case of TanzaniaHadoop and Big Data Readiness in Africa: A Case of Tanzania
Hadoop and Big Data Readiness in Africa: A Case of Tanzaniaijsrd.com
 
UNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdfUNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdfvvpadhu
 
A REVIEW STUDY ON BIG DATA ANALYSIS USING R STUDIO
A REVIEW STUDY ON BIG DATA ANALYSIS USING R STUDIOA REVIEW STUDY ON BIG DATA ANALYSIS USING R STUDIO
A REVIEW STUDY ON BIG DATA ANALYSIS USING R STUDIORobin Beregovska
 
H2020 Open Research Data pilot
H2020 Open Research Data pilotH2020 Open Research Data pilot
H2020 Open Research Data pilotSarah Jones
 
Data policy and management plan (First version)
Data policy and management plan (First version)Data policy and management plan (First version)
Data policy and management plan (First version)OlgaRodrguezLargo
 
Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)
Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)
Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)EUDAT
 
A Survey on Big Data Mining Challenges
A Survey on Big Data Mining ChallengesA Survey on Big Data Mining Challenges
A Survey on Big Data Mining ChallengesEditor IJMTER
 
Data management plans – EUDAT Best practices and case study | www.eudat.eu
Data management plans – EUDAT Best practices and case study | www.eudat.euData management plans – EUDAT Best practices and case study | www.eudat.eu
Data management plans – EUDAT Best practices and case study | www.eudat.euEUDAT
 
Guidelines on Data Management in Horizon 2020
Guidelines on Data Management in Horizon 2020 Guidelines on Data Management in Horizon 2020
Guidelines on Data Management in Horizon 2020 Elena Gili Sampol
 
Open Research Data in Horizon 2020
Open Research Data in Horizon 2020Open Research Data in Horizon 2020
Open Research Data in Horizon 2020OpenAIRE
 
A SURVEY OF BIG DATA ANALYTICS
A SURVEY OF BIG DATA ANALYTICSA SURVEY OF BIG DATA ANALYTICS
A SURVEY OF BIG DATA ANALYTICSijistjournal
 
E-infrastructure for open agri-food sciences: Vision & Roadmap
E-infrastructure for open agri-food sciences: Vision & RoadmapE-infrastructure for open agri-food sciences: Vision & Roadmap
E-infrastructure for open agri-food sciences: Vision & Roadmape-ROSA
 
An Investigation on Scalable and Efficient Privacy Preserving Challenges for ...
An Investigation on Scalable and Efficient Privacy Preserving Challenges for ...An Investigation on Scalable and Efficient Privacy Preserving Challenges for ...
An Investigation on Scalable and Efficient Privacy Preserving Challenges for ...IJERDJOURNAL
 
Big Data A Review
Big Data A ReviewBig Data A Review
Big Data A Reviewijtsrd
 
The Horizon 2020 Open Data Pilot - OpenAIRE webinar (Oct. 21 2014) by Sarah J...
The Horizon 2020 Open Data Pilot - OpenAIRE webinar (Oct. 21 2014) by Sarah J...The Horizon 2020 Open Data Pilot - OpenAIRE webinar (Oct. 21 2014) by Sarah J...
The Horizon 2020 Open Data Pilot - OpenAIRE webinar (Oct. 21 2014) by Sarah J...OpenAIRE
 
H2020 data pilot openaire
H2020 data pilot openaireH2020 data pilot openaire
H2020 data pilot openaireSarah Jones
 

Similar a Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea (20)

Hadoop and Big Data Readiness in Africa: A Case of Tanzania
Hadoop and Big Data Readiness in Africa: A Case of TanzaniaHadoop and Big Data Readiness in Africa: A Case of Tanzania
Hadoop and Big Data Readiness in Africa: A Case of Tanzania
 
What is a DMP
What is a DMPWhat is a DMP
What is a DMP
 
UNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdfUNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdf
 
Big Data technology
Big Data technologyBig Data technology
Big Data technology
 
A REVIEW STUDY ON BIG DATA ANALYSIS USING R STUDIO
A REVIEW STUDY ON BIG DATA ANALYSIS USING R STUDIOA REVIEW STUDY ON BIG DATA ANALYSIS USING R STUDIO
A REVIEW STUDY ON BIG DATA ANALYSIS USING R STUDIO
 
Complete-SRS.doc
Complete-SRS.docComplete-SRS.doc
Complete-SRS.doc
 
H2020 Open Research Data pilot
H2020 Open Research Data pilotH2020 Open Research Data pilot
H2020 Open Research Data pilot
 
Data policy and management plan (First version)
Data policy and management plan (First version)Data policy and management plan (First version)
Data policy and management plan (First version)
 
Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)
Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)
Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)
 
A Survey on Big Data Mining Challenges
A Survey on Big Data Mining ChallengesA Survey on Big Data Mining Challenges
A Survey on Big Data Mining Challenges
 
Data management plans – EUDAT Best practices and case study | www.eudat.eu
Data management plans – EUDAT Best practices and case study | www.eudat.euData management plans – EUDAT Best practices and case study | www.eudat.eu
Data management plans – EUDAT Best practices and case study | www.eudat.eu
 
Guidelines on Data Management in Horizon 2020
Guidelines on Data Management in Horizon 2020 Guidelines on Data Management in Horizon 2020
Guidelines on Data Management in Horizon 2020
 
Open Research Data in Horizon 2020
Open Research Data in Horizon 2020Open Research Data in Horizon 2020
Open Research Data in Horizon 2020
 
A SURVEY OF BIG DATA ANALYTICS
A SURVEY OF BIG DATA ANALYTICSA SURVEY OF BIG DATA ANALYTICS
A SURVEY OF BIG DATA ANALYTICS
 
E-infrastructure for open agri-food sciences: Vision & Roadmap
E-infrastructure for open agri-food sciences: Vision & RoadmapE-infrastructure for open agri-food sciences: Vision & Roadmap
E-infrastructure for open agri-food sciences: Vision & Roadmap
 
Rdaeu russia_fg_1_july2014_final
Rdaeu  russia_fg_1_july2014_finalRdaeu  russia_fg_1_july2014_final
Rdaeu russia_fg_1_july2014_final
 
An Investigation on Scalable and Efficient Privacy Preserving Challenges for ...
An Investigation on Scalable and Efficient Privacy Preserving Challenges for ...An Investigation on Scalable and Efficient Privacy Preserving Challenges for ...
An Investigation on Scalable and Efficient Privacy Preserving Challenges for ...
 
Big Data A Review
Big Data A ReviewBig Data A Review
Big Data A Review
 
The Horizon 2020 Open Data Pilot - OpenAIRE webinar (Oct. 21 2014) by Sarah J...
The Horizon 2020 Open Data Pilot - OpenAIRE webinar (Oct. 21 2014) by Sarah J...The Horizon 2020 Open Data Pilot - OpenAIRE webinar (Oct. 21 2014) by Sarah J...
The Horizon 2020 Open Data Pilot - OpenAIRE webinar (Oct. 21 2014) by Sarah J...
 
H2020 data pilot openaire
H2020 data pilot openaireH2020 data pilot openaire
H2020 data pilot openaire
 

Más de WirelessInfo

Presentation INSPIRE HAck
Presentation INSPIRE HAckPresentation INSPIRE HAck
Presentation INSPIRE HAckWirelessInfo
 
Using geo dcat ap specification for sharing metadata in geoss and inspire
Using geo dcat ap specification for sharing metadata in geoss and inspireUsing geo dcat ap specification for sharing metadata in geoss and inspire
Using geo dcat ap specification for sharing metadata in geoss and inspireWirelessInfo
 
Find your farm producer1
Find your farm producer1Find your farm producer1
Find your farm producer1WirelessInfo
 
Introduction to the 2nd inspire hack 2017
Introduction to the 2nd inspire hack 2017Introduction to the 2nd inspire hack 2017
Introduction to the 2nd inspire hack 2017WirelessInfo
 
Data bio d1.1-agriculture-pilot-definition_v1.0_2017-06-30_lespro
Data bio d1.1-agriculture-pilot-definition_v1.0_2017-06-30_lesproData bio d1.1-agriculture-pilot-definition_v1.0_2017-06-30_lespro
Data bio d1.1-agriculture-pilot-definition_v1.0_2017-06-30_lesproWirelessInfo
 
Open data and rural communities v5
Open data and rural communities v5Open data and rural communities v5
Open data and rural communities v5WirelessInfo
 
Indikátory pro územní plánování nejen v turistice na bázi otevřených dat final
Indikátory pro územní plánování nejen v turistice na bázi otevřených dat finalIndikátory pro územní plánování nejen v turistice na bázi otevřených dat final
Indikátory pro územní plánování nejen v turistice na bázi otevřených dat finalWirelessInfo
 
Iot and big data technologies for bio industry data bio
Iot and big data technologies for bio industry   data bioIot and big data technologies for bio industry   data bio
Iot and big data technologies for bio industry data bioWirelessInfo
 
H2020 big data and fiware an d iot
H2020 big data and fiware an d iotH2020 big data and fiware an d iot
H2020 big data and fiware an d iotWirelessInfo
 
Data bio big data worksop Brussels
Data bio big data worksop BrusselsData bio big data worksop Brussels
Data bio big data worksop BrusselsWirelessInfo
 
Concept of collaborative and open innovation approaches for development of ag...
Concept of collaborative and open innovation approaches for development of ag...Concept of collaborative and open innovation approaches for development of ag...
Concept of collaborative and open innovation approaches for development of ag...WirelessInfo
 
Foodie data models for crops from seed to store
Foodie   data models for crops from seed to storeFoodie   data models for crops from seed to store
Foodie data models for crops from seed to storeWirelessInfo
 
Pa17 asia australasia_partner_prospectus_28_nov2016
Pa17 asia australasia_partner_prospectus_28_nov2016Pa17 asia australasia_partner_prospectus_28_nov2016
Pa17 asia australasia_partner_prospectus_28_nov2016WirelessInfo
 
Pa17 abstract extension_flyer
Pa17 abstract extension_flyerPa17 abstract extension_flyer
Pa17 abstract extension_flyerWirelessInfo
 
Otn barcelona presentation
Otn  barcelona presentationOtn  barcelona presentation
Otn barcelona presentationWirelessInfo
 
Vgi and inspire introduction
Vgi and inspire   introductionVgi and inspire   introduction
Vgi and inspire introductionWirelessInfo
 
Sens log – way to standardize vgi data collection
Sens log – way to standardize vgi data collectionSens log – way to standardize vgi data collection
Sens log – way to standardize vgi data collectionWirelessInfo
 
2014 10 sdi4apps_press-release
2014 10 sdi4apps_press-release2014 10 sdi4apps_press-release
2014 10 sdi4apps_press-releaseWirelessInfo
 

Más de WirelessInfo (20)

Presentation INSPIRE HAck
Presentation INSPIRE HAckPresentation INSPIRE HAck
Presentation INSPIRE HAck
 
Using geo dcat ap specification for sharing metadata in geoss and inspire
Using geo dcat ap specification for sharing metadata in geoss and inspireUsing geo dcat ap specification for sharing metadata in geoss and inspire
Using geo dcat ap specification for sharing metadata in geoss and inspire
 
Find your farm producer1
Find your farm producer1Find your farm producer1
Find your farm producer1
 
Introduction to the 2nd inspire hack 2017
Introduction to the 2nd inspire hack 2017Introduction to the 2nd inspire hack 2017
Introduction to the 2nd inspire hack 2017
 
Data bio d1.1-agriculture-pilot-definition_v1.0_2017-06-30_lespro
Data bio d1.1-agriculture-pilot-definition_v1.0_2017-06-30_lesproData bio d1.1-agriculture-pilot-definition_v1.0_2017-06-30_lespro
Data bio d1.1-agriculture-pilot-definition_v1.0_2017-06-30_lespro
 
Open data and rural communities v5
Open data and rural communities v5Open data and rural communities v5
Open data and rural communities v5
 
Indikátory pro územní plánování nejen v turistice na bázi otevřených dat final
Indikátory pro územní plánování nejen v turistice na bázi otevřených dat finalIndikátory pro územní plánování nejen v turistice na bázi otevřených dat final
Indikátory pro územní plánování nejen v turistice na bázi otevřených dat final
 
Iot and big data technologies for bio industry data bio
Iot and big data technologies for bio industry   data bioIot and big data technologies for bio industry   data bio
Iot and big data technologies for bio industry data bio
 
H2020 big data and fiware an d iot
H2020 big data and fiware an d iotH2020 big data and fiware an d iot
H2020 big data and fiware an d iot
 
Data bio big data worksop Brussels
Data bio big data worksop BrusselsData bio big data worksop Brussels
Data bio big data worksop Brussels
 
Concept of collaborative and open innovation approaches for development of ag...
Concept of collaborative and open innovation approaches for development of ag...Concept of collaborative and open innovation approaches for development of ag...
Concept of collaborative and open innovation approaches for development of ag...
 
Foodie data models for crops from seed to store
Foodie   data models for crops from seed to storeFoodie   data models for crops from seed to store
Foodie data models for crops from seed to store
 
Pa17 asia australasia_partner_prospectus_28_nov2016
Pa17 asia australasia_partner_prospectus_28_nov2016Pa17 asia australasia_partner_prospectus_28_nov2016
Pa17 asia australasia_partner_prospectus_28_nov2016
 
Pa17 abstract extension_flyer
Pa17 abstract extension_flyerPa17 abstract extension_flyer
Pa17 abstract extension_flyer
 
Foodie data model
Foodie data modelFoodie data model
Foodie data model
 
Fatima p oster
Fatima p osterFatima p oster
Fatima p oster
 
Otn barcelona presentation
Otn  barcelona presentationOtn  barcelona presentation
Otn barcelona presentation
 
Vgi and inspire introduction
Vgi and inspire   introductionVgi and inspire   introduction
Vgi and inspire introduction
 
Sens log – way to standardize vgi data collection
Sens log – way to standardize vgi data collectionSens log – way to standardize vgi data collection
Sens log – way to standardize vgi data collection
 
2014 10 sdi4apps_press-release
2014 10 sdi4apps_press-release2014 10 sdi4apps_press-release
2014 10 sdi4apps_press-release
 

Último

Ramadan Chocolate Gifts Ramadan Chocolate Gifts
Ramadan Chocolate Gifts Ramadan Chocolate GiftsRamadan Chocolate Gifts Ramadan Chocolate Gifts
Ramadan Chocolate Gifts Ramadan Chocolate Giftsoperations616114
 
Best Verified 2 kala jadu Love Marriage Black Magic Punjab Powerful Black Mag...
Best Verified 2 kala jadu Love Marriage Black Magic Punjab Powerful Black Mag...Best Verified 2 kala jadu Love Marriage Black Magic Punjab Powerful Black Mag...
Best Verified 2 kala jadu Love Marriage Black Magic Punjab Powerful Black Mag...Amil baba
 
Ramadan Dates Gift Box Ramadan Dates Gift Box
Ramadan Dates Gift Box Ramadan Dates Gift BoxRamadan Dates Gift Box Ramadan Dates Gift Box
Ramadan Dates Gift Box Ramadan Dates Gift Boxoperations616114
 
Kitchen Essentials & Basic Food Preparation CHAPTER 1.pptx
Kitchen Essentials & Basic Food Preparation CHAPTER 1.pptxKitchen Essentials & Basic Food Preparation CHAPTER 1.pptx
Kitchen Essentials & Basic Food Preparation CHAPTER 1.pptxMhackBalberanZara
 
Get Started with The Rolling Plate: Your Cloud Kitchen Business
Get Started with The Rolling Plate: Your Cloud Kitchen BusinessGet Started with The Rolling Plate: Your Cloud Kitchen Business
Get Started with The Rolling Plate: Your Cloud Kitchen BusinessTherollingplates
 
Food Mutagens (Exploring Food Mutagens: Understanding Their Impact on Health ...
Food Mutagens (Exploring Food Mutagens: Understanding Their Impact on Health ...Food Mutagens (Exploring Food Mutagens: Understanding Their Impact on Health ...
Food Mutagens (Exploring Food Mutagens: Understanding Their Impact on Health ...Uswa34
 

Último (8)

Ramadan Chocolate Gifts Ramadan Chocolate Gifts
Ramadan Chocolate Gifts Ramadan Chocolate GiftsRamadan Chocolate Gifts Ramadan Chocolate Gifts
Ramadan Chocolate Gifts Ramadan Chocolate Gifts
 
Best Verified 2 kala jadu Love Marriage Black Magic Punjab Powerful Black Mag...
Best Verified 2 kala jadu Love Marriage Black Magic Punjab Powerful Black Mag...Best Verified 2 kala jadu Love Marriage Black Magic Punjab Powerful Black Mag...
Best Verified 2 kala jadu Love Marriage Black Magic Punjab Powerful Black Mag...
 
Take Dunkinrunsonyou.com survey At Dunkinrunsonyou.com.co
Take Dunkinrunsonyou.com survey At Dunkinrunsonyou.com.coTake Dunkinrunsonyou.com survey At Dunkinrunsonyou.com.co
Take Dunkinrunsonyou.com survey At Dunkinrunsonyou.com.co
 
Ramadan Dates Gift Box Ramadan Dates Gift Box
Ramadan Dates Gift Box Ramadan Dates Gift BoxRamadan Dates Gift Box Ramadan Dates Gift Box
Ramadan Dates Gift Box Ramadan Dates Gift Box
 
Kitchen Essentials & Basic Food Preparation CHAPTER 1.pptx
Kitchen Essentials & Basic Food Preparation CHAPTER 1.pptxKitchen Essentials & Basic Food Preparation CHAPTER 1.pptx
Kitchen Essentials & Basic Food Preparation CHAPTER 1.pptx
 
Get Started with The Rolling Plate: Your Cloud Kitchen Business
Get Started with The Rolling Plate: Your Cloud Kitchen BusinessGet Started with The Rolling Plate: Your Cloud Kitchen Business
Get Started with The Rolling Plate: Your Cloud Kitchen Business
 
PMG Newsletter (Volume 03. Issue 09).pdf
PMG Newsletter (Volume 03. Issue 09).pdfPMG Newsletter (Volume 03. Issue 09).pdf
PMG Newsletter (Volume 03. Issue 09).pdf
 
Food Mutagens (Exploring Food Mutagens: Understanding Their Impact on Health ...
Food Mutagens (Exploring Food Mutagens: Understanding Their Impact on Health ...Food Mutagens (Exploring Food Mutagens: Understanding Their Impact on Health ...
Food Mutagens (Exploring Food Mutagens: Understanding Their Impact on Health ...
 

Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea

  • 1. This document is part of a project that has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 732064. It is the property of the DataBio consortium and shall not be distributed or reproduced without the formal approval of the DataBio Management Committee. Project Acronym: DataBio Grant Agreement number: 732064 (H2020-ICT-2016-1 – Innovation Action) Project Full Title: Data-Driven Bioeconomy Project Coordinator: INTRASOFT International DELIVERABLE D6.2 – Data Management Plan Dissemination level PU -Public Type of Document Report Contractual date of delivery M06 – 30/6/2017 Deliverable Leader CREA Status - version, date Final – v1.0, 30/6/2017 WP / Task responsible WP6 Keywords: Data management plan, big data, bioeconomy
  • 2. D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017 Dissemination level: PU -Public Page 2 Executive Summary This document presents DataBio’s D6.2 deliverable, Data Management Plan (DMP), the key element of good data management. DataBio participates in the European Commission H2020 Program’s extended open research data pilot and hence, a DMP is required. And, consequently, DataBio project’s datasets will be as open as possible and as closed as necessary, focusing on sound big data management for the sake of best research practice, and in order to create value, and foster knowledge and technology out of big datasets for the good of man. The deliverable describes the data management life cycle for the data to be collected, processed and/or generated by DataBio project, accounting also for the necessity to make research data findable, accessible, interoperable and reusable (FAIR). DataBio’s partners will be encouraged to adhere to sound data management to ensure that data are well-managed, archived and preserved. Data preservation is synonymous to data relevance since: (1) data can then be reused by other researchers, (2) data collector can direct requests for data to the database, rather than address requests individually, (3) preserved data have the potential to lead to new, unanticipated discoveries, (4) preserved data prevent duplication of scientific studies that have already been conducted, and (5) archiving data insures against loss by the data collector. The main issues addressed in this deliverable include: (1) the purpose of data collection, (2) data type, format, size, velocity, beneficiaries, and provenance, (3) use of historical data, (4) making data FAIR, (5) data management support, (6) data security, and (7) ethical aspects. Doubtless, big data is a new paradigm and is coercing changes in businesses and other organizations. A few entities in EU are starting to manage the massive data sets and non- traditional data structures that are typical of big data and/or managing big data by extending their data management skills and their portfolios of data management software. Big data management empowers those entities to efficiently automate business operations, operate closer to real time, and through analytics, add value and learn valuable new facts about business operations, customers, partners, etc. Within the DataBio framework, big data management (BDM), is a mixture of conventional and new best practices, skills, teams, data types, and in-house grown or vendor-built functionality. All of these are being realigned under DataBio platform built upon partners own experiences and tools. It is anticipated that DataBio will provide a solution which will assume that datasets will be distributed among different infrastructures and that their accessibility could be complex, needing to have mechanisms which facilitate data retrieval, processing, manipulation and visualization as seamlessly as possible. The infrastructure will open new possibilities for ICT sector, including SMEs to develop new Bioeconomy 4.0 and will also open new possibilities for companies from the Earth Observation sector. Some partners have scaled up pre-existing applications and databases to handle burgeoning volumes of relational big data, or they have acquired new data management platforms that
  • 3. D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017 Dissemination level: PU -Public Page 3 are purpose-built for managing and analyzing multi-structured big data, including streaming big data. Others are evaluating big data platforms in order to create a brisk market of vendor products and services for managing and harnessing big data. The Hadoop Distributed File System (HDFS), MapReduce, various Hadoop tools, complex event processing (for streaming big data), NoSQL databases (for schema-free big data), in-memory databases (for real-time analytic processing of big data), private clouds, in-database analytics, and grid computing, will be some of the software products implemented within the DataBio framework. During the lifecycle of the DataBio project, big data will be collected that is, very large data sets (multi-terabyte or larger) consist of a wide range of data types (relational, text, multi- structured data, etc.) from numerous sources. Most data will come from farm and forestry machinery, fishing vessels, remote and proximal sensors and imagery, and many other technologies. DataBio is purposefully collecting big data, specifically: • Forestry: Big Data methods are expected to bring the possibility to both increase the value of the forests as well as to decrease the costs within sustainability limits set by natural growth and ecological aspects. The key technology is to gather more and more accurate information about the trees from a host of sensors including new generation of satellites, UAV images, laser scanning, mobile devices through crowdsourcing and machines operating in the forests. • Agriculture: Big Data in Agriculture is currently a hot topic. DataBio aims at building a European vision of Big Data for agriculture. This vision is to offer solution which will increase role of Big Data role in Agri Food chains in Europe: a perspective, which prepared recommendation for future big data development in Europe. • Fisheries: the ambition of this project is to herald and promote the use of Big Data analytical tools within fisheries applications by initiating several pilots which will demonstrate benefits of using Big Data in an analytical way for the fisheries, such as improved analysis of operational data, tools for planning and operational choices, crowdsourcing methods for fish stock estimation. This is the first version of DataBio DMP; it will be updated over the course of the project as warranted by significant changes arising during the project implementation, and the requirements of the project consortium. At least two updates will be prepared, on Months 18 and 36 of the project.
  • 4. D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017 Dissemination level: PU -Public Page 4 Deliverable Leader: Ephrem Habyarimana (CREA) Contributors: Jaroslav Šmejkal (ZETOR), Tomas Mildorf (UWB), Bernard Stevenot (SPACEBEL), Irene Matzakou (INTRASOFT), Ingo Simonis (OGSE), Christian Zinke (INFAI), Karel Charvat (LESPRO) Reviewers: Kyrill Meyer (INFAI), Tomas Mildorf (UWB), Erwin Goor (VITO), Fabiana Fournier (IBM), Marco Folegani (MEEO) Approved by: Athanasios Poulakidas (INTRASOFT) Document History Version Date Contributor(s) Description 0.1.1-2 12/05/2017 Ephrem Habyarimana TOC 0.1.3 22/05/2017 Ephrem Habyarimana Reviewed TOC, First assignments 0.2 30/05/2017 Tomas Mildorf Section 4.1 FAIR data costs 0.3 05/06/2017 Bernard Stevenot Section 6 Ethical issues 0.4 09/06/2017 Irene Matzakou, Athanasios Poulakidas Section 5.4 - 5.5 Privacy and sensitive data management 0.5.1 21/06/2017 Ingo Simonis Section 3.3 and 3.4 added 0.5.2 22/06/2017 Christian Zinke, Jaroslav Šmejkal Sections 2.2.4.4 Machine-generated data and 4.2 added 0.6 23/06/2017 Ephrem Habyarimana Added: Executive summary, sections 1.2 & 2.1, and chapter 7 0.7 27/06/2017 Ephrem Habyarimana added section 1.3 and made edits throughout the document.
  • 5. D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017 Dissemination level: PU -Public Page 5 0.8 28/06/2017 Tomas Mildorf Update of Section 2.2.4.3, Section 2.5.4, Section 2.5.5, Section 3.1.3 and Section 4.1 0.9 30/06/2017 Ephrem Habyarimana Included all tables for currently described DataBio’s datasets; overall edit of entire document. 1.0 30/06/2017 Athanasios Poulakidas Compliance to submission format and minor changes.
  • 6. D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017 Dissemination level: PU -Public Page 6 Table of Contents EXECUTIVE SUMMARY.....................................................................................................................................2 TABLE OF CONTENTS........................................................................................................................................6 TABLE OF FIGURES ...........................................................................................................................................8 LIST OF TABLES ................................................................................................................................................8 DEFINITIONS, ACRONYMS AND ABBREVIATIONS.............................................................................................9 INTRODUCTION ....................................................................................................................................10 1.1 PROJECT SUMMARY.....................................................................................................................................10 1.2 DOCUMENT SCOPE......................................................................................................................................13 1.3 DOCUMENT STRUCTURE ...............................................................................................................................14 DATA SUMMARY ..................................................................................................................................15 2.1 PURPOSE OF DATA COLLECTION......................................................................................................................15 2.2 DATA TYPES AND FORMATS ...........................................................................................................................17 2.2.1 Structured data.............................................................................................................................17 2.2.2 Semi-structured data ....................................................................................................................17 2.2.3 Unstructured data.........................................................................................................................19 2.2.4 New generation big data ..............................................................................................................19 2.3 HISTORICAL DATA........................................................................................................................................25 2.4 EXPECTED DATA SIZE AND VELOCITY.................................................................................................................26 2.5 DATA BENEFICIARIES ....................................................................................................................................26 2.5.1 Agricultural Sector ........................................................................................................................27 2.5.2 Forestry Sector..............................................................................................................................27 2.5.3 Fishery Sector................................................................................................................................28 2.5.4 Technical Staff...............................................................................................................................28 2.5.5 ICT sector ......................................................................................................................................28 2.5.6 Research and education................................................................................................................30 2.5.7 Policy making bodies.....................................................................................................................30 FAIR DATA ............................................................................................................................................31 3.1 DATA FINDABILITY .......................................................................................................................................31 3.1.1 Data discoverability and metadata provision...............................................................................31 3.1.2 Data identification, naming mechanisms and search keyword approaches.................................33 3.1.3 Data lineage..................................................................................................................................34 3.2 DATA ACCESSIBILITY .....................................................................................................................................37 3.2.1 Open data and closed data...........................................................................................................37 3.2.2 Data access mechanisms, software and tools ..............................................................................38 3.2.3 Big data warehouse architectures and database management systems .....................................38 3.3 DATA INTEROPERABILITY ...............................................................................................................................40 3.3.1 Interoperability mechanisms ........................................................................................................41 3.3.2 Inter-discipline interoperability and ontologies ............................................................................41 3.4 PROMOTING DATA REUSE..............................................................................................................................42 DATA MANAGEMENT SUPPORT............................................................................................................43 4.1 FAIR DATA COSTS........................................................................................................................................43
  • 7. D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017 Dissemination level: PU -Public Page 7 4.2 BIG DATA MANAGERS...................................................................................................................................43 4.2.1 Project manager ...........................................................................................................................43 4.2.2 Business Analysts ..........................................................................................................................44 4.2.3 Data Scientists ..............................................................................................................................44 4.2.4 Data Engineer / Architect .............................................................................................................44 4.2.5 Platform architects .......................................................................................................................44 4.2.6 IT/Operation manager..................................................................................................................44 4.2.7 Consultant.....................................................................................................................................45 4.2.8 Business User ................................................................................................................................45 4.2.9 Pilot experts ..................................................................................................................................45 DATA SECURITY ....................................................................................................................................46 5.1 INTRODUCTION...........................................................................................................................................46 5.2 DATA RECOVERY..........................................................................................................................................47 5.3 PRIVACY AND SENSITIVE DATA MANAGEMENT ...................................................................................................48 5.3.1 Introduction ..................................................................................................................................48 5.3.2 Enterprise Data (commercial sensitive data)................................................................................48 5.3.3 Personal Data................................................................................................................................49 5.4 GENERAL PRIVACY CONCERNS ........................................................................................................................50 ETHICAL ISSUES.....................................................................................................................................51 CONCLUSIONS ......................................................................................................................................52 REFERENCES .........................................................................................................................................54 APPENDIX A DATABIO DATASETS ...........................................................................................................55 A.1 SMART POI DATA SET (UWB - D03.01) ....................................................................................................56 A.2 OPEN TRANSPORT MAP (UWB - D03.02) .................................................................................................58 A.3 SENTINELS SCIENTIFIC HUB DATASETS VIA FEDEO GATEWAY (SPACEBEL -D07.01)..........................................60 A.4 NASA CMR LANDSAT DATASETS VIA FEDEO GATEWAY (SPACEBEL - D07.02)...............................................61 A.5 OPEN LAND USE (LESPRO - D02.01) .........................................................................................................62 A.6 FOREST RESOURCE DATA (METSAK - D18.01)............................................................................................64 A.7 CUSTOMER AND FOREST ESTATE DATA (METSAK - D18.02)..........................................................................65 A.8 STORM DAMAGE OBSERVATIONS AND POSSIBLE RISK AREAS (METSAK - D18.03)..............................................67 A.9 QUALITY CONTROL DATA (METSAK - D18.04) ...........................................................................................68 A.10 ONTOLOGY FOR (PRECISION) AGRICULTURE (PSNC - D09.01).......................................................................69 A.11 WUUDIS DATA (MHGS - D20.01)............................................................................................................71 A.12 SIGPAC (TRAGSA - D11.05)....................................................................................................................72 A.13 FIELD DATA - PILOT B2 (TRAGSA - D11.07).................................................................................................74 A.14 IACS (NP - D13.01)..............................................................................................................................75 A.15 SENTINEL DATA......................................................................................................................................76 A.16 TREE SPECIES MAP (FMI - D14.03) ..........................................................................................................76 A.17 STAND AGE MAP (FMI - D14.04) .............................................................................................................77 A.18 CANOPY HEIGHT MAP (FMI - D14.05).......................................................................................................78 A.19 LEAF AREA INDEX (FMI - D14.06).............................................................................................................79 A.20 FOREST DAMAGE (FMI - D14.07).............................................................................................................80 A.21 HYPERSPECTRAL IMAGE ORTHOMOSAIC (SENOP - D44.02) ............................................................................81 A.22 GAIATRONS IOT (DS13.01) ...................................................................................................................81 A.23 PHENOMICS, METABOLOMICS, GENOMICS AND ENVIRONMENTAL DATASETS (CERTH - DS40.01) .........................82
  • 8. D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017 Dissemination level: PU -Public Page 8 Table of Figures FIGURE 1: DATABIO’S ANALYTICS AND BIG DATA VALUE APPROACH .....................................................................................16 FIGURE 2: THE PROCESSING DATA LIFECYCLE ...................................................................................................................36 FIGURE 3: THE “DISCIPLINARY DATA INTEGRATION PLATFORM: WHERE DO YOU SSIT? (SOURCE: WYBORN)..................................41 FIGURE 4: DATABIO’S DATA MANAGERS.........................................................................................................................45 FIGURE 5: DATA LIFECYCLE ..........................................................................................................................................46 FIGURE 6: THE DATA MODEL OF SMART POINTS OF INTEREST ............................................................................................58 FIGURE 7: THE DATA MODEL OF OPEN TRANSPORT MAP...................................................................................................60 FIGURE 8: FEDEO CLIENT (C07.05) .............................................................................................................................61 List of Tables TABLE 1: THE DATABIO CONSORTIUM PARTNERS.............................................................................................................10 TABLE 2: SENSOR DATA TOOLS, RESOLUTION AND SPATIAL DENSITY .....................................................................................20 TABLE 3: GEOSPATIAL DATA TOOLS, FORMAT AND ORIGIN .................................................................................................24 TABLE 4: GENOMIC, BIOCHEMICAL AND METABOLOMIC DATA TOOLS, DESCRIPTION AND ACQUISITION........................................25
  • 9. D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017 Dissemination level: PU -Public Page 9 Definitions, Acronyms and Abbreviations Acronym/ Abbreviation Title BDVA Big Data Value Association EC European Commission EO Earth Observation ETL Extract Transform Load DMP Data Management Plan GSM Global System for Mobile GSP Global Positioning System FAIR Findable Accessible Interoperable and Reusable HDFS Hadoop Distributed File System ICT Information and Communications Technology IoT Internet of Things JDBC Java DataBase Connectivity JSON JavaScript Object Notation NoSQL Not Only SQL OBDC Open Database Connectivity OEM Object Exchange Model OGC Open Geospatial Consortium REST Representational State Transfer RFID Radio-Frequency IDentification RPAS Remotely Piloted Aircraft Systems SME Small-Medium Enterprise SOAP Simple Object Access Protocol SQL Structured Query Language UAV Unmanned Air Vehicle UI User Interface WP Work Package XML eXtensible Markup Language
  • 10. D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017 Dissemination level: PU -Public Page 10 Introduction 1.1 Project Summary The data intensive target sector on which the DataBio project focuses is the Data-Driven Bioeconomy. DataBio focuses on utilizing Big Data to contribute to the production of the best possible raw materials from agriculture, forestry and fishery (aquaculture) for the bioeconomy industry, as well as their further processing into food, energy and biomaterials, while taking into account various accountability and sustainability issues. DataBio will deploy state-of-the-art big data technologies and existing partners’ infrastructure and solutions, linked together through the DataBio Platform. These will aggregate Big Data from the three identified sectors (agriculture, forestry and fishery), intelligently process them and allow the three sectors to selectively utilize numerous platform components, according to their requirements. The execution will be through continuous cooperation of end user and technology provider companies, bioeconomy and technology research institutes, and stakeholders from the big data value PPP programme. DataBio is driven by the development, use and evaluation of a large number of pilots in the three identified sectors, where associated partners and additional stakeholders are also involved. The selected pilot concepts will be transformed to pilot implementations utilizing co-innovative methods and tools. The pilots select and utilize the best suitable market-ready or almost market-ready ICT, Big Data and Earth Observation methods, technologies, tools and services to be integrated to the common DataBio Platform. Based on the pilot results and the new DataBio Platform, new solutions and new business opportunities are expected to emerge. DataBio will organize a series of trainings and hackathons to support its uptake and to enable developers outside the consortium to design and develop new tools, services and applications based on and for the DataBio Platform. The DataBio consortium is listed in Table 1. For more information about the project see [REF- 01]. Table 1: The DataBio consortium partners Number Name Short name Country 1 (CO) INTRASOFT INTERNATIONAL SA INTRASOFT Belgium
  • 11. D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017 Dissemination level: PU -Public Page 11 2 LESPROJEKT SLUZBY SRO LESPRO Czech Republic 3 ZAPADOCESKA UNIVERZITA V PLZNI UWB Czech Republic 4 FRAUNHOFER GESELLSCHAFT ZUR FOERDERUNG DER ANGEWANDTEN FORSCHUNG E.V. Fraunhofer Germany 5 ATOS SPAIN SA ATOS Spain 6 STIFTELSEN SINTEF SINTEF ICT Norway 7 SPACEBEL SA SPACEBEL Belgium 8 VLAAMSE INSTELLING VOOR TECHNOLOGISCH ONDERZOEK N.V. VITO Belgium 9 INSTYTUT CHEMII BIOORGANICZNEJ POLSKIEJ AKADEMII NAUK PSNC Poland 10 CIAOTECH Srl CiaoT Italy 11 EMPRESA DE TRANSFORMACION AGRARIA SA TRAGSA Spain 12 INSTITUT FUR ANGEWANDTE INFORMATIK (INFAI) EV INFAI Germany 13 NEUROPUBLIC AE PLIROFORIKIS & EPIKOINONION NP Greece 14 Ústav pro hospodářskou úpravu lesů Brandýs nad Labem UHUL FMI Czech Republic 15 INNOVATION ENGINEERING SRL InnoE Italy 16 Teknologian tutkimuskeskus VTT Oy VTT Finland 17 SINTEF FISKERI OG HAVBRUK AS SINTEF Fishery Norway 18 SUOMEN METSAKESKUS-FINLANDS SKOGSCENTRAL METSAK Finland 19 IBM ISRAEL - SCIENCE AND TECHNOLOGY LTD IBM Israel 20 MHG SYSTEMS OY - MHGS MHGS Finland 21 NB ADVIES BV NB Advies Netherlands 22 CONSIGLIO PER LA RICERCA IN AGRICOLTURA E L'ANALISI DELL'ECONOMIA AGRARIA CREA Italy 23 FUNDACION AZTI - AZTI FUNDAZIOA AZTI Spain 24 KINGS BAY AS KingsBay Norway
  • 12. D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017 Dissemination level: PU -Public Page 12 25 EROS AS Eros Norway 26 ERVIK & SAEVIK AS ESAS Norway 27 LIEGRUPPEN FISKERI AS LiegFi Norway 28 E-GEOS SPA e-geos Italy 29 DANMARKS TEKNISKE UNIVERSITET DTU Denmark 30 FEDERUNACOMA SRL UNIPERSONALE Federu Italy 31 CSEM CENTRE SUISSE D'ELECTRONIQUE ET DE MICROTECHNIQUE SA - RECHERCHE ET DEVELOPPEMENT CSEM Switzerland 32 UNIVERSITAET ST. GALLEN UStG Switzerland 33 NORGES SILDESALGSLAG SA Sildes Norway 34 EXUS SOFTWARE LTD EXUS United Kingdom 35 CYBERNETICA AS CYBER Estonia 36 GAIA EPICHEIREIN ANONYMI ETAIREIA PSIFIAKON YPIRESION GAIA Greece 37 SOFTEAM Softeam France 38 FUNDACION CITOLIVA, CENTRO DE INNOVACION Y TECNOLOGIA DEL OLIVAR Y DEL ACEITE CITOLIVA Spain 39 TERRASIGNA SRL TerraS Romania 40 ETHNIKO KENTRO EREVNAS KAI TECHNOLOGIKIS ANAPTYXIS CERTH Greece 41 METEOROLOGICAL AND ENVIRONMENTAL EARTH OBSERVATION SRL MEEO Italy 42 ECHEBASTAR FLEET SOCIEDAD LIMITADA ECHEBF Spain 43 NOVAMONT SPA Novam Italy 44 SENOP OY Senop Finland 45 UNIVERSIDAD DEL PAIS VASCO/ EUSKAL HERRIKO UNIBERTSITATEA EHU/UPV Spain 46 OPEN GEOSPATIAL CONSORTIUM (EUROPE) LIMITED LBG OGCE United Kingdom
  • 13. D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017 Dissemination level: PU -Public Page 13 47 ZETOR TRACTORS AS ZETOR Czech Republic 48 COOPERATIVA AGRICOLA CESENATE SOCIETA COOPERATIVA AGRICOLA CAC Italy 1.2 Document Scope This document outlines DataBio’s data management plan (DMP), formally documenting how data will be handled both during the implementation and upon natural termination of the project. Many DMP aspects will be considered including metadata generation, data preservation, data security and ethics, accounting for the FAIR (Findable, Accessible, Interoperable, Re-usable) data principle. DataBio, Data-driven Bioeconomy project, is an innovation big data intensive action involving public private partnership to promote productivity on EU companies in three of the major bioeconomy sectors namely, Agriculture, forestry and fishery. Experiences from US show that bioeconomy can get a significant boost from Big Data. In Europe, this sector has until now attracted few large ICT vendors. A central goal of DataBio is to increase participation of European ICT industry in the development of Big Data systems for boosting the lagging bioeconomy productivity. As a good case in point, European agriculture, forestry and fishery can benefit greatly from the European Copernicus space program which has currently launched its third Sentinel satellite, telemetry IoT, UAVs, etc. Farm and forestry machinery, and fishing vessels in use today collect large quantities of data in unprecedented pattern. Remote and proximal sensors and imagery, and many other technologies, are all working together to give details about crop and soil properties, marine environment, weeds and pests, sunlight and shade, and many other primary production relevant variables. Deploying big data analytics in these data can help the farmers, foresters and fishers to adjust and improve the productivity of their business operations. On the other hand, large data sets such as those coming from the Copernicus earth monitoring infrastructure, are increasingly available on different levels of granularity, but they are heterogeneous, at times also unstructured, hard to analyze and distributed across various sectors and different providers. It is here that data management plan comes in. It is anticipated that DataBio will provide a solution which will assume that datasets will be distributed among different infrastructures and that their accessibility could be complex, needing to have mechanisms which facilitate data retrieval, processing, manipulation and visualization as seamlessly as possible. The infrastructure will open new possibilities for ICT sector, including SMEs to develop new Bioeconomy 4.0 and will also open new possibilities for companies from the Earth Observation sector. This DMP will be updated over the course of DataBio project whenever significant changes arise. The updates of this document will increasingly provide in-depths on DataBio DMP strategies with particular interest on the aspects of findability, accessibility, interoperability
  • 14. D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017 Dissemination level: PU -Public Page 14 and reusability of the Big Data the project produces. At least two updates will be prepared, on Month 18 and Month 36 of the project. 1.3 Document Structure This document is comprised of the following chapters: Chapter 1 presents an introduction to the project and the document. Chapter 2 presents the data summary including the purpose of data collection, data size, type and format, historical data reuse and data beneficiaries. Chapter 3 outlines DataBio’s FAIR data strategies. Chapter 4 describes data management support. Chapter 5 describes data security. Chapter 6 describes ethical issues. Chapter 7 presents the concluding remarks. Appendix A presents the managed data sets.
  • 15. D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017 Dissemination level: PU -Public Page 15 Data Summary 2.1 Purpose of data collection During the lifecycle of the DataBio project, big data will be collected that is, very large data sets (multi-terabyte or larger) consisting of a wide range of data types (relational, text, multi- structured data, etc.) from numerous sources, including relatively new generation big data (machines, sensors, genomics, etc.). The ultimate purpose of data collection is to use the data as a source of information in the implementation of a variety of big data analytics algorithms, services and applications DataBio will deploy to create a value, new business facts and insights with a particular focus on the bioeconomy industry. The big datasets are part of the building blocks of the DataBio’s big data technology platform (Figure 1) that was designed to help European companies increase productivity. Big Data experts provide common analytic technology support for the main common and typical Bioeconomy applications/analytics that are now emerging through the pilots in the project. Data from the past will be managed and analyzed, including many different kind of data sources: i.e., descriptive analytics and classical query/reporting (in need of variety management - and handling and analysis of all of the data from the past, including performance data, transactional data, attitudinal data, descriptive data, behavioural data, location-related data, interactional data, from many different sources). Big data from the present time will be harnessed in the process of monitoring and real-time analytics - pilot services (in need of velocity processing - and handling of real-time data from the present) - trigging alarms, actuators etc. Harnessing big data for the future time include forecasting, prediction and recommendation analytics - pilot services (in need of volume processing - and processing of large amounts of data combining knowledge from the past and present, and from models, to provide insight for the future).
  • 16. D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017 Dissemination level: PU -Public Page 16 Figure 1: DataBio’s analytics and big data value approach Specifically: • Forestry: Big Data methods are expected to bring the possibility to both increase the value of the forests as well as to decrease the costs within sustainability limits set by natural growth and ecological aspects. The key technology is to gather more and more accurate information about the trees from a host of sensors including new generation of satellites, UAV images, laser scanning, mobile devices through crowdsourcing and machines operating in the forests. • Agriculture: Big Data in Agriculture is currently a hot topic. The DataBio intention is to build a European vision of Big Data for agriculture. This vision is to offer solutions which will increase the role of Big Data role in Agri Food chains in Europe: a perspective, which will prepare recommendation for future big data development in Europe. • Fisheries: the ambition is to herald and promote the use of Big Data analytical tools within fisheries applications by initiating several pilots which will demonstrate benefits of using Big Data in an analytical way for the fisheries, such as improved analysis of operational data, tools for planning and operational choices, crowdsourcing methods for fish stock estimation.
  • 17. D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017 Dissemination level: PU -Public Page 17 • The use of Big data analytics will bring about innovation. It will generate significant economic value, extend the relevant market sectors, and herald novel business/organizational models. The cross-cutting character of the geo-spatial Big Data solutions allows the straightforward extension of the scope of applications beyond the bio-economy sectors. Such extensions of the market for the Big Data technologies are foreseen in economic sectors, such as: Urban planning, Water quality, Public safety (incl. technological and natural hazards), Protection of critical infrastructures, Waste management. On the other hand, the Big Data technologies revolutionize the business approach in the geospatial market and foster the emergence of innovative business/organizational models; indeed, to achieve the cost effectiveness of the services to the customers, it is necessary to organize the offer to the market on a territorial/local basis, as the users share the same geospatial sources of data and are best served by local players (service providers). This can be illustrated by a network of European services providers, developing proximity relationships with their customers and sharing their knowledge through the network. 2.2 Data types and formats The DataBio specific data types, formats and sources are listed in detail in Appendix A; below are described key features of the data used in the project. 2.2.1 Structured data Structured data refers to any data that resides in a fixed field within a record or file. This includes data contained in relational databases, spreadsheets, and data in forms of events such as sensor data. Structured data first depends on creating a data model – a model of the types of business data that will be recorded and how they will be stored, processed and accessed. This includes defining what fields of data will be stored and how that data will be stored: data type (numeric, currency, alphabetic, name, date, address) and any restrictions on the data input (number of characters; restricted to certain terms such as Mr., Ms. or Dr.; M or F). 2.2.2 Semi-structured data Semi-structured data is a cross between structured and unstructured data. It is a type of structured data, but lacks the strict data model structure. With semi-structured data, tags or other types of markers are used to identify certain elements within the data, but the data doesn't have a rigid structure. For example, word processing software now can include metadata showing the author's name and the date created, with the bulk of the document just being unstructured text. Emails have the sender, recipient, date, time and other fixed fields added to the unstructured data of the email message content and any attachments. Photos or other graphics can be tagged with keywords such as the creator, date, location and keywords, making it possible to organize and locate graphics. XML and other markup languages are often used to manage semi-structured data. Semi-structured data is therefore
  • 18. D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017 Dissemination level: PU -Public Page 18 a form of structured data that does not conform with the formal structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Therefore, it is also known as self-describing structure. In semi- structured data, the entities belonging to the same class may have different attributes even though they are grouped together, and the attributes' order is not important. Semi-structured data are increasingly occurring since the advent of the Internet where full-text documents and databases are not the only forms of data anymore, and different applications need a medium for exchanging information. In object-oriented databases, one often finds semi- structured data. XML and other markup languages, email, and EDI are all forms of semi-structured data. OEM (Object Exchange Model) was created prior to XML as a means of self-describing a data structure. XML has been popularized by web services that are developed utilizing SOAP principles. Some types of data described here as "semi-structured", especially XML, suffer from the impression that they are incapable of structural rigor at the same functional level as Relational Tables and Rows. Indeed, the view of XML as inherently semi-structured (previously, it was referred to as "unstructured") has handicapped its use for a widening range of data-centric applications. Even documents, normally thought of as the epitome of semi- structure, can be designed with virtually the same rigor as database schema, enforced by the XML schema and processed by both commercial and custom software programs without reducing their usability by human readers. In view of this fact, XML might be referred to as having "flexible structure" capable of human- centric flow and hierarchy as well as highly rigorous element structure and data typing. The concept of XML as "human-readable", however, can only be taken so far. Some implementations/dialects of XML, such as the XML representation of the contents of a Microsoft Word document, as implemented in Office 2007 and later versions, utilize dozens or even hundreds of different kinds of tags that reflect a particular problem domain - in Word's case, formatting at the character and paragraph and document level, definitions of styles, inclusion of citations, etc. - which are nested within each other in complex ways. Understanding even a portion of such an XML document by reading it, let alone catching errors in its structure, is impossible without a very deep prior understanding of the specific XML implementation, along with assistance by software that understands the XML schema that has been employed. Such text is not "human-understandable" any more than a book written in Swahili (which uses the Latin alphabet) would be to an American or Western European who does not know a word of that language: the tags are symbols that are meaningless to a person unfamiliar with the domain. JSON or JavaScript Object Notation, is an open standard format that uses human-readable text to transmit data objects consisting of attribute–value pairs. It is used primarily to transmit data between a server and web application, as an alternative to XML. JSON has been
  • 19. D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017 Dissemination level: PU -Public Page 19 popularized by web services developed utilizing REST principles. There is a new breed of databases such as MongoDB and Couchbase that store data natively in JSON format, leveraging the pros of semi-structured data architecture. 2.2.3 Unstructured data Unstructured data (or unstructured information) refers to information that either does not have a pre-defined data model or is not organized in a pre-defined manner. This results in irregularities and ambiguities that make it difficult to understand using traditional programs as compared to data stored in “field” form in databases or annotated (semantically tagged) in documents. Unstructured data can't be so readily classified and fit into a neat box: photos and graphic images, videos, streaming instrument data, webpages, PDF files, PowerPoint presentations, emails, blog entries, wikis and word processing documents. In 1998, Merrill Lynch cited a rule of thumb that somewhere around 80-90% of all potentially usable business information may originate in unstructured form. This rule of thumb is not based on primary or any quantitative research, but nonetheless is accepted by some. IDC and EMC project that data will grow to 40 zettabytes by 2020, resulting in a 50-fold growth from the beginning of 2010. Computer World states that unstructured information might account for more than 70%–80% of all data in organizations. Software that creates machine-processable structure can utilize the linguistic, auditory, and visual structure that exist in all forms of human communication. Algorithms can infer this inherent structure from text, for instance, by examining word morphology, sentence syntax, and other small- and large-scale patterns. Unstructured information can then be enriched and tagged to address ambiguities and relevancy-based techniques then used to facilitate search and discovery. Examples of "unstructured data" may include books, journals, documents, metadata, health records, audio, video, analog data, images, files, and unstructured text such as the body of an e-mail message, Web page, or word-processor document. While the main content being conveyed does not have a defined structure, it generally comes packaged in objects (e.g. in files or documents, …) that themselves have structure and are thus a mix of structured and unstructured data, but collectively this is still referred to as "unstructured data". 2.2.4 New generation big data The new generation big data is in particular focusing on semi-structured and unstructured data, often in combination with structured data. In the BDVA reference model for big data technologies a distinction is done between 6 different big data types.
  • 20. D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017 Dissemination level: PU -Public Page 20 2.2.4.1 Sensor data Within the Databio pilots, several key parameters will be monitored through sensorial platforms and sensor data will be collected along the way to support the project activities. Two types of sensor data have been already identified and namely, a) IoT data from in-situ sensors and telemetric stations, b) imagery data from unmanned aerial sensing platforms (drones), c) imagery from hand-held or mounted optical sensors. 2.2.4.1.1 Internet of Things data The IoT data are a major subgroup of sensor data involved in multiple pilot activities in the Databio project. IoT data are sent via TCP/UDP protocol in various formats (e.g. txt with time series data, json strings) and can be further divided into the following categories: • Agro-climatic/Field telemetry stations which contribute with raw data (numerical values) related to several parameters. As different pilots focus on different application scenarios, the following table summarizes several IoT-based monitoring approaches to be followed. Table 2: Sensor data tools, resolution and spatial density Pilot Mission, instrument Data resolution and spatial density A1.1, B1.2, C1.1, C2.2 NP’s GAIAtrons, which are telemetry IoT stations with modular/expandable design will be used to monitor ambient temperature, humidity, solar radiation, leaf wetness, rainfall volume, wind speed and direction, barometric pressure (GAIAtron atmo), soil temperature and humidity (multi-depth) (GAIAtron soil) Time step for data collection every 10 minutes. One station per microclimate zone (300ha - 1100 ha for atmo, 300ha - 3300ha for soil)
  • 21. D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017 Dissemination level: PU -Public Page 21 A1.2, B1.3 Field bound sensors will be used to monitor air temperature, air moisture, solar radiation, leaf wetness, rainfall, wind speed and direction, soil moisture, soil temperature, soil EC/salinity, PAR, and barometric pressure. These sensors consist in technology platform of retriever and pups wireless sensor network and SpecConnect, a cloud based crop data management solution. Time step for data collection is customizable from 1 to 60 minutes; Field sensors will be used to monitor 5 tandemly located sites at a density: a) Air temperature, air moisture, rainfall, wind data and solar radiation: one bloc of sensors per 5 ha b) Leaf wetness: two sensors per ha c) Soil moisture, soil temperature and soil EC/salinity: one combined sensor per ha A2.1 Environmental indoor: air temperature, air relative humidity, solar radiation, crop leaf temperature (remotely and in contact), soil/substrate water content. Environmental outdoor: wind speed and direction, evaporation, rain, UVA, UVB To be determined B1.1 Agro-climatic IoT stations monitoring temperature, relative and absolute humidity, wind parameters To be determined • Control data in the parcels/fields measuring sprinklers, drippers, metering devices, valves, alarm settings, heating, pumping state, pressure switches, etc. • Contact sensing data that determine problems with great precision, speeding up the use of techniques which help to solve problems • Vessel and buoy-based stations which contribute with raw data (numerical values), typically hydro acoustic and machinery data 2.2.4.1.2 Drone data A specific subset of sensor data generated and processed within DataBio project is images produced by cameras on-board drones or RPAS (Remotely Piloted Aircraft Systems). In particular, some DataBio pilots will use optical (RGB), thermal or multispectral images and 3D
  • 22. D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017 Dissemination level: PU -Public Page 22 point-clouds acquired from RPAS. The information generated by drone-airborne cameras is usually Image Data (JPEG or JPEG2000). A general description of the workflow is provided below. Data acquired by the RGB sensor The RGB sensor acquires individual pictures in .JPG format, together with their ‘geotag’ files, which are downloaded from the RPAS and processed into: • .LAS files: 3D point clouds (x, y, z), which are then processed to produce Digital Models (Terrain- DTM, Surface-DSM, Elevation-DEM, Vegetation-DVM) • .TIF files: which are then processed into an orthorectified mosaic. In order to obtain smaller files, mosaics are usually exported to compressed .ECW format. Data acquired by the thermal sensor The Thermal sensor acquires a video file which is downloaded from the RPAS and: • split into frames in .TIF format (pixels contain Digital Numbers: 0-255) • 1 of every 10 frames is selected (with an overlap of about 80%, so as not to process an excessive amount of information) Data acquired by the multispectral sensor The multispectral sensor acquires individual pictures from the 6 spectral channels in .RAW format, which are downloaded from the RPAS and processed into: • .TIF files (16 bits), which are then processed to produce a 6-bands .TIF mosaic (pixels contain Digital Numbers: 0-255) 2.2.4.1.3 Data from hand-held or mounted optical sensors Images from hand-held or mounted cameras will be collected using truck-held or hand held full Range / high resolution UV-VIS-NIR-SWIR Spectroradiometer. 2.2.4.2 Machine-generated data Machine-generated data in the DataBio project are data produced by ships, boats and machinery used in agriculture and in forestry (such as tractors). These data will serve for further analysis and optimisation of processes in the bio-economy sector. For illustration purposes, examples of data collected by tractors in agriculture are described. Tractors are equipped by the following units: • Control units for data control, data collection and analyses including dashboards, transmission control unit, hydrostatic or hydrodynamic system control unit, engine control unit. • Global Positioning System (GPS) units or Global System for Mobile Communications (GSM) units for tractor tracking.
  • 23. D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017 Dissemination level: PU -Public Page 23 • Unit for displaying characteristics of field/soil characteristics including area, quality, boundaries and yields. These units generate the following data: • Identification of tractor + identification of driver by code or by RFID module. • Identification of the current operation status. • Time identification by the date and the current time. • Precise tractor location tracking (daily route, starts, stops, speed). • Tractor hours - monitoring working hours in time and place. • Information from tachometer [Σ km] and [Σ working hrs and min]. • Identification of the current maintenance status. • Tractor diagnostic: failure modes or failure codes • Information about the date of the last calibration of each tractor systems + information about setting, information about SW version, last update, etc. • The amount of fuel in the fuel tank [L]. • Online information about sudden loss of fuel in the fuel tank. • Fuel consumption per trip / per time period / per kilometer (monitoring of fuel consumption in various dependencies e.g. motor load). • Total fuel consumption per day [L/day]. • Engine speed [run/min]. • Possibility to online setup engine speed in range [run/min from - to], signaling when limits are exceeding. • Current position of accelerator pedal [% from scale 0-100 %]. • Charging level of the main battery [V]. • Current temperature of the cooling weather [C ͦ or F ͦ ]. • Current temperature of the motor oil [C ͦ or F ͦ ]. • Current temperature of after treatment [C ͦ or F ͦ ]. • Current temperature of the transmission oil [C ͦ or F ͦ ]. • Diagnosis gear shift [grades backward and forward]. • Current engine load [% from scale 0-100 %] 2.2.4.3 Geospatial data The DataBio pilots will collect earth observation (EO) data from a number of sources which will be refined during the project. Currently, it is confirmed that the following EO data will be collected and used as input data:
  • 24. D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017 Dissemination level: PU -Public Page 24 Table 3: Geospatial data tools, format and origin Mission, instrument Format Origin Sentinel-1, C-SAR SLC, GRD Copernicus Open Access Hub (https://scihub.copernicus.eu/) Sentinel-2, MSI L1C Copernicus Open Access Hub (https://scihub.copernicus.eu/) Information about the expected sizes will be added, when the information becomes available. In addition to EO data, DataBio will utilise other geospatial data from EU, national, local, private and open repositories including Land Parcel Identification System data, cadastral data, Open Land Use map (http://sdi4apps.eu/open_land_use/), Urban Atlas and Corine Land Cover, Proba-V data (www.vito-eodata.be). The meteo-data will be collected mainly from EO systems based and will be collected from European data sources such as COPERNICUS products, EUMETSAT H-SAF products, but also other EO data sources such as VIIRS and MODIS and ASTER will be considered. As complementary data sources, the weather forecast models output (ECMWF) and the regional weather services output usually based on ground weather stations can be considered according to the specific target areas of the pilots." 2.2.4.4 Genomics data Within the DataBio Pilot 1.1.2 different data will be collected and produced. Three categories of data have been already identified for the Pilot and namely, a) in-situ sensors (including image capture) and farm data, b) genomic data from plant breeding efforts in Green Houses produced using Next Generation Sequencers (NGS), c) biochemical data of tomato fruits produced by chromatographs (LC/MS/MS, GS/MS, HPLC). In-situ sensors/Environmental outdoor: Wind speed and direction, Evaporation, Rain, Light intensity, UVA, UVB. In-situ sensors/Environmental indoor: Air temperature, Air relative humidity, Crop leaf temperature (remotely and in contact), Soil/substrate water content, crop type, etc.). Farm Data: • In-Situ measurements: Soil nutritional status. • Farm logs (work calendar, technical practices at farm level, irrigation information,).
  • 25. D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017 Dissemination level: PU -Public Page 25 • Farm profile (Static farm information, such as size Table 4: Genomic, biochemical and metabolomic data tools, description and acquisition Pilot A1.1.2 Mission, Instrument Data description and acquisition Genomic data To characterize the genetic diversity of local tomato varieties used for breeding. To use the genetic- genomic information to guide the breeding efforts (as a selection tool for higher performance) and develop a model to predict the final breeding result in order to achieve rapidly and with less financial burden varieties of higher performance. Data will be produced using two Illumina NGS Macchines. Data produced from Illumina machines stored in compressed text files (fastq). Data will be produced from plant biological samples (leaf and fruit). Collection will be done in 2 different plant stages (plantlets and mature plants). Genomic data will be produced using standard and customized protocols at CERTH. Genomic data, although plait text in format, are big- volume data and pose challenges in their storage, handling and processing. Preliminary analysis will be performed using the local HPC computational facility. Biochemical, metabolomic data To characterize the biochemical profile of fruits from tomato varieties used for breeding. Data will be produced from different chromatographs and mass spectrometers Data will be mainly proprietary binary based archives converted to XML or other open formats. Data will be acquired from biological samples of tomato fruits. While genomic data are stored in raw format as files, environmental data, which are generated using a network of sensors, will be stored in a database along with the time information and will be processed as time series data. 2.3 Historical data In the context of doing machine learning and predictive and prescriptive analytics it is important to be able to use historical data for training and validation purposes. Machine learning algorithms will use existing historical data as training data both for supervised and unsupervised learning. Information about datasets and the time periods concerned with historical datasets to be used for DataBio can be found in Appendix A. Historical data can also
  • 26. D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017 Dissemination level: PU -Public Page 26 serve as training complex event processing applications. In this case, historical data is injected as “happening in real-time” therefore serving as testing the complex event driven application in hand before running it in real-environment. 2.4 Expected data size and velocity The big data “V” characteristics of Volume and Velocity is being described for each of the identified data sets in the DataBio projects - typically with measurements of total historical volumes and new/additional data per time unit. The DataBio-specific Data Volumes and velocities (or injection rates) can be found in Appendix A. 2.5 Data beneficiaries In this section, this document analyses the key data beneficiaries who will benefit from the use of big data in several fields as analytics, data sets, business value, sales or marketing. This section will consider both tangibles and intangibles concepts. In examining the value of big data, it is necessary to evaluate who is affected by them and their usage. In some cases, the individual whose data is processed directly receives a benefit. Nevertheless, regarding Data Driven Bio-Economy, the benefit to the individual can be considered as indirect. In other cases, the relevant individual receives no benefit attributable, with big data value reaped by business, government, or society at large. Concerning General Community, the collection and use of an individual’s data benefits not only that individual, but also members of a proximate class, such as users of a similar product or residents of a geographical area. In the case of organizations, Big Data analysis often benefits those organizations that collect and harness the data. Data-driven profits may be viewed as enhancing allocative efficiency by facilitating the free economy. The emergence, expansion, and widespread use of innovative products and services at decreasing marginal costs have revolutionized global economies and societal structures, facilitating access to technology and knowledge and fomenting social change. With more data, businesses can optimize distribution methods, efficiently allocate credit, and robustly combat fraud, benefitting consumers as a whole. On the other hand, big data analysis can provide a direct benefit to those individuals whose information is being used. However, DataBio project is not directly involved on those specific cases (see chapter6 about ethical issues). Regarding general benefits, big data is creating enormous value for the global economy, driving innovation, productivity, efficiency, and growth. Data has become the driving force behind almost every interaction between individuals, businesses, and governments. The uses of big data can be transformative and are sometimes difficult to anticipate at the time of initial collection.
  • 27. D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017 Dissemination level: PU -Public Page 27 This section does not provide a comprehensive taxonomy of big data benefits. It would be pretentious to do so, ranking the relative importance of weighty social goals. Rather it posits that such benefits must be accounted for by rigorous analysis considering the priorities of a nation, society, or economy. Only then, can benefits be assessed within an economic framework. Besides those general concepts on Big Data Beneficiaries, it is possible to analyse the impact of DataBio project results regarding the final users of the different technologies, tools and services to be developed. Using this approach, and taking into account that more detailed information is available at Deliverables D1.1, D2.1 and D3.1 regarding Agricultural, Forestry and Fishery pilots definition, the main beneficiaries of big data are described in the following sections. 2.5.1 Agricultural Sector One of the proposed agricultural pilots is about the use of tractor units able to online send information regarding current operations to the driver or farmer. The prototypes will be equipped with units for tracking and tracing (GPS - Global Positioning System or GSM - Global System for Mobile Communications) and the unit for displaying characteristics of soil units. The proposed solution will meet Farmers requests on cost reduction and improved productivity in order to increase their economic benefits following, also, sustainable agriculture practices. In other case, Smart farming services provided as irrigation through flexible mechanisms and UIs (web, mobile, tablet compatible) will promote the adoption of technological tools (IoT, data analytics) and collaboration with certified professionals to optimize farm productivity. Therefore, Farming Cooperatives will obtain, again, cost reduction and improved productivity migrating from standard to sustainable smart-agriculture practices. As a summary, main beneficiaries of DataBio will be Farming cooperatives, farmers and land owners. 2.5.2 Forestry Sector Data sharing and a collaborative environment enable improved tools for sustainable forest management decisions and operations. Forest management services make data accessible for forest owners, and other end users, and integrate this data for e-contracting, online purchase and sales of timber and biomass. Higher data volumes and better data accessibility increase the probability that the data will be updated and maintained. DataBio WP2 will develop and pilot standardized procedures for collecting and transferring Big Data based on DataBio WP4 platform from silvicultural activities executed in the forest. As a summary, the Big Data beneficiaries related to WP2 – Forestry Pilots activities will be: • Forest owners (private, public, timberland investors) • Forest authority experts • Forest companies
  • 28. D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017 Dissemination level: PU -Public Page 28 • Contractors and service providers 2.5.3 Fishery Sector Regarding WP3 – Fisheries Pilot, in Pilot A2: Small pelagic fisheries immediate operational choices, the main users and beneficiaries of this pilot will be the ship owners and masters on board small pelagic vessels. Modern pelagic vessels are equipped with increasingly complex machinery systems for propulsion, manoeuvring and power generation. Due to that, the vessel is always in an operational state, but the configuration of the vessel systems imposes constraints on operation. The captain is tasked with safe operation of the vessel, while the efficiency of the vessel systems may be increased if the captain is informed about the actual operational state, potential for improvement and expected results of available actions. The goal of the pilot B2: Oceanic tuna fisheries planning is to create tools that aid in trip planning by presenting historical catch data as well as attempting to forecast where the fish might be in the near future. The forecast model will be constructed from historical data of catches with the data available by the skippers at that moment (oceanographical data, buoys data etc). In that case, the main beneficiary of DataBio development will be tuna fisheries companies. Therefore, as a summary, DataBio WP3 beneficiaries will be the broad range of fisheries stakeholders from companies, captains and vessels owners. 2.5.4 Technical Staff Adoption rates aside, the potential benefits of utilising big data and related technologies are significant both in scale and scope and include, for example: better/more targeted marketing activities, improved business decision making, cost reduction and generation of operational efficiencies, enhanced planning and strategic decision making and increased business agility, fraud detection, waste reduction and customer retention to name but a few. Obviously, the ability of firms to realize business benefits will be dependent on company characteristics such as size, data dependency and nature of business activity. A core concern voiced by many of those participating in big data focused studies is the ability of employers to find and attract the talent needed for both a) the successful implementation of big data solutions and b) the subsequent realisation of associated business benefits. Although ‘Data Scientist’ may currently be the most requested profile in big data, the recruitment of Data Scientists (in volume terms at least) appears relatively low down the wish list of recruiters. Instead, the openings most commonly arising in the big data field (as is the case for IT recruitment) are development positions. 2.5.5 ICT sector 2.5.5.1 Developers The generic title of developer is normally employed together with a detailed description of the specific technical related skills required for the post and it is this description that defines
  • 29. D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017 Dissemination level: PU -Public Page 29 the specific type of development activity undertaken. The technical skills most often cited by recruiters in adverts for big data Developers are: NoSQL (MongoDB in particular), Java, SQL, JavaScript, MySQL, Linux, Oracle, Hadoop (especially Cassandra), HTML and Spring. 2.5.5.2 Architects More specifically, however, applicants for these positions are required to hold skills in a range of technical disciplines including: Oracle (in particular, BI EE), Java, SQL, Hadoop and SQL Server, whilst the main generic areas of technical Knowledge and competence required were: Data Modelling, ETL, and Enterprise Architecture, Open Source and Analytics. 2.5.5.3 Analysts Particular process/methodological skills required from applicants for analyst positions were primarily in respect of: Data Modelling, ETL, Analytics and Data. 2.5.5.4 Administrators In general, the technical skills most often requested by employers from big data Administrators at that time were: Linux, MySQL and Puppet, Hadoop and Oracle, whilst the process and methodological competences most often requested were in the areas of Configuration Management, Disaster Recovery, Clustering and ETL. 2.5.5.5 Project Managers The specific types of Project Manager most often required by big data recruiters are Oracle Project Managers, Technical Project Managers and Business Intelligence Project Managers. Aside from Oracle (and in particular BI EE, EBS and EBS R12), which was specified in over two- thirds of all adverts for big data related Project Management posts, other technical skills often needed by applicants for this type of position were: Netezza, Business Objects and Hyperion. Process and methodological skills commonly required included ETL and Agile Software Development together with a range of more ‘business focused’ skills, i.e. PRINCE2 and Stakeholder Management. 2.5.5.6 Data Designers The most commonly requested technical skills associated with these posts to have been Oracle (particularly BIEE) and SQL followed by Netezza, SQL Server, MySQL and UNIX. Common process and methodological skills needed were: ETL, Data Modelling, Analytics, CSS, Unit Testing, Data Integration and Data Mining, whilst more general knowledge requirements related to the need for experience and understanding of Business Intelligence, Data Warehouse, Big Data, Migration and Middleware. 2.5.5.7 Data Scientists The core technical skills needed to secure a position as a Data Scientist are found to be: Hadoop, Java, NoSQL and C++. As was the case for other big data positions, adverts for Data Scientists often made reference to a need for various process and methodological skills and
  • 30. D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017 Dissemination level: PU -Public Page 30 competences. Interestingly however, in this case, such references were found to be much more commonplace and (perhaps as would be expected) most often focused upon data and/or statistical themes, i.e. Statistics, Analytics and Mathematics. 2.5.6 Research and education Researchers, scientists and academics are one of the largest groups for data reuse. DataBio data published as open data will be used for further research and for educational purposes (e.g. thesis). 2.5.7 Policy making bodies The DataBio data and results will serve as a basis for decision making bodies, especially for policy evaluation and feedback on policy implementation. This includes mainly the European Commission, national and regional public authorities.
  • 31. D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017 Dissemination level: PU -Public Page 31 FAIR Data The FAIR principle ensures that data can be discovered through catalogs or search engines, is accessible through open interfaces, is compliant to standards to interoperable processing of that data, and therefore can be easily being reused. 3.1 Data findability 3.1.1 Data discoverability and metadata provision Metadata is, as its name implies, data about data. It describes the properties of a dataset. Metadata can cover various types of information. Descriptive metadata includes elements such as the title, abstract, author and keywords, and is mostly used to discover and identify a dataset. Another type is administrative metadata with elements such as the license, intellectual property rights, when and how the dataset was created, who has access to it, etc. The datasets on the DataBio Infrastructure are either added locally, by a user, harvested from existing data portals, or fetched from operational systems or IoT ecosystems. In DataBio, the definition of a set of metadata elements is necessary in order to allow identification of the vast amount information resources managed for which metadata is created, its classification and identification of its geographic location and temporal reference, quality and validity, conformity with implementing rules on the interoperability of spatial data sets and services, constraints related to access and use, and organization responsible for the resource. In addition, metadata elements related to the metadata record itself are also necessary to monitor that the metadata created are kept up to date, and for identifying the organization responsible for the creation and maintenance of the metadata. Such minimum set of metadata elements is also necessary to comply with Directive 2007/2/EC and does not preclude the possibility for organizations to document the information resources more extensively with additional elements derived from international standards or working practices in their community of interest. Metadata referred to datasets and dataset series (particularly relevant for DataBio will be the EO products derived from satellite imagery) should adhere to the profile originating from the INSPIRE Metadata regulation with added theme-specific metadata elements for the agriculture, forestry and fishery domains if necessary. This approach will ensure that metadata created for the datasets, dataset series and services will be compliant with the INSPIRE requirements as well international standards ISO EN 19115 (Geographic Information – Metadata; with special emphasis in ISO 19115-2:2009 Geographic information -- Metadata -- Part 2: Extensions for imagery and gridded data), ISO EN 19119 (Geographic Information – Services), ISO EN 19139 (Geographic Information – Metadata – Metadata XML Schema) and ISO EN ISO 19156 (Earth Observation Metadata profile of Observations & Measurements). Besides, INSPIRE conformant metadata may be expressed also through the DCAT Application
  • 32. D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017 Dissemination level: PU -Public Page 32 Profile1, which defines a minimum set of metadata elements to ensure cross-domain and cross-border interoperability between metadata schemas used in European data portals. If adopted by DataBio, such a mapping could support the inclusion of INSPIRE metadata in the Pan-European Open Data Portal for wider discovery across sectors beyond the geospatial domain. A Distribution represents a way in which the data is made available. DCAT is a rather small vocabulary, but deliberately leaves many details open. It welcomes “application profiles”: more specific specifications built on top of DCAT resp GeoDCAT – AP as geospatial extension. For sensors we will focused on SensorML. SensorML can be used to describe a wide range of sensors, including both dynamic and stationary platforms and both in-situ and remote sensors. Other possibility is Semantic Sensor Net Ontology which describes sensors and observations, and related concepts. It does not describe domain concepts, time, locations, etc. these are intended to be included from other ontologies via OWL imports. This ontology is developed by the W3C Semantic Sensor Networks Incubator Group (SSN-XG). In DataBio, there is a need for metadata harmonization of the spatial and non-spatial datasets and services. GeoDCAT-AP was an obvious choice due to the strong focus on geographic datasets. The main advantage is that it enables users to query all datasets in a uniform way. GeoDCAT-AP is still very new, and the implementation of the new standard within EUXDAT can provide feedback to OGC, W3C & JRC from both technical and end user point of view. Several software components are available in the DataBio architecture that have varying support for GeoDCAT-AP, being Micka2, CKAN3 and GeoNetwork4. For the DataBio purposes we will need also integrate Semantic Sensor Net Ontology and SensorML. For enabling compatibility with COPERNICUS, INSPIRE and GEOSS, the DataBio project will make three extensions: i) Module for extended harvesting INSPIRE metadata to DCAT, based on XSLT and easy configuration; ii)Module for user friendly visualisation of INSPIRE metadata in CKAN; and iii)Module to output metadata in GeoDCAT-AP resp SensorDCAT. We plan use Micka and CKAN systems. MICKA is a complex system for metadata management used for building Spatial Data Infrastructure (SDI) and geo portal solutions. It contains tools for editing and the management of spatial data and services metadata, and other sources (documents, websites, etc.). CKAN supports DCAT to import or export its datasets. CKAN enables harvesting data from OGC:CSW catalogues, but not all mandatory INSPIRE metadata elements are supported. Unfortunately, the DCAT output does not fulfil all INSPIRE requirements, nor is GeoDCAT-AP fully supported. 1 https://joinup.ec.europa.eu/asset/dcat_application_profile/description 2 http://micka.bnhelp.cz/ 3 https://ckan.org/ 4 http://geonetwork-opensource.org/
  • 33. D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017 Dissemination level: PU -Public Page 33 An ongoing programme of spatial data infrastructure projects, undertaken with academic and commercial partners, enables DataBio to contribute to the creation of standard data specifications and policies. This ensures their databases remain of high quality, compatible and can interact with one another to deliver data which provides practical and tangible benefits for European society. The network’s mission is to provide and disseminate statistical information which has to be objective, independent and of high quality. Federal statistics are available to everybody: politicians, authorities, businesses and citizens. 3.1.2 Data identification, naming mechanisms and search keyword approaches For data identification, naming and search keywords we will use INSPIRE data registry. The INSPIRE infrastructure involves a number of items, which require clear descriptions and the possibility to be referenced through unique identifiers. Examples for such items include INSPIRE themes, code lists, application schemas or discovery services. Registers provide a means to assign identifiers to items and their labels, definitions and descriptions (in different languages). The INSPIRE Registry is a service giving access to INSPIRE semantic assets (e.g. application schemas, meta/data codelists, themes), and assigning to each of them a persistent URI. As such, this service can be considered also as a metadata directory/catalogue for INSPIRE, as well as a registry for the INSPIRE "terminology". Starting from June 2013, when the INSPIRE Registry was first published, a number of version have been released, implementing new features based on the community's feedback. Now, recently, a new version of the INSPIRE Registry has been published, which, among other features, makes available its content also in RDF/XML: http://inspire.ec.europa.eu/registry/5 The INSPIRE registry provides a central access point to a number of centrally managed INSPIRE registers6. INSPIRE registry include: ● INSPIRE application schema register ● INSPIRE code list register ● INSPIRE enumeration register ● INSPIRE feature concept dictionary ● INSPIRE glossary ● INSPIRE layer register ● INSPIRE media-types register ● INSPIRE metadata code list register ● INSPIRE reference document register ● INSPIRE theme register 5 https://www.rd-alliance.org/group/metadata-ig/post/inspire-registry-rdf-representation-now- supported.html 6 http://inspire.ec.europa.eu/registry/
  • 34. D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017 Dissemination level: PU -Public Page 34 Most relevant for naming in metadata is INSPIRE metadata code list register, which contains the code lists and their values, as defined in the INSPIRE implementing rules on metadata.7 3.1.3 Data lineage Data lineage refers to the sources of information, such as entities and processes, involved in producing or delivering an artifact. Data lineage records the derivation history of a data product. The history could include the algorithms used, the process steps taken, the computing environment run, data sources input to the processes, the organization/person responsible for the product, etc. Provenance provides important information to data users for them to determine the usability and reliability of the product. In the science domain, the data provenance is especially important since scientists need to use the information to determine the scientific validity of a data product and to decide if such a product can be used as the basis for further scientific analysis. The provenance of information is crucial to making determinations about whether information is trusted, how to integrate diverse information sources, and how to give credit to originators when reusing information [REF-02]. In an open and inclusive environment such as the Web, users find information that is often contradictory or questionable. Reasoners in the Semantic Web will need explicit representations of provenance information in order to make trust judgments about the information they use. With the arrival of massive amounts of Semantic Web data (eg, via the Linked Open Data community) information about the origin of that data, ie, provenance, becomes an important factor in developing new Semantic Web applications. Therefore, a crucial enabler of the Semantic Web deployment is the explicit representation of provenance information that is accessible to machines, not just to humans. Data provenance as the information about how data was derived. Both are critical to the ability to interpret a particular data item. Provenance is often conflated with metadata and trust. Metadata is used to represent properties of objects. Many of those properties have to do with provenance, so the two are often equated. Trust is derived from provenance information, and typically is a subjective judgment that depends on context and use [REF-03]. W3C PROV Family of Documents defines a model, corresponding serializations and other supporting definitions to enable the interoperable interchange of provenance information in heterogeneous environments such as the Web [REF-04]. Current standards include [REF-05]: PROV-DM: The PROV Data Model [REF-06] - PROV-DM is a core data model for provenance for building representations of the entities, people and processes involved in producing a piece of data or thing in the world. PROV-DM is domain-agnostic, but with well-defined extensibility points allowing further domain-specific and application-specific extensions to be defined. It is accompanied by PROV-ASN, a technology-independent abstract syntax notation, which allows serializations of PROV-DM instances to be created for human consumption, 7 http://inspire.ec.europa.eu/metadata-codelist
  • 35. D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017 Dissemination level: PU -Public Page 35 which facilitates its mapping to concrete syntax, and which is used as the basis for a formal semantics. PROV-O: The PROV Ontology [REF-07] - This specification defines the PROV Ontology as the normative representation of the PROV Data Model using the Web Ontology Language (OWL2). This document is part of a set of specifications being created to address the issue of provenance interchange in Web applications. Constraints of the PROV Data Model [REF-08] - PROV-DM, the PROV data model, is a data model for provenance that describes the entities, people and activities involved in producing a piece of data or thing. PROV-DM is structured in six components, dealing with: (1) entities and activities, and the time at which they were created, used, or ended; (2) agents bearing responsibility for entities that were generated and activities that happened; (3) derivations of entities from entities; (4) properties to link entities that refer to a same thing; (5) collections forming a logical structure for its members; (6) a simple annotation mechanism. PROV-N: The Provenance Notation [REF-09] - PROV-DM, the PROV data model, is a data model for provenance that describes the entities, people and activities involved in producing a piece of data or thing. PROV-DM is structured in six components, dealing with: (1) entities and activities, and the time at which they were created, used, or ended; (2) agents bearing responsibility for entities that were generated and activities that happened; (3) derivations of entities from entities; (4) properties to link entities that refer to the same thing; (5) collections forming a logical structure for its members; (6) a simple annotation mechanism. Figure 2 [REF-10] is a generic data lifecycle in the context of a data processing environment where data are first discovered by the user with the help of metadata and provenance catalogues.
  • 36. D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017 Dissemination level: PU -Public Page 36 Figure 2: The processing data lifecycle During the data processing phase, data replica information may be entered in replica catalogues (which contain metadata about the data location), data may be transferred between storage and execution sites, and software components may be staged to the execution sites as well. While data are being processed, provenance information can be automatically captured and then stored in a provenance store. The resulting derived data products (both intermediate and final) can also be stored in an archive, with metadata about them stored in a metadata catalogue and location information stored in a replica catalogue. Data Provenance is also addressed in W3C DCAT Metadata model [REF-11]. dcat:CatalogRecord describes a dataset entry in the catalog. It is used to capture provenance information about dataset entries in a catalog. This class is optional and not all catalogs will use it. It exists for catalogs where a distinction is made between metadata about a dataset and metadata about the dataset's entry in the catalog. For example, the publication date property of the dataset reflects the date when the information was originally made available by the publishing agency, while the publication date of the catalog record is the date when the dataset was added to the catalog. In cases where both dates differ, or where only the latter is known, the publication date should only be specified for the catalog record. W3C PROV Ontology [prov-o] allows describing further provenance information such as the details of the process and the agent involved in a particular change to a dataset. Detailed specification of data provenance is also additional requirements for DCAT – AP specification effort [REF-12].
  • 37. D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017 Dissemination level: PU -Public Page 37 3.2 Data accessibility Through DataBio experiments with a large number of tools and technologies identified in WP4 and WP5, a common data access pattern shall be developed. Ideally, this pattern is based on internationally adopted standards, such as OGC WFS for feature data, OGC WCS for coverage data, OGC WMS for maps, or OGC SOS for sensor data. 3.2.1 Open data and closed data Everyone from citizens to civil servants, researchers and entrepreneurs can benefit from open data. In this respect, the aim is to make effective use of Open Data. This data is already available in public domains and is not within the control of the DataBio project. All data rests on a scale between closed and open because there are variances in how information is shared between the two points in the continuum. Closed data might be shared with specific individuals within a corporate setting. Open data may require attribution to the contributing source, but still be completely available to the end user. Generally, open data differs from closed data in three key ways8: 1. Open data is accessible, usually via a data warehouse on the internet. 2. It is available in a readable format. 3. It’s licensed as open source, which allows anyone to use the data or share it for non- commercial or commercial gain. Closed data restricts access to the information in several potential ways: 1. It is only available to certain individuals within an organization. 2. The data is patented or proprietary. 3. The data is semi-restricted to certain groups. 4. Data that is open to the public through a licensure fee or other prerequisite. 5. Data that is difficult to access, such as paper records that haven’t been digitized. The perfect example of closed data could be information that requires a security clearance; health-related information collected by a hospital or insurance carrier; or, on a smaller scale, your own personal tax returns. There are also other datasets used for the pilots, like e.g. cartography, 3D or land use data but those are stored in databases which are not available through the Open Data portals. Once the use case specification and requirements have been completed these data may also be needed for the processing and visualisation within the DataBio applications. However, this data – in its raw format – may not be made available to external stakeholders for further use due to licensing and/or privacy issues. Therefore, at this stage, the data management plan will not cover these datasets. 8 www.opendatasoft.com
  • 38. D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017 Dissemination level: PU -Public Page 38 3.2.2 Data access mechanisms, software and tools Data access is the process of entering a database to store or retrieve data. Data Access Tools are end user oriented tools that allow users to build structured query language (SQL) queries by pointing and clicking on the list of table and fields in the data warehouse. Thorough computing history, there have been different methods and languages already that were used for data access and these varied depending on the type of data warehouse. The data warehouse contains a rich repository of data pertaining to organizational business rules, policies, events and histories and these warehouses store data in different and incompatible formats so several data access tools have been developed to overcome problems of data incompatibilities. Recent advancement in information technology has brought about new and innovative software applications that have more standardized languages, format, and methods to serve as interface among different data formats. Some of these more popular standards include SQL, OBDC, ADO.NET, JDBC, XML, XPath, XQuery and Web Services. 3.2.3 Big data warehouse architectures and database management systems Depending on the project needs, there are different possibilities to store data: 3.2.3.1 Relational Database This is a digital database whose organization is based on the relational model of data. The various software systems used to maintain relational databases are known as a relational database management system (RDBMS). Virtually all relational database systems use SQL (Structured Query Language) as the language for querying and maintaining the database. A relational database has the important advantage of being easy to extend. After the original database creation, a new data category can be added without requiring that all existing applications be modified. This model organizes data into one or more tables (or "relations") of columns and rows, with a unique key identifying each row. Rows are also called records or tuples. Generally, each table/relation represents one "entity type" (such as customer or product). The rows represent instances of that type of entity and the columns representing values attributed to that instance. The definition of a relational database results in a table of metadata or formal descriptions of the tables, columns, domains, and constraints. When creating a relational database, the domain of possible values can be defined in a data column and further constraints that may apply to that data value can be described. For example, a domain of possible customers could allow up to ten possible customer names but be constrained in one table to allowing only three of these customer names to be specifiable.
  • 39. D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017 Dissemination level: PU -Public Page 39 An example of a relational database management system is the Microsoft SQL Server, developed by Microsoft. As a database server, it is a software product with the primary function of storing and retrieving data as requested by other software applications—which may run either on the same computer or on another computer across a network (including the Internet). Microsoft makes SQL Server available in multiple editions, with different feature sets and targeting different users. PostgreSQL – for specific domains: PostgreSQL, often simply Postgres, is an object-relational database management system (ORDBMS) with an emphasis on extensibility and standards compliance. As a database server, its primary functions are to store data securely and return that data in response to requests from other software applications. It can handle workloads ranging from small single-machine applications to large Internet-facing applications (or for data warehousing) with many concurrent users; on macOS Server, PostgreSQL is the default database. It is also available for Microsoft Windows and Linux. PostgreSQL is developed by the PostgreSQL Global Development Group, a diverse group of many companies and individual contributors. It is free and open-source, released under the terms of the PostgreSQL License, a permissive software license. Furthermore, it is ACID- compliant and transactional. PostgreSQL has updatable views and materialized views, triggers, foreign keys; supports functions and stored procedures, and other expandability. 3.2.3.2 Big Data storage solutions A NoSQL (originally referring to "non-SQL", "non-relational" or "not only SQL") database provides a mechanism for storage and retrieval of data which is modeled in means other than the tabular relations used in relational databases. Such databases have existed since the late 1960s, but did not obtain the "NoSQL" moniker until a surge of popularity in the early twenty- first century, triggered by the needs of Web 2.0 companies such as Facebook, Google, and Amazon.com. NoSQL databases are increasingly used in big data and real-time web applications. NoSQL systems are also sometimes called "Not only SQL" to emphasize that they may support SQL-like query languages. Motivations for this approach include: simplicity of design, simpler "horizontal" scaling to clusters of machines (which is a problem for relational databases), and finer control over availability. The data structures used by NoSQL databases (e.g. key-value, wide column, graph, or document) are different from those used by default in relational databases, making some operations faster in NoSQL. The particular suitability of a given NoSQL database depends on the problem it must solve. Sometimes the data structures used by NoSQL databases are also viewed as "more flexible" than relational database tables. MongoDB: MongoDB (from humongous) is a free and open-source cross-platform document- oriented database program. Classified as a NoSQL database program, MongoDB uses JSON- like documents with schemas. MongoDB is developed by MongoDB Inc. and is free and open-
  • 40. D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017 Dissemination level: PU -Public Page 40 source, published under a combination of the GNU Affero General Public License and the Apache License. MongoDB supports field, range queries, regular expression searches. Queries can return specific fields of documents and also include user-defined JavaScript functions. Queries can also be configured to return a random sample of results of a given size. MongoDB can be used as a file system with load balancing and data replication features over multiple machines for storing files. This function, called Grid File System, is included with MongoDB drivers. MongoDB exposes functions for file manipulation and content to developers. GridFS is used in plugins for NGINX and lighttpd. GridFS divides a file into parts, or chunks, and stores each of those chunks as a separate document. MongoDB based (but not restricted to) is GeoRocket, developed by Fraunhofer IGD. It provides high-performance data storage and is schema agnostic and format preserving. For more information please refer to D4.1 which describes the components applied in the DataBio project. 3.3 Data interoperability Data can be made available in many different formats implementing different information models. The heterogeneity of these models reduces the level of interoperability that can be achieved. In principle, the combination of a standardized data access interface, a standardized transport protocol, and a standardized data model ensure seamless integration of data across platforms, tools, domains, or communities. When the amount of data grows, mechanisms have to be explored to ensure interoperability while handling large volumes of data. Currently, the amount of data can still be handled using OGC models and data exchange services. We will need to review this element during the course of the project. For now, data interoperability is envisioned to be ensured through compliance with internationally adopted standards. Eventually, interoperability requires different phenotypes when being applied in various “disciplinary” settings. The following figure illustrates that concept (source: Wyborn 2017).
  • 41. D6.2 – Data Management Plan H2020 Contract No. 732064 Final – v1.0, 30/6/2017 Dissemination level: PU -Public Page 41 Figure 3: The “disciplinary data integration platform: where do you ssit? (source: Wyborn) The intra-disciplinary type remains within a single discipline. The level of standardization needs to cover the discipline needs, but little attention is usually paid to cross-discipline standards. The multi-disciplinary situation has many people from different domains working together, but eventually they all remain within their silos and data exchange is limited to the bare minimum. The cross-disciplinary setting is what we are experiencing at the beginning of DataBio. All disciplines are interfacing and reformatting their data to make it fit. The model works as long as data exchange is minor, but does not scale, as it requires bilateral agreements between various parties. The interdisciplinary approach is targeted in DataBio. The goal here is to adhere to a minimum set of standards. Ideally, the specific characteristics are standardized between all partners upfront. This model adds minimum overhead to all parties, as a single mapping needs to be implemented per party (or, even better, the new model is used natively from now on). The transdisciplinary approach starts with data already provided as linked data with links across the various disciplines, well-defined vocabularies, and a set of mapping rules to ensure usability of data generated in arbitrary disciplines. 3.3.1 Interoperability mechanisms Key to interoperable data exchange are standardized interfaces. Currently, the amount of data processing and exchange tools is extremely large. We expect a consolidation of the number of tools during the first 15 months of the project. We will revise the requirements set by the various pilots and the data sets made available regularly to ensure that proper recommendations can be given at any time. 3.3.2 Inter-discipline interoperability and ontologies A key element to interoperability within and across disciplines are shared semantics, but the Semantic Web is still in its infancy and it is not clear to which extent it will become widely accepted within data intensive communities in the near future. It requires graph-structures for data and/or metadata, well defined vocabularies and ontologies, and lacks both the