Odam: Open Data, Access and Mining

Give an open access to your data
and make them ready to be mined
Daniel Jacob
UMR 1332 BFP – Metabolism Group
Bordeaux Metabolomics Facility
May 2016
Open Data for Access and Mining
A data explorer as bonus
EDTMS
ODAM

Daniel Jacob – INRA UMR 1332 –May 2016
The experimental context: needs / wishesseeding harvesting
samples
preparation
samples analysis
Sample
identifiers
2
Experiment
Data Tables
Experiment Design
Web API
Develop if needed, lightweight tools
- R scripts (Galaxy), lightweight GUI
(R shiny)
Make both metadata and data
available for data mining
identifiers centrally
managed
data sharing & data availability
facilitate the subsequent
data mining
1
2
3
EDTMS
ODAM Open Data for Access and Mining : The core idea in one shot

Data repository
Data capture Minimal effort (PUT)
PUT
myhost.org
http://myhost.org/
mount
GET
Implementation of an
Experiment Data Tables Management System
(EDTMS)
Experiment
Data Tables
Merely dropping data files in a data
repository (e.g. a local NAS or distant
storage space) should allow users to
access them by web API
Data can be downloaded,
explored and mined
No database schema, no programming code and no additional configuration on the server side.
Open Data for Access and Mining : The core idea in one shot
EDTMS
ODAM
3

plants.tsv
harvests.tsv
samples.tsv
compounds.tsv
Data subset files
enzymes.tsv
• Whatever the kind of experiment, this assumes a design of experiment
(DoE) involving individuals, samples or whatever things, as the main
objects of study (e.g. plants, tissues, bacteria, …)
• This also assumes the observation of dependent variables resulting of
effects of some controlled experimental factors.
• Moreover, the objects of study have usually an identifier for each of
them, and the variables can be quantitative or qualitative.
• We can have either one object type of study or several kinds, but in
this latter case, it must exist a relationship between object types that
we assume of “obtainedFrom" type.
Preparation and cleaning of the data sub-sets of files
EDTMS
ODAM
4

plants.tsv
harvests.tsv
samples.tsv
compounds.tsv
Classification of each column within its right category
enzymes.tsv
Data subset files
factor
quantitative
qualitative
identifier
link
categories
EDTMS
ODAM
5
Data subsets files and their associated metadata files must be compliant
with the TSV standard (Tab-Separator-Values)
• You have to organize your data subsets so that links could be established between them.
• In practical, it means to add a column containing the identifiers corresponding to the entity
to which you want to connect the subset, implying a ‘obtainedFrom’ relation.
• It is to be noted that this duplication of identifiers must be the only redundant
information, through all data subsets.

plants.tsv harvests.tsv
samples.tsv
enzymes.tsv
Data subset files
compounds.tsv
Plants
Harvests
Samples
Compounds
Enzymes
Connections between the dataset files based on identifiers
Entities
(concepts)
Link between 2 subsets being carried out from identifiers
(implies a ‘obtainedFrom’ relation)
Identifier of the central entity of the subset
EDTMS
ODAM
factor
quantitative
qualitative
identifier
link
categories
6

Supplementary files
In order to allow data to be explored and mined, we have to adjoin some
minimal but relevant metadata:
For that, 2 metadata files are required
• s_subsets.tsv: a file allowing to associate with each subset of data a key
concept corresponding to the main entity of the subset and the relations
of the type "obtainedFrom" between these concepts
• a_attributes.tsv: a metadata file allowing each attribute
(concept/variable) to be annotated with some minimal but relevant
metadata
Creation of the metadata files
EDTMS
ODAM
7
Data subsets files and their associated metadata files must be compliant with the TSV standard (Tab-Separator-Values)Note:
TSV is an alternative to the common comma-separated values (CSV) format, which often causes difficulties because of the need to escape commas

s_subsets.tsv This metadata file allows to associate a key concept to each data subset file
EDTMS
ODAM
8
Plants
Compounds
Enzymes
Harvests
Samples
plants.tsv
PlanteID
harvests.tsv
Lot samples.tsv
SampleID
compounds.tsv
enzymes.tsv
SampleID
SampleID
1
2
3
4
5
Identifier of the central entity of the subset
Link between 2 subsets (implies a ‘obtainedFrom’ relation)
Unique rank number of the data subset
Key concept (i.e. the main entity) associated to the subset in the form of a short name
Plants1
factor
quantitative
qualitative
identifier
categories
PlanteID plants.tsv
Data file name

a_attributes.tsv This metadata file allows each attribute (variable) to be annotated with
some minimal but relevant metadata
EDTMS
ODAM
9
factor
quantitative
qualitative
identifier
categories
Plants
Harvests
Samples
Compounds
…
…

s_subsets.tsv
a_attributes.tsv
…
…
Additional subsets/ attributes can be
added step by step, as soon as data
are produced.
Updating the metadata files
EDTMS
ODAM

Uploading your datasets in the data repository
EDTMS
ODAM
No database schema, no programming code and no additional configuration on the server side.
Your data subset files
Your dataset entry (named
‘frim1’ as example) within
the data repository
Z: (Storage)
Merely dropping data files on the data repository (e.g. NAS) should allow
users to access them by web API
Data subsets files and their
associated metadata files must be
compliant with the TSV standard
(Tab-Separator-Values)
Data repository
PUT
myhost.orgmount
GET
Data capture
Minimal effort (PUT)

http://myhost.org/check/frim1
myhost.org
StorageDataRepos
NAS
Checking online if your the data subset files are consistent
EDTMS
ODAM
Many test checks can
be automatically
done for you

EDTMS
ODAM
Data storage
seeding
harvesting samples analysis
samples
preparation
13
GET
, maximal efficiency (GET)
After depositing your complete dataset as described previously:
• An open access is given to your data through web API
• They are ready to be mined
• No specific code or additional configuration are needed (*) https://www.erasysbio.net/index.php?index=266
minimal effort (PUT)
PUT
Format
TSV
Data
Data Linking
Preparation and cleaning of the data sub-sets of files
FRIM1(*)
Check
Open Data, Access and Mining : web API

Data
Format
TSV
EDTMS
ODAM
Data linking
Open Data, Access and Mining : web API
REST Services: hierarchical tree of resource naming (URL)
Retrieving data
Retrieving metadata
<data format>
<dataset name>
<subset>
(<subset>)
<entry><category>
<value> <value> <value>
<entry>
GET http://myhost.org/getdata/<data format>/<dataset name>/< … >/< … >
factor
quantitative
qualitative
identifier
link
categories
FRIM1 (*)
xml/tsv/json
frim1
14
(*) https://doi.org/10.5281/zenodo.154041

EDTMS
ODAM Open Data, Access and Mining : web API
15
Field Description Examples
<data format> format of the retrieved data; possible values are: 'xml' or 'csv' xml
<dataset name> Short name (tag) of your dataset frim1
<subset> Short name of a data subset samples
<entry> Name of an attribute entry (defined by the user in the a_attribute file
(column ‘entry’)
sampleid
<category> Name of the attribute category; (assigned by the user in the a_attribute file
(column ‘category’)
possible values are: ‘identifier’, ‘factor’, ‘qualitative’, ‘quantitative’
quantitative
(<subset>) Set of data subsets by merging all the subsets with lower rank than the
specified subset and following the pathway defined by the "is_part_of"
links.
(samples) 
plants + harvests
+ samples
<value> Exact value of the desired entry or category 1, factor

EDTMS
ODAM Open Data, Access and Mining : web API
16
http://myhost.org/getdata/<data format>/<dataset name>/<subset>/<entry>/<value>
http://myhost.org/getdata/<data format>/<dataset name>/(<subset>)/<category>
http://myhost.org/getdata/<data format>/<dataset name>
http://myhost.org/getdata/<data format>/<dataset name>/(<subset>)/<entry>/<value>
http://myhost.org/getdata/<data format>/<dataset name>/<subset>
http://myhost.org/getdata/<data format>/<dataset name>/(<subset>)
• Get the subset list of a dataset
• Get all values within a data subset
• Get values within a data subset for a specific value of an entry
• Get all values within a set of data subsets
• Get values within a set of data subsets for a specific value of an entry
• Get the attribute list within a set of data subsets for a specific category

http://myhost.org/getdata/xml/frim1 http://myhost.org/getdata/xml/frim1/plants
http://myhost.org/getdata/xml/frim1/harvests/lot/1
http://myhost.org/getdata/xml/frim1/(compounds)/quantitative
Metadata
Metadata
Data
Data
Open Data Access via web API: Examples based on FRIM1
EDTMS
ODAM
FRIM1
17

http://myhost.org/getdata/xml/frim1/(samples)/treatment/Control
Set of data subsets by merging all the subsets with lower rank than the specified
subset and following the pathway defined by the “obtainedFrom" links.
(samples)  plants + harvests + samples
Open Data Access via web API: Examples based on FRIM1
EDTMS
ODAM
FRIM1
18

Data
Format
TSV
minimal effort, maximal efficiency
EDTMS
ODAM
Data linking
Open Data Access via web API: Application layer
FRIM1
19
…
Use existing tools
- Spreadsheets, R studio,
BioStatFlow, Galaxy,
Cytoscape, …

Retrieving Data within R
The R package
Rodam
EDTMS
ODAM
20

Open Data Access via web API Rodam package
21
<data format>
<dataset name>
<subset>
(<subset>)
<entry><category>
<value> <value> <value>
<entry>
tsv
frim1
samples
sample
365
GET http://www.bordeaux.inra.fr/pmb/getdata/tsv/frim1/(samples)/sample/365

Open Data Access via web API
Read metadata
i.e. category types within the data
Get the data subset ‘activome’
along with its metadata
22
<data format>
<dataset name>
<subset>
(<subset>)
<entry>
<category>
<value>
<value>
<entry>
tsv
frim1
activome
factor
GET http://www.bordeaux.inra.fr/pmb/getdata/tsv/frim1/(activome)/factor
Rodam package

23
Rodam package

Data / Metadata
Data Mining
?
Make both
metadata and data
available for
data mining.
Experimentation
/ Analysis
MFA
rCCA
pLDA
…
activome qNMR_metabo
Water StressControl
ODAM facilitates the subsequent data mining
All Dev. Stages
All Treatments
ODAM facilitates the subsequent data mining
(log10 transformed)
24
Rodam package

- R scripts (Galaxy), lightweight GUI (R shiny)
minimal effort, maximal efficiency
…
Use existing tools
- Spreadsheets, R studio,
BioStatFlow, Galaxy,
Cytoscape, …
EDTMS
ODAM
Data
Format
TSV
Data linking
FRIM1
25

FRIM - Fruit Integrative Modelling
EDTMS
ODAM
26
http://www.bordeaux.inra.fr/pmb/dataexplorer/?ds=frim1

EDTMS
ODAM
27
http://www.bordeaux.inra.fr/pmb/dataexplorer/?ds=frim1

EDTMS
ODAM
28

EDTMS
ODAM
29
To remove an item
from the selection: i)
click on it, and then
ii) click on the
‘Suppr’ key

EDTMS
ODAM
30

EDTMS
ODAM
31
Explore several
possibilities by
interacting with
the graph

To summarize
1. Preparation and cleaning of the data sub-sets of files
2. Classification of each column within its right category
3. Connections between the dataset files based on identifiers
4. Creation of the definition files namely s_subsets.tsv and a_attributes.tsv
5. Deposit of the dataset files in the data repository
6. Checking online if your the data subset files are consistent
7. Testing online the web-services on your dataset
8. Use of the web API through an application layer (R scripts, data explorer, ... )
EDTMS
ODAM
Data subsets files and their associated metadata files must be
compliant with the TSV standard (Tab-Separator-Values)
Note:
TSV is an alternative to the common comma-separated values (CSV) format, which often causes difficulties because of the need to escape commas
(See https://en.wikipedia.org/wiki/Tab-separated_values)

Advantages of this approach
data sharing & data availability
- The array of the "plants" may be created even before planting the seeds.
- Similarly, the array of the "harvests" can be created as soon as the harvests are done,
and this before any analysis.
- Thus, these arrays are generated only once in the project and we can set up the
sharing soon the seed planting. Then each analysis comes to complement the set of
data as soon as they produce their own sub-dataset.
- data are accessible to everyone as soon as they are produced,
identifiers centrally managed
- data are archived and compiled, so that it becomes useless to proceed a laborious
investigation to find out who possesses the right identifiers, etc.
EDTMS
ODAM
seeding harvesting samples analysis
Sample
identifiers
samples
preparation

facilitate the subsequent publication of data
- data are already readily available online by web API,
- But nothing prevents to take this data to fill in existing databases, by adjoining more
elaborate annotations.
- Neither administrator privileges nor any programmatic skills are required
EDTMS
ODAM
Data
Format
TSV
Data linking
PUT
GET
Data capture
Minimal effortData analysis/mining
Maximum efficiency

minimal effort, maximum efficiency
Format the data
- Based on TSV: choice to keep the good old way of scientist to use
worksheets, thus i) using the same tool for both data files and metadata
definition files, ii) no programmatic skill are required
Give an access through a web services layer
- based on current standards (REST)
Use existing tools
- Spreadsheets, R studio, BioStatFlow, Galaxy, Cytoscape, …
- R scripts, lightweight GUI (R shiny)
biostatflow.org
EDTMS
ODAM

Have a good fun !!
Daniel Jacob
UMR 1332 BFP – Metabolism Group
Bordeaux Metabolomics Facility
May 2016
Open Data for Access and Mining
https://hub.docker.com/r/odam/getdata/
http://www.bordeaux.inra.fr/pmb/dataexplorer/
https://github.com/INRA/ODAM
https://cran.r-project.org/package=Rodam
https://zenodo.org/record/154041
An online example

Odam: Open Data, Access and Mining

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (16)

Similar a Odam: Open Data, Access and Mining

Similar a Odam: Open Data, Access and Mining (20)

Último

Último (20)

Odam: Open Data, Access and Mining