SlideShare una empresa de Scribd logo
1 de 36
Give an open access to your data
and make them ready to be mined
Daniel Jacob
UMR 1332 BFP – Metabolism Group
Bordeaux Metabolomics Facility
May 2016
Open Data for Access and Mining
A data explorer as bonus
EDTMS
ODAM
Daniel Jacob – INRA UMR 1332 –May 2016
The experimental context: needs / wishesseeding harvesting
samples
preparation
samples analysis
Sample
identifiers
2
Experiment
Data Tables
Experiment Design
Web API
Develop if needed, lightweight tools
- R scripts (Galaxy), lightweight GUI
(R shiny)
Make both metadata and data
available for data mining
identifiers centrally
managed
data sharing & data availability
facilitate the subsequent
data mining
1
2
3
EDTMS
ODAM Open Data for Access and Mining : The core idea in one shot
Daniel Jacob – INRA UMR 1332 –May 2016
Data repository
Data capture Minimal effort (PUT)
PUT
myhost.org
http://myhost.org/
mount
GET
Implementation of an
Experiment Data Tables Management System
(EDTMS)
Experiment
Data Tables
Merely dropping data files in a data
repository (e.g. a local NAS or distant
storage space) should allow users to
access them by web API
Data can be downloaded,
explored and mined
No database schema, no programming code and no additional configuration on the server side.
Open Data for Access and Mining : The core idea in one shot
EDTMS
ODAM
3
Daniel Jacob – INRA UMR 1332 –May 2016
plants.tsv
harvests.tsv
samples.tsv
compounds.tsv
Data subset files
enzymes.tsv
• Whatever the kind of experiment, this assumes a design of experiment
(DoE) involving individuals, samples or whatever things, as the main
objects of study (e.g. plants, tissues, bacteria, …)
• This also assumes the observation of dependent variables resulting of
effects of some controlled experimental factors.
• Moreover, the objects of study have usually an identifier for each of
them, and the variables can be quantitative or qualitative.
• We can have either one object type of study or several kinds, but in
this latter case, it must exist a relationship between object types that
we assume of “obtainedFrom" type.
Preparation and cleaning of the data sub-sets of files
EDTMS
ODAM
4
Daniel Jacob – INRA UMR 1332 –May 2016
plants.tsv
harvests.tsv
samples.tsv
compounds.tsv
Classification of each column within its right category
enzymes.tsv
Data subset files
factor
quantitative
qualitative
identifier
link
categories
EDTMS
ODAM
5
Data subsets files and their associated metadata files must be compliant
with the TSV standard (Tab-Separator-Values)
• You have to organize your data subsets so that links could be established between them.
• In practical, it means to add a column containing the identifiers corresponding to the entity
to which you want to connect the subset, implying a ‘obtainedFrom’ relation.
• It is to be noted that this duplication of identifiers must be the only redundant
information, through all data subsets.
Daniel Jacob – INRA UMR 1332 –May 2016
plants.tsv harvests.tsv
samples.tsv
enzymes.tsv
Data subset files
compounds.tsv
Plants
Harvests
Samples
Compounds
Enzymes
Connections between the dataset files based on identifiers
Entities
(concepts)
Link between 2 subsets being carried out from identifiers
(implies a ‘obtainedFrom’ relation)
Identifier of the central entity of the subset
EDTMS
ODAM
factor
quantitative
qualitative
identifier
link
categories
6
Daniel Jacob – INRA UMR 1332 –May 2016
Supplementary files
In order to allow data to be explored and mined, we have to adjoin some
minimal but relevant metadata:
For that, 2 metadata files are required
• s_subsets.tsv: a file allowing to associate with each subset of data a key
concept corresponding to the main entity of the subset and the relations
of the type "obtainedFrom" between these concepts
• a_attributes.tsv: a metadata file allowing each attribute
(concept/variable) to be annotated with some minimal but relevant
metadata
Creation of the metadata files
EDTMS
ODAM
7
Data subsets files and their associated metadata files must be compliant with the TSV standard (Tab-Separator-Values)Note:
TSV is an alternative to the common comma-separated values (CSV) format, which often causes difficulties because of the need to escape commas
Daniel Jacob – INRA UMR 1332 –May 2016
s_subsets.tsv This metadata file allows to associate a key concept to each data subset file
Creation of the metadata files
EDTMS
ODAM
8
Plants
Compounds
Enzymes
Harvests
Samples
plants.tsv
PlanteID
harvests.tsv
Lot samples.tsv
SampleID
compounds.tsv
enzymes.tsv
SampleID
SampleID
1
2
3
4
5
Identifier of the central entity of the subset
Link between 2 subsets (implies a ‘obtainedFrom’ relation)
Unique rank number of the data subset
Key concept (i.e. the main entity) associated to the subset in the form of a short name
Plants1
factor
quantitative
qualitative
identifier
categories
PlanteID plants.tsv
Data file name
Daniel Jacob – INRA UMR 1332 –May 2016
a_attributes.tsv This metadata file allows each attribute (variable) to be annotated with
some minimal but relevant metadata
Creation of the metadata files
EDTMS
ODAM
9
factor
quantitative
qualitative
identifier
categories
Plants
Harvests
Samples
Compounds
…
…
Daniel Jacob – INRA UMR 1332 –May 2016
s_subsets.tsv
a_attributes.tsv
…
…
Additional subsets/ attributes can be
added step by step, as soon as data
are produced.
Updating the metadata files
EDTMS
ODAM
Daniel Jacob – INRA UMR 1332 –May 2016
Uploading your datasets in the data repository
EDTMS
ODAM
No database schema, no programming code and no additional configuration on the server side.
Your data subset files
Your dataset entry (named
‘frim1’ as example) within
the data repository
Z: (Storage)
Merely dropping data files on the data repository (e.g. NAS) should allow
users to access them by web API
Data subsets files and their
associated metadata files must be
compliant with the TSV standard
(Tab-Separator-Values)
Data repository
PUT
myhost.orgmount
GET
Data capture
Minimal effort (PUT)
Daniel Jacob – INRA UMR 1332 –May 2016
http://myhost.org/check/frim1
myhost.org
StorageDataRepos
NAS
Checking online if your the data subset files are consistent
EDTMS
ODAM
Many test checks can
be automatically
done for you
Daniel Jacob – INRA UMR 1332 –May 2016
EDTMS
ODAM
Data storage
seeding
harvesting samples analysis
samples
preparation
13
GET
, maximal efficiency (GET)
After depositing your complete dataset as described previously:
• An open access is given to your data through web API
• They are ready to be mined
• No specific code or additional configuration are needed (*) https://www.erasysbio.net/index.php?index=266
minimal effort (PUT)
PUT
Format
TSV
Data
Data Linking
Preparation and cleaning of the data sub-sets of files
FRIM1(*)
Check
Open Data, Access and Mining : web API
Daniel Jacob – INRA UMR 1332 –May 2016
Data
Format
TSV
EDTMS
ODAM
Data linking
Open Data, Access and Mining : web API
REST Services: hierarchical tree of resource naming (URL)
Retrieving data
Retrieving metadata
<data format>
<dataset name>
<subset>
(<subset>)
<entry><category>
<value> <value> <value>
<entry>
GET http://myhost.org/getdata/<data format>/<dataset name>/< … >/< … >
factor
quantitative
qualitative
identifier
link
categories
FRIM1 (*)
xml/tsv/json
frim1
14
(*) https://doi.org/10.5281/zenodo.154041
Daniel Jacob – INRA UMR 1332 –May 2016
EDTMS
ODAM Open Data, Access and Mining : web API
REST Services: hierarchical tree of resource naming (URL)
15
GET http://myhost.org/getdata/<data format>/<dataset name>/< … >/< … >
Field Description Examples
<data format> format of the retrieved data; possible values are: 'xml' or 'csv' xml
<dataset name> Short name (tag) of your dataset frim1
<subset> Short name of a data subset samples
<entry> Name of an attribute entry (defined by the user in the a_attribute file
(column ‘entry’)
sampleid
<category> Name of the attribute category; (assigned by the user in the a_attribute file
(column ‘category’)
possible values are: ‘identifier’, ‘factor’, ‘qualitative’, ‘quantitative’
quantitative
(<subset>) Set of data subsets by merging all the subsets with lower rank than the
specified subset and following the pathway defined by the "is_part_of"
links.
(samples) 
plants + harvests
+ samples
<value> Exact value of the desired entry or category 1, factor
Daniel Jacob – INRA UMR 1332 –May 2016
EDTMS
ODAM Open Data, Access and Mining : web API
REST Services: hierarchical tree of resource naming (URL)
16
GET http://myhost.org/getdata/<data format>/<dataset name>/< … >/< … >
http://myhost.org/getdata/<data format>/<dataset name>/<subset>/<entry>/<value>
http://myhost.org/getdata/<data format>/<dataset name>/(<subset>)/<category>
http://myhost.org/getdata/<data format>/<dataset name>
http://myhost.org/getdata/<data format>/<dataset name>/(<subset>)/<entry>/<value>
http://myhost.org/getdata/<data format>/<dataset name>/<subset>
http://myhost.org/getdata/<data format>/<dataset name>/(<subset>)
• Get the subset list of a dataset
• Get all values within a data subset
• Get values within a data subset for a specific value of an entry
• Get all values within a set of data subsets
• Get values within a set of data subsets for a specific value of an entry
• Get the attribute list within a set of data subsets for a specific category
Daniel Jacob – INRA UMR 1332 –May 2016
http://myhost.org/getdata/xml/frim1 http://myhost.org/getdata/xml/frim1/plants
http://myhost.org/getdata/xml/frim1/harvests/lot/1
http://myhost.org/getdata/xml/frim1/(compounds)/quantitative
Metadata
Metadata
Data
Data
Open Data Access via web API: Examples based on FRIM1
EDTMS
ODAM
FRIM1
17
Daniel Jacob – INRA UMR 1332 –May 2016
http://myhost.org/getdata/xml/frim1/(samples)/treatment/Control
Set of data subsets by merging all the subsets with lower rank than the specified
subset and following the pathway defined by the “obtainedFrom" links.
(samples)  plants + harvests + samples
Open Data Access via web API: Examples based on FRIM1
EDTMS
ODAM
FRIM1
18
Daniel Jacob – INRA UMR 1332 –May 2016
Data
Format
TSV
minimal effort, maximal efficiency
EDTMS
ODAM
Data linking
Open Data Access via web API: Application layer
FRIM1
19
…
Use existing tools
- Spreadsheets, R studio,
BioStatFlow, Galaxy,
Cytoscape, …
Daniel Jacob – INRA UMR 1332 –May 2016
Retrieving Data within R
Open Data Access via web API: Application layer
The R package
Rodam
EDTMS
ODAM
20
Daniel Jacob – INRA UMR 1332 –May 2016
Open Data Access via web API Rodam package
21
<data format>
<dataset name>
<subset>
(<subset>)
<entry><category>
<value> <value> <value>
<entry>
tsv
frim1
samples
sample
365
GET http://www.bordeaux.inra.fr/pmb/getdata/tsv/frim1/(samples)/sample/365
Daniel Jacob – INRA UMR 1332 –May 2016
Open Data Access via web API
Read metadata
i.e. category types within the data
Get the data subset ‘activome’
along with its metadata
22
<data format>
<dataset name>
<subset>
(<subset>)
<entry>
<category>
<value>
<value>
<entry>
tsv
frim1
activome
factor
GET http://www.bordeaux.inra.fr/pmb/getdata/tsv/frim1/(activome)/factor
Rodam package
Daniel Jacob – INRA UMR 1332 –May 2016
Open Data Access via web API
23
Rodam package
Daniel Jacob – INRA UMR 1332 –May 2016
Data / Metadata
Data Mining
?
Make both
metadata and data
available for
data mining.
Experimentation
/ Analysis
MFA
rCCA
pLDA
…
Open Data Access via web API
activome qNMR_metabo
Water StressControl
ODAM facilitates the subsequent data mining
All Dev. Stages
All Treatments
ODAM facilitates the subsequent data mining
(log10 transformed)
24
Rodam package
Daniel Jacob – INRA UMR 1332 –May 2016
Develop if needed, lightweight tools
- R scripts (Galaxy), lightweight GUI (R shiny)
minimal effort, maximal efficiency
…
Use existing tools
- Spreadsheets, R studio,
BioStatFlow, Galaxy,
Cytoscape, …
EDTMS
ODAM
Data
Format
TSV
Data linking
Open Data Access via web API: Application layer
FRIM1
25
Daniel Jacob – INRA UMR 1332 –May 2016
FRIM - Fruit Integrative Modelling
EDTMS
ODAM
26
http://www.bordeaux.inra.fr/pmb/dataexplorer/?ds=frim1
Daniel Jacob – INRA UMR 1332 –May 2016
FRIM - Fruit Integrative Modelling
EDTMS
ODAM
27
http://www.bordeaux.inra.fr/pmb/dataexplorer/?ds=frim1
Daniel Jacob – INRA UMR 1332 –May 2016
FRIM - Fruit Integrative Modelling
EDTMS
ODAM
28
Daniel Jacob – INRA UMR 1332 –May 2016
FRIM - Fruit Integrative Modelling
EDTMS
ODAM
29
To remove an item
from the selection: i)
click on it, and then
ii) click on the
‘Suppr’ key
Daniel Jacob – INRA UMR 1332 –May 2016
FRIM - Fruit Integrative Modelling
EDTMS
ODAM
30
Daniel Jacob – INRA UMR 1332 –May 2016
FRIM - Fruit Integrative Modelling
EDTMS
ODAM
31
Explore several
possibilities by
interacting with
the graph
Daniel Jacob – INRA UMR 1332 –May 2016
To summarize
1. Preparation and cleaning of the data sub-sets of files
2. Classification of each column within its right category
3. Connections between the dataset files based on identifiers
4. Creation of the definition files namely s_subsets.tsv and a_attributes.tsv
5. Deposit of the dataset files in the data repository
6. Checking online if your the data subset files are consistent
7. Testing online the web-services on your dataset
8. Use of the web API through an application layer (R scripts, data explorer, ... )
EDTMS
ODAM
Data subsets files and their associated metadata files must be
compliant with the TSV standard (Tab-Separator-Values)
Note:
TSV is an alternative to the common comma-separated values (CSV) format, which often causes difficulties because of the need to escape commas
(See https://en.wikipedia.org/wiki/Tab-separated_values)
Daniel Jacob – INRA UMR 1332 –May 2016
Advantages of this approach
data sharing & data availability
- The array of the "plants" may be created even before planting the seeds.
- Similarly, the array of the "harvests" can be created as soon as the harvests are done,
and this before any analysis.
- Thus, these arrays are generated only once in the project and we can set up the
sharing soon the seed planting. Then each analysis comes to complement the set of
data as soon as they produce their own sub-dataset.
- data are accessible to everyone as soon as they are produced,
identifiers centrally managed
- data are archived and compiled, so that it becomes useless to proceed a laborious
investigation to find out who possesses the right identifiers, etc.
EDTMS
ODAM
seeding harvesting samples analysis
Sample
identifiers
samples
preparation
Daniel Jacob – INRA UMR 1332 –May 2016
Advantages of this approach
facilitate the subsequent publication of data
- data are already readily available online by web API,
- But nothing prevents to take this data to fill in existing databases, by adjoining more
elaborate annotations.
- Neither administrator privileges nor any programmatic skills are required
EDTMS
ODAM
Data
Format
TSV
Data linking
PUT
GET
Data capture
Minimal effortData analysis/mining
Maximum efficiency
Daniel Jacob – INRA UMR 1332 –May 2016
minimal effort, maximum efficiency
Format the data
- Based on TSV: choice to keep the good old way of scientist to use
worksheets, thus i) using the same tool for both data files and metadata
definition files, ii) no programmatic skill are required
Give an access through a web services layer
- based on current standards (REST)
Use existing tools
- Spreadsheets, R studio, BioStatFlow, Galaxy, Cytoscape, …
Develop if needed, lightweight tools
- R scripts, lightweight GUI (R shiny)
Advantages of this approach
biostatflow.org
EDTMS
ODAM
Daniel Jacob – INRA UMR 1332 –May 2016
Have a good fun !!
Daniel Jacob
UMR 1332 BFP – Metabolism Group
Bordeaux Metabolomics Facility
May 2016
Open Data for Access and Mining
https://hub.docker.com/r/odam/getdata/
http://www.bordeaux.inra.fr/pmb/dataexplorer/
https://github.com/INRA/ODAM
https://cran.r-project.org/package=Rodam
https://zenodo.org/record/154041
An online example

Más contenido relacionado

La actualidad más candente

Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining Phi Jack
 
Cluster2
Cluster2Cluster2
Cluster2work
 
data mining and data warehousing
data mining and data warehousingdata mining and data warehousing
data mining and data warehousingSunny Gandhi
 
Data miningppt378
Data miningppt378Data miningppt378
Data miningppt378nitttin
 
Data mining and its concepts
Data mining and its conceptsData mining and its concepts
Data mining and its conceptsBharadwaj Sharma
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesSơn Còm Nhom
 
Data Mining Concepts and Techniques
Data Mining Concepts and TechniquesData Mining Concepts and Techniques
Data Mining Concepts and TechniquesPratik Tambekar
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial Salah Amean
 
Data Warehouse and Data Mining
Data Warehouse and Data MiningData Warehouse and Data Mining
Data Warehouse and Data MiningRanak Ghosh
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 abhagathk
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dmsumit621
 
Introduction to Data mining
Introduction to Data miningIntroduction to Data mining
Introduction to Data miningHadi Fadlallah
 
Data warehouse and olap technology
Data warehouse and olap technologyData warehouse and olap technology
Data warehouse and olap technologyDataminingTools Inc
 

La actualidad más candente (20)

Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining
 
02 Data Mining
02 Data Mining02 Data Mining
02 Data Mining
 
Cluster2
Cluster2Cluster2
Cluster2
 
data mining and data warehousing
data mining and data warehousingdata mining and data warehousing
data mining and data warehousing
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Data miningppt378
Data miningppt378Data miningppt378
Data miningppt378
 
Data mining and its concepts
Data mining and its conceptsData mining and its concepts
Data mining and its concepts
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and Techniques
 
Data Mining Concepts and Techniques
Data Mining Concepts and TechniquesData Mining Concepts and Techniques
Data Mining Concepts and Techniques
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
 
Data Warehouse and Data Mining
Data Warehouse and Data MiningData Warehouse and Data Mining
Data Warehouse and Data Mining
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 a
 
Chapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data MiningChapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data Mining
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
 
Introduction to Data mining
Introduction to Data miningIntroduction to Data mining
Introduction to Data mining
 
Database
DatabaseDatabase
Database
 
Data mining
Data miningData mining
Data mining
 
3 Data Mining Tasks
3  Data Mining Tasks3  Data Mining Tasks
3 Data Mining Tasks
 
Data warehouse and olap technology
Data warehouse and olap technologyData warehouse and olap technology
Data warehouse and olap technology
 
Data mining
Data miningData mining
Data mining
 

Destacado

How I data mined my text message history
How I data mined my text message historyHow I data mined my text message history
How I data mined my text message historyJoe Cannatti Jr.
 
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methodsData Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methodsSalah Amean
 
Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsDatamining Tools
 
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsData Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsSalah Amean
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methodsKrish_ver2
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsJustin Cletus
 
1.8 discretization
1.8 discretization1.8 discretization
1.8 discretizationKrish_ver2
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Miningidnats
 
Data Mining: Classification and analysis
Data Mining: Classification and analysisData Mining: Classification and analysis
Data Mining: Classification and analysisDataminingTools Inc
 
Data cube computation
Data cube computationData cube computation
Data cube computationRashmi Sheikh
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for ClassificationPrakash Pimpale
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesSaif Ullah
 
Data mining slides
Data mining slidesData mining slides
Data mining slidessmj
 

Destacado (16)

How I data mined my text message history
How I data mined my text message historyHow I data mined my text message history
How I data mined my text message history
 
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methodsData Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
 
Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlations
 
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsData Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and Correlations
 
Data visualization
Data visualizationData visualization
Data visualization
 
1.8 discretization
1.8 discretization1.8 discretization
1.8 discretization
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Mining
 
Data Mining: Classification and analysis
Data Mining: Classification and analysisData Mining: Classification and analysis
Data Mining: Classification and analysis
 
Data cube computation
Data cube computationData cube computation
Data cube computation
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for Classification
 
Data Mining: Association Rules Basics
Data Mining: Association Rules BasicsData Mining: Association Rules Basics
Data Mining: Association Rules Basics
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
 
Data mining
Data miningData mining
Data mining
 

Similar a Odam: Open Data, Access and Mining

Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...Research Data Alliance
 
Make your data great now
Make your data great nowMake your data great now
Make your data great nowDaniel JACOB
 
Environment Canada's Data Management Service
Environment Canada's Data Management ServiceEnvironment Canada's Data Management Service
Environment Canada's Data Management ServiceSafe Software
 
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...LEARN Project
 
Force11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, OxfordForce11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, OxfordMark Wilkinson
 
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
A Generic Scientific Data Model and Ontology for Representation of Chemical DataA Generic Scientific Data Model and Ontology for Representation of Chemical Data
A Generic Scientific Data Model and Ontology for Representation of Chemical DataStuart Chalk
 
Make your data great again - Ver 2
Make your data great again - Ver 2Make your data great again - Ver 2
Make your data great again - Ver 2Daniel JACOB
 
Dataset description: DCAT and other vocabularies
Dataset description: DCAT and other vocabulariesDataset description: DCAT and other vocabularies
Dataset description: DCAT and other vocabulariesValeria Pesce
 
Dataset Catalogs as a Foundation for FAIR* Data
Dataset Catalogs as a Foundation for FAIR* DataDataset Catalogs as a Foundation for FAIR* Data
Dataset Catalogs as a Foundation for FAIR* DataTom Plasterer
 
ChemConnect: Characterizing CombusAon KineAc Data with ontologies and meta-­‐...
ChemConnect: Characterizing CombusAon KineAc Data with ontologies and meta-­‐...ChemConnect: Characterizing CombusAon KineAc Data with ontologies and meta-­‐...
ChemConnect: Characterizing CombusAon KineAc Data with ontologies and meta-­‐...Edward Blurock
 
DataFinder concepts and example: General (20100503)
DataFinder concepts and example: General (20100503)DataFinder concepts and example: General (20100503)
DataFinder concepts and example: General (20100503)Data Finder
 
Jeff Grethe: CAMERA
Jeff Grethe: CAMERAJeff Grethe: CAMERA
Jeff Grethe: CAMERAIddo
 
eTRIKS Data Harmonization Service Platform
eTRIKS Data Harmonization Service PlatformeTRIKS Data Harmonization Service Platform
eTRIKS Data Harmonization Service Platformibemam
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop frameworkTu Pham
 
A Look into the Apache OODT Ecosystem
A Look into the Apache OODT EcosystemA Look into the Apache OODT Ecosystem
A Look into the Apache OODT EcosystemChris Mattmann
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonMOHITKUMAR1379
 
Metadata & brokering - a modern approach #2
Metadata & brokering - a modern approach #2Metadata & brokering - a modern approach #2
Metadata & brokering - a modern approach #2Daniele Bailo
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesDatabricks
 

Similar a Odam: Open Data, Access and Mining (20)

Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
 
Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data"
Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data"Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data"
Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data"
 
Make your data great now
Make your data great nowMake your data great now
Make your data great now
 
Environment Canada's Data Management Service
Environment Canada's Data Management ServiceEnvironment Canada's Data Management Service
Environment Canada's Data Management Service
 
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
 
Force11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, OxfordForce11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, Oxford
 
The CIARD RINGValeri
The CIARD RINGValeriThe CIARD RINGValeri
The CIARD RINGValeri
 
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
A Generic Scientific Data Model and Ontology for Representation of Chemical DataA Generic Scientific Data Model and Ontology for Representation of Chemical Data
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
 
Make your data great again - Ver 2
Make your data great again - Ver 2Make your data great again - Ver 2
Make your data great again - Ver 2
 
Dataset description: DCAT and other vocabularies
Dataset description: DCAT and other vocabulariesDataset description: DCAT and other vocabularies
Dataset description: DCAT and other vocabularies
 
Dataset Catalogs as a Foundation for FAIR* Data
Dataset Catalogs as a Foundation for FAIR* DataDataset Catalogs as a Foundation for FAIR* Data
Dataset Catalogs as a Foundation for FAIR* Data
 
ChemConnect: Characterizing CombusAon KineAc Data with ontologies and meta-­‐...
ChemConnect: Characterizing CombusAon KineAc Data with ontologies and meta-­‐...ChemConnect: Characterizing CombusAon KineAc Data with ontologies and meta-­‐...
ChemConnect: Characterizing CombusAon KineAc Data with ontologies and meta-­‐...
 
DataFinder concepts and example: General (20100503)
DataFinder concepts and example: General (20100503)DataFinder concepts and example: General (20100503)
DataFinder concepts and example: General (20100503)
 
Jeff Grethe: CAMERA
Jeff Grethe: CAMERAJeff Grethe: CAMERA
Jeff Grethe: CAMERA
 
eTRIKS Data Harmonization Service Platform
eTRIKS Data Harmonization Service PlatformeTRIKS Data Harmonization Service Platform
eTRIKS Data Harmonization Service Platform
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop framework
 
A Look into the Apache OODT Ecosystem
A Look into the Apache OODT EcosystemA Look into the Apache OODT Ecosystem
A Look into the Apache OODT Ecosystem
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using Python
 
Metadata & brokering - a modern approach #2
Metadata & brokering - a modern approach #2Metadata & brokering - a modern approach #2
Metadata & brokering - a modern approach #2
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 

Último

2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 

Último (20)

2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 

Odam: Open Data, Access and Mining

  • 1. Give an open access to your data and make them ready to be mined Daniel Jacob UMR 1332 BFP – Metabolism Group Bordeaux Metabolomics Facility May 2016 Open Data for Access and Mining A data explorer as bonus EDTMS ODAM
  • 2. Daniel Jacob – INRA UMR 1332 –May 2016 The experimental context: needs / wishesseeding harvesting samples preparation samples analysis Sample identifiers 2 Experiment Data Tables Experiment Design Web API Develop if needed, lightweight tools - R scripts (Galaxy), lightweight GUI (R shiny) Make both metadata and data available for data mining identifiers centrally managed data sharing & data availability facilitate the subsequent data mining 1 2 3 EDTMS ODAM Open Data for Access and Mining : The core idea in one shot
  • 3. Daniel Jacob – INRA UMR 1332 –May 2016 Data repository Data capture Minimal effort (PUT) PUT myhost.org http://myhost.org/ mount GET Implementation of an Experiment Data Tables Management System (EDTMS) Experiment Data Tables Merely dropping data files in a data repository (e.g. a local NAS or distant storage space) should allow users to access them by web API Data can be downloaded, explored and mined No database schema, no programming code and no additional configuration on the server side. Open Data for Access and Mining : The core idea in one shot EDTMS ODAM 3
  • 4. Daniel Jacob – INRA UMR 1332 –May 2016 plants.tsv harvests.tsv samples.tsv compounds.tsv Data subset files enzymes.tsv • Whatever the kind of experiment, this assumes a design of experiment (DoE) involving individuals, samples or whatever things, as the main objects of study (e.g. plants, tissues, bacteria, …) • This also assumes the observation of dependent variables resulting of effects of some controlled experimental factors. • Moreover, the objects of study have usually an identifier for each of them, and the variables can be quantitative or qualitative. • We can have either one object type of study or several kinds, but in this latter case, it must exist a relationship between object types that we assume of “obtainedFrom" type. Preparation and cleaning of the data sub-sets of files EDTMS ODAM 4
  • 5. Daniel Jacob – INRA UMR 1332 –May 2016 plants.tsv harvests.tsv samples.tsv compounds.tsv Classification of each column within its right category enzymes.tsv Data subset files factor quantitative qualitative identifier link categories EDTMS ODAM 5 Data subsets files and their associated metadata files must be compliant with the TSV standard (Tab-Separator-Values) • You have to organize your data subsets so that links could be established between them. • In practical, it means to add a column containing the identifiers corresponding to the entity to which you want to connect the subset, implying a ‘obtainedFrom’ relation. • It is to be noted that this duplication of identifiers must be the only redundant information, through all data subsets.
  • 6. Daniel Jacob – INRA UMR 1332 –May 2016 plants.tsv harvests.tsv samples.tsv enzymes.tsv Data subset files compounds.tsv Plants Harvests Samples Compounds Enzymes Connections between the dataset files based on identifiers Entities (concepts) Link between 2 subsets being carried out from identifiers (implies a ‘obtainedFrom’ relation) Identifier of the central entity of the subset EDTMS ODAM factor quantitative qualitative identifier link categories 6
  • 7. Daniel Jacob – INRA UMR 1332 –May 2016 Supplementary files In order to allow data to be explored and mined, we have to adjoin some minimal but relevant metadata: For that, 2 metadata files are required • s_subsets.tsv: a file allowing to associate with each subset of data a key concept corresponding to the main entity of the subset and the relations of the type "obtainedFrom" between these concepts • a_attributes.tsv: a metadata file allowing each attribute (concept/variable) to be annotated with some minimal but relevant metadata Creation of the metadata files EDTMS ODAM 7 Data subsets files and their associated metadata files must be compliant with the TSV standard (Tab-Separator-Values)Note: TSV is an alternative to the common comma-separated values (CSV) format, which often causes difficulties because of the need to escape commas
  • 8. Daniel Jacob – INRA UMR 1332 –May 2016 s_subsets.tsv This metadata file allows to associate a key concept to each data subset file Creation of the metadata files EDTMS ODAM 8 Plants Compounds Enzymes Harvests Samples plants.tsv PlanteID harvests.tsv Lot samples.tsv SampleID compounds.tsv enzymes.tsv SampleID SampleID 1 2 3 4 5 Identifier of the central entity of the subset Link between 2 subsets (implies a ‘obtainedFrom’ relation) Unique rank number of the data subset Key concept (i.e. the main entity) associated to the subset in the form of a short name Plants1 factor quantitative qualitative identifier categories PlanteID plants.tsv Data file name
  • 9. Daniel Jacob – INRA UMR 1332 –May 2016 a_attributes.tsv This metadata file allows each attribute (variable) to be annotated with some minimal but relevant metadata Creation of the metadata files EDTMS ODAM 9 factor quantitative qualitative identifier categories Plants Harvests Samples Compounds … …
  • 10. Daniel Jacob – INRA UMR 1332 –May 2016 s_subsets.tsv a_attributes.tsv … … Additional subsets/ attributes can be added step by step, as soon as data are produced. Updating the metadata files EDTMS ODAM
  • 11. Daniel Jacob – INRA UMR 1332 –May 2016 Uploading your datasets in the data repository EDTMS ODAM No database schema, no programming code and no additional configuration on the server side. Your data subset files Your dataset entry (named ‘frim1’ as example) within the data repository Z: (Storage) Merely dropping data files on the data repository (e.g. NAS) should allow users to access them by web API Data subsets files and their associated metadata files must be compliant with the TSV standard (Tab-Separator-Values) Data repository PUT myhost.orgmount GET Data capture Minimal effort (PUT)
  • 12. Daniel Jacob – INRA UMR 1332 –May 2016 http://myhost.org/check/frim1 myhost.org StorageDataRepos NAS Checking online if your the data subset files are consistent EDTMS ODAM Many test checks can be automatically done for you
  • 13. Daniel Jacob – INRA UMR 1332 –May 2016 EDTMS ODAM Data storage seeding harvesting samples analysis samples preparation 13 GET , maximal efficiency (GET) After depositing your complete dataset as described previously: • An open access is given to your data through web API • They are ready to be mined • No specific code or additional configuration are needed (*) https://www.erasysbio.net/index.php?index=266 minimal effort (PUT) PUT Format TSV Data Data Linking Preparation and cleaning of the data sub-sets of files FRIM1(*) Check Open Data, Access and Mining : web API
  • 14. Daniel Jacob – INRA UMR 1332 –May 2016 Data Format TSV EDTMS ODAM Data linking Open Data, Access and Mining : web API REST Services: hierarchical tree of resource naming (URL) Retrieving data Retrieving metadata <data format> <dataset name> <subset> (<subset>) <entry><category> <value> <value> <value> <entry> GET http://myhost.org/getdata/<data format>/<dataset name>/< … >/< … > factor quantitative qualitative identifier link categories FRIM1 (*) xml/tsv/json frim1 14 (*) https://doi.org/10.5281/zenodo.154041
  • 15. Daniel Jacob – INRA UMR 1332 –May 2016 EDTMS ODAM Open Data, Access and Mining : web API REST Services: hierarchical tree of resource naming (URL) 15 GET http://myhost.org/getdata/<data format>/<dataset name>/< … >/< … > Field Description Examples <data format> format of the retrieved data; possible values are: 'xml' or 'csv' xml <dataset name> Short name (tag) of your dataset frim1 <subset> Short name of a data subset samples <entry> Name of an attribute entry (defined by the user in the a_attribute file (column ‘entry’) sampleid <category> Name of the attribute category; (assigned by the user in the a_attribute file (column ‘category’) possible values are: ‘identifier’, ‘factor’, ‘qualitative’, ‘quantitative’ quantitative (<subset>) Set of data subsets by merging all the subsets with lower rank than the specified subset and following the pathway defined by the "is_part_of" links. (samples)  plants + harvests + samples <value> Exact value of the desired entry or category 1, factor
  • 16. Daniel Jacob – INRA UMR 1332 –May 2016 EDTMS ODAM Open Data, Access and Mining : web API REST Services: hierarchical tree of resource naming (URL) 16 GET http://myhost.org/getdata/<data format>/<dataset name>/< … >/< … > http://myhost.org/getdata/<data format>/<dataset name>/<subset>/<entry>/<value> http://myhost.org/getdata/<data format>/<dataset name>/(<subset>)/<category> http://myhost.org/getdata/<data format>/<dataset name> http://myhost.org/getdata/<data format>/<dataset name>/(<subset>)/<entry>/<value> http://myhost.org/getdata/<data format>/<dataset name>/<subset> http://myhost.org/getdata/<data format>/<dataset name>/(<subset>) • Get the subset list of a dataset • Get all values within a data subset • Get values within a data subset for a specific value of an entry • Get all values within a set of data subsets • Get values within a set of data subsets for a specific value of an entry • Get the attribute list within a set of data subsets for a specific category
  • 17. Daniel Jacob – INRA UMR 1332 –May 2016 http://myhost.org/getdata/xml/frim1 http://myhost.org/getdata/xml/frim1/plants http://myhost.org/getdata/xml/frim1/harvests/lot/1 http://myhost.org/getdata/xml/frim1/(compounds)/quantitative Metadata Metadata Data Data Open Data Access via web API: Examples based on FRIM1 EDTMS ODAM FRIM1 17
  • 18. Daniel Jacob – INRA UMR 1332 –May 2016 http://myhost.org/getdata/xml/frim1/(samples)/treatment/Control Set of data subsets by merging all the subsets with lower rank than the specified subset and following the pathway defined by the “obtainedFrom" links. (samples)  plants + harvests + samples Open Data Access via web API: Examples based on FRIM1 EDTMS ODAM FRIM1 18
  • 19. Daniel Jacob – INRA UMR 1332 –May 2016 Data Format TSV minimal effort, maximal efficiency EDTMS ODAM Data linking Open Data Access via web API: Application layer FRIM1 19 … Use existing tools - Spreadsheets, R studio, BioStatFlow, Galaxy, Cytoscape, …
  • 20. Daniel Jacob – INRA UMR 1332 –May 2016 Retrieving Data within R Open Data Access via web API: Application layer The R package Rodam EDTMS ODAM 20
  • 21. Daniel Jacob – INRA UMR 1332 –May 2016 Open Data Access via web API Rodam package 21 <data format> <dataset name> <subset> (<subset>) <entry><category> <value> <value> <value> <entry> tsv frim1 samples sample 365 GET http://www.bordeaux.inra.fr/pmb/getdata/tsv/frim1/(samples)/sample/365
  • 22. Daniel Jacob – INRA UMR 1332 –May 2016 Open Data Access via web API Read metadata i.e. category types within the data Get the data subset ‘activome’ along with its metadata 22 <data format> <dataset name> <subset> (<subset>) <entry> <category> <value> <value> <entry> tsv frim1 activome factor GET http://www.bordeaux.inra.fr/pmb/getdata/tsv/frim1/(activome)/factor Rodam package
  • 23. Daniel Jacob – INRA UMR 1332 –May 2016 Open Data Access via web API 23 Rodam package
  • 24. Daniel Jacob – INRA UMR 1332 –May 2016 Data / Metadata Data Mining ? Make both metadata and data available for data mining. Experimentation / Analysis MFA rCCA pLDA … Open Data Access via web API activome qNMR_metabo Water StressControl ODAM facilitates the subsequent data mining All Dev. Stages All Treatments ODAM facilitates the subsequent data mining (log10 transformed) 24 Rodam package
  • 25. Daniel Jacob – INRA UMR 1332 –May 2016 Develop if needed, lightweight tools - R scripts (Galaxy), lightweight GUI (R shiny) minimal effort, maximal efficiency … Use existing tools - Spreadsheets, R studio, BioStatFlow, Galaxy, Cytoscape, … EDTMS ODAM Data Format TSV Data linking Open Data Access via web API: Application layer FRIM1 25
  • 26. Daniel Jacob – INRA UMR 1332 –May 2016 FRIM - Fruit Integrative Modelling EDTMS ODAM 26 http://www.bordeaux.inra.fr/pmb/dataexplorer/?ds=frim1
  • 27. Daniel Jacob – INRA UMR 1332 –May 2016 FRIM - Fruit Integrative Modelling EDTMS ODAM 27 http://www.bordeaux.inra.fr/pmb/dataexplorer/?ds=frim1
  • 28. Daniel Jacob – INRA UMR 1332 –May 2016 FRIM - Fruit Integrative Modelling EDTMS ODAM 28
  • 29. Daniel Jacob – INRA UMR 1332 –May 2016 FRIM - Fruit Integrative Modelling EDTMS ODAM 29 To remove an item from the selection: i) click on it, and then ii) click on the ‘Suppr’ key
  • 30. Daniel Jacob – INRA UMR 1332 –May 2016 FRIM - Fruit Integrative Modelling EDTMS ODAM 30
  • 31. Daniel Jacob – INRA UMR 1332 –May 2016 FRIM - Fruit Integrative Modelling EDTMS ODAM 31 Explore several possibilities by interacting with the graph
  • 32. Daniel Jacob – INRA UMR 1332 –May 2016 To summarize 1. Preparation and cleaning of the data sub-sets of files 2. Classification of each column within its right category 3. Connections between the dataset files based on identifiers 4. Creation of the definition files namely s_subsets.tsv and a_attributes.tsv 5. Deposit of the dataset files in the data repository 6. Checking online if your the data subset files are consistent 7. Testing online the web-services on your dataset 8. Use of the web API through an application layer (R scripts, data explorer, ... ) EDTMS ODAM Data subsets files and their associated metadata files must be compliant with the TSV standard (Tab-Separator-Values) Note: TSV is an alternative to the common comma-separated values (CSV) format, which often causes difficulties because of the need to escape commas (See https://en.wikipedia.org/wiki/Tab-separated_values)
  • 33. Daniel Jacob – INRA UMR 1332 –May 2016 Advantages of this approach data sharing & data availability - The array of the "plants" may be created even before planting the seeds. - Similarly, the array of the "harvests" can be created as soon as the harvests are done, and this before any analysis. - Thus, these arrays are generated only once in the project and we can set up the sharing soon the seed planting. Then each analysis comes to complement the set of data as soon as they produce their own sub-dataset. - data are accessible to everyone as soon as they are produced, identifiers centrally managed - data are archived and compiled, so that it becomes useless to proceed a laborious investigation to find out who possesses the right identifiers, etc. EDTMS ODAM seeding harvesting samples analysis Sample identifiers samples preparation
  • 34. Daniel Jacob – INRA UMR 1332 –May 2016 Advantages of this approach facilitate the subsequent publication of data - data are already readily available online by web API, - But nothing prevents to take this data to fill in existing databases, by adjoining more elaborate annotations. - Neither administrator privileges nor any programmatic skills are required EDTMS ODAM Data Format TSV Data linking PUT GET Data capture Minimal effortData analysis/mining Maximum efficiency
  • 35. Daniel Jacob – INRA UMR 1332 –May 2016 minimal effort, maximum efficiency Format the data - Based on TSV: choice to keep the good old way of scientist to use worksheets, thus i) using the same tool for both data files and metadata definition files, ii) no programmatic skill are required Give an access through a web services layer - based on current standards (REST) Use existing tools - Spreadsheets, R studio, BioStatFlow, Galaxy, Cytoscape, … Develop if needed, lightweight tools - R scripts, lightweight GUI (R shiny) Advantages of this approach biostatflow.org EDTMS ODAM
  • 36. Daniel Jacob – INRA UMR 1332 –May 2016 Have a good fun !! Daniel Jacob UMR 1332 BFP – Metabolism Group Bordeaux Metabolomics Facility May 2016 Open Data for Access and Mining https://hub.docker.com/r/odam/getdata/ http://www.bordeaux.inra.fr/pmb/dataexplorer/ https://github.com/INRA/ODAM https://cran.r-project.org/package=Rodam https://zenodo.org/record/154041 An online example