From data lakes to actionable data (adventures in data curation)

From data lakes to actionable data
(adventures in data curation)
Andrea Splendiani, PhD
BioData, Basel
November 29th , 2018
NIBR Informatics

NIBR Informatics, TMS
What we do How we think Perspectives
Content
From data to knowledge:
The What and Why of data curation (and data lakes)
Public use2

The data lake paradigm
What & Why: data lakes
Noise vs actionable data
Public use3
The data lake
paradigm
1. Collect “all” data
2. Index it, make it
searchable
3. …
4. Analyze it
5. …
6. Generate value

What & Why: an example
Public use4
• A data item (sample annotation):
– It is published
– It is ingested in a larger repository
– Some extraction/normalization is
done
– The data item is now in a larger
context (it can be queried with
other data)

Public use5
Sample annotation: structured and unstructured information

Public use6
• Information is incorporated
in a repository
• Some
normalization/mapping of
information, e.g.:
– Adult -> EFO:001272
– Ethanol -> CHEBI:16236
• Structured representation

Public use7
• Information is put in a
larger context and can be
queried across different
datasets:
– E.g.: all samples treated with
alcohol (chebi:16236)
– E.g.: differentially expressed
genes for samples treated with
alcohol
An ideal lake

Property Value
Ontology (annotation
from EBI)
biomaterial provider
Peter Ritchie (Victoria
University of Wellington)
EFO_0000001
(experimental factor)
development stage adult EFO_0001272
latitude and longitude 46.50 S 166.00 E EFO_0000001
organism part Muscle UBERON_0001015
strain wild caught EFO_0000001
geographic location New Zealand: Puysegur EFO_0000001
storage conditions Ethanol CHEBI_16236
sample code OR00579 EFO_0000001
Public use8
Sample ID: SAMN03105804
Description: Model organism or animal sample from Hoplostethus atlanticus
Biological characteristics (structure description):
All genes affected by alcohol, a close look at results.

The value of
results of
queries across
large data
assets depends
on the quality
of the data
harmonization
Public use9
FAIR
In the previous examples, errors would not have impacted the overall results.
But as we loose track of details, can we know how errors propagate?

What we do
How we think Perspectives
Content
Public use10

Reactive:
Cleanse (meta)data already
produced
What we do
Public use11
Proactive:
Influence the production of
(meta)data
Proactive and reactive approaches to data curation

• Why: formalize curation
processes
– Efficiency
– Reproducibility
• What: A rule-based environment
to design “curation protocols”.
– Embed atomic operations, such as NLP-
based ontology mapping, text extraction,
computations…
• How: Build by example approach
• Who: “Power user”. Has a stake
in the standard definition process
Data Curation Framework
Public use12
Can we “augment” a curator with NLP that scale?
Can we make human processes reproducible?

(The theory behind)
Framing the data curation process: multiple dimensions explicit
Public use13
Validation
state
(Confidence)
Valid Valid Valid
Curation goal
(The need)
Required Required Required Required Required
Semantic type1
(Meaning)
Identifier
about
Sample
ID2 about
Organism
Name
about
Organism
Name about
Gender
Identifier
about Gender
Description
about Age
Age Unit
about
Age
Field Name
(the “location”
in the source)
ID taxID Organism Gender age
Value GSM701
607
10090 Mus
Musculus
6 weeks old
1 All semantic types expressed are expressed via an ontology (here presented as a simplified definition)
2 Identifiers also require a domain specification
Example, extract from NCBI GEO GSM701607 (only a subset of fields from the previous slide are considered)

(The theory behind): abstract rules,
operators
Public14
Compute missing identifier:
If (E.X.type=“Identifier” ^ E.X.Goal=“Required” ^ E.X.Value=“” ^ exists
(E.Y: E.Y.type.about=E.X.type.about and E.Y.type=“Description” and
E.Y.Value!=“”)) then E.X.Value=extract(isAbout(E.Y.type), E.Y.value)

This curation rule can be saved, shared, and executed…
Public use15
7
(from theory to practice)
rules to set curation tasks

Reactive:
Cleanse (meta)data already
produced
What we do
Public use16
Proactive:
Influence the production of
(meta)data
Proactive and reactive approaches to data curation

Templates: end user interaction
Public use17
sampleAnnot-template
• Why: propagate data
standards.
• What:
• a simple excel-like template (can be
shared via a URL)
• a central system to serve and
process templates
• How:
• rules behind the template allow
normalization.
• Central repository can capture
variations.
• Who: power user design a
template, end-users use it
(and change it).

What we do
How we think
Perspectives
Content
Public use18

Sample schema mapping
(the basic problem)
Public use19
Standardized list of fields:
Sample Source Species
Sample Source Anatomical par
Sample Source Sex
Sample Storage Conditions
…

Can we combine lexical similarities and data
distributions for better predictions?
Public use20
Assumption, within the same study,
properties with the same name
have the same meaning
Tamr/ML/
interactive
training

Public use21
Can we combine lexical similarities and
data distributions for better predictions?

Public use22
Experimental process
Hypothesis:
What can we
leverage?
Problem
framing
Execution
(Training/learning
in Tamr)
Data review
Assessment pf
results (sampling)

Public use23
Hypothesis:
What can we
leverage?
Problem
framing
Execution
(Training/learni
ng in Tamr)
Data review
Assessment pf
results
(sampling)

Public use24
Hypothesis:
What can we
leverage?
Problem
framing
Execution
(Training/learni
ng in Tamr)
Data review
Assessment pf
results
(sampling)

Public use25
Hypothesis:
What can we
leverage?
Problem
framing
Execution
(Training/learni
ng in Tamr)
Data review
Assessment pf
results
(sampling)

Public use26
Hypothesis:
What can we
leverage?
Problem
framing
Execution
(Training/learni
ng in Tamr)
Data review
Assessment pf
results
(sampling)

Ideas
Public use27
Properties are not
independent, can we use
co-occurrence?
Two species detected.
Rule: need to specify if transplant or not
Can we use context for better mappings?
Source?
Submitter?

What we do How we think
Perspectives
Content
Public use28

Reflection points
Public use29
• What is the value of “curation”?
• What model for curation?
• What metrics?
• What measurements?

What is the value of data curation ?
• 80% of analyst time is spent
discovering and preparing data
[1]. Why are there not solutions ?
• The major proportion of published
data is irreproducible [3]
• Lost know-how about the data.
• Cost of producing data (e.g.:
genomics) is exponentially
decreasing: what is the value of
past data? [2]
• Is it at all possible to have
homogenous data? Do we an
homogenous knowledge?
Public use30
A devil’s advocate
perspective: is data re-use
really valuable?

Curation at source
Can we anticipate all use cases?
Curation at ingestion
Hard coding a use case
Curation on demand ?
At all feasible?
What model for curation ?
Public use31

What metrics?
• Can we quantify how much data is “curated enough”?
• Can we quantify the value of data curation ?
• FAIRness metrics, measures of quality:
• Can we assess how data fits a purpose?
• (if you had to invest X money in N datasets, which criteria would you
use to choose?)
Public use32
http://bit.ly/valueOfData

What measurements?
• Data (and its context)
evolve.
• Whichever measure for
“value” we choose, it will
change in time.
• How do we monitor such
“value” ?
Public use33
http://yummydata.org

Conclusions/recap
1. The more we have data in data lakes, the more we
need to think on how to relate data together
– Especially if data is observational and coming from different sources
2. We can implement both reactive and proactive
approaches to normalize data.
3. Is curation meta-”data science”?
4. How can we quantify the value of curation?
Public use34

Acknowledgments
• Daniel Cronenberger (SW Engineering)
• Frederic Sutter (SW Engineering)
• Dorothy Reilly (Data curation)
• Jean Marc Von-Allmen (Data curation)
• Anosha Siripala (Data curation)
• Joseph Kunkel (Data science)
• Martin Zablocki (Data science , Trivadis)
• Ted Snyder (Data science , Tamr)
Public use35

References and picture credits
• References
– [1] https://hbr.org/2017/05/whats-your-data-strategy
– [2] https://www.collaborativedrug.com/provocative-thoughts-from-chris-
lipinski/?utm_source=hs_email&utm_medium=email&utm_content=67107351&_hsenc=p2ANqtz-8_ZZ-
AFSiGUcDnAIumoS6GgXnCeZDA55mY2WwDl9XLuZUchRSG53bFpfqJNgtp3CsXRG2uj62yG_L6PKvcc8o-
Q8MwLdHjHA_zxfYtzQD7iERrbO8&_hsmi=67107352#older-data
– [3] C.G. Begley, L.M. Ellis, Drug development: raise standards for preclinical cancer research, Nature 483 (March (7391)) (2012)
531–533.
• Picture credits
– https://www.pexels.com/photo/full-frame-shot-of-abstract-pattern-247719/
– https://www.semanticscholar.org/paper/The-EBI-RDF-platform%3A-linked-open-data-for-the-life-Jupp-
Malone/6516a4a5885847438ba2ec7f7f32000c50389a04
– https://www.wamc.org/post/epa-fund-clean-water-project-hopewell-junction
– https://upload.wikimedia.org/wikipedia/commons/d/d8/Inle_Lake_%28Myanmar%29.jpg
– https://www.ebi.ac.uk/ols/ontologies/uberon/terms/graph?iri=http://purl.obolibrary.org/obo/UBERON_0002107
– https://en.wikipedia.org/wiki/The_Devil%27s_Advocate_(1997_film)#/media/File:Faust_and_Mephisto.png
– https://www.researchgate.net/figure/Part-1-of-the-Conceptual-Data-Model-that-stores-the-Learning-Objects-in-
Creator_fig1_277477099
– https://commons.wikimedia.org/wiki/File:Etl-process.svg
Business Use Only37

From data lakes to actionable data (adventures in data curation)

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (18)

Similar a From data lakes to actionable data (adventures in data curation)

Similar a From data lakes to actionable data (adventures in data curation) (20)

Último

Último (20)

From data lakes to actionable data (adventures in data curation)