The document discusses data curation from data lakes. It describes the data lake paradigm of collecting all data and making it searchable. It then discusses the importance of data curation and normalization to generate value from large and diverse datasets. Examples are provided showing how sample annotations can be normalized and structured to enable complex queries across multiple datasets. The document reflects on challenges around quantifying the value of data curation and need for curation as data volumes increase.
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
From data lakes to actionable data (adventures in data curation)
1. From data lakes to actionable data
(adventures in data curation)
Andrea Splendiani, PhD
BioData, Basel
November 29th , 2018
NIBR Informatics
2. NIBR Informatics, TMS
What we do How we think Perspectives
Content
From data to knowledge:
The What and Why of data curation (and data lakes)
Public use2
3. NIBR Informatics, TMS
The data lake paradigm
What & Why: data lakes
Noise vs actionable data
Public use3
The data lake
paradigm
1. Collect “all” data
2. Index it, make it
searchable
3. …
4. Analyze it
5. …
6. Generate value
4. NIBR Informatics, TMS
What & Why: an example
Public use4
• A data item (sample annotation):
– It is published
– It is ingested in a larger repository
– Some extraction/normalization is
done
– The data item is now in a larger
context (it can be queried with
other data)
5. NIBR Informatics, TMS
What & Why: an example
Public use5
Sample annotation: structured and unstructured information
6. NIBR Informatics, TMS
What & Why: an example
Public use6
• Information is incorporated
in a repository
• Some
normalization/mapping of
information, e.g.:
– Adult -> EFO:001272
– Ethanol -> CHEBI:16236
• Structured representation
7. NIBR Informatics, TMS
What & Why: an example
Public use7
• Information is put in a
larger context and can be
queried across different
datasets:
– E.g.: all samples treated with
alcohol (chebi:16236)
– E.g.: differentially expressed
genes for samples treated with
alcohol
An ideal lake
8. NIBR Informatics, TMS
What & Why: an example
Property Value
Ontology (annotation
from EBI)
biomaterial provider
Peter Ritchie (Victoria
University of Wellington)
EFO_0000001
(experimental factor)
development stage adult EFO_0001272
latitude and longitude 46.50 S 166.00 E EFO_0000001
organism part Muscle UBERON_0001015
strain wild caught EFO_0000001
geographic location New Zealand: Puysegur EFO_0000001
storage conditions Ethanol CHEBI_16236
sample code OR00579 EFO_0000001
Public use8
Sample ID: SAMN03105804
Description: Model organism or animal sample from Hoplostethus atlanticus
Biological characteristics (structure description):
All genes affected by alcohol, a close look at results.
9. NIBR Informatics, TMS
The value of
results of
queries across
large data
assets depends
on the quality
of the data
harmonization
What & Why: an example
Public use9
FAIR
In the previous examples, errors would not have impacted the overall results.
But as we loose track of details, can we know how errors propagate?
10. NIBR Informatics, TMS
What we do
How we think Perspectives
Content
From data to knowledge:
The What and Why of data curation (and data lakes)
Public use10
11. NIBR Informatics, TMS
Reactive:
Cleanse (meta)data already
produced
What we do
Public use11
Proactive:
Influence the production of
(meta)data
Proactive and reactive approaches to data curation
12. NIBR Informatics, TMS
• Why: formalize curation
processes
– Efficiency
– Reproducibility
• What: A rule-based environment
to design “curation protocols”.
– Embed atomic operations, such as NLP-
based ontology mapping, text extraction,
computations…
• How: Build by example approach
• Who: “Power user”. Has a stake
in the standard definition process
Data Curation Framework
Public use12
Can we “augment” a curator with NLP that scale?
Can we make human processes reproducible?
13. NIBR Informatics, TMS
Data Curation Framework
(The theory behind)
Framing the data curation process: multiple dimensions explicit
Public use13
Validation
state
(Confidence)
Valid Valid Valid
Curation goal
(The need)
Required Required Required Required Required
Semantic type1
(Meaning)
Identifier
about
Sample
ID2 about
Organism
Name
about
Organism
Name about
Gender
Identifier
about Gender
Description
about Age
Age Unit
about
Age
Field Name
(the “location”
in the source)
ID taxID Organism Gender age
Value GSM701
607
10090 Mus
Musculus
6 weeks old
1 All semantic types expressed are expressed via an ontology (here presented as a simplified definition)
2 Identifiers also require a domain specification
Example, extract from NCBI GEO GSM701607 (only a subset of fields from the previous slide are considered)
14. NIBR Informatics, TMS
Data Curation Framework
(The theory behind): abstract rules,
operators
Public14
Compute missing identifier:
If (E.X.type=“Identifier” ^ E.X.Goal=“Required” ^ E.X.Value=“” ^ exists
(E.Y: E.Y.type.about=E.X.type.about and E.Y.type=“Description” and
E.Y.Value!=“”)) then E.X.Value=extract(isAbout(E.Y.type), E.Y.value)
15. NIBR Informatics, TMS
This curation rule can be saved, shared, and executed…
Public use15
7
Data Curation Framework
(from theory to practice)
rules to set curation tasks
16. NIBR Informatics, TMS
Reactive:
Cleanse (meta)data already
produced
What we do
Public use16
Proactive:
Influence the production of
(meta)data
Proactive and reactive approaches to data curation
17. NIBR Informatics, TMS
Templates: end user interaction
Public use17
sampleAnnot-template
• Why: propagate data
standards.
• What:
• a simple excel-like template (can be
shared via a URL)
• a central system to serve and
process templates
• How:
• rules behind the template allow
normalization.
• Central repository can capture
variations.
• Who: power user design a
template, end-users use it
(and change it).
18. NIBR Informatics, TMS
What we do
How we think
Perspectives
Content
From data to knowledge:
The What and Why of data curation (and data lakes)
Public use18
19. NIBR Informatics, TMS
Sample schema mapping
(the basic problem)
Public use19
Standardized list of fields:
Sample Source Species
Sample Source Anatomical par
Sample Source Sex
Sample Storage Conditions
…
20. NIBR Informatics, TMS
Can we combine lexical similarities and data
distributions for better predictions?
Public use20
Assumption, within the same study,
properties with the same name
have the same meaning
Tamr/ML/
interactive
training
21. NIBR Informatics, TMS
Public use21
Can we combine lexical similarities and
data distributions for better predictions?
22. NIBR Informatics, TMS
Public use22
Experimental process
Hypothesis:
What can we
leverage?
Problem
framing
Execution
(Training/learning
in Tamr)
Data review
Assessment pf
results (sampling)
23. NIBR Informatics, TMS
Public use23
Experimental process
Hypothesis:
What can we
leverage?
Problem
framing
Execution
(Training/learni
ng in Tamr)
Data review
Assessment pf
results
(sampling)
24. NIBR Informatics, TMS
Public use24
Experimental process
Hypothesis:
What can we
leverage?
Problem
framing
Execution
(Training/learni
ng in Tamr)
Data review
Assessment pf
results
(sampling)
25. NIBR Informatics, TMS
Public use25
Experimental process
Hypothesis:
What can we
leverage?
Problem
framing
Execution
(Training/learni
ng in Tamr)
Data review
Assessment pf
results
(sampling)
26. NIBR Informatics, TMS
Public use26
Experimental process
Hypothesis:
What can we
leverage?
Problem
framing
Execution
(Training/learni
ng in Tamr)
Data review
Assessment pf
results
(sampling)
27. NIBR Informatics, TMS
Ideas
Public use27
Properties are not
independent, can we use
co-occurrence?
Two species detected.
Rule: need to specify if transplant or not
Can we use context for better mappings?
Source?
Submitter?
28. NIBR Informatics, TMS
What we do How we think
Perspectives
Content
From data to knowledge:
The What and Why of data curation (and data lakes)
Public use28
29. NIBR Informatics, TMS
Reflection points
Public use29
• What is the value of “curation”?
• What model for curation?
• What metrics?
• What measurements?
30. NIBR Informatics, TMS
What is the value of data curation ?
• 80% of analyst time is spent
discovering and preparing data
[1]. Why are there not solutions ?
• The major proportion of published
data is irreproducible [3]
• Lost know-how about the data.
• Cost of producing data (e.g.:
genomics) is exponentially
decreasing: what is the value of
past data? [2]
• Is it at all possible to have
homogenous data? Do we an
homogenous knowledge?
Public use30
A devil’s advocate
perspective: is data re-use
really valuable?
31. NIBR Informatics, TMS
Curation at source
Can we anticipate all use cases?
Curation at ingestion
Hard coding a use case
Curation on demand ?
At all feasible?
What model for curation ?
Public use31
32. NIBR Informatics, TMS
What metrics?
• Can we quantify how much data is “curated enough”?
• Can we quantify the value of data curation ?
• FAIRness metrics, measures of quality:
• Can we assess how data fits a purpose?
• (if you had to invest X money in N datasets, which criteria would you
use to choose?)
Public use32
http://bit.ly/valueOfData
33. NIBR Informatics, TMS
What measurements?
• Data (and its context)
evolve.
• Whichever measure for
“value” we choose, it will
change in time.
• How do we monitor such
“value” ?
Public use33
http://yummydata.org
34. NIBR Informatics, TMS
Conclusions/recap
1. The more we have data in data lakes, the more we
need to think on how to relate data together
– Especially if data is observational and coming from different sources
2. We can implement both reactive and proactive
approaches to normalize data.
3. Is curation meta-”data science”?
4. How can we quantify the value of curation?
Public use34
35. NIBR Informatics, TMS
Acknowledgments
• Daniel Cronenberger (SW Engineering)
• Frederic Sutter (SW Engineering)
• Dorothy Reilly (Data curation)
• Jean Marc Von-Allmen (Data curation)
• Anosha Siripala (Data curation)
• Joseph Kunkel (Data science)
• Martin Zablocki (Data science , Trivadis)
• Ted Snyder (Data science , Tamr)
Public use35