Presented at European Respiratory Society, Berlin, October 2017. High level talk to mix of clinicians and scientists on the difficulties of biomedical analysis, including practical, statistical and data issues.
3. The 4-headed
beast
● The 4 heads
○ Acquisition
○ Storage
○ Analysis
○ Sharing
● Big data 4Vs
○ Velocity
○ Volume
○ Variety
○ Veracity
4. The problems of biomedical data
Many ...
● Types
● Formats
● Silos
● Gaps
● Interactions
Difficult analysis
● The curse of dimensionality
● Multiple hypothesis testing &
false discovery
● Batch effects
● Life history
● Biased sampling
● Need for integrative analysis
Practical issues
● Unstructured data
● Managing big data
● Security
● Legal & privacy
5. Future
medicine
A mix of promise & peril
● More data
○ Genomic medicine
○ Other “omic” medicine
○ Wearables
○ EHR & digital health
● P4 medicine
○ Stratification
○ Analysis at the bedside
○ Patient participation
● Translational medicine
○ Leveraging health data for
research
6. Scientific data
doubles every
18 months
A new paper
is published
every 30
seconds
Most papers
are never
cited or
even read
No new principle will declare itself from below
a heap of facts. (Peter Medewar)
15. Standards
● Clinical descriptions
● Measurements:
○ blood pressure
○ White cells
● Cross-study
Yes!
● Allows combining & comparing studies
● CDISC
● HPO
But!
● A lot of work
16. Data formats & storageYes!
● Plain text
● Open formats
● Structured formats
● Advantages:
○ Human & machine readable
○ Unambiguous
○ WYSIWYG
● Examples:
○ Open bio formats
○ CSV, TSV
No!
● Homebrew formats
● Proprietary / closed formats
● Binary formats
● Excel
17. Workflow systems & notebooks● Analysis as:
○ An executable recipe
○ A document or commentary
● Many candidates:
○ Workflows:
■ Snakemake
■ Nextflow
■ CWL etc.
○ Computational notebooks:
■ Jupyter / IPython
■ RMarkdown
18. Deep learning / machine learning
How do you know a biologist
is using deep learning in
their research?
Don’t worry, they’ll tell you.
● “Just” optimization and search techniques
● Takes a set of features and produces a
model that performs a classification or a
regression
● A series of layers that assemble features
into higher level features
● Several high-quality toolkits
● Some need for specialised hardware
(GPU)
● Interpretability
● Ground truths
● Needs lots of data
20. Batch effects
● Technical sources of
variation
○ Reageants
○ Technician
○ Platform
○ ...
● Solutions:
○ Plot data
○ Don’t batch
○ COMBAT etc. (but
loss of information)
21. Omnigenics
What if every gene affected
every other gene?
● Pritchard et al 2017
● FOAF / six degrees of separation effect
● Implicated genes are a few drivers and an
enormous number of “related” loci
● Context?
22. The garden of forking paths
Multiple hypothesis testing
23. Conclusion
Taming the 4-headed beast
Acquiring: interpret EHR
Storing: data formats & systems
Analysing: statistics, correct for
batch effects, integrative analysis,
deep learning
Sharing: standards, data formats,
workflow systems
Notas del editor
Scientific data doubles every 18 months
A new paper is published every 30 seconds
Most papers are never cited (or even read)