A talk given to the Alan Turing Institute, UK, Oct 2021, reporting on the preliminary results and ongoing research in our lab, on self-monitoring using accelerometers for healthcare applications
A talk given to the Alan Turing Institute, UK, Oct 2021, reporting on the preliminary results and ongoing research in our lab, on self-monitoring using accelerometers for healthcare applications
Digital biomarkers for preventive personalised healthcare
1.
Digital biomarkers for preventive
personalised healthcare
Oct 18th, 2021
Paolo Missier
Newcastle University, School of Computing
2.
The Team
Prof. Paolo Missier PI, Newcastle University (PI)
Prof. Michael Catt, Newcastle University and Closed Loop Medicine, Cambridge (CO-I)
Dr. Jaume Bacardit, Newcastle University (CO-I)
Key contributors:
Dr. Ossama Alshabrawy (PhD student, now Lecturer at Northumbria University)
Ben Lam, PhD student
Dr. Jacek Cala, Sr. Research Associate
In collaboration with the IMI DIRECT Consortium
https://www.imi.europa.eu/projects-results/project-factsheets/direct
Diabetes research on patient stratification
3.
Data-Driven, Personalised, Predictive, Preventive, Participatory Medicine (D2P4)
Part I:
The role of physical activity monitoring to support Type II Diabetes studies
Can we learn useful representations for a person’s daily activities from accelerometry?
Part II:
Generating synthetic physical activity data
How do we simulate plausible physical activity patterns and why?
4.
Data-Driven, Personalised, Predictive, Preventive, Participatory Medicine (D2P4)
Part I:
The role of physical activity monitoring to support Type II Diabetes studies
Can we learn useful representations for a person’s daily activities from accelerometry?
Main contributors:
Dr. Ossama Alshabrawy (PhD student, now Lecturer at Northumbria University)
Benjamin Lam, PhD student
5.
Activity traces archive from the UK Biobank
Filter:
Accelerometry study?
103,712
Split criteria:
Type 2 Diabetes?
At baseline: 2,755
Through EHR analysis: 1,321
Total: 4,076
Non-Diabetes
99,636
Filter:
EHR data available?
19,852
502, 664
All UK Biobank participants:
Filter:
QC on activity traces
3,103
Positives:
T2D vs Norm-0
Physical Impairment analysis
Severe impairment
1,666
No impairment
8,463
T2D vs Norm-2
Is there enough signal in the traces to
segregate T2D from Norm?
6.
Extracting High Level Activity Features (HLAF)
feature extraction 60 features / day aggregated to week
(*)
(*) Doherty A, Jackson D, et al. (2017), Large scale population assessment of physical activity using wrist worn accelerometers: the UK Biobank study. PLOS
ONE. 12(2):e0169649. https://github.com/activityMonitoring/biobankAccelerometerAnalysis
7.
Selected results: Clustering
- Cluster 2 contains almost entirely T2D
positives (99.8%)
- phenotypes associated with increased
risk of T2D (increasing age, high body
fat percentage and a sedentary lifestyle)
are also highly expressed in Cluster 2
8.
Selected results: classification
Negatives: HLAF SDL HLAF+SDL
Norm-0 Norm-2 Norm-0 Norm-2 Norm-0 Norm-2
RF .80 .68 .83 .78 .86 .77
LR .79 .70 .83 .78 .86 .78
XGB .78 .66 .80 .74 .85 .75
Lam B, Catt M, Cassidy S, Bacardit J, Darke P, Butterfield S, Alshabrawy O, Trenell M, Missier P, Using Wearable Activity Trackers to Predict Type 2 Diabetes: Machine Learning–
Based Cross-sectional Study of the UK Biobank Accelerometer Cohort -- JMIR Diabetes, Vol 6 no1. 19/3/2021:23364
SDL: Socio-Demographic and Lifestyle variables
9.
Lessons learnt
• Signal is weak and noisy when used in the contex of a complex metabolic disease
• “Controls” may actually be physically impaired and this is hard to determine
• UK Biobank had no QC protocol, ”a random week in life” provides poor indicators
Are we mapping raw traces to the best possible feature space?
10.
Learning embedded representation spaces
DIRECT
DB
• ~3,000 individuals total
• Follow-ups at 18 36, 48 months
Representation
learning
Embedded
feature space
LSTM Autoencoder
Covariates,
Outcomes (eg Insulin sensitivity)
Classification
Clustering
Cluster
interpretation
11.
Autoencoder Architecture
LSTM Autoencoder
Final reconstruction loss: 0.46
(early termination, 9/150 epochs to prevent overfitting)
12.
Clustering in the high level and embedded spaces
Embedded features High-level features
K-means
Hierarchical
Affinity Propagation
Spectral clustering
Embedded features High-level features
Embedded features High-level features Embedded features High-level features
13.
Clusters quality
Silhouette Calinski-Harabrasz Davies-Bouldin
Affinity Propagation 0.634 2220.021 0.895
Spectral 0.677 2600.836 0.839
DBSCAN 0.274 73.642 1.808
Hierarchical 0.466 2292.27 0.879
K-means 0.482 2617.19 0.839
Silhouette: Bounded between 0 and 1 (Closer to 1, the better)
Calinski-Harabrasz: Unbounded (The higher the score, the better)
Davies-Bouldin (Not well suited to density methods): Bounded between 0 and 1 (Closer to 0, the better)
Many of the other cluster validity indices require knowledge of the ground truth labels, so this is not suitable for this study
14.
Cluster interpretation: clinical and activity variables
Logistic regression AdaBoost classifier
Random forest classifier XGBoost
2 clusters binary classification: are the clinical variables good predictors for the clusters?
percent time
light-tasks daily
percent time
sedentary daily
avg num hrs
asleep daily
avg daily
MET level
0.009 0.3 0.005 0.01
Significant p-values from t-tests
Distribution of physical variables
15.
Data-Driven, Personalised, Predictive, Preventive, Participatory Medicine (D2P4)
Part II:
Generating synthetic physical activity data
How do we simulate plausible physical activity patterns and why?
Main contributor: Dr. Jacek Cala, Sr. Research Associate
16.
Motivation
From the EPSRC Healthcare Technologies Grand Challenges (*)
“[Design] An intelligent 'companion' that is fully aware of an individual's healthcare history and
experience, empowering them to self-manage their health and care by providing directly relevant
feedback, information and advice.”
(*) https://epsrc.ukri.org/research/ourportfolio/themes/healthcaretechnologies/strategy/grandchallenges/
Scoping this down…
How do we design an AI agent that
- Knows our (wellness, fitness, health) goals
- Understands our current state through physical activity monitoring
- Can suggest personalised interventions to achieve our goals
Idea: Reinforcement learning
17.
Longitudinal and profile-specific data scarcity
The Good: Annotated sensor data are widely available and useful to train an AI agent
The Bad: Difficult to find / create protocols where:
• Participants are followed for any length of time
no longitudinal dimension (months, years)
• Responses to interventions can be observed
• Activity traces are available for specific conditions, pathologies, patient groups...
18.
A little puppetry
Approach:
1. Use 24x7 traces to:
• Learn to generate new synthetic traces for a
catalogue A1… An of activities
• Model unfolding daily activity patterns
2. Simulate:
Generate syntraces and combine them into
controlled plausible daily patterns
Limited to basic activity types
- Sedentary, Light tasks,
Moderate, Vigorous
- Sleep
Goal:
to simulate a variety of physical activity patterns that unfold in time
- Realistic
- Useful in practice to boost existing training sets
19.
Learn: Generating synthetic activity traces
(*) https://github.com/activityMonitoring/biobankAccelerometerAnalysis
Training:
- Traces: UK Biobank / 24x7 / 27 individuals
- HAR: Oxford accelerometry analysis tool (*)
- Traces broken down by (predicted) activity type
- A separate model trained for each activity type
- Notes: sleep excluded, traces trimmed to limit
training times
raw data
(subject-1) preprocess
126-dimensional
feature vectors
classified
activity trace
classify
split vectors
by activity
walking feature
vectors
sleep feature
vectors
moderate
feature vectors
...
train low-level
sleep model
train low-level
moderate model
train low-level
walking model
...
walking
model
moderate
model
sleep
model
...
raw data
(subject-1) preprocess
126-dimensional
feature vectors
classified
activity trace
classify
split vectors
by activity
walking feature
vectors
sleep feature
vectors
moderate
feature vectors
raw data
(subject-1) preprocess
126-dimensional
feature vectors
classified
activity trace
classify
split vectors
by activity
walking feature
vectors
sleep feature
vectors
moderate
feature vectors
...
Approach: Generative Neural Networks
BasicGAN from Synthetic Data Vault:
https://sdv.dev/
20.
Preliminary results
Validation: Oxford activity classifier used as discriminator
186 synthetic traces
• Walking activity easiest to simulate: 120 correctly classified
• Moderate activity hardest (196) – only 4 correctly classified
Problem:
some of the correctly classified traces look unrealistic
21.
Model and simulate: whole-day activity profiles
Goal: to realistically
combine bouts of single activities into ”virtual days”
Approach: parametric multi-state modelling
• transition probability si sj increases as more time
spent in si
Objective: Use real 24x7 sequences to learn:
- Realistic lengths of each activity bouts
- Activity transitions, eg walk sit
Selected traces of 24-hour synthetic activity profiles generated by
the semi-Markov generalised gamma model
(a), (b) show plausible traces; (c), (d) less realistic
22.
Summary and open research
Part I: The role of physical activity monitoring to support Type II Diabetes studies
- Single sensor, free-living, poor QC weak and noisy signal
- Good clustering of patients but signal inadequate for specific outcomes eg insuline sensitivity
- Signal either stable over time or too noisy to track disease progression
Next: multi-sensor monitoring
Part II: Generating synthetic physical activity data
- Plausible activity patterns
Next: use syndata for training using reinforcement learning?
23.
Leveraged resources and Future plans
New collaboration:
- Physical activity monitoring to support a study on “long covid”-induced frailty.
Consortium of 5 hospitals (Italy + Israel), about 300 patients. Funded by Gilead
Potential collaborations:
- Closed Loop Medicine, Digital Healthcare, Cambridge (through Prof. Catt)
- Fully-funded CDT PhD studentship aligned with the project since its inception (Ben Lam)
- New PhD student started October 2021 (Naif Alzahrani)
24.
Key outputs
Publications:
• Lam B, Catt M, Cassidy S, Bacardit J, Darke P, Butterfield S, Alshabrawy O, Trenell M, Missier P, Using Wearable
Activity Trackers to Predict Type 2 Diabetes: Machine Learning–Based Cross-sectional Study of the UK Biobank
Accelerometer Cohort -- JMIR Diabetes, Vol 6 no1. 19/3/2021:23364
• Ferrari D, Milic J, Tonelli R, Ghinelli F, Meschiari M, et al. (2020) Machine learning in predicting respiratory failure in
patients with COVID-19 pneumonia—Challenges, strengths, and opportunities in a global health emergency.
PLOS ONE 15(11): e0239172. https://doi.org/10.1371/journal.pone.0239172
Invited Presentations:
- Data Science for (Health) Science: tales from a challenging front line, a talk given to the The School of Information
Sciences, Center for Informatics Research in Science and Scholarship, University of Illinois Urbana-Champaign, USA
(March 2021)
- Digital markers from physical activity traces to support research into type 2 diabetes, Talk given to the IMI DIRECT
consortium (April 2020)
- Prediction & prevention of age-related diseases through Machine Learning, Talk given to Newcastle BRC/NIHR group
(Jan 2020)
- Exploring the role of digital and genetic biomarkers to learn personalized predictive models of metabolic diseases, Talk
given at the Turing Health Programme workshop, Manchester March 2019
Notas del editor
Aims: 1. To understand the role {potential, limitations} of physical activity monitoring to support the detection of complex metabolic diseases (Type II Diabetes) 2. To investigate the potential of synthetic activity traces to address the scarcity of longitudinal activity datasets for research
Clustering within the space Learning classifiers for specific clinical outcomes
Aims: 1. To understand the role {potential, limitations} of physical activity monitoring to support the detection of complex metabolic diseases (Type II Diabetes) 2. To investigate the potential of synthetic activity traces to address the scarcity of longitudinal activity datasets for research
Clustering within the space Learning classifiers for specific clinical outcomes
Spectral clustering is a technique with roots in graph theory, where the approach is used to identify communities of nodes in a graph based on the edges connecting them. The method is flexible and allows us to cluster non graph data as well. Spectral clustering uses information from the eigenvalues (spectrum) of the Laplacian built from a graph representation of the data set.
Aims: 1. To understand the role {potential, limitations} of physical activity monitoring to support the detection of complex metabolic diseases (Type II Diabetes) 2. To investigate the potential of synthetic activity traces to address the scarcity of longitudinal activity datasets for research
Clustering within the space Learning classifiers for specific clinical outcomes
Los recortes son una forma práctica de recopilar diapositivas importantes para volver a ellas más tarde. Ahora puedes personalizar el nombre de un tablero de recortes para guardar tus recortes.
Crear un tablero de recortes
Compartir esta SlideShare
¿Odia los anuncios?
Consiga SlideShare sin anuncios
Acceda a millones de presentaciones, documentos, libros electrónicos, audiolibros, revistas y mucho más. Todos ellos sin anuncios.
Oferta especial para lectores de SlideShare
Solo para ti: Prueba exclusiva de 60 días con acceso a la mayor biblioteca digital del mundo.
La familia SlideShare crece. Disfruta de acceso a millones de libros electrónicos, audiolibros, revistas y mucho más de Scribd.
Parece que tiene un bloqueador de anuncios ejecutándose. Poniendo SlideShare en la lista blanca de su bloqueador de anuncios, está apoyando a nuestra comunidad de creadores de contenidos.
¿Odia los anuncios?
Hemos actualizado nuestra política de privacidad.
Hemos actualizado su política de privacidad para cumplir con las cambiantes normativas de privacidad internacionales y para ofrecerle información sobre las limitadas formas en las que utilizamos sus datos.
Puede leer los detalles a continuación. Al aceptar, usted acepta la política de privacidad actualizada.