Hackaton on data from the Healthy Growth, Birth, Development knowledge integration initiative from the Bill and Melinda Gates initiative.
Analysis includes use of R package caret to prepare data and modelling, and an example of trajectory clustering.
Posted on http://bioinfoblog.it
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
Hacking Global Health London 2016
1. Giovanni M. Dall’Olio
Hacking Global Health
1
lessons learned from an Open Data Science Hackaton
https://github.com/dalloliogm/HBGDki-London/tree/master/Ultrasound/notebooks
2. Background – the HBGDki initiative
Bill and Melinda Gates Foundation
Presentation title 2
Slides credit: http://www.slideshare.net/JessicaWillis13/odsc-hackathon-for-health-october-2016
3. The HBGDki data
Objective of HBGDki:
•Understand which factors affect child development
Variables in full dataset (curated from 122 studies):
•Motor, Cognitive, Language Development
•Environment, Socioeconomic status
•Parents’ Reasoning skills and Depressive Symptoms
•Infant temperament, Breastfeeding, Micronutrients, Growth velocity, HAZ, enteric infections
Presentation title 3
4. Observations on HBGDki data?
• 90% data from US studies
• US data may be collected in a more systematic way or
with better tools
Bias towards US
studies
• Inconsistent data (different procedures used) although
manually curated
• Incomplete data
Data collected
from several
sources
• HBGDki plans to use insights from current dataset to
launch a global data collection study
• Scope of the Hackaton is to see which type of analysis
can be done and where efforts should be concentrated
Future plans
ahead
Presentation title 4
5. The Hackaton Challenge
• Being able to predict the weight at birth during the pregnancy
allows to detect underweight babies and act in advance
• This can be predicted from ultrasound measurements
• The current method are relatively good, but the objective of
the hackaton is to improve them.
Predicting weight at birth, given ultrasound measurements
Slides credit: http://www.slideshare.net/JessicaWillis13/odsc-hackathon-for-health-october-2016
6. The Hackaton data
Data size
• 17,370 ultrasound scans from 2,525 samples collected from two studies
Variables
• GAGEDAYS: age of the foetus in days at the time of the ultrasound
• SUBJID, STUDYID, SEX: subject and study id, sex of the baby
• WTKG: predicted weight at birth, using best method in x
• BWT_40: predicted weight at birth, using best method in literature
• PARITY, GRAVIDA: number of times the mother has been pregnant before
• ABCIRCM, BPDCM, FEMURCM, HCIRCM: ultrasound measurements
Presentation title 6
6
Biparietal Diameter
BPDCM
Head Circumference
HCIRCM
Abdominal Circumference
ABCIRM
Femur Length
FEMURCM
Slides credit: http://www.slideshare.net/JessicaWillis13/odsc-hackathon-for-health-october-2016
7. Exploratory 1: how much data, and how it is
distributed
Number of ultrasounds per subject
Presentation title 7
Distribution of ultrasound measurements
8. Centering, scaling, and imputing data with caret
library(caret)
preProcess(., method=c("center", "scale", "knnImpute" , "YeoJohnson" ))
After transformBefore transform
8
– The caret library in R can be used to center and scale the data, apply an YeoJohson transform to
normalize it, and impute missing values
11. Exploratory 3: Differences between Studies
• One group plotted the PARITY (number of pregnancies) by Study
• From the different distributions they hypothesized that Study 1 was from
an high-income country, while Study 2 from a medium-low income country
Presentation title 11
Study 1 Study 2
12. A PCA of the four ultrasound measurements
confirms they are highly correlated
• We can merge these 4 variables into one single Principal Component,
losing <1% of the variance
Presentation title 12
13. My plan: trajectory clustering
Presentation title 13
Use trajectory clustering to
classify growth trajectories
into different groups.
For example a group of
individuals may grow slower
or faster than the others, or
with different trajectories
Use non-ultrasound
variables to characterize the
different trajectory groups –
e.g. does male sex increases
odds of being in a fast-
growing group?
Data on the right shows
example analysis on
mousephenotype.org data
https://github.com/dalloliogm/HBGDki-London/blob/master/Ultrasound/notebooks/prehackaton_mousephenotype_trajectoryclustering.ipynb
14. Trajectory Clustering on PC1 of Ultrasound
measurements
Presentation title 14
cluster n
1 1
2 12
3 5
4 578
– Unfortunately trajectory clustering of the data
doesn’t show much
– Almost all samples (578) follow the same
trajectory
– A cluster of 12 samples (cluster 2) follows a slightly
faster growth trajectory than the others
15. Characterizing Cluster 2
• Cluster 2 contains 12 babies that grow slightly faster than the other
groups
• We can use a binomial regression on other variables (Sex, study id, parity)
to determine if they increase the odds of belonging to cluster 12
• Results are not exciting but at least indicate a new possible direction of
analysis when new data is available
Logistic Regression – odds of belonging to cluster 2 given Sex, Study ID and
Parity
Presentation title 15
Coefficients Estimate Std. Error z-value Pr(>|z|)
(Intercept) 9.2496 729.0359 0.013 0.989877
SEXMale 0.6564 0.2685 2.444 0.014517 *
STUDYID -14.2373 729.0359 -0.02 0.984419
PARITY 0.508 0.133 3.82 0.000134 ***
16. Modeling with caret
• The caret library is an interface to several R packages for
modelling / clustering / regressions
• The train function can be used to:
• Preprocess the data (center, scale, normalization)
• Fit a model/ regression/etc
• Do resampling and cross-validation
• Select best fit based on a metric
Presentation title 16
ctrl <- trainControl( method="boot", number=10, repeats=3)
gbm.fit = train(BWT_40~.,
data=ultrasound.data,
method="gbm",
trainControl=ctrl,
preProcess=c("center", "scale"),
verbose=F)
17. Generalized boosting regression on ultrasound
data
Presentation title 17
var rel.inf
ABCIRCM ABCIRCM 42.3102187
GAGEDAYS GAGEDAYS 34.7568922
FEMURCM FEMURCM 7.0196893
SEXMale SEXMale 6.5910654
BPDCM BPDCM 4.7765837
HCIRCM HCIRCM 2.6879421
PARITY PARITY 1.5100042
STUDYID STUDYID 0.3476046
• 25 resamplings
• Data centered,
scaled, knnImputed
with caret
• RMSE 0.294
18. Focusing model on weeks 15-25 slightly
improves performances
Presentation title 18
• 25 resamplings
• Data centered, scaled,
knnImputed with caret
• RMSE .327
gbm variable importance
Overall
GAGEDAYS 100.00
ABCIRCM 93.12
HCIRCM 68.62
FEMURCM 46.02
BPDCM 29.54
SEXMale 21.96
PARITY 11.61
STUDYID 0.00
19. Caret is an interface to several R modelling
packages
Presentation title 19
20. Models
Models tried:
• Linear regression
• Regularised regression (LASSO/Ridge)
• Decision trees + AdaBoost
• Random forests
Using:
• Last scan only
• Last two scans
• Last three scans
• All 6 scans (if available)
‘Best’ model
• Last three scans
• Elastic Net
• MAPE ≈ 7.4% (MAE ≈ 0.24 kg)
This can be improved by:
• Adding scans closer to delivery back in (MAPE
≈ 6.4%)
What did teams do
What did the winning team do better?
• Feature engineering
• Smart transform of features to predict brain volume, density, etc
• Unfortunately their slides are not available anymore ..
21. Lessons learned
• About 50% time was spent on cleaning and understanding
data
• HBGDki’s investment in data curation is well justified
Cleaning data
takes time
• An approach to classify longitudinal data, even if incomplete
• More samples and more variables would allow to
characterize different classes of growth speed
Trajectory
clustering
• Common interface for several R modelling packages
• Also useful for data cleaning and exploringCaret
• Models can be improved by understanding the variables and
transforming them in a proper way
Feature
Engineering
Presentation title 21