SlideShare a Scribd company logo
1 of 21
Giovanni M. Dall’Olio
Hacking Global Health
1
lessons learned from an Open Data Science Hackaton
https://github.com/dalloliogm/HBGDki-London/tree/master/Ultrasound/notebooks
Background – the HBGDki initiative
Bill and Melinda Gates Foundation
Presentation title 2
Slides credit: http://www.slideshare.net/JessicaWillis13/odsc-hackathon-for-health-october-2016
The HBGDki data
Objective of HBGDki:
•Understand which factors affect child development
Variables in full dataset (curated from 122 studies):
•Motor, Cognitive, Language Development
•Environment, Socioeconomic status
•Parents’ Reasoning skills and Depressive Symptoms
•Infant temperament, Breastfeeding, Micronutrients, Growth velocity, HAZ, enteric infections
Presentation title 3
Observations on HBGDki data?
• 90% data from US studies
• US data may be collected in a more systematic way or
with better tools
Bias towards US
studies
• Inconsistent data (different procedures used) although
manually curated
• Incomplete data
Data collected
from several
sources
• HBGDki plans to use insights from current dataset to
launch a global data collection study
• Scope of the Hackaton is to see which type of analysis
can be done and where efforts should be concentrated
Future plans
ahead
Presentation title 4
The Hackaton Challenge
• Being able to predict the weight at birth during the pregnancy
allows to detect underweight babies and act in advance
• This can be predicted from ultrasound measurements
• The current method are relatively good, but the objective of
the hackaton is to improve them.
Predicting weight at birth, given ultrasound measurements
Slides credit: http://www.slideshare.net/JessicaWillis13/odsc-hackathon-for-health-october-2016
The Hackaton data
Data size
• 17,370 ultrasound scans from 2,525 samples collected from two studies
Variables
• GAGEDAYS: age of the foetus in days at the time of the ultrasound
• SUBJID, STUDYID, SEX: subject and study id, sex of the baby
• WTKG: predicted weight at birth, using best method in x
• BWT_40: predicted weight at birth, using best method in literature
• PARITY, GRAVIDA: number of times the mother has been pregnant before
• ABCIRCM, BPDCM, FEMURCM, HCIRCM: ultrasound measurements
Presentation title 6
6
Biparietal Diameter
BPDCM
Head Circumference
HCIRCM
Abdominal Circumference
ABCIRM
Femur Length
FEMURCM
Slides credit: http://www.slideshare.net/JessicaWillis13/odsc-hackathon-for-health-october-2016
Exploratory 1: how much data, and how it is
distributed
Number of ultrasounds per subject
Presentation title 7
Distribution of ultrasound measurements
Centering, scaling, and imputing data with caret
library(caret)
preProcess(., method=c("center", "scale", "knnImpute" , "YeoJohnson" ))
After transformBefore transform
8
– The caret library in R can be used to center and scale the data, apply an YeoJohson transform to
normalize it, and impute missing values
Exploratory 2:
Correlation
between
variables
• The ggpairs
function from
GGally allows to
quickly create
pair plots
Presentation title 9
Correlation
between
variables,
Grouped by
Study
Presentation title 10
Exploratory 3: Differences between Studies
• One group plotted the PARITY (number of pregnancies) by Study
• From the different distributions they hypothesized that Study 1 was from
an high-income country, while Study 2 from a medium-low income country
Presentation title 11
Study 1 Study 2
A PCA of the four ultrasound measurements
confirms they are highly correlated
• We can merge these 4 variables into one single Principal Component,
losing <1% of the variance
Presentation title 12
My plan: trajectory clustering
Presentation title 13
Use trajectory clustering to
classify growth trajectories
into different groups.
For example a group of
individuals may grow slower
or faster than the others, or
with different trajectories
Use non-ultrasound
variables to characterize the
different trajectory groups –
e.g. does male sex increases
odds of being in a fast-
growing group?
Data on the right shows
example analysis on
mousephenotype.org data
https://github.com/dalloliogm/HBGDki-London/blob/master/Ultrasound/notebooks/prehackaton_mousephenotype_trajectoryclustering.ipynb
Trajectory Clustering on PC1 of Ultrasound
measurements
Presentation title 14
cluster n
1 1
2 12
3 5
4 578
– Unfortunately trajectory clustering of the data
doesn’t show much
– Almost all samples (578) follow the same
trajectory
– A cluster of 12 samples (cluster 2) follows a slightly
faster growth trajectory than the others
Characterizing Cluster 2
• Cluster 2 contains 12 babies that grow slightly faster than the other
groups
• We can use a binomial regression on other variables (Sex, study id, parity)
to determine if they increase the odds of belonging to cluster 12
• Results are not exciting but at least indicate a new possible direction of
analysis when new data is available
Logistic Regression – odds of belonging to cluster 2 given Sex, Study ID and
Parity
Presentation title 15
Coefficients Estimate Std. Error z-value Pr(>|z|)
(Intercept) 9.2496 729.0359 0.013 0.989877
SEXMale 0.6564 0.2685 2.444 0.014517 *
STUDYID -14.2373 729.0359 -0.02 0.984419
PARITY 0.508 0.133 3.82 0.000134 ***
Modeling with caret
• The caret library is an interface to several R packages for
modelling / clustering / regressions
• The train function can be used to:
• Preprocess the data (center, scale, normalization)
• Fit a model/ regression/etc
• Do resampling and cross-validation
• Select best fit based on a metric
Presentation title 16
ctrl <- trainControl( method="boot", number=10, repeats=3)
gbm.fit = train(BWT_40~.,
data=ultrasound.data,
method="gbm",
trainControl=ctrl,
preProcess=c("center", "scale"),
verbose=F)
Generalized boosting regression on ultrasound
data
Presentation title 17
var rel.inf
ABCIRCM ABCIRCM 42.3102187
GAGEDAYS GAGEDAYS 34.7568922
FEMURCM FEMURCM 7.0196893
SEXMale SEXMale 6.5910654
BPDCM BPDCM 4.7765837
HCIRCM HCIRCM 2.6879421
PARITY PARITY 1.5100042
STUDYID STUDYID 0.3476046
• 25 resamplings
• Data centered,
scaled, knnImputed
with caret
• RMSE 0.294
Focusing model on weeks 15-25 slightly
improves performances
Presentation title 18
• 25 resamplings
• Data centered, scaled,
knnImputed with caret
• RMSE .327
gbm variable importance
Overall
GAGEDAYS 100.00
ABCIRCM 93.12
HCIRCM 68.62
FEMURCM 46.02
BPDCM 29.54
SEXMale 21.96
PARITY 11.61
STUDYID 0.00
Caret is an interface to several R modelling
packages
Presentation title 19
Models
Models tried:
• Linear regression
• Regularised regression (LASSO/Ridge)
• Decision trees + AdaBoost
• Random forests
Using:
• Last scan only
• Last two scans
• Last three scans
• All 6 scans (if available)
‘Best’ model
• Last three scans
• Elastic Net
• MAPE ≈ 7.4% (MAE ≈ 0.24 kg)
This can be improved by:
• Adding scans closer to delivery back in (MAPE
≈ 6.4%)
What did teams do
What did the winning team do better?
• Feature engineering
• Smart transform of features to predict brain volume, density, etc
• Unfortunately their slides are not available anymore ..
Lessons learned
• About 50% time was spent on cleaning and understanding
data
• HBGDki’s investment in data curation is well justified
Cleaning data
takes time
• An approach to classify longitudinal data, even if incomplete
• More samples and more variables would allow to
characterize different classes of growth speed
Trajectory
clustering
• Common interface for several R modelling packages
• Also useful for data cleaning and exploringCaret
• Models can be improved by understanding the variables and
transforming them in a proper way
Feature
Engineering
Presentation title 21

More Related Content

Viewers also liked

Unix Operating System
Unix Operating SystemUnix Operating System
Unix Operating Systemsubhsikha
 
Bioinformaticians to the resque
Bioinformaticians to the resqueBioinformaticians to the resque
Bioinformaticians to the resqueElena Sügis
 
Bioinformatics n bio-bio-1_uoda_workshop_4_july_2013_v1.0
Bioinformatics n bio-bio-1_uoda_workshop_4_july_2013_v1.0Bioinformatics n bio-bio-1_uoda_workshop_4_july_2013_v1.0
Bioinformatics n bio-bio-1_uoda_workshop_4_july_2013_v1.0Fokhruz Zaman
 
VideoLan VLC Player App Artifact Report
VideoLan VLC Player App Artifact ReportVideoLan VLC Player App Artifact Report
VideoLan VLC Player App Artifact ReportAziz Sasmaz
 
History of L0phtCrack
History of L0phtCrackHistory of L0phtCrack
History of L0phtCrackcwysopal
 
Intro to linux performance analysis
Intro to linux performance analysisIntro to linux performance analysis
Intro to linux performance analysisChris McEniry
 
Nigerian design and digital marketing agency
Nigerian design and digital marketing agencyNigerian design and digital marketing agency
Nigerian design and digital marketing agencySamson Aligba
 
Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureData Science London
 
脆弱性診断って何をどうすればいいの?(おかわり)
脆弱性診断って何をどうすればいいの?(おかわり)脆弱性診断って何をどうすればいいの?(おかわり)
脆弱性診断って何をどうすればいいの?(おかわり)脆弱性診断研究会
 

Viewers also liked (18)

Wagner chapter 3
Wagner chapter 3Wagner chapter 3
Wagner chapter 3
 
Linux intro 5 extra: makefiles
Linux intro 5 extra: makefilesLinux intro 5 extra: makefiles
Linux intro 5 extra: makefiles
 
Linux intro 4 awk + makefile
Linux intro 4  awk + makefileLinux intro 4  awk + makefile
Linux intro 4 awk + makefile
 
Linux intro 2 basic terminal
Linux intro 2   basic terminalLinux intro 2   basic terminal
Linux intro 2 basic terminal
 
Linux intro 5 extra: awk
Linux intro 5 extra: awkLinux intro 5 extra: awk
Linux intro 5 extra: awk
 
Linux intro 3 grep + Unix piping
Linux intro 3 grep + Unix pipingLinux intro 3 grep + Unix piping
Linux intro 3 grep + Unix piping
 
Unix Operating System
Unix Operating SystemUnix Operating System
Unix Operating System
 
L'inferenza statistica e la lettura dei dati
L'inferenza statistica e la lettura dei datiL'inferenza statistica e la lettura dei dati
L'inferenza statistica e la lettura dei dati
 
Bioinformaticians to the resque
Bioinformaticians to the resqueBioinformaticians to the resque
Bioinformaticians to the resque
 
Bioinformatics n bio-bio-1_uoda_workshop_4_july_2013_v1.0
Bioinformatics n bio-bio-1_uoda_workshop_4_july_2013_v1.0Bioinformatics n bio-bio-1_uoda_workshop_4_july_2013_v1.0
Bioinformatics n bio-bio-1_uoda_workshop_4_july_2013_v1.0
 
VideoLan VLC Player App Artifact Report
VideoLan VLC Player App Artifact ReportVideoLan VLC Player App Artifact Report
VideoLan VLC Player App Artifact Report
 
History of L0phtCrack
History of L0phtCrackHistory of L0phtCrack
History of L0phtCrack
 
Samsung mobile root
Samsung mobile rootSamsung mobile root
Samsung mobile root
 
Intro to linux performance analysis
Intro to linux performance analysisIntro to linux performance analysis
Intro to linux performance analysis
 
Nigerian design and digital marketing agency
Nigerian design and digital marketing agencyNigerian design and digital marketing agency
Nigerian design and digital marketing agency
 
Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and Future
 
Access any data anywhere
Access any data anywhereAccess any data anywhere
Access any data anywhere
 
脆弱性診断って何をどうすればいいの?(おかわり)
脆弱性診断って何をどうすればいいの?(おかわり)脆弱性診断って何をどうすればいいの?(おかわり)
脆弱性診断って何をどうすればいいの?(おかわり)
 

More from Giovanni Marco Dall'Olio

More from Giovanni Marco Dall'Olio (11)

Fehrman Nat Gen 2014 - Journal Club
Fehrman Nat Gen 2014 - Journal ClubFehrman Nat Gen 2014 - Journal Club
Fehrman Nat Gen 2014 - Journal Club
 
Hg for bioinformatics, second part
Hg for bioinformatics, second partHg for bioinformatics, second part
Hg for bioinformatics, second part
 
Hg version control bioinformaticians
Hg version control bioinformaticiansHg version control bioinformaticians
Hg version control bioinformaticians
 
The true story behind the annotation of a pathway
The true story behind the annotation of a pathwayThe true story behind the annotation of a pathway
The true story behind the annotation of a pathway
 
Plotting data with python and pylab
Plotting data with python and pylabPlotting data with python and pylab
Plotting data with python and pylab
 
Pycon
PyconPycon
Pycon
 
Makefiles Bioinfo
Makefiles BioinfoMakefiles Bioinfo
Makefiles Bioinfo
 
biopython, doctest and makefiles
biopython, doctest and makefilesbiopython, doctest and makefiles
biopython, doctest and makefiles
 
Web 2.0 e ricerca scientifica - Web 2.0 and scientific research
Web 2.0 e ricerca scientifica - Web 2.0 and scientific researchWeb 2.0 e ricerca scientifica - Web 2.0 and scientific research
Web 2.0 e ricerca scientifica - Web 2.0 and scientific research
 
Perl Bioinfo
Perl BioinfoPerl Bioinfo
Perl Bioinfo
 
(draft) perl e bioinformatica - presentazione per ipw2008
(draft) perl e bioinformatica - presentazione per ipw2008(draft) perl e bioinformatica - presentazione per ipw2008
(draft) perl e bioinformatica - presentazione per ipw2008
 

Recently uploaded

Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayupadhyaymani499
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuinethapagita
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringPrajakta Shinde
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxFarihaAbdulRasheed
 
Functional group interconversions(oxidation reduction)
Functional group interconversions(oxidation reduction)Functional group interconversions(oxidation reduction)
Functional group interconversions(oxidation reduction)itwameryclare
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx023NiWayanAnggiSriWa
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptJoemSTuliba
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingNetHelix
 

Recently uploaded (20)

Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyay
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical Engineering
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
 
Functional group interconversions(oxidation reduction)
Functional group interconversions(oxidation reduction)Functional group interconversions(oxidation reduction)
Functional group interconversions(oxidation reduction)
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.ppt
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensor
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
 

Hacking Global Health London 2016

  • 1. Giovanni M. Dall’Olio Hacking Global Health 1 lessons learned from an Open Data Science Hackaton https://github.com/dalloliogm/HBGDki-London/tree/master/Ultrasound/notebooks
  • 2. Background – the HBGDki initiative Bill and Melinda Gates Foundation Presentation title 2 Slides credit: http://www.slideshare.net/JessicaWillis13/odsc-hackathon-for-health-october-2016
  • 3. The HBGDki data Objective of HBGDki: •Understand which factors affect child development Variables in full dataset (curated from 122 studies): •Motor, Cognitive, Language Development •Environment, Socioeconomic status •Parents’ Reasoning skills and Depressive Symptoms •Infant temperament, Breastfeeding, Micronutrients, Growth velocity, HAZ, enteric infections Presentation title 3
  • 4. Observations on HBGDki data? • 90% data from US studies • US data may be collected in a more systematic way or with better tools Bias towards US studies • Inconsistent data (different procedures used) although manually curated • Incomplete data Data collected from several sources • HBGDki plans to use insights from current dataset to launch a global data collection study • Scope of the Hackaton is to see which type of analysis can be done and where efforts should be concentrated Future plans ahead Presentation title 4
  • 5. The Hackaton Challenge • Being able to predict the weight at birth during the pregnancy allows to detect underweight babies and act in advance • This can be predicted from ultrasound measurements • The current method are relatively good, but the objective of the hackaton is to improve them. Predicting weight at birth, given ultrasound measurements Slides credit: http://www.slideshare.net/JessicaWillis13/odsc-hackathon-for-health-october-2016
  • 6. The Hackaton data Data size • 17,370 ultrasound scans from 2,525 samples collected from two studies Variables • GAGEDAYS: age of the foetus in days at the time of the ultrasound • SUBJID, STUDYID, SEX: subject and study id, sex of the baby • WTKG: predicted weight at birth, using best method in x • BWT_40: predicted weight at birth, using best method in literature • PARITY, GRAVIDA: number of times the mother has been pregnant before • ABCIRCM, BPDCM, FEMURCM, HCIRCM: ultrasound measurements Presentation title 6 6 Biparietal Diameter BPDCM Head Circumference HCIRCM Abdominal Circumference ABCIRM Femur Length FEMURCM Slides credit: http://www.slideshare.net/JessicaWillis13/odsc-hackathon-for-health-october-2016
  • 7. Exploratory 1: how much data, and how it is distributed Number of ultrasounds per subject Presentation title 7 Distribution of ultrasound measurements
  • 8. Centering, scaling, and imputing data with caret library(caret) preProcess(., method=c("center", "scale", "knnImpute" , "YeoJohnson" )) After transformBefore transform 8 – The caret library in R can be used to center and scale the data, apply an YeoJohson transform to normalize it, and impute missing values
  • 9. Exploratory 2: Correlation between variables • The ggpairs function from GGally allows to quickly create pair plots Presentation title 9
  • 11. Exploratory 3: Differences between Studies • One group plotted the PARITY (number of pregnancies) by Study • From the different distributions they hypothesized that Study 1 was from an high-income country, while Study 2 from a medium-low income country Presentation title 11 Study 1 Study 2
  • 12. A PCA of the four ultrasound measurements confirms they are highly correlated • We can merge these 4 variables into one single Principal Component, losing <1% of the variance Presentation title 12
  • 13. My plan: trajectory clustering Presentation title 13 Use trajectory clustering to classify growth trajectories into different groups. For example a group of individuals may grow slower or faster than the others, or with different trajectories Use non-ultrasound variables to characterize the different trajectory groups – e.g. does male sex increases odds of being in a fast- growing group? Data on the right shows example analysis on mousephenotype.org data https://github.com/dalloliogm/HBGDki-London/blob/master/Ultrasound/notebooks/prehackaton_mousephenotype_trajectoryclustering.ipynb
  • 14. Trajectory Clustering on PC1 of Ultrasound measurements Presentation title 14 cluster n 1 1 2 12 3 5 4 578 – Unfortunately trajectory clustering of the data doesn’t show much – Almost all samples (578) follow the same trajectory – A cluster of 12 samples (cluster 2) follows a slightly faster growth trajectory than the others
  • 15. Characterizing Cluster 2 • Cluster 2 contains 12 babies that grow slightly faster than the other groups • We can use a binomial regression on other variables (Sex, study id, parity) to determine if they increase the odds of belonging to cluster 12 • Results are not exciting but at least indicate a new possible direction of analysis when new data is available Logistic Regression – odds of belonging to cluster 2 given Sex, Study ID and Parity Presentation title 15 Coefficients Estimate Std. Error z-value Pr(>|z|) (Intercept) 9.2496 729.0359 0.013 0.989877 SEXMale 0.6564 0.2685 2.444 0.014517 * STUDYID -14.2373 729.0359 -0.02 0.984419 PARITY 0.508 0.133 3.82 0.000134 ***
  • 16. Modeling with caret • The caret library is an interface to several R packages for modelling / clustering / regressions • The train function can be used to: • Preprocess the data (center, scale, normalization) • Fit a model/ regression/etc • Do resampling and cross-validation • Select best fit based on a metric Presentation title 16 ctrl <- trainControl( method="boot", number=10, repeats=3) gbm.fit = train(BWT_40~., data=ultrasound.data, method="gbm", trainControl=ctrl, preProcess=c("center", "scale"), verbose=F)
  • 17. Generalized boosting regression on ultrasound data Presentation title 17 var rel.inf ABCIRCM ABCIRCM 42.3102187 GAGEDAYS GAGEDAYS 34.7568922 FEMURCM FEMURCM 7.0196893 SEXMale SEXMale 6.5910654 BPDCM BPDCM 4.7765837 HCIRCM HCIRCM 2.6879421 PARITY PARITY 1.5100042 STUDYID STUDYID 0.3476046 • 25 resamplings • Data centered, scaled, knnImputed with caret • RMSE 0.294
  • 18. Focusing model on weeks 15-25 slightly improves performances Presentation title 18 • 25 resamplings • Data centered, scaled, knnImputed with caret • RMSE .327 gbm variable importance Overall GAGEDAYS 100.00 ABCIRCM 93.12 HCIRCM 68.62 FEMURCM 46.02 BPDCM 29.54 SEXMale 21.96 PARITY 11.61 STUDYID 0.00
  • 19. Caret is an interface to several R modelling packages Presentation title 19
  • 20. Models Models tried: • Linear regression • Regularised regression (LASSO/Ridge) • Decision trees + AdaBoost • Random forests Using: • Last scan only • Last two scans • Last three scans • All 6 scans (if available) ‘Best’ model • Last three scans • Elastic Net • MAPE ≈ 7.4% (MAE ≈ 0.24 kg) This can be improved by: • Adding scans closer to delivery back in (MAPE ≈ 6.4%) What did teams do What did the winning team do better? • Feature engineering • Smart transform of features to predict brain volume, density, etc • Unfortunately their slides are not available anymore ..
  • 21. Lessons learned • About 50% time was spent on cleaning and understanding data • HBGDki’s investment in data curation is well justified Cleaning data takes time • An approach to classify longitudinal data, even if incomplete • More samples and more variables would allow to characterize different classes of growth speed Trajectory clustering • Common interface for several R modelling packages • Also useful for data cleaning and exploringCaret • Models can be improved by understanding the variables and transforming them in a proper way Feature Engineering Presentation title 21