Predicting Health Outcomes from Data

What is health data science?
• Data-driven solution to solve complex real world health problems
• Or to derive knowledge from unstructured and messy data
• It is an interdisciplinary field: biostatistics, computer science,
epidemiology, public health, mathematics, etc

Real life health data science example
• HIV:
• Visualising the pattern of early HIV transmission within the mucosal barrier
• COVID-19:
• What can predict covid-19 neutralisation activity?
• Can we predict covid-19 vaccine efficacy?

Early HIV transmission
dynamics

Background
• Early HIV transmission event might occur during vaginal or anal sex
• Want to investigate if the mucosal barrier (within the vaginal tissue) is
effective in blocking HIV virus transmission or not

If the mucosal barrier is good in preventing viral
transmission, this is what we expect to see

If the mucosal barrier is not good at preventing
transmission, multiple viruses can be found
(random infection)

If the mucosal barrier is not good at preventing
transmission, multiple viruses can be found
(clustered infection)

14
Data Visualisation
Can still see many viral variants
no evidence that the vaginal tissue
is effective in blocking viral entry

Need a formal method
• How can we say (formally) if infection is spatially clustered (or not) ?
• Mantel test (or Mantel and Valand) -> relate a matrix of
“geographical” distance and a matrix of “biological” distance
• So, need to define the “geographical” matrix and “biological” matrix
first
15

“Geographical” distance
• Euclidean distance
di, j = (xi - xj )2
+(yi - yj )2
16

“Biological” distance
• Morisita – Horn index of overlap
MH =
2
n1in2i
N1N2
i
å
n1i
2
N1
+
n2i
2
N2
i
å
17

“Biological” distance
• Similarity between 1 and 2 =
0.98
• Similarity between 1 and 3 =
0.46
18

Mantel Test (or Mantel and Valand)
• Testing the association between two matrices
• Mantel quantity (Zm) is given by:
• Basic idea -> permutation test
• Randomly changing the rows and columns of the two matrices
• And store the value of Zm for each permutation of rows and columns
Zm = gij
j
å
i
å bij
19

20
Low p-values: infection is clustered locally
within the vaginal tissue

What can predict covid-19
viral neutralisation activity?

Background
• Neutralising antibody (NAb): antibody that can defend the host from
the specific pathogen
• Data: 41 convalescent adults; measured several immunological
parameters (13 parameters total)
• Goal: want to know in those 41 recovered patients, what
immunological parameters can be used to predict NAb

Methods
• Data visualisation is very important in data science
• First step: plot the correlation matrix for the whole dataset

Microneutralization is positively correlated
with SARS-CoV-2 RBD
Microneutralization is negatively correlated
with CCR6+CXCR3-

Ok, not very informative….
Have so many things correlated with microneutralization

Methods
• Correlation matrix shows that Nab is correlated with so many things
• Next step: Can I find some hidden features in this dataset?
• Method: principal component analysis (PCA)

The main focus is microneutralization
If the angle between microneut and another variable is less
than 90o; then it’s a positive association
If the angle between microneut and another variable is greater
than 90o; then it’s a negative association

For instance, higher ELISA S trimer gives higher
microneutralization level (less than 90o)
For instance, higher CCR6+CXCR3- gives lower
microneutralization level (more than 90o)

Methods
• PCA visualisation is better than correlation matrix
• But, still cannot just pick one thing that can be used to predict NAb
• Next step: I want to only pick one thing to predict NAb
• Method: multiple linear regression with a backward model selection
strategy
• The idea is to run a linear regression with all the variables, and iteratively
remove non-significant predictor until all the predictors are significant

Two main things are highly predictive of NAb

Predicting covid-19
vaccine efficacy

Background
• At the end of the phase 2 trial, we get the immunogenicity data
(measuring the amount of antibody)
• Given the data from phase 2 trial (antibody data), can we predict
what the efficacy of the vaccine will be?
• Training dataset: efficacy and antibody data from all available vaccines

Methods
• The first step is always to visualise your data, so why don’t we plot
efficacy against antibody first?

High antibody = high efficacy
Low antibody = low efficacy
Can we simply do a classification method based on the
level of antibody?

Methods
• The model is a distribution-free binary classification model, based on
the threshold level of antibody
• The lower your antibody level, higher chance for you to be infected,
so the vaccine efficacy will be lower
• The higher your antibody level, lower chance for you to be infected,
so the vaccine efficacy will be higher
• We want to know what is this threshold of antibody

We normalised the antibody to the convalescent patients
(the mean for convalescent is one)
Covaxin data came out a bit later, so we used covaxin to
validate our ‘classifier’ model
Using our classifier, as long as we have antibody data (from
phase 2 trial), we can predict any vaccine efficacy

CureVac mRNA vaccine failure – why???

Simple data visualisation can help to answer
Because lower dose than Pfizer and Moderna

Predicting Health Outcomes from Data

Recomendados

Recomendados

Más contenido relacionado

Similar a Predicting Health Outcomes from Data

Similar a Predicting Health Outcomes from Data (20)

Último

Último (20)

Predicting Health Outcomes from Data