Health data science.pptx

1. Health data science
2. Why study data science?
3. Why study data science?
4. What is health data science? • Data-driven solution to solve complex real world health problems • Or to derive knowledge from unstructured and messy data • It is an interdisciplinary field: biostatistics, computer science, epidemiology, public health, mathematics, etc
5. But basically…
6. Real life health data science example • HIV: • Visualising the pattern of early HIV transmission within the mucosal barrier • COVID-19: • What can predict covid-19 neutralisation activity? • Can we predict covid-19 vaccine efficacy?
7. Early HIV transmission dynamics
8. Background • Early HIV transmission event might occur during vaginal or anal sex • Want to investigate if the mucosal barrier (within the vaginal tissue) is effective in blocking HIV virus transmission or not
9. If the mucosal barrier is good in preventing viral transmission, this is what we expect to see
10. If the mucosal barrier is not good at preventing transmission, multiple viruses can be found (random infection)
11. If the mucosal barrier is not good at preventing transmission, multiple viruses can be found (clustered infection)
12. Animal experiment
13. Data
14. 14 Data Visualisation Can still see many viral variants no evidence that the vaginal tissue is effective in blocking viral entry
15. Need a formal method • How can we say (formally) if infection is spatially clustered (or not) ? • Mantel test (or Mantel and Valand) -> relate a matrix of “geographical” distance and a matrix of “biological” distance • So, need to define the “geographical” matrix and “biological” matrix first 15
16. “Geographical” distance • Euclidean distance di, j = (xi - xj )2 +(yi - yj )2 16
17. “Biological” distance • Morisita – Horn index of overlap MH = 2 n1in2i N1N2 i å n1i 2 N1 + n2i 2 N2 i å 17
18. “Biological” distance • Similarity between 1 and 2 = 0.98 • Similarity between 1 and 3 = 0.46 18
19. Mantel Test (or Mantel and Valand) • Testing the association between two matrices • Mantel quantity (Zm) is given by: • Basic idea -> permutation test • Randomly changing the rows and columns of the two matrices • And store the value of Zm for each permutation of rows and columns Zm = gij j å i å bij 19
20. 20 Low p-values: infection is clustered locally within the vaginal tissue
21. What can predict covid-19 viral neutralisation activity?
22. Background • Neutralising antibody (NAb): antibody that can defend the host from the specific pathogen • Data: 41 convalescent adults; measured several immunological parameters (13 parameters total) • Goal: want to know in those 41 recovered patients, what immunological parameters can be used to predict NAb
23. Methods • Data visualisation is very important in data science • First step: plot the correlation matrix for the whole dataset
24. Microneutralization is positively correlated with SARS-CoV-2 RBD Microneutralization is negatively correlated with CCR6+CXCR3-
25. Ok, not very informative…. Have so many things correlated with microneutralization
26. Methods • Correlation matrix shows that Nab is correlated with so many things • Next step: Can I find some hidden features in this dataset? • Method: principal component analysis (PCA)
27. The main focus is microneutralization If the angle between microneut and another variable is less than 90o; then it’s a positive association If the angle between microneut and another variable is greater than 90o; then it’s a negative association
28. For instance, higher ELISA S trimer gives higher microneutralization level (less than 90o) For instance, higher CCR6+CXCR3- gives lower microneutralization level (more than 90o)
29. Methods • PCA visualisation is better than correlation matrix • But, still cannot just pick one thing that can be used to predict NAb • Next step: I want to only pick one thing to predict NAb • Method: multiple linear regression with a backward model selection strategy • The idea is to run a linear regression with all the variables, and iteratively remove non-significant predictor until all the predictors are significant
30. Two main things are highly predictive of NAb
31. Predicting covid-19 vaccine efficacy
32. Background
33. Background • At the end of the phase 2 trial, we get the immunogenicity data (measuring the amount of antibody) • Given the data from phase 2 trial (antibody data), can we predict what the efficacy of the vaccine will be? • Training dataset: efficacy and antibody data from all available vaccines
34. Methods • The first step is always to visualise your data, so why don’t we plot efficacy against antibody first?
35. High antibody = high efficacy Low antibody = low efficacy Can we simply do a classification method based on the level of antibody?
36. Methods • The model is a distribution-free binary classification model, based on the threshold level of antibody • The lower your antibody level, higher chance for you to be infected, so the vaccine efficacy will be lower • The higher your antibody level, lower chance for you to be infected, so the vaccine efficacy will be higher • We want to know what is this threshold of antibody
37. We normalised the antibody to the convalescent patients (the mean for convalescent is one) Covaxin data came out a bit later, so we used covaxin to validate our ‘classifier’ model Using our classifier, as long as we have antibody data (from phase 2 trial), we can predict any vaccine efficacy
38. CureVac mRNA vaccine failure – why???
39. Simple data visualisation can help to answer Because lower dose than Pfizer and Moderna