SlideShare una empresa de Scribd logo
1 de 42
Descargar para leer sin conexión
ECML	
  PKDD	
  2015,	
  Porto,	
  Portugal
Assessing	
  the	
  impact	
  of	
  a	
  health	
  intervention	
  
via	
  user-­‐generated	
  Internet	
  data	
  
Data	
  Mining	
  and	
  Knowledge	
  Discovery	
  29(5),	
  pp.	
  1434–1457,	
  2015
Vasileios	
  Lampos,	
  Elad	
  Yom-­‐Tov,	
  	
  
Richard	
  Pebody	
  and	
  Ingemar	
  J.	
  Cox
STATUTORY NOTIFICATIONS OF INFECTIOUS D
WEEK 2015/33 week ending 16/08/2015
in ENGLAND and WALES
Table 1 Statutory notifications of infectious diseases in the past 6 week
current year compared with corresponding periods of the two p
CONTENTS
Table 2 Statutory notifications of infectious diseases for diseases for W
Region, county, local and unitary authority including additional
6th April 2010
Registered Medical Practioner in England and Wales have a statutory duty to
the local authority, often the CCDC (Consultant in Communicable Disease Co
of certain infectious diseases:
Acute encephalitis Haemolytic uraemic syndrome * R
NOIDs WEEKLY REPORTat
bridge
๏ Background	
  and	
  motivation	
  
๏ Nowcasting	
  disease	
  rates	
  from	
  online	
  text	
  
๏ Estimating	
  the	
  impact	
  of	
  a	
  health	
  intervention	
  
๏ Case	
  study:	
  influenza	
  vaccination	
  impact	
  
๏ Conclusions	
  &	
  future	
  work
1%
Assessing	
  the	
  impact	
  of	
  a	
  health	
  intervention	
  via	
  online	
  content
Online,	
  user-­‐generated	
  data
+ Social	
  media,	
  blogs,	
  search	
  engine	
  query	
  logs	
  
+ Proxy	
  of	
  real-­‐world	
  (online+offline)	
  behaviour	
  
+ Complementary	
  information	
  sensors	
  to	
  more	
  
‘traditional’	
  crowdsourcing	
  efforts	
  
+ Can	
  answer	
  questions	
  difficult	
  to	
  resolve	
  otherwise	
  
+ Strong	
  predictive	
  power
Online,	
  user-­‐generated	
  data	
  —	
  Applications
+ Politics	
  
• voting	
  intention	
  
• result	
  of	
  an	
  election	
  
+ Finance	
  
• financial	
  indices	
  
• tourism	
  patterns	
  
+ User	
  profiling	
  
• age	
  
• gender	
  
• occupation (Preotiuc-­‐Pietro,	
  Lampos	
  &	
  Aletras,	
  2015)
(Burger	
  et	
  al.,	
  2011)
(Rao	
  et	
  al.,	
  2010)
(Bollen,	
  Mao	
  &	
  Zeng,	
  2011)
(Choi	
  &	
  Varian,	
  2012)
(Lampos,	
  Preotiuc-­‐Pietro	
  &	
  Cohn,	
  2013)
(Tumasjan	
  et	
  al.,	
  2010)
Online,	
  user-­‐generated	
  data	
  for	
  health
Traditional	
  disease	
  surveillance	
  
- does	
  not	
  cover	
  the	
  entire	
  population	
  
- not	
  present	
  everywhere	
  (cities	
  /	
  countries)	
  
- not	
  always	
  timely	
  
Digital	
  disease	
  surveillance	
  
+ different	
  or	
  better	
  population	
  coverage	
  
+ better	
  geographical	
  granularity	
  
+ useful	
  in	
  underdeveloped	
  parts	
  of	
  the	
  world	
  
+ almost	
  instant	
  
- noisy,	
  unstructured	
  information
e.g.	
  (Lampos	
  &	
  Cristianini,	
  2010	
  &	
  2012),	
  (Lamb,	
  Paul	
  &	
  Dredze,	
  2013),	
  (Lampos	
  et	
  al.,	
  2015)	
  
What	
  this	
  work	
  is	
  all	
  about
Health	
  intervention
disease	
  rates
(
Pebody	
  &	
  Cox,	
  2015
impact ?
What	
  this	
  work	
  is	
  all	
  about
Health	
  intervention
disease	
  rates
(Lampos,	
  Yom-­‐Tov,	
  
Pebody	
  &	
  Cox,	
  2015)
impact ?
✓ Background	
  and	
  motivation	
  
๏ Estimating	
  disease	
  rates	
  from	
  online	
  text	
  
๏ Estimating	
  the	
  impact	
  of	
  a	
  health	
  intervention	
  
๏ Case	
  study:	
  influenza	
  vaccination	
  impact	
  
๏ Conclusions	
  &	
  future	
  work
Assessing	
  the	
  impact	
  of	
  a	
  health	
  intervention	
  via	
  online	
  content
15%
Estimating	
  disease	
  rates	
  from	
  online	
  textVariables
N
M
X 2 RN⇥M
y 2 RN
time	
  intervals
n-­‐grams
frequency	
  of	
  n-­‐grams	
  during	
  the	
  time	
  intervals
disease	
  rates	
  during	
  the	
  time	
  intervals
Ridge regression
argmin
w,
0
@
NX
i=1
(xiw + yi)2
+ 
MX
j=1
w2
j
1
A
Elastic net
min
0
@
NX
i=1
(xiw + yi)2
+ 1
MX
j=1
|wj| + 2
MX
j=1
w2
j
1
A
(Hoerl	
  &	
  Kennard,	
  1970)
Ridge	
  regression
Ridge regression
argmin
w,
0
@
NX
i=1
(xiw + yi)2
+ 
MX
j=1
w2
j
1
A
Elastic net
argmin
w,
0
@
NX
i=1
(xiw + yi)2
+ 1
MX
j=1
|wj| + 2
MX
j=1
w2
j
1
A (Zou	
  &	
  Hastie,	
  2005)
Elastic	
  net
Estimating	
  disease	
  rates	
  from	
  online	
  text
the observation matrix X) we want to learn a function f:
drawn from a GP prior
f(x) ⇠ GP µ(x) = 0, k(x, x0
)
kSE(x, x0
) = 2
exp
✓
kx x0k2
2
2`2
◆
where 2 describes the overall level of variance and ` is r
characteristic length-scale parameter.
An infinite sum of SE kernels with di↵erent length-scal
other well studied covariance function, the Rational Quadra
kRQ(x, x0
) = 2
✓
1 +
kx x0k2
2
2↵`2
◆ ↵
↵ is a parameter that determines the relative weightin
and large-scale variations of input pairs. The RQ kernel can
Gaussian	
  Process
kSE(x, x0
) = 2
exp 2
2`2
where 2 describes the overall level of variance and ` is referred to a
acteristic length-scale parameter.
An infinite sum of SE kernels with di↵erent length-scales results to
r well studied covariance function, the Rational Quadratic (RQ) ke
kRQ(x, x0
) = 2
✓
1 +
kx x0k2
2
2↵`2
◆ ↵
↵ is a parameter that determines the relative weighting between s
large-scale variations of input pairs. The RQ kernel can be used to m
tions that are expected to vary smoothly across many length-scale
1
Rational	
  Quadratic	
  covariance	
  function	
  (kernel)
infinite	
  sum	
  of	
  squared	
  exponential	
  (RBF)	
  kernels
k(x, x0
) =
CX
n=1
kRQ(gn, g0
n)
!
+ kN(x, x0
)
One	
  kernel	
  per	
  n-­‐gram	
  category	
  
varied	
  usage	
  patterns,	
  increasing	
  semantic	
  value
(Rasmussen	
  &	
  Williams,	
  2006)
see	
  also	
  (
Estimating	
  disease	
  rates	
  from	
  online	
  text
the observation matrix X) we want to learn a function f:
drawn from a GP prior
f(x) ⇠ GP µ(x) = 0, k(x, x0
)
kSE(x, x0
) = 2
exp
✓
kx x0k2
2
2`2
◆
where 2 describes the overall level of variance and ` is r
characteristic length-scale parameter.
An infinite sum of SE kernels with di↵erent length-scal
other well studied covariance function, the Rational Quadra
kRQ(x, x0
) = 2
✓
1 +
kx x0k2
2
2↵`2
◆ ↵
↵ is a parameter that determines the relative weightin
and large-scale variations of input pairs. The RQ kernel can
Gaussian	
  Process
kSE(x, x0
) = 2
exp 2
2`2
where 2 describes the overall level of variance and ` is referred to a
acteristic length-scale parameter.
An infinite sum of SE kernels with di↵erent length-scales results to
r well studied covariance function, the Rational Quadratic (RQ) ke
kRQ(x, x0
) = 2
✓
1 +
kx x0k2
2
2↵`2
◆ ↵
↵ is a parameter that determines the relative weighting between s
large-scale variations of input pairs. The RQ kernel can be used to m
tions that are expected to vary smoothly across many length-scale
1
Rational	
  Quadratic	
  covariance	
  function	
  (kernel)
infinite	
  sum	
  of	
  squared	
  exponential	
  (RBF)	
  kernels
k(x, x0
) =
CX
n=1
kRQ(gn, g0
n)
!
+ kN(x, x0
)
here gn is used to express the features of each n-gram category
One	
  kernel	
  per	
  n-­‐gram	
  category	
  
varied	
  usage	
  patterns,	
  increasing	
  semantic	
  value
(Rasmussen	
  &	
  Williams,	
  2006)
see	
  also	
  (Lampos	
  et	
  al.,	
  2015)
Estimating	
  influenza-­‐like	
  illness	
  (ILI)	
  rates	
  —	
  Data
2012 2013 2014
0
0.01
0.02
0.03
0.04
ILIrateper100people
ILI rates (PHE)
Bing
User-­‐generated	
  data,	
  geolocated	
  in	
  England	
  
• Twitter:	
  May	
  2011	
  to	
  April	
  2014	
  (308	
  million	
  tweets)	
  
• Bing:	
  end	
  of	
  December	
  2012	
  to	
  April	
  2014
ILI	
  rates	
  from	
  Public	
  Health	
  England	
  (PHE)
Estimating	
  ILI	
  rates	
  —	
  Feature	
  extraction
• Start	
  with	
  a	
  manually	
  crafted	
  list	
  of	
  36	
  textual	
  
markers,	
  e.g.	
  flu,	
  headache,	
  doctor,	
  cough	
  	
  
• Extract	
  frequent	
  co-­‐occurring	
  n-­‐grams	
  from	
  a	
  corpus	
  
of	
  30	
  million	
  UK	
  tweets	
  (February	
  &	
  March,	
  2014)	
  
after	
  removing	
  stop-­‐words	
  
• Set	
  of	
  markers	
  expanded	
  to	
  205	
  n-­‐grams	
  (n	
  ≤	
  4)

e.g.	
  #flu,	
  #cough,	
  annoying	
  cough,	
  worst	
  sore	
  throat	
  	
  
• Relatively	
  small	
  set	
  of	
  features	
  motivated	
  by	
  
previous	
  work	
   (Culotta,	
  2013)
Estimating	
  ILI	
  rates	
  —	
  Experimental	
  setup
Two	
  time	
  intervals	
  based	
  on	
  the	
  different	
  temporal	
  
coverage	
  of	
  Twitter	
  and	
  Bing	
  data	
  
• Dt1:	
  154	
  weeks	
  (May	
  2011	
  to	
  April	
  2014)	
  
• Dt2:	
  67	
  weeks	
  (December	
  2012	
  to	
  April	
  2014)	
  
Stratified	
  10-­‐fold	
  cross	
  validation	
  
Error	
  metrics	
  
• Pearson	
  correlation	
  (r)	
  
• Mean	
  Absolute	
  Error	
  (MAE)
Pearson	
  correlation	
  (r)
0.5
0.6
0.7
0.8
0.9
1
User-­‐generated	
  data	
  source
Twitter	
  (Dt1) Twitter	
  (Dt2) Bing	
  (Dt2)
0.952
0.924
0.845
0.867
0.744
0.718
0.814
0.698
0.64
Ridge	
  Regression Elastic	
  Net Gaussian	
  Process
Estimating	
  ILI	
  rates	
  —	
  Performance
MAE
1
1.64
2.28
2.92
3.56
4.2
User-­‐generated	
  data	
  source
Twitter	
  (Dt1) Twitter	
  (Dt2) Bing	
  (Dt2)
1.598
1.999
2.196
2.564
3.198
2.828
2.963
4.084
3.074
Ridge	
  Regression Elastic	
  Net Gaussian	
  Process
Estimating	
  ILI	
  rates	
  —	
  Performancex	
  103
✓ Background	
  and	
  motivation	
  
✓ Estimating	
  disease	
  rates	
  from	
  online	
  text	
  
๏ Estimating	
  the	
  impact	
  of	
  a	
  health	
  intervention	
  
๏ Case	
  study:	
  influenza	
  vaccination	
  impact	
  
๏ Conclusions	
  &	
  future	
  work
Assessing	
  the	
  impact	
  of	
  a	
  health	
  intervention	
  via	
  online	
  content
41%
Estimating	
  the	
  impact	
  of	
  a	
  health	
  intervention
1. Disease	
  intervention	
  launched	
  (to	
  a	
  set	
  of	
  areas)	
  
2. Define	
  a	
  distinct	
  set	
  of	
  control	
  areas	
  
3. Estimate	
  disease	
  rates	
  in	
  all	
  areas	
  
4.Identify	
  pairs	
  of	
  areas	
  with	
  strong	
  historical	
  correlation	
  
in	
  their	
  disease	
  rates	
  
5. Use	
  this	
  relationship	
  during	
  and	
  slightly	
  after	
  the	
  
intervention	
  to	
  infer	
  diseases	
  rates	
  in	
  the	
  affected	
  areas	
  
had	
  the	
  intervention	
  not	
  taken	
  place
Estimating	
  the	
  impact	
  of	
  a	
  health	
  intervention
Based on a new observation x⇤, a prediction is conduc
the mean value of the posterior predictive distribution, E
—
⌧ = {t1, . . . , tN }
v
c
r(q⌧
v, q⌧
c )
f(w, ) : R ! R
argmin
w,
NX
i=1
qti
c w + qti
v
2
q⇤
v = q⇤
cw + b
time	
  interval(s)	
  before	
  the	
  intervention
location(s)	
  where	
  the	
  intervention	
  took	
  place
control	
  location(s)
log-marginal likelihood function
argmin
1,..., C ,`1,...,`C ,↵1,...,↵C , N
(y µ)|
K 1
(y µ) + log |K
where K holds the covariance function evaluations for all pai
i.e., (K)i,j = k(xi, xj), and µ = (µ(x1), . . . , µ(xN )).
Based on a new observation x⇤, a prediction is conducted by
the mean value of the posterior predictive distribution, E[y⇤|y,
—
⌧ = {t1, . . . , tN }
v
c
r(q⌧
v, q⌧
c )
f(w, ) : R ! R
where K holds the covariance function evaluations
i.e., (K)i,j = k(xi, xj), and µ = (µ(x1), . . . , µ(xN )).
Based on a new observation x⇤, a prediction is con
the mean value of the posterior predictive distribution
—
⌧ = {t1, . . . , tN }
v
c
r(q⌧
v, q⌧
c )
f(w, ) : R ! R
argmin
w,
NX
i=1
qti
c w + qti
v
2
such	
  that
i.e., (K)i,j = k(xi, xj), and µ = (µ(
Based on a new observation x⇤,
the mean value of the posterior pre
—
⌧ = {t1, . . . , tN }
v
c
r(q⌧
v, q⌧
c )
f(w, ) : R ! R
argmin
w,
NX
i=1
⇤
disease	
  rate(s)	
  in	
  
affected	
  location	
  
before	
  intervention
disease	
  rate(s)	
  in	
  
control	
  location	
  
before	
  intervention
high
Estimating	
  the	
  impact	
  of	
  a	
  health	
  intervention
c
r(q⌧
v, q⌧
c )
f(w, ) : R ! R
argmin
w,
NX
i=1
qti
c w + qti
v
2
q⇤
v = q⇤
cw + b
qv
v = qv q⇤
v
qv q⇤
v
f(w, ) : R ! R
argmin
w,
NX
i=1
qti
c w + qti
v
2
q⇤
v = q⇤
cw + b
qv
v = qv q⇤
v
✓v =
qv q⇤
v
q⇤
v
.
such	
  that
qv
disease	
  rate(s)	
  in	
  affected	
  location	
  
during/after	
  intervention
v = qv q⇤
v
absolute	
  difference
✓v =
qv q⇤
v
q⇤
v
relative	
  difference	
  (impact)
(Lambert	
  &	
  Pregibon,	
  2008
estimate	
  projected	
  rate(s)	
  in	
  affected	
  
location	
  during/after	
  intervention
argmin
w, i=1
qc w +
q⇤
v = q⇤
cw + b
q⇤
v = qcw + b
qv
v = qv q⇤
v
2
Estimating	
  the	
  impact	
  of	
  a	
  health	
  intervention
c
r(q⌧
v, q⌧
c )
f(w, ) : R ! R
argmin
w,
NX
i=1
qti
c w + qti
v
2
q⇤
v = q⇤
cw + b
qv
v = qv q⇤
v
qv q⇤
v
f(w, ) : R ! R
argmin
w,
NX
i=1
qti
c w + qti
v
2
q⇤
v = q⇤
cw + b
qv
v = qv q⇤
v
✓v =
qv q⇤
v
q⇤
v
.
such	
  that
f(w, ) : R ! R
argmin
w,
NX
i=1
qti
c w + qti
v
q⇤
v = q⇤
cw + b
qv
v = qv q⇤
v
✓v =
qv q⇤
v
q⇤
v
.
disease	
  rate(s)	
  in	
  affected	
  location	
  
during/after	
  intervention
argmin
w,
NX
i=1
qti
c w + qti
v
2
q⇤
v = q⇤
cw + b
v = qv q⇤
v
✓ =
qv q⇤
v
absolute	
  difference
argmin
w,
NX
i=1
qti
c w + qti
v
2
q⇤
v = q⇤
cw + b
qv
v = qv q⇤
v
✓v =
qv q⇤
v
q⇤
v
relative	
  difference	
  (impact)
(Lambert	
  &	
  Pregibon,	
  2008)
estimate	
  projected	
  rate(s)	
  in	
  affected	
  
location	
  during/after	
  intervention
argmin
w, i=1
qc w +
q⇤
v = q⇤
cw + b
q⇤
v = qcw + b
qv
v = qv q⇤
v
2
✓ Background	
  and	
  motivation	
  
✓ Estimating	
  disease	
  rates	
  from	
  online	
  text	
  
✓ Estimating	
  the	
  impact	
  of	
  a	
  health	
  intervention	
  
๏ Case	
  study:	
  influenza	
  vaccination	
  impact	
  
๏ Conclusions	
  &	
  future	
  work
Assessing	
  the	
  impact	
  of	
  a	
  health	
  intervention	
  via	
  online	
  content
52%
Live	
  Attenuated	
  Influenza	
  Vaccine	
  (LAIV)	
  campaign
2012 2013 2014
0
0.01
0.02
0.03
ILIrateper100people
PHE/RCGP LAIV Post LAIV
∆t
v
• LAIV	
  programme	
  for	
  children	
  (4	
  to	
  11	
  years)	
  in	
  pilot	
  
areas	
  of	
  England	
  during	
  the	
  2013/14	
  flu	
  season	
  
• Vaccination	
  period	
  (blue):	
  Sept.	
  2013	
  to	
  Jan.	
  2014	
  
• Post-­‐vaccination	
  period	
  (green):	
  Feb.	
  to	
  April	
  2014
Target	
  (vaccinated)	
  &	
  control	
  areas
Brighton	
  •	
  Bristol	
  •	
  Cambridge	
  
Exeter	
  •	
  Leeds	
  •	
  Liverpool	
  
Norwich	
  •	
  Nottingham	
  •	
  Plymouth	
  
Sheffield	
  •	
  Southampton	
  •	
  York
Control	
  areas
Bury	
  •	
  Cumbria	
  •	
  Gateshead	
  
Leicester	
  •	
  East	
  Leicestershire	
  
Rutland	
  •	
  South-­‐East	
  Essex	
  
Havering	
  (London)	
  
Newham	
  (London)
Vaccinated	
  areas
Applying	
  the	
  impact	
  estimation	
  framework
Target	
  vs.	
  control	
  areas	
  
• Use	
  previous	
  flu	
  season	
  only	
  to	
  establish	
  relationships	
  
• Find	
  the	
  best	
  correlated	
  areas	
  or	
  supersets	
  of	
  them	
  
Confidence	
  intervals	
  
• Bootstrap	
  sampling	
  of	
  the	
  regression	
  residuals	
  
(mapping	
  function	
  of	
  control	
  to	
  vaccinated	
  areas)	
  
• Bootstrap	
  sampling	
  of	
  data	
  prior	
  to	
  the	
  application	
  of	
  
the	
  bootstrapped	
  regressor	
  
• 105	
  bootstraps;	
  use	
  the	
  .025	
  and	
  .975	
  quantiles	
  
Statistical	
  significance	
  assessment	
  
• Impact	
  estimate	
  (abs.)	
  >	
  2σ	
  of	
  the	
  bootstrap	
  estimates
Relationship	
  between	
  vaccinated	
  &	
  control	
  areas
Twitter	
  —	
  All	
  areas
Bing	
  —	
  All	
  areas
0 0.25 0.5 0.75 1
0
0.25
0.5
0.75
1
ILIratesinvaccinatedareas
ILI rates in control areas
pre−vaccination period
during/after LAIV
0 0.25 0.5 0.75 1
0
0.25
0.5
0.75
1
ILIratesinvaccinatedareas
ILI rates in control areas
pre−vaccination period
during/after LAIV
axes	
  normalised	
  
from	
  0	
  to	
  1
r	
  =	
  .86
r	
  =	
  .87
Relationship	
  between	
  vaccinated	
  &	
  control	
  areas
Twitter	
  —	
  London	
  
areas
Bing	
  —	
  London	
  areas
0 0.25 0.5 0.75 1
0
0.25
0.5
0.75
1
ILIratesinvaccinatedareas
ILI rates in control areas
pre−vaccination period
during/after LAIV
0 0.25 0.5 0.75 1
0
0.25
0.5
0.75
1
ILIratesinvaccinatedareas
ILI rates in control areas
pre−vaccination period
during/after LAIV
axes	
  normalised	
  
from	
  0	
  to	
  1
r	
  =	
  .74
r	
  =	
  .85
Impact	
  estimation	
  results	
  (strongly	
  correlated	
  controls)
Source Target r δ	
  x	
  103 θ	
  (%)
Twitter All	
  areas .861 -­‐2.5	
  (-­‐4.1,	
  -­‐1.0) -­‐32.8	
  (-­‐47.4,	
  -­‐15.6)
Bing All	
  areas .866 -­‐1.9	
  (-­‐3.2,	
  -­‐0.7) -­‐21.7	
  (-­‐32.1,	
  -­‐9.10)
Twitter
London	
  
areas
.738 -­‐1.7	
  (-­‐2.5,	
  -­‐0.9) -­‐30.5	
  (-­‐41.8,	
  -­‐17.5)
Bing
London	
  
areas
.848 -­‐2.8	
  (-­‐4.1,	
  -­‐1.6) -­‐28.4	
  (-­‐36.7,	
  -­‐17.9)
Impact	
  estimation	
  results	
  (strongly	
  correlated	
  controls)
Source Target r δ	
  x	
  103 θ	
  (%)
Twitter All	
  areas .861 -­‐2.5	
  (-­‐4.1,	
  -­‐1.0) -­‐32.8	
  (-­‐47.4,	
  -­‐15.6)
Bing All	
  areas .866 -­‐1.9	
  (-­‐3.2,	
  -­‐0.7) -­‐21.7	
  (-­‐32.1,	
  -­‐9.10)
Twitter
London	
  
areas
.738 -­‐1.7	
  (-­‐2.5,	
  -­‐0.9) -­‐30.5	
  (-­‐41.8,	
  -­‐17.5)
Bing
London	
  
areas
.848 -­‐2.8	
  (-­‐4.1,	
  -­‐1.6) -­‐28.4	
  (-­‐36.7,	
  -­‐17.9)
Source Target r δ	
  x	
  103 θ	
  (%)
Twitter All	
  areas .861 -­‐2.5	
  (-­‐4.1,	
  -­‐1.0) -­‐32.8	
  (-­‐47.4,	
  -­‐15.6)
Bing All	
  areas .866 -­‐1.9	
  (-­‐3.2,	
  -­‐0.7) -­‐21.7	
  (-­‐32.1,	
  -­‐9.10)
Twitter
London	
  
areas
.738 -­‐1.7	
  (-­‐2.5,	
  -­‐0.9) -­‐30.5	
  (-­‐41.8,	
  -­‐17.5)
Bing
London	
  
areas
.848 -­‐2.8	
  (-­‐4.1,	
  -­‐1.6) -­‐28.4	
  (-­‐36.7,	
  -­‐17.9)
Impact	
  estimation	
  results	
  (strongly	
  correlated	
  controls)
Source Target r δ	
  x	
  103 θ	
  (%)
Twitter All	
  areas .861 -­‐2.5	
  (-­‐4.1,	
  -­‐1.0) -­‐32.8	
  (-­‐47.4,	
  -­‐15.6)
Bing All	
  areas .866 -­‐1.9	
  (-­‐3.2,	
  -­‐0.7) -­‐21.7	
  (-­‐32.1,	
  -­‐9.10)
Twitter
London	
  
areas
.738 -­‐1.7	
  (-­‐2.5,	
  -­‐0.9) -­‐30.5	
  (-­‐41.8,	
  -­‐17.5)
Bing
London	
  
areas
.848 -­‐2.8	
  (-­‐4.1,	
  -­‐1.6) -­‐28.4	
  (-­‐36.7,	
  -­‐17.9)
Impact	
  estimation	
  results	
  (strongly	
  correlated	
  controls)
Impact	
  estimation	
  results	
  (stat.	
  sig.)
-­‐θ	
  (%)
0
7
14
21
28
35
All	
  areas London	
  areas Newham Cumbria Gateshead
30.2
28.7
21.7 21.1
30.430.5
32.8
Twitter Bing
Projected	
  vs.	
  inferred	
  ILI	
  rates	
  in	
  vaccinated	
  locations
Twitter	
  —	
  All	
  areas
Bing	
  —	
  All	
  areas
Oct Nov Dec Jan Feb Mar Apr
0
0.005
0.01
0.015
0.02
ILIratesper100people
weeks during and after the vaccination programme
inferred ILI rates
projected ILI rates
Oct Nov Dec Jan Feb Mar Apr
0
0.005
0.01
0.015
0.02
ILIratesper100people
weeks during and after the vaccination programme
inferred ILI rates
projected ILI rates
Projected	
  vs.	
  inferred	
  ILI	
  rates	
  in	
  vaccinated	
  locations
Twitter	
  —	
  London	
  
areas
Bing	
  —	
  London	
  areas
Oct Nov Dec Jan Feb Mar Apr
0
0.005
0.01
ILIratesper100people
weeks during and after the vaccination programme
inferred ILI rates
projected ILI rates
Oct Nov Dec Jan Feb Mar Apr
0
0.005
0.01
0.015
ILIratesper100people
weeks during and after the vaccination programme
inferred ILI rates
projected ILI rates
Sensitivity	
  of	
  impact	
  estimates	
  to	
  variable	
  controls
• Repeat	
  the	
  impact	
  estimation	
  for	
  the	
  N	
  controls	
  (up	
  to	
  
a	
  100)	
  with	
  r	
  ≥	
  95%	
  of	
  the	
  best	
  r	
  —>	
  μ(δ)	
  and	
  μ(θ)	
  (%)	
  
• Measure	
  %	
  of	
  difference,	
  Δ(θ),	
  between	
  θ	
  and	
  μ(θ)
Source Target N μ(r) μ(δ)	
  x	
  103
μ(θ)	
  (%) Δθ	
  (%)
Twitter All	
  areas 100 0.84 -­‐2.5	
  (0.2) -­‐32.7	
  (2.1) 0.10
Bing All	
  areas 46 0.85 -­‐1.4	
  (0.4) -­‐16.4	
  (3.6) 24.4
Twitter
London	
  
areas
79 0.70 -­‐1.5	
  (0.1) -­‐27.9	
  (2.0) 8.32
Bing
London	
  
areas
100 0.84 -­‐1.4	
  (0.2) -­‐16.9	
  (1.8) 40.4
Sensitivity	
  of	
  impact	
  estimates	
  to	
  variable	
  controls
• Repeat	
  the	
  impact	
  estimation	
  for	
  the	
  N	
  controls	
  (up	
  to	
  
a	
  100)	
  with	
  r	
  ≥	
  95%	
  of	
  the	
  best	
  r	
  —>	
  μ(δ)	
  and	
  μ(θ)	
  (%)	
  
• Measure	
  %	
  of	
  difference,	
  Δ(θ),	
  between	
  θ	
  and	
  μ(θ)
Source Target N μ(r) μ(δ)	
  x	
  103
μ(θ)	
  (%) Δθ	
  (%)
Twitter All	
  areas 100 0.84 -­‐2.5	
  (0.2) -­‐32.7	
  (2.1) 0.10
Bing All	
  areas 46 0.85 -­‐1.4	
  (0.4) -­‐16.4	
  (3.6) 24.4
Twitter
London	
  
areas
79 0.70 -­‐1.5	
  (0.1) -­‐27.9	
  (2.0) 8.32
Bing
London	
  
areas
100 0.84 -­‐1.4	
  (0.2) -­‐16.9	
  (1.8) 40.4
Sensitivity	
  of	
  impact	
  estimates	
  to	
  variable	
  controls
• Repeat	
  the	
  impact	
  estimation	
  for	
  the	
  N	
  controls	
  (up	
  to	
  
a	
  100)	
  with	
  r	
  ≥	
  95%	
  of	
  the	
  best	
  r	
  —>	
  μ(δ)	
  and	
  μ(θ)	
  (%)	
  
• Measure	
  %	
  of	
  difference,	
  Δ(θ),	
  between	
  θ	
  and	
  μ(θ)
Source Target N μ(r) μ(δ)	
  x	
  103
μ(θ)	
  (%) Δθ	
  (%)
Twitter All	
  areas 100 0.84 -­‐2.5	
  (0.2) -­‐32.7	
  (2.1) 0.10
Bing All	
  areas 46 0.85 -­‐1.4	
  (0.4) -­‐16.4	
  (3.6) 24.4
Twitter
London	
  
areas
79 0.70 -­‐1.5	
  (0.1) -­‐27.9	
  (2.0) 8.32
Bing
London	
  
areas
100 0.84 -­‐1.4	
  (0.2) -­‐16.9	
  (1.8) 40.4
✓ Background	
  and	
  motivation	
  
✓ Estimating	
  disease	
  rates	
  from	
  online	
  text	
  
✓ Estimating	
  the	
  impact	
  of	
  a	
  health	
  intervention	
  
✓ Case	
  study:	
  influenza	
  vaccination	
  impact	
  
๏ Conclusions	
  &	
  future	
  work
Assessing	
  the	
  impact	
  of	
  a	
  health	
  intervention	
  via	
  online	
  content
89%
Conclusions	
  &	
  points	
  for	
  discussion
• Framework	
  for	
  estimating	
  the	
  impact	
  of	
  a	
  health	
  
intervention	
  based	
  on	
  online	
  content	
  
• Access	
  to	
  different	
  &	
  larger	
  parts	
  of	
  the	
  population	
  
Evaluation	
  is	
  hard,	
  however:	
  
• PHE’s	
  impact	
  estimates:	
  -­‐66%	
  based	
  on	
  sentinel	
  
surveillance,	
  -­‐24%	
  laboratory	
  confirmed	
  
• Correlation	
  between	
  actual	
  vaccination	
  uptake	
  and	
  our	
  
study’s	
  estimated	
  impacts	
  
Why	
  are	
  Bing	
  and	
  Twitter	
  estimations	
  different?	
  
• Different	
  user	
  demographics	
  (?)	
  —	
  this	
  can	
  be	
  useful	
  
• Different	
  temporal	
  resolution
(Pebody	
  et	
  al.,	
  2014)
Potential	
  future	
  work	
  directions
• Improve	
  supervised	
  learning	
  models	
  
- better	
  natural	
  language	
  processing	
  /	
  machine	
  
learning	
  modelling	
  
- combination	
  of	
  different	
  data	
  sources	
  
• Work	
  on	
  unsupervised	
  techniques	
  
- inferring	
  /	
  understanding	
  the	
  demographics	
  of	
  the	
  
online	
  medium	
  will	
  be	
  essential	
  
• More	
  rigorous	
  evaluation
Collaborators,	
  acknowledgements	
  &	
  material
Elad	
  Yom-­‐Tov,	
  Microsoft	
  Research	
  
Richard	
  Pebody,	
  Public	
  Health	
  England	
  
Ingemar	
  J.	
  Cox,	
  UCL	
  &	
  University	
  of	
  Copenhagen
Jens	
  Geyti,	
  UCL	
  (Software	
  Engineer)	
  
Simon	
  de	
  Lusignan,	
  University	
  of	
  Surrey	
  &	
  RCGP
Slides:	
  ow.ly/RN7MZPaper:	
  ow.ly/RN9J2
i-­‐sense.org.uk
Bollen,	
  Mao	
  &	
  Zeng.	
  Twitter	
  mood	
  predicts	
  the	
  stock	
  market.	
  J	
  Comp	
  Science,	
  2011.	
  
Burger,	
  Henderson,	
  Kim	
  &	
  Zarrella.	
  Discriminating	
  Gender	
  on	
  Twitter.	
  EMNLP,	
  2011.	
  
Choi	
  &	
  Varian.	
  Predicting	
  the	
  Present	
  with	
  Google	
  Trends.	
  Economic	
  Record,	
  2012.	
  
Culotta.	
  Lightweight	
  methods	
  to	
  estimate	
  influenza	
  rates	
  and	
  alcohol	
  sales	
  volume	
  from	
  Twitter	
  messages.	
  Lang	
  
Resour	
  Eval,	
  2013.	
  
Hoerl	
  &	
  Kennard.	
  Ridge	
  regression:	
  biased	
  estimation	
  for	
  nonorthogonal	
  problems.	
  Technometrics,	
  1970.	
  
Lamb,	
  Paul	
  &	
  Dredze.	
  Separating	
  Fact	
  from	
  Fear:	
  Tracking	
  Flu	
  Infections	
  on	
  Twitter.	
  NAACL,	
  2013.	
  
Lambert	
  &	
  Pregibon.	
  Online	
  effects	
  of	
  offline	
  ads.	
  Data	
  Mining	
  &	
  Audience	
  Intelligence	
  for	
  Advertising,	
  2008.	
  
Lampos	
  &	
  Cristianini.	
  Tracking	
  the	
  flu	
  pandemic	
  by	
  monitoring	
  the	
  Social	
  Web.	
  CIP,	
  2010.	
  
Lampos	
  &	
  Cristianini.	
  Nowcasting	
  Events	
  from	
  the	
  Social	
  Web	
  with	
  Statistical	
  Learning.	
  ACM	
  TIST,	
  2012.	
  
Lampos,	
  Miller,	
  Crossan	
  &	
  Stefansen.	
  Advances	
  in	
  nowcasting	
  influenza-­‐like	
  illness	
  rates	
  using	
  search	
  query	
  logs.	
  
Sci	
  Rep,	
  2015.	
  
Lampos,	
   Yom-­‐Tov,	
   Pebody	
   &	
   Cox.	
   Assessing	
   the	
   impact	
   of	
   a	
   health	
   intervention	
   via	
   user-­‐generated	
   Internet	
  
content.	
  DMKD,	
  2015.	
  
Pebody	
  et	
  al.	
  Uptake	
  and	
  impact	
  of	
  a	
  new	
  live	
  attenuated	
  influenza	
  vaccine	
  programme	
  in	
  England:	
  early	
  results	
  of	
  
a	
  pilot	
  in	
  primary	
  school-­‐age	
  children,	
  2013/14	
  influenza	
  season.	
  Eurosurveillance,	
  2014.	
  
Preotiuc-­‐Pietro,	
  Lampos	
  &	
  Aletras.	
  An	
  analysis	
  of	
  the	
  user	
  occupational	
  class	
  through	
  Twitter	
  content.	
  ACL,	
  2015.	
  
Rao,	
  Yarowsky,	
  Shreevats	
  &	
  Gupta.	
  Classifying	
  Latent	
  User	
  Attributes	
  in	
  Twitter.	
  SMUC,	
  2010.	
  
Rasmussen	
  &	
  Williams.	
  Gaussian	
  Processes	
  for	
  Machine	
  Learning.	
  MIT	
  Press,	
  2006.	
  
Tumasjan,	
   Sprenger,	
   Sandner	
   &	
   Welpe.	
   Predicting	
   Elections	
   with	
   Twitter:	
   What	
   140	
   characters	
   Reveal	
   about	
  
Political	
  Sentiment.	
  ICWSM,	
  2010.	
  
Zou	
  &	
  Hastie.	
  Regularization	
  and	
  variable	
  selection	
  via	
  the	
  elastic	
  net.	
  J	
  R	
  Stat	
  Soc	
  Series	
  B	
  Stat	
  Methodol,	
  2005.
References

Más contenido relacionado

Similar a Assessing the impact of a health intervention via user-generated Internet content

A delay decomposition approach to robust stability analysis of uncertain syst...
A delay decomposition approach to robust stability analysis of uncertain syst...A delay decomposition approach to robust stability analysis of uncertain syst...
A delay decomposition approach to robust stability analysis of uncertain syst...
ISA Interchange
 

Similar a Assessing the impact of a health intervention via user-generated Internet content (20)

Curses, tradeoffs, and scalable management: advancing evolutionary direct pol...
Curses, tradeoffs, and scalable management: advancing evolutionary direct pol...Curses, tradeoffs, and scalable management: advancing evolutionary direct pol...
Curses, tradeoffs, and scalable management: advancing evolutionary direct pol...
 
A Non Parametric Estimation Based Underwater Target Classifier
A Non Parametric Estimation Based Underwater Target ClassifierA Non Parametric Estimation Based Underwater Target Classifier
A Non Parametric Estimation Based Underwater Target Classifier
 
Dengue Vector Population Forecasting Using Multisource Earth Observation Prod...
Dengue Vector Population Forecasting Using Multisource Earth Observation Prod...Dengue Vector Population Forecasting Using Multisource Earth Observation Prod...
Dengue Vector Population Forecasting Using Multisource Earth Observation Prod...
 
mHealth Israel_Connecting time-dots for Outcomes Prediction in Healthcare Big...
mHealth Israel_Connecting time-dots for Outcomes Prediction in Healthcare Big...mHealth Israel_Connecting time-dots for Outcomes Prediction in Healthcare Big...
mHealth Israel_Connecting time-dots for Outcomes Prediction in Healthcare Big...
 
Two Layer k-means based Consensus Clustering for Rural Health Information System
Two Layer k-means based Consensus Clustering for Rural Health Information SystemTwo Layer k-means based Consensus Clustering for Rural Health Information System
Two Layer k-means based Consensus Clustering for Rural Health Information System
 
BPSO&1-NN algorithm-based variable selection for power system stability ident...
BPSO&1-NN algorithm-based variable selection for power system stability ident...BPSO&1-NN algorithm-based variable selection for power system stability ident...
BPSO&1-NN algorithm-based variable selection for power system stability ident...
 
MUMS Undergraduate Workshop - Introduction to Bayesian Inference & Uncertaint...
MUMS Undergraduate Workshop - Introduction to Bayesian Inference & Uncertaint...MUMS Undergraduate Workshop - Introduction to Bayesian Inference & Uncertaint...
MUMS Undergraduate Workshop - Introduction to Bayesian Inference & Uncertaint...
 
DETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATA
DETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATADETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATA
DETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATA
 
Universal approximators for Direct Policy Search in multi-purpose water reser...
Universal approximators for Direct Policy Search in multi-purpose water reser...Universal approximators for Direct Policy Search in multi-purpose water reser...
Universal approximators for Direct Policy Search in multi-purpose water reser...
 
Exponential software reliability using SPRT: MLE
Exponential software reliability using SPRT: MLEExponential software reliability using SPRT: MLE
Exponential software reliability using SPRT: MLE
 
50120130405032
5012013040503250120130405032
50120130405032
 
Denoising of heart sound signal using wavelet transform
Denoising of heart sound signal using wavelet transformDenoising of heart sound signal using wavelet transform
Denoising of heart sound signal using wavelet transform
 
GluNet network for glucose prediction .pdf
GluNet network for glucose prediction .pdfGluNet network for glucose prediction .pdf
GluNet network for glucose prediction .pdf
 
CROI 2018 Poster #989
CROI 2018 Poster #989CROI 2018 Poster #989
CROI 2018 Poster #989
 
Cu24631635
Cu24631635Cu24631635
Cu24631635
 
Sub1539
Sub1539Sub1539
Sub1539
 
An Algorithm For Vector Quantizer Design
An Algorithm For Vector Quantizer DesignAn Algorithm For Vector Quantizer Design
An Algorithm For Vector Quantizer Design
 
multiscale_tutorial.pdf
multiscale_tutorial.pdfmultiscale_tutorial.pdf
multiscale_tutorial.pdf
 
Myriam phd
Myriam phdMyriam phd
Myriam phd
 
A delay decomposition approach to robust stability analysis of uncertain syst...
A delay decomposition approach to robust stability analysis of uncertain syst...A delay decomposition approach to robust stability analysis of uncertain syst...
A delay decomposition approach to robust stability analysis of uncertain syst...
 

Más de Vasileios Lampos

Más de Vasileios Lampos (10)

Quicksort
QuicksortQuicksort
Quicksort
 
Transfer learning for unsupervised influenza-like illness models from online ...
Transfer learning for unsupervised influenza-like illness models from online ...Transfer learning for unsupervised influenza-like illness models from online ...
Transfer learning for unsupervised influenza-like illness models from online ...
 
Topic models, vector semantics and applications
Topic models, vector semantics and applicationsTopic models, vector semantics and applications
Topic models, vector semantics and applications
 
Mining online data for public health surveillance
Mining online data for public health surveillanceMining online data for public health surveillance
Mining online data for public health surveillance
 
Mining socio-political and socio-economic signals from social media content
Mining socio-political and socio-economic signals from social media contentMining socio-political and socio-economic signals from social media content
Mining socio-political and socio-economic signals from social media content
 
Inferring the Socioeconomic Status of Social Media Users based on Behaviour a...
Inferring the Socioeconomic Status of Social Media Users based on Behaviour a...Inferring the Socioeconomic Status of Social Media Users based on Behaviour a...
Inferring the Socioeconomic Status of Social Media Users based on Behaviour a...
 
An introduction to digital health surveillance from online user-generated con...
An introduction to digital health surveillance from online user-generated con...An introduction to digital health surveillance from online user-generated con...
An introduction to digital health surveillance from online user-generated con...
 
User-generated content: collective and personalised inference tasks
User-generated content: collective and personalised inference tasksUser-generated content: collective and personalised inference tasks
User-generated content: collective and personalised inference tasks
 
Bilinear text regression and applications
Bilinear text regression and applicationsBilinear text regression and applications
Bilinear text regression and applications
 
Extracting interesting concepts from large-scale textual data
Extracting interesting concepts from large-scale textual dataExtracting interesting concepts from large-scale textual data
Extracting interesting concepts from large-scale textual data
 

Último

Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Silpa
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
Silpa
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
Silpa
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
Silpa
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
ANSARKHAN96
 

Último (20)

Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 
Genetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditionsGenetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditions
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLGwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
 

Assessing the impact of a health intervention via user-generated Internet content

  • 1. ECML  PKDD  2015,  Porto,  Portugal Assessing  the  impact  of  a  health  intervention   via  user-­‐generated  Internet  data   Data  Mining  and  Knowledge  Discovery  29(5),  pp.  1434–1457,  2015 Vasileios  Lampos,  Elad  Yom-­‐Tov,     Richard  Pebody  and  Ingemar  J.  Cox STATUTORY NOTIFICATIONS OF INFECTIOUS D WEEK 2015/33 week ending 16/08/2015 in ENGLAND and WALES Table 1 Statutory notifications of infectious diseases in the past 6 week current year compared with corresponding periods of the two p CONTENTS Table 2 Statutory notifications of infectious diseases for diseases for W Region, county, local and unitary authority including additional 6th April 2010 Registered Medical Practioner in England and Wales have a statutory duty to the local authority, often the CCDC (Consultant in Communicable Disease Co of certain infectious diseases: Acute encephalitis Haemolytic uraemic syndrome * R NOIDs WEEKLY REPORTat bridge
  • 2. ๏ Background  and  motivation   ๏ Nowcasting  disease  rates  from  online  text   ๏ Estimating  the  impact  of  a  health  intervention   ๏ Case  study:  influenza  vaccination  impact   ๏ Conclusions  &  future  work 1% Assessing  the  impact  of  a  health  intervention  via  online  content
  • 3. Online,  user-­‐generated  data + Social  media,  blogs,  search  engine  query  logs   + Proxy  of  real-­‐world  (online+offline)  behaviour   + Complementary  information  sensors  to  more   ‘traditional’  crowdsourcing  efforts   + Can  answer  questions  difficult  to  resolve  otherwise   + Strong  predictive  power
  • 4. Online,  user-­‐generated  data  —  Applications + Politics   • voting  intention   • result  of  an  election   + Finance   • financial  indices   • tourism  patterns   + User  profiling   • age   • gender   • occupation (Preotiuc-­‐Pietro,  Lampos  &  Aletras,  2015) (Burger  et  al.,  2011) (Rao  et  al.,  2010) (Bollen,  Mao  &  Zeng,  2011) (Choi  &  Varian,  2012) (Lampos,  Preotiuc-­‐Pietro  &  Cohn,  2013) (Tumasjan  et  al.,  2010)
  • 5. Online,  user-­‐generated  data  for  health Traditional  disease  surveillance   - does  not  cover  the  entire  population   - not  present  everywhere  (cities  /  countries)   - not  always  timely   Digital  disease  surveillance   + different  or  better  population  coverage   + better  geographical  granularity   + useful  in  underdeveloped  parts  of  the  world   + almost  instant   - noisy,  unstructured  information e.g.  (Lampos  &  Cristianini,  2010  &  2012),  (Lamb,  Paul  &  Dredze,  2013),  (Lampos  et  al.,  2015)  
  • 6. What  this  work  is  all  about Health  intervention disease  rates ( Pebody  &  Cox,  2015 impact ?
  • 7. What  this  work  is  all  about Health  intervention disease  rates (Lampos,  Yom-­‐Tov,   Pebody  &  Cox,  2015) impact ?
  • 8. ✓ Background  and  motivation   ๏ Estimating  disease  rates  from  online  text   ๏ Estimating  the  impact  of  a  health  intervention   ๏ Case  study:  influenza  vaccination  impact   ๏ Conclusions  &  future  work Assessing  the  impact  of  a  health  intervention  via  online  content 15%
  • 9. Estimating  disease  rates  from  online  textVariables N M X 2 RN⇥M y 2 RN time  intervals n-­‐grams frequency  of  n-­‐grams  during  the  time  intervals disease  rates  during  the  time  intervals Ridge regression argmin w, 0 @ NX i=1 (xiw + yi)2 +  MX j=1 w2 j 1 A Elastic net min 0 @ NX i=1 (xiw + yi)2 + 1 MX j=1 |wj| + 2 MX j=1 w2 j 1 A (Hoerl  &  Kennard,  1970) Ridge  regression Ridge regression argmin w, 0 @ NX i=1 (xiw + yi)2 +  MX j=1 w2 j 1 A Elastic net argmin w, 0 @ NX i=1 (xiw + yi)2 + 1 MX j=1 |wj| + 2 MX j=1 w2 j 1 A (Zou  &  Hastie,  2005) Elastic  net
  • 10. Estimating  disease  rates  from  online  text the observation matrix X) we want to learn a function f: drawn from a GP prior f(x) ⇠ GP µ(x) = 0, k(x, x0 ) kSE(x, x0 ) = 2 exp ✓ kx x0k2 2 2`2 ◆ where 2 describes the overall level of variance and ` is r characteristic length-scale parameter. An infinite sum of SE kernels with di↵erent length-scal other well studied covariance function, the Rational Quadra kRQ(x, x0 ) = 2 ✓ 1 + kx x0k2 2 2↵`2 ◆ ↵ ↵ is a parameter that determines the relative weightin and large-scale variations of input pairs. The RQ kernel can Gaussian  Process kSE(x, x0 ) = 2 exp 2 2`2 where 2 describes the overall level of variance and ` is referred to a acteristic length-scale parameter. An infinite sum of SE kernels with di↵erent length-scales results to r well studied covariance function, the Rational Quadratic (RQ) ke kRQ(x, x0 ) = 2 ✓ 1 + kx x0k2 2 2↵`2 ◆ ↵ ↵ is a parameter that determines the relative weighting between s large-scale variations of input pairs. The RQ kernel can be used to m tions that are expected to vary smoothly across many length-scale 1 Rational  Quadratic  covariance  function  (kernel) infinite  sum  of  squared  exponential  (RBF)  kernels k(x, x0 ) = CX n=1 kRQ(gn, g0 n) ! + kN(x, x0 ) One  kernel  per  n-­‐gram  category   varied  usage  patterns,  increasing  semantic  value (Rasmussen  &  Williams,  2006) see  also  (
  • 11. Estimating  disease  rates  from  online  text the observation matrix X) we want to learn a function f: drawn from a GP prior f(x) ⇠ GP µ(x) = 0, k(x, x0 ) kSE(x, x0 ) = 2 exp ✓ kx x0k2 2 2`2 ◆ where 2 describes the overall level of variance and ` is r characteristic length-scale parameter. An infinite sum of SE kernels with di↵erent length-scal other well studied covariance function, the Rational Quadra kRQ(x, x0 ) = 2 ✓ 1 + kx x0k2 2 2↵`2 ◆ ↵ ↵ is a parameter that determines the relative weightin and large-scale variations of input pairs. The RQ kernel can Gaussian  Process kSE(x, x0 ) = 2 exp 2 2`2 where 2 describes the overall level of variance and ` is referred to a acteristic length-scale parameter. An infinite sum of SE kernels with di↵erent length-scales results to r well studied covariance function, the Rational Quadratic (RQ) ke kRQ(x, x0 ) = 2 ✓ 1 + kx x0k2 2 2↵`2 ◆ ↵ ↵ is a parameter that determines the relative weighting between s large-scale variations of input pairs. The RQ kernel can be used to m tions that are expected to vary smoothly across many length-scale 1 Rational  Quadratic  covariance  function  (kernel) infinite  sum  of  squared  exponential  (RBF)  kernels k(x, x0 ) = CX n=1 kRQ(gn, g0 n) ! + kN(x, x0 ) here gn is used to express the features of each n-gram category One  kernel  per  n-­‐gram  category   varied  usage  patterns,  increasing  semantic  value (Rasmussen  &  Williams,  2006) see  also  (Lampos  et  al.,  2015)
  • 12. Estimating  influenza-­‐like  illness  (ILI)  rates  —  Data 2012 2013 2014 0 0.01 0.02 0.03 0.04 ILIrateper100people ILI rates (PHE) Bing User-­‐generated  data,  geolocated  in  England   • Twitter:  May  2011  to  April  2014  (308  million  tweets)   • Bing:  end  of  December  2012  to  April  2014 ILI  rates  from  Public  Health  England  (PHE)
  • 13. Estimating  ILI  rates  —  Feature  extraction • Start  with  a  manually  crafted  list  of  36  textual   markers,  e.g.  flu,  headache,  doctor,  cough     • Extract  frequent  co-­‐occurring  n-­‐grams  from  a  corpus   of  30  million  UK  tweets  (February  &  March,  2014)   after  removing  stop-­‐words   • Set  of  markers  expanded  to  205  n-­‐grams  (n  ≤  4)
 e.g.  #flu,  #cough,  annoying  cough,  worst  sore  throat     • Relatively  small  set  of  features  motivated  by   previous  work   (Culotta,  2013)
  • 14. Estimating  ILI  rates  —  Experimental  setup Two  time  intervals  based  on  the  different  temporal   coverage  of  Twitter  and  Bing  data   • Dt1:  154  weeks  (May  2011  to  April  2014)   • Dt2:  67  weeks  (December  2012  to  April  2014)   Stratified  10-­‐fold  cross  validation   Error  metrics   • Pearson  correlation  (r)   • Mean  Absolute  Error  (MAE)
  • 15. Pearson  correlation  (r) 0.5 0.6 0.7 0.8 0.9 1 User-­‐generated  data  source Twitter  (Dt1) Twitter  (Dt2) Bing  (Dt2) 0.952 0.924 0.845 0.867 0.744 0.718 0.814 0.698 0.64 Ridge  Regression Elastic  Net Gaussian  Process Estimating  ILI  rates  —  Performance
  • 16. MAE 1 1.64 2.28 2.92 3.56 4.2 User-­‐generated  data  source Twitter  (Dt1) Twitter  (Dt2) Bing  (Dt2) 1.598 1.999 2.196 2.564 3.198 2.828 2.963 4.084 3.074 Ridge  Regression Elastic  Net Gaussian  Process Estimating  ILI  rates  —  Performancex  103
  • 17. ✓ Background  and  motivation   ✓ Estimating  disease  rates  from  online  text   ๏ Estimating  the  impact  of  a  health  intervention   ๏ Case  study:  influenza  vaccination  impact   ๏ Conclusions  &  future  work Assessing  the  impact  of  a  health  intervention  via  online  content 41%
  • 18. Estimating  the  impact  of  a  health  intervention 1. Disease  intervention  launched  (to  a  set  of  areas)   2. Define  a  distinct  set  of  control  areas   3. Estimate  disease  rates  in  all  areas   4.Identify  pairs  of  areas  with  strong  historical  correlation   in  their  disease  rates   5. Use  this  relationship  during  and  slightly  after  the   intervention  to  infer  diseases  rates  in  the  affected  areas   had  the  intervention  not  taken  place
  • 19. Estimating  the  impact  of  a  health  intervention Based on a new observation x⇤, a prediction is conduc the mean value of the posterior predictive distribution, E — ⌧ = {t1, . . . , tN } v c r(q⌧ v, q⌧ c ) f(w, ) : R ! R argmin w, NX i=1 qti c w + qti v 2 q⇤ v = q⇤ cw + b time  interval(s)  before  the  intervention location(s)  where  the  intervention  took  place control  location(s) log-marginal likelihood function argmin 1,..., C ,`1,...,`C ,↵1,...,↵C , N (y µ)| K 1 (y µ) + log |K where K holds the covariance function evaluations for all pai i.e., (K)i,j = k(xi, xj), and µ = (µ(x1), . . . , µ(xN )). Based on a new observation x⇤, a prediction is conducted by the mean value of the posterior predictive distribution, E[y⇤|y, — ⌧ = {t1, . . . , tN } v c r(q⌧ v, q⌧ c ) f(w, ) : R ! R where K holds the covariance function evaluations i.e., (K)i,j = k(xi, xj), and µ = (µ(x1), . . . , µ(xN )). Based on a new observation x⇤, a prediction is con the mean value of the posterior predictive distribution — ⌧ = {t1, . . . , tN } v c r(q⌧ v, q⌧ c ) f(w, ) : R ! R argmin w, NX i=1 qti c w + qti v 2 such  that i.e., (K)i,j = k(xi, xj), and µ = (µ( Based on a new observation x⇤, the mean value of the posterior pre — ⌧ = {t1, . . . , tN } v c r(q⌧ v, q⌧ c ) f(w, ) : R ! R argmin w, NX i=1 ⇤ disease  rate(s)  in   affected  location   before  intervention disease  rate(s)  in   control  location   before  intervention high
  • 20. Estimating  the  impact  of  a  health  intervention c r(q⌧ v, q⌧ c ) f(w, ) : R ! R argmin w, NX i=1 qti c w + qti v 2 q⇤ v = q⇤ cw + b qv v = qv q⇤ v qv q⇤ v f(w, ) : R ! R argmin w, NX i=1 qti c w + qti v 2 q⇤ v = q⇤ cw + b qv v = qv q⇤ v ✓v = qv q⇤ v q⇤ v . such  that qv disease  rate(s)  in  affected  location   during/after  intervention v = qv q⇤ v absolute  difference ✓v = qv q⇤ v q⇤ v relative  difference  (impact) (Lambert  &  Pregibon,  2008 estimate  projected  rate(s)  in  affected   location  during/after  intervention argmin w, i=1 qc w + q⇤ v = q⇤ cw + b q⇤ v = qcw + b qv v = qv q⇤ v 2
  • 21. Estimating  the  impact  of  a  health  intervention c r(q⌧ v, q⌧ c ) f(w, ) : R ! R argmin w, NX i=1 qti c w + qti v 2 q⇤ v = q⇤ cw + b qv v = qv q⇤ v qv q⇤ v f(w, ) : R ! R argmin w, NX i=1 qti c w + qti v 2 q⇤ v = q⇤ cw + b qv v = qv q⇤ v ✓v = qv q⇤ v q⇤ v . such  that f(w, ) : R ! R argmin w, NX i=1 qti c w + qti v q⇤ v = q⇤ cw + b qv v = qv q⇤ v ✓v = qv q⇤ v q⇤ v . disease  rate(s)  in  affected  location   during/after  intervention argmin w, NX i=1 qti c w + qti v 2 q⇤ v = q⇤ cw + b v = qv q⇤ v ✓ = qv q⇤ v absolute  difference argmin w, NX i=1 qti c w + qti v 2 q⇤ v = q⇤ cw + b qv v = qv q⇤ v ✓v = qv q⇤ v q⇤ v relative  difference  (impact) (Lambert  &  Pregibon,  2008) estimate  projected  rate(s)  in  affected   location  during/after  intervention argmin w, i=1 qc w + q⇤ v = q⇤ cw + b q⇤ v = qcw + b qv v = qv q⇤ v 2
  • 22. ✓ Background  and  motivation   ✓ Estimating  disease  rates  from  online  text   ✓ Estimating  the  impact  of  a  health  intervention   ๏ Case  study:  influenza  vaccination  impact   ๏ Conclusions  &  future  work Assessing  the  impact  of  a  health  intervention  via  online  content 52%
  • 23. Live  Attenuated  Influenza  Vaccine  (LAIV)  campaign 2012 2013 2014 0 0.01 0.02 0.03 ILIrateper100people PHE/RCGP LAIV Post LAIV ∆t v • LAIV  programme  for  children  (4  to  11  years)  in  pilot   areas  of  England  during  the  2013/14  flu  season   • Vaccination  period  (blue):  Sept.  2013  to  Jan.  2014   • Post-­‐vaccination  period  (green):  Feb.  to  April  2014
  • 24. Target  (vaccinated)  &  control  areas Brighton  •  Bristol  •  Cambridge   Exeter  •  Leeds  •  Liverpool   Norwich  •  Nottingham  •  Plymouth   Sheffield  •  Southampton  •  York Control  areas Bury  •  Cumbria  •  Gateshead   Leicester  •  East  Leicestershire   Rutland  •  South-­‐East  Essex   Havering  (London)   Newham  (London) Vaccinated  areas
  • 25. Applying  the  impact  estimation  framework Target  vs.  control  areas   • Use  previous  flu  season  only  to  establish  relationships   • Find  the  best  correlated  areas  or  supersets  of  them   Confidence  intervals   • Bootstrap  sampling  of  the  regression  residuals   (mapping  function  of  control  to  vaccinated  areas)   • Bootstrap  sampling  of  data  prior  to  the  application  of   the  bootstrapped  regressor   • 105  bootstraps;  use  the  .025  and  .975  quantiles   Statistical  significance  assessment   • Impact  estimate  (abs.)  >  2σ  of  the  bootstrap  estimates
  • 26. Relationship  between  vaccinated  &  control  areas Twitter  —  All  areas Bing  —  All  areas 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 ILIratesinvaccinatedareas ILI rates in control areas pre−vaccination period during/after LAIV 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 ILIratesinvaccinatedareas ILI rates in control areas pre−vaccination period during/after LAIV axes  normalised   from  0  to  1 r  =  .86 r  =  .87
  • 27. Relationship  between  vaccinated  &  control  areas Twitter  —  London   areas Bing  —  London  areas 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 ILIratesinvaccinatedareas ILI rates in control areas pre−vaccination period during/after LAIV 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 ILIratesinvaccinatedareas ILI rates in control areas pre−vaccination period during/after LAIV axes  normalised   from  0  to  1 r  =  .74 r  =  .85
  • 28. Impact  estimation  results  (strongly  correlated  controls) Source Target r δ  x  103 θ  (%) Twitter All  areas .861 -­‐2.5  (-­‐4.1,  -­‐1.0) -­‐32.8  (-­‐47.4,  -­‐15.6) Bing All  areas .866 -­‐1.9  (-­‐3.2,  -­‐0.7) -­‐21.7  (-­‐32.1,  -­‐9.10) Twitter London   areas .738 -­‐1.7  (-­‐2.5,  -­‐0.9) -­‐30.5  (-­‐41.8,  -­‐17.5) Bing London   areas .848 -­‐2.8  (-­‐4.1,  -­‐1.6) -­‐28.4  (-­‐36.7,  -­‐17.9)
  • 29. Impact  estimation  results  (strongly  correlated  controls) Source Target r δ  x  103 θ  (%) Twitter All  areas .861 -­‐2.5  (-­‐4.1,  -­‐1.0) -­‐32.8  (-­‐47.4,  -­‐15.6) Bing All  areas .866 -­‐1.9  (-­‐3.2,  -­‐0.7) -­‐21.7  (-­‐32.1,  -­‐9.10) Twitter London   areas .738 -­‐1.7  (-­‐2.5,  -­‐0.9) -­‐30.5  (-­‐41.8,  -­‐17.5) Bing London   areas .848 -­‐2.8  (-­‐4.1,  -­‐1.6) -­‐28.4  (-­‐36.7,  -­‐17.9)
  • 30. Source Target r δ  x  103 θ  (%) Twitter All  areas .861 -­‐2.5  (-­‐4.1,  -­‐1.0) -­‐32.8  (-­‐47.4,  -­‐15.6) Bing All  areas .866 -­‐1.9  (-­‐3.2,  -­‐0.7) -­‐21.7  (-­‐32.1,  -­‐9.10) Twitter London   areas .738 -­‐1.7  (-­‐2.5,  -­‐0.9) -­‐30.5  (-­‐41.8,  -­‐17.5) Bing London   areas .848 -­‐2.8  (-­‐4.1,  -­‐1.6) -­‐28.4  (-­‐36.7,  -­‐17.9) Impact  estimation  results  (strongly  correlated  controls)
  • 31. Source Target r δ  x  103 θ  (%) Twitter All  areas .861 -­‐2.5  (-­‐4.1,  -­‐1.0) -­‐32.8  (-­‐47.4,  -­‐15.6) Bing All  areas .866 -­‐1.9  (-­‐3.2,  -­‐0.7) -­‐21.7  (-­‐32.1,  -­‐9.10) Twitter London   areas .738 -­‐1.7  (-­‐2.5,  -­‐0.9) -­‐30.5  (-­‐41.8,  -­‐17.5) Bing London   areas .848 -­‐2.8  (-­‐4.1,  -­‐1.6) -­‐28.4  (-­‐36.7,  -­‐17.9) Impact  estimation  results  (strongly  correlated  controls)
  • 32. Impact  estimation  results  (stat.  sig.) -­‐θ  (%) 0 7 14 21 28 35 All  areas London  areas Newham Cumbria Gateshead 30.2 28.7 21.7 21.1 30.430.5 32.8 Twitter Bing
  • 33. Projected  vs.  inferred  ILI  rates  in  vaccinated  locations Twitter  —  All  areas Bing  —  All  areas Oct Nov Dec Jan Feb Mar Apr 0 0.005 0.01 0.015 0.02 ILIratesper100people weeks during and after the vaccination programme inferred ILI rates projected ILI rates Oct Nov Dec Jan Feb Mar Apr 0 0.005 0.01 0.015 0.02 ILIratesper100people weeks during and after the vaccination programme inferred ILI rates projected ILI rates
  • 34. Projected  vs.  inferred  ILI  rates  in  vaccinated  locations Twitter  —  London   areas Bing  —  London  areas Oct Nov Dec Jan Feb Mar Apr 0 0.005 0.01 ILIratesper100people weeks during and after the vaccination programme inferred ILI rates projected ILI rates Oct Nov Dec Jan Feb Mar Apr 0 0.005 0.01 0.015 ILIratesper100people weeks during and after the vaccination programme inferred ILI rates projected ILI rates
  • 35. Sensitivity  of  impact  estimates  to  variable  controls • Repeat  the  impact  estimation  for  the  N  controls  (up  to   a  100)  with  r  ≥  95%  of  the  best  r  —>  μ(δ)  and  μ(θ)  (%)   • Measure  %  of  difference,  Δ(θ),  between  θ  and  μ(θ) Source Target N μ(r) μ(δ)  x  103 μ(θ)  (%) Δθ  (%) Twitter All  areas 100 0.84 -­‐2.5  (0.2) -­‐32.7  (2.1) 0.10 Bing All  areas 46 0.85 -­‐1.4  (0.4) -­‐16.4  (3.6) 24.4 Twitter London   areas 79 0.70 -­‐1.5  (0.1) -­‐27.9  (2.0) 8.32 Bing London   areas 100 0.84 -­‐1.4  (0.2) -­‐16.9  (1.8) 40.4
  • 36. Sensitivity  of  impact  estimates  to  variable  controls • Repeat  the  impact  estimation  for  the  N  controls  (up  to   a  100)  with  r  ≥  95%  of  the  best  r  —>  μ(δ)  and  μ(θ)  (%)   • Measure  %  of  difference,  Δ(θ),  between  θ  and  μ(θ) Source Target N μ(r) μ(δ)  x  103 μ(θ)  (%) Δθ  (%) Twitter All  areas 100 0.84 -­‐2.5  (0.2) -­‐32.7  (2.1) 0.10 Bing All  areas 46 0.85 -­‐1.4  (0.4) -­‐16.4  (3.6) 24.4 Twitter London   areas 79 0.70 -­‐1.5  (0.1) -­‐27.9  (2.0) 8.32 Bing London   areas 100 0.84 -­‐1.4  (0.2) -­‐16.9  (1.8) 40.4
  • 37. Sensitivity  of  impact  estimates  to  variable  controls • Repeat  the  impact  estimation  for  the  N  controls  (up  to   a  100)  with  r  ≥  95%  of  the  best  r  —>  μ(δ)  and  μ(θ)  (%)   • Measure  %  of  difference,  Δ(θ),  between  θ  and  μ(θ) Source Target N μ(r) μ(δ)  x  103 μ(θ)  (%) Δθ  (%) Twitter All  areas 100 0.84 -­‐2.5  (0.2) -­‐32.7  (2.1) 0.10 Bing All  areas 46 0.85 -­‐1.4  (0.4) -­‐16.4  (3.6) 24.4 Twitter London   areas 79 0.70 -­‐1.5  (0.1) -­‐27.9  (2.0) 8.32 Bing London   areas 100 0.84 -­‐1.4  (0.2) -­‐16.9  (1.8) 40.4
  • 38. ✓ Background  and  motivation   ✓ Estimating  disease  rates  from  online  text   ✓ Estimating  the  impact  of  a  health  intervention   ✓ Case  study:  influenza  vaccination  impact   ๏ Conclusions  &  future  work Assessing  the  impact  of  a  health  intervention  via  online  content 89%
  • 39. Conclusions  &  points  for  discussion • Framework  for  estimating  the  impact  of  a  health   intervention  based  on  online  content   • Access  to  different  &  larger  parts  of  the  population   Evaluation  is  hard,  however:   • PHE’s  impact  estimates:  -­‐66%  based  on  sentinel   surveillance,  -­‐24%  laboratory  confirmed   • Correlation  between  actual  vaccination  uptake  and  our   study’s  estimated  impacts   Why  are  Bing  and  Twitter  estimations  different?   • Different  user  demographics  (?)  —  this  can  be  useful   • Different  temporal  resolution (Pebody  et  al.,  2014)
  • 40. Potential  future  work  directions • Improve  supervised  learning  models   - better  natural  language  processing  /  machine   learning  modelling   - combination  of  different  data  sources   • Work  on  unsupervised  techniques   - inferring  /  understanding  the  demographics  of  the   online  medium  will  be  essential   • More  rigorous  evaluation
  • 41. Collaborators,  acknowledgements  &  material Elad  Yom-­‐Tov,  Microsoft  Research   Richard  Pebody,  Public  Health  England   Ingemar  J.  Cox,  UCL  &  University  of  Copenhagen Jens  Geyti,  UCL  (Software  Engineer)   Simon  de  Lusignan,  University  of  Surrey  &  RCGP Slides:  ow.ly/RN7MZPaper:  ow.ly/RN9J2 i-­‐sense.org.uk
  • 42. Bollen,  Mao  &  Zeng.  Twitter  mood  predicts  the  stock  market.  J  Comp  Science,  2011.   Burger,  Henderson,  Kim  &  Zarrella.  Discriminating  Gender  on  Twitter.  EMNLP,  2011.   Choi  &  Varian.  Predicting  the  Present  with  Google  Trends.  Economic  Record,  2012.   Culotta.  Lightweight  methods  to  estimate  influenza  rates  and  alcohol  sales  volume  from  Twitter  messages.  Lang   Resour  Eval,  2013.   Hoerl  &  Kennard.  Ridge  regression:  biased  estimation  for  nonorthogonal  problems.  Technometrics,  1970.   Lamb,  Paul  &  Dredze.  Separating  Fact  from  Fear:  Tracking  Flu  Infections  on  Twitter.  NAACL,  2013.   Lambert  &  Pregibon.  Online  effects  of  offline  ads.  Data  Mining  &  Audience  Intelligence  for  Advertising,  2008.   Lampos  &  Cristianini.  Tracking  the  flu  pandemic  by  monitoring  the  Social  Web.  CIP,  2010.   Lampos  &  Cristianini.  Nowcasting  Events  from  the  Social  Web  with  Statistical  Learning.  ACM  TIST,  2012.   Lampos,  Miller,  Crossan  &  Stefansen.  Advances  in  nowcasting  influenza-­‐like  illness  rates  using  search  query  logs.   Sci  Rep,  2015.   Lampos,   Yom-­‐Tov,   Pebody   &   Cox.   Assessing   the   impact   of   a   health   intervention   via   user-­‐generated   Internet   content.  DMKD,  2015.   Pebody  et  al.  Uptake  and  impact  of  a  new  live  attenuated  influenza  vaccine  programme  in  England:  early  results  of   a  pilot  in  primary  school-­‐age  children,  2013/14  influenza  season.  Eurosurveillance,  2014.   Preotiuc-­‐Pietro,  Lampos  &  Aletras.  An  analysis  of  the  user  occupational  class  through  Twitter  content.  ACL,  2015.   Rao,  Yarowsky,  Shreevats  &  Gupta.  Classifying  Latent  User  Attributes  in  Twitter.  SMUC,  2010.   Rasmussen  &  Williams.  Gaussian  Processes  for  Machine  Learning.  MIT  Press,  2006.   Tumasjan,   Sprenger,   Sandner   &   Welpe.   Predicting   Elections   with   Twitter:   What   140   characters   Reveal   about   Political  Sentiment.  ICWSM,  2010.   Zou  &  Hastie.  Regularization  and  variable  selection  via  the  elastic  net.  J  R  Stat  Soc  Series  B  Stat  Methodol,  2005. References