Here's a copy of the slides I used for the talk at the AAAS session on Web Surveillance: Fighting Terrorism and Infectious Diseases in Vancouver, February 2012. T
Text mining in action: early detection of disease outbreaks from online media
1. Text mining in action: early detection of
disease outbreaks from online media
Nigel Collier
Associate Professor
National Institute of Informatics, Tokyo
and Japan Science and Technology Agency SAKIGAKE program
collier@nii.ac.jp
http://sites.google.com/site/nhcollier/
PI of “BioCaster” project (JST, Sakigake grant-in-aid)
AAAS Annual Meeting, Vancouver, Saturday 19th February 2012 (13:30-16:30)
3. Time
Sentinel Field Laboratory
Rumours GP reports
networks workers reports
Certainty
Blog rumour>
Blog rumour> “I’m sick with a
“Ahh! Really bad Blog rumour> chest infection”
throat.” “Still getting worse.
Staying at home News report>
News report> temp is up to 39.5.” “Mystery illness
“Influenza starts causes concern.”
early this year.”
5. http://born.nii.ac.jp
Ontology browsing
Trend graphs Email/GeoRSS alerting
Watchboard, etc.
Event database search
Up to date news in Event summaries
12 languages
WHO US
IT GHSAG UK
JP partners FR
CA DE
Event alerts
Real time Twitter
analysis
6. Technical challenges
X0,000 news providers
REAL TIME
SCALING 30,000-40,000 news items/day
900 on topic/day
200 events/day
4 alerts/day
7. Technical challenges
X0,000 news providers
鳥インフルエンザ
Avian Flu
REAL TIME
SCALING Percentage of News by Language
Influenza aviaire
Cúm gia cầm
MULTILINGUALITY English
조류인플루엔자
Chinese
German
News event counts for porcine foot-
Russian
and-mouth outbreak in South Korea
Korean
2010-2011
French
Vietnamese
Portuguese
Other
Increased sensitivity and
timeliness from multilingual
news
8. Technical challenges
X0,000 news providers
Temporal identification
REAL TIME “The Spanish flu outbreak…”
SCALING
MULTILINGUALITY Entity identification
“Obama fever builds as Americans
AMBIGUITY await a new era”
Toponym grounding Variant transliterations
Camden (UK) Camden (AU) Camden (CA) + 19 others
Tajoura Tajura Tajoora…
Equine influenza in Camden Coreference
“Two British holidaymakers fell ill… ” 2 or 4 victims?
“Two male pensioners died…”
10. A snapshot of the BioCaster ontology
[1] Kawazoe, A., Chanlekha, H., Shigematsu, M. and Collier, N. (2008), “Structuring an event ontology for disease outbreak detection”,
in BMC Bioinformatics, 9 (Suppl 3):S8.
[2] Collier, N., Kawazoe, A., Jin, L., Shigematsu, M., Dien, D. Barrero, R., Takeuchi , K.and Kawtrakul, A. (2007), “A multilingual ontology for
infectious disease surveillance: rationale, design and challenges”, Language Resources and Evaluation, Elsevier, DOI: 10.1007/s10579-007-
9019-7.
11. Extant technology gaps
– How can we understanding „norms‟ and detect their violations?
• Time series analysis and summarization
– How do we integrate event features?
• Across languages
• Across media types
• Across ontologies/granularities
– How do we rapidly adapt surveillance systems to new
vocabulary/event types/domains
12.
13. 5 detection algrorithms
1. Early aberration reporting system (EARS) C2 algorithm
– captures the number of standard deviations that the current count exceeds the history mean;
– St = max(0, (Ct – (μt + kσt))/ σt)
2. EARS C3 algorithm
– similar to C2 except that C3 uses a weighted sum of the previous 3 days for the current period;
3. W2 algorithm
– a modified version of C2 which ignores history counts on Saturdays and Sundays to compensate
for day of week effects;
4. F statistic
– compares the variance in the history window to the variance in the current window;
– St = σt 2 +σb 2
5. Exponential Weighted Moving Average (EWMA)
– provides less weight to days in the history that are further from the test day.
– St = (Yt – μt)/[σt * (λ/(2- λ))1/2], where Y1 = C1 and Yt = λCt + (1- λ)Yt-1
Model parameters were estimated based on an additional 5 epidemic data sets from
ProMED-mail (data not shown)
[3] Burkom H. S. (2005), “Accessible Alerting Algorithms for Biosurveillance”. National Syndromic Surveillance Conference
[4+ Jackson M. L. et all (2007), “A simulation study comparing aberration detection algorithms for syndromic
surveillance” Medical Informatics and Decision Making , 7(6): BMC, DOI: 10.1186/1472-6947-7-6.
*5+ Madoff L. (2004), “ProMED-mail: An early warning system for emerging diseases”. Clin Infect Dis , 39(2): 227–232.
14. Test Data
# Disease Country ProMED-alerts # Disease Country ProMED-alerts
1 Hand,foot,mo PR China 9 10 Influenza Egypt 49
uth
11 Plague USA 8
2 Ebola Congo 17
12 Dengue Brazil 27
3 Yellow fever Brazil 28
13 Dengue Indonesia 14
4 Influenza USA 21
14 Measles UK 13
5 Cholera Iraq 5
15 Chikungunya Malaysia 15
6 Chikungunya Singapore 8
16 Yellow fever Senegal 0
7 Anthrax USA 15
17 Influenza Indonesia 35
8 Yellow fever Argentina 5
18 Influenza Banglades 3
9 Ebola Reston Philippines 15 h
• 14 countries and 11 infectious disease types
• 366 days of news data was collected from BioCaster for each disease and country
• The study period is 17th June 2008 to 17th June 2009
16. Time from outbreak news to outbreak detection
Outbreak characteristics: Early surge vs multi-modal
transmission
News event frequency over time
Testing data sets for a range of diseases used in
Collier, N. (2010), “Towards cross-lingual alerting for
bursty epidemic events”, J. Biomed. Semantics, 2
(Suppl 5):S10.
Best performance using EARS C3
algorithm on multilingual news event
counts: 4 days earlier than ProMED
with an F-measure of 0.56 and 12.0
Source: BioCaster alarms/100 days.
17. The landscape of Web sensing for public health
GPHIN (Ginsberg et al. 2009) EpiSpider (Tolentino et al. 2007)
MiTaP (Damianos et al. 2002) BioCaster (Collier et al. 2008)
Argus (Wilson et al .2008) Medisys (Yangarber et al. 2007)
HealthMap (Friefeld et al. 2008) ProMed-mail (Madoff 2004)
MiTaP (?)
(Damianos et al. 2002)
Newswire
Radio Share
Ushahidi
(Okolloh et al. 2009)
Twitter Earthquake Detector
(Guy et al. 2010) SMS/
Query
Google Flu Trends
HealthMap microblog
Online
(Ginsberg et al. 2009)
(Friefeld et al. 2008) Signals
BioCaster
(Collier et al. 2008)
Social
Lifestream
networks
Discuss Livecast
18. Classification scheme
• Disease spread can be strongly influenced by behavioural changes [7]
• After surveying Twitter messages we conflated Jones and Salathe‟s
groupings into three plus two new categories:
– (A) Avoiding behaviour
• Avoid people who cough/sneeze, Avoid large gatherings of people, Avoid
public transportation, Avoid travel to infected areas
– (I) Increased sanitation
• Wash hands more often, use disinfectant
– (W) Wearing a mask
– (P) Pharmaceutical intervention
• Seeking clinical advice or using medicines or vaccines to prevent disease
– (S) Self reported diagnosis
• User reports that they have the flu
[7] Jones , J, Salathe, M. (2009), “Early assessment of anxiety and behavioral response to novel swine-origin inuenza A(H1N1)”, PLoS
One, 4(12):e8032.
[8] Collier, N. (2009), “UMG U got flu? Analysis of shared health messages for bio-surveillance”, in Proc. 4th Symposium on Semantic Mining in
Biomedicine (SMBM’10).
19. Anxiety indicators have moderately strong correlation
with CDC A(H1N1) lab data 2009-2010
3000 450
400
2500
Category Spearman’s P-value
350
Rho
2000 300
CDC A 0.66 0.020
A S 0.66 0.021
250 S I 0.58 0.048
1500
I P 0.67 0.017
200
P A+I+P 0.68 0.008
1000 150
A+I+P A+I+P+S 0.67 0.017
A+I+P+S
100
500
50
0 0
46 47 48 49 50 51 52 1 2 3 4 5
20. DIZIE: Text mining from personal health reports
on Twitter
Syndromic surveillance for
gastrointestinal, respiratory, neurological, dermatological, haemorrhagi
c, musculoskeletal from Tweets in 40 world cities.
21. Significance and connections
• PH analysis is a highly skilled human task made easier by text mining from
open sources
• Value in transparent evaluation of core technologies using gold standards
– Good understanding now of intrinsic components
– More extrinsic evaluations needed to broaden uptake among PH community
– Community discussion needed on utility of evaluation strategies.
• Power of integrating sources needs to be explored
Heat map showing lowest ranked countries by
number of reports per „000 population gathered by
BioCaster
22. Special thanks
• Funding
– Japan Science and Technology Agency‟s SAKIGAKE fund
– JSPS Young Researcher type A fund
• Postdoctoral students:
– Son Doan, PhD., Mike Conway, PhD. (now at UCSD), Reiko Goodwin, PhD.
(Fordham U.), Ai Kawazoe, PhD. (now at Tsuda U.)
• Ph.D. students
– John McCrae, PhD. (now at Bielefeld U.), Hutchatai Chanlekha, PhD. (now at
Kasetsart U.)
• Intern students
– Wita Ratsameetip (Chulalongkorn University, Thailand),Nguyen Trurong Son (Vietnam National University, Ho Chi
Minh City, Vietnam), Nguyen Thi Ngoc Mai (Vietnam National University, Ho Chi Minh City, Vietnam), Aurelie
Chabord (ENSIMAG-Grenoble INP, France), Therawat Tooumnauy (Kasetsart University, Thailand), Nam Xuan Cao
(Vietnam National University, Ho Chi Minh City, Vietnam), Hoang Cong Duy Vu (Vietnam National University, Ho Chi
Minh City, Vietnam), Nghiem Quoc Minh (Vietnam National University, Ho Chi Minh City, Vietnam), Van Chi Nam
(Vietnam National University, Ho Chi Minh City, Vietnam), Nguyen Thi Hong Nhung (Vietnam National University, Ho
Chi Minh City, Vietnam), Pham Thao Thi Xuan (Vietnam National University, Ho Chi Minh City, Vietnam), Ngo Quoc
Hung (Vietnam National University, Ho Chi Minh City, Vietnam), Tran Tri Quoc (Vietnam National University, Ho Chi
Minh City, Vietnam)