SlideShare una empresa de Scribd logo
1 de 22
Text mining in action: early detection of
disease outbreaks from online media

Nigel Collier
Associate Professor
National Institute of Informatics, Tokyo
and Japan Science and Technology Agency SAKIGAKE program
collier@nii.ac.jp
http://sites.google.com/site/nhcollier/
PI of “BioCaster” project (JST, Sakigake grant-in-aid)


AAAS Annual Meeting, Vancouver, Saturday 19th February 2012 (13:30-16:30)
>>> From fiction to fact




Contagion
© Warner Bros, 2011




                      World Health Organization, Timeline of Influenza A(H1N1), 2009, © WHO
Time

                     Sentinel                              Field              Laboratory
     Rumours                          GP reports
                     networks                              workers            reports




                                        Certainty
                                                                                Blog rumour>
Blog rumour>                                                                    “I’m sick with a
“Ahh! Really bad                   Blog rumour>                                 chest infection”
throat.”                           “Still getting worse.
                                   Staying at home         News report>
               News report>        temp is up to 39.5.”    “Mystery illness
               “Influenza starts                           causes concern.”
               early this year.”
Alerting real world events
                                                                                                                               2. News media response

    1. Biological reality                                                      5. PH Partner
                                                                               validate,
                                                                               analyse and
                                                                               communicate



    Cholera, 2007, Iraq

                                                                                                                             3. Detect event signals


                      4. Select anomalous events
      News volume




                    400                                                                                                   Alert level
                    200
                      0
                          8/2…
                                 8/2…
                                        8/2…
                                               8/3…




                                                                                              9/1…
                                                                                                     9/1…
                                                                                                            9/1…
                                                                                                                   9/1…
                                                      9/1/…
                                                              9/3/…
                                                                      9/5/…
                                                                              9/7/…
                                                                                      9/9/…




                                                               Time
http://born.nii.ac.jp
                                                     Ontology browsing
                                      Trend graphs   Email/GeoRSS alerting
                                                     Watchboard, etc.
Event database search




    Up to date news in                                  Event summaries
    12 languages



WHO                      US
IT       GHSAG           UK
JP       partners        FR
CA                       DE




             Event alerts


                              Real time Twitter
                              analysis
Technical challenges

                 X0,000 news providers



    REAL TIME
    SCALING                                    30,000-40,000 news items/day




                                         900 on topic/day

                                  200 events/day

                               4 alerts/day
Technical challenges

                                  X0,000 news providers


                                鳥インフルエンザ
                                                     Avian Flu
          REAL TIME
          SCALING                                                Percentage of News by Language
                                Influenza aviaire
                                            Cúm gia cầm
          MULTILINGUALITY                                                                   English
                                      조류인플루엔자
                                                                                            Chinese
                                                                                            German
News event counts for porcine foot-
                                                                                            Russian
and-mouth outbreak in South Korea
                                                                                            Korean
2010-2011
                                                                                            French
                                                                                            Vietnamese
                                                                                            Portuguese
                                                                                            Other

                                        Increased sensitivity and
                                       timeliness from multilingual
                                       news
Technical challenges

                                     X0,000 news providers


                                                                           Temporal identification
           REAL TIME                                                   “The Spanish flu outbreak…”
           SCALING
           MULTILINGUALITY                                             Entity identification
                                                                “Obama fever builds as Americans
           AMBIGUITY                                                   await a new era”

          Toponym grounding                                          Variant transliterations
Camden (UK) Camden (AU) Camden (CA) + 19 others
                                                                      Tajoura Tajura Tajoora…

    Equine influenza in Camden                           Coreference
                                             “Two British holidaymakers fell ill… ”   2 or 4 victims?
                                                “Two male pensioners died…”
BioCaster‟s semantic enrichment workflow




                                           MOSES (underway)
A snapshot of the BioCaster ontology




[1] Kawazoe, A., Chanlekha, H., Shigematsu, M. and Collier, N. (2008), “Structuring an event ontology for disease outbreak detection”,
in BMC Bioinformatics, 9 (Suppl 3):S8.
[2] Collier, N., Kawazoe, A., Jin, L., Shigematsu, M., Dien, D. Barrero, R., Takeuchi , K.and Kawtrakul, A. (2007), “A multilingual ontology for
infectious disease surveillance: rationale, design and challenges”, Language Resources and Evaluation, Elsevier, DOI: 10.1007/s10579-007-
9019-7.
Extant technology gaps

  – How can we understanding „norms‟ and detect their violations?
     • Time series analysis and summarization
  – How do we integrate event features?
     • Across languages
     • Across media types
     • Across ontologies/granularities
  – How do we rapidly adapt surveillance systems to new
    vocabulary/event types/domains
5 detection algrorithms
  1. Early aberration reporting system (EARS) C2 algorithm
          –    captures the number of standard deviations that the current count exceeds the history mean;
          –    St = max(0, (Ct – (μt + kσt))/ σt)
  2. EARS C3 algorithm
          –    similar to C2 except that C3 uses a weighted sum of the previous 3 days for the current period;
  3. W2 algorithm
          –    a modified version of C2 which ignores history counts on Saturdays and Sundays to compensate
               for day of week effects;
  4. F statistic
          –    compares the variance in the history window to the variance in the current window;
          –    St = σt 2 +σb 2
  5. Exponential Weighted Moving Average (EWMA)
          –    provides less weight to days in the history that are further from the test day.
          –    St = (Yt – μt)/[σt * (λ/(2- λ))1/2], where Y1 = C1 and Yt = λCt + (1- λ)Yt-1


  Model parameters were estimated based on an additional 5 epidemic data sets from
  ProMED-mail (data not shown)

[3] Burkom H. S. (2005), “Accessible Alerting Algorithms for Biosurveillance”. National Syndromic Surveillance Conference
[4+ Jackson M. L. et all (2007), “A simulation study comparing aberration detection algorithms for syndromic
     surveillance” Medical Informatics and Decision Making , 7(6): BMC, DOI: 10.1186/1472-6947-7-6.
*5+ Madoff L. (2004), “ProMED-mail: An early warning system for emerging diseases”. Clin Infect Dis , 39(2): 227–232.
Test Data
#   Disease        Country       ProMED-alerts   #    Disease        Country     ProMED-alerts

1   Hand,foot,mo   PR China      9               10   Influenza      Egypt       49
    uth
                                                 11   Plague         USA         8
2   Ebola          Congo         17
                                                 12   Dengue         Brazil      27
3   Yellow fever   Brazil        28
                                                 13   Dengue         Indonesia   14
4   Influenza      USA           21
                                                 14   Measles        UK          13
5   Cholera        Iraq          5
                                                 15   Chikungunya    Malaysia    15
6   Chikungunya    Singapore     8
                                                 16   Yellow fever   Senegal     0
7   Anthrax        USA           15
                                                 17   Influenza      Indonesia   35
8   Yellow fever   Argentina     5
                                                 18   Influenza      Banglades   3
9   Ebola Reston   Philippines   15                                  h

    •   14 countries and 11 infectious disease types
    •   366 days of news data was collected from BioCaster for each disease and country
    •   The study period is 17th June 2008 to 17th June 2009
Evaluation of time series algorithms
                                     C3                   C2                   W2                  F-statistic   EWMA

       Sensitivity                   0.74                 0.66                 0.66                0.78          0.73

                                     (0.69-0.78)          (0.61-0.72)          (0.60-0.71)         (0.74-0.82)   (0.68-0.78)

       Specificity                   0.96                 0.98                 0.98                0.92          0.95

                                     (0.95-0.96)          (0.98-0.98)          (0.98-0.99)         (0.91-0.92)   (0.94-0.96)

       PPV                           0.55                 0.64                 0.65                0.46          0.47

                                     (0.98-0.99)          (0.98-0.99)          (0.98-0.99)         (0.98-0.99)   (0.98-0.99)

       NPV                           0.98                 0.98                 0.98                0.98          0.98

                                     (0.98-0.99)          (0.98-0.99)          (0.98-0.99)         (0.98-0.98)   (0.98-0.99)

       Alarms/100 days               6.48                 4.52                 4.17                12.34         7.85

       F-measure                     0.63                 0.65                 0.66                0.58          0.58

       Results in parentheses show 95% confidence intervals

[6] Collier, N. (2009), “What’s unusual in online disease outbreak news?”, in BMC Biiomedical Semantics, 1(2).
Time from outbreak news to outbreak detection
 Outbreak characteristics: Early surge vs multi-modal
 transmission
 News event frequency over time




                                                        Testing data sets for a range of diseases used in
                                                        Collier, N. (2010), “Towards cross-lingual alerting for
                                                        bursty epidemic events”, J. Biomed. Semantics, 2
                                                        (Suppl 5):S10.



                                                        Best performance using EARS C3
                                                        algorithm on multilingual news event
                                                        counts: 4 days earlier than ProMED
                                                        with an F-measure of 0.56 and 12.0
Source: BioCaster                                       alarms/100 days.
The landscape of Web sensing for public health
                                       GPHIN (Ginsberg et al. 2009)              EpiSpider (Tolentino et al. 2007)
                                       MiTaP (Damianos et al. 2002)              BioCaster (Collier et al. 2008)
                                       Argus (Wilson et al .2008)                Medisys (Yangarber et al. 2007)
                                       HealthMap (Friefeld et al. 2008)          ProMed-mail (Madoff 2004)

              MiTaP      (?)
              (Damianos et al. 2002)
                                                                          Newswire


                                                        Radio                                   Share
Ushahidi
(Okolloh et al. 2009)
Twitter Earthquake Detector
(Guy et al. 2010)                             SMS/
                                                                                                        Query
                                                                                                                 Google Flu Trends
HealthMap                                   microblog
                                                                          Online
                                                                                                                 (Ginsberg et al. 2009)
(Friefeld et al. 2008)                                                    Signals
BioCaster
(Collier et al. 2008)
                                                                                                     Social
                                               Lifestream
                                                                                                    networks




                                                                Discuss              Livecast
Classification scheme

• Disease spread can be strongly influenced by behavioural changes [7]
• After surveying Twitter messages we conflated Jones and Salathe‟s
  groupings into three plus two new categories:
        – (A) Avoiding behaviour
                • Avoid people who cough/sneeze, Avoid large gatherings of people, Avoid
                  public transportation, Avoid travel to infected areas
        – (I) Increased sanitation
                • Wash hands more often, use disinfectant
        – (W) Wearing a mask
        – (P) Pharmaceutical intervention
                • Seeking clinical advice or using medicines or vaccines to prevent disease
        – (S) Self reported diagnosis
                • User reports that they have the flu

[7] Jones , J, Salathe, M. (2009), “Early assessment of anxiety and behavioral response to novel swine-origin inuenza A(H1N1)”, PLoS
One, 4(12):e8032.
[8] Collier, N. (2009), “UMG U got flu? Analysis of shared health messages for bio-surveillance”, in Proc. 4th Symposium on Semantic Mining in
Biomedicine (SMBM’10).
Anxiety indicators have moderately strong correlation
with CDC A(H1N1) lab data 2009-2010
  3000                                                          450


                                                                400
  2500
                                                                                Category   Spearman’s   P-value
                                                                350
                                                                                           Rho

  2000                                                          300
                                                                      CDC       A          0.66         0.020
                                                                      A         S          0.66         0.021
                                                                250   S         I          0.58         0.048
  1500
                                                                      I         P          0.67         0.017
                                                                200
                                                                      P         A+I+P      0.68         0.008
  1000                                                          150
                                                                      A+I+P     A+I+P+S    0.67         0.017
                                                                      A+I+P+S
                                                                100
   500
                                                                50


     0                                                          0
         46   47   48   49   50   51   52   1   2   3   4   5
DIZIE: Text mining from personal health reports
on Twitter




    Syndromic                          surveillance                      for
    gastrointestinal, respiratory, neurological, dermatological, haemorrhagi
    c, musculoskeletal from Tweets in 40 world cities.
Significance and connections
•   PH analysis is a highly skilled human task made easier by text mining from
    open sources

•   Value in transparent evaluation of core technologies using gold standards
     –   Good understanding now of intrinsic components
     –   More extrinsic evaluations needed to broaden uptake among PH community
     –   Community discussion needed on utility of evaluation strategies.


•   Power of integrating sources needs to be explored




                                                  Heat map showing lowest ranked countries by
                                                  number of reports per „000 population gathered by
                                                  BioCaster
Special thanks

•   Funding
     – Japan Science and Technology Agency‟s SAKIGAKE fund
     – JSPS Young Researcher type A fund
•   Postdoctoral students:
     – Son Doan, PhD., Mike Conway, PhD. (now at UCSD), Reiko Goodwin, PhD.
       (Fordham U.), Ai Kawazoe, PhD. (now at Tsuda U.)
•   Ph.D. students
     – John McCrae, PhD. (now at Bielefeld U.), Hutchatai Chanlekha, PhD. (now at
       Kasetsart U.)
•   Intern students
     –   Wita Ratsameetip (Chulalongkorn University, Thailand),Nguyen Trurong Son (Vietnam National University, Ho Chi
         Minh City, Vietnam), Nguyen Thi Ngoc Mai (Vietnam National University, Ho Chi Minh City, Vietnam), Aurelie
         Chabord (ENSIMAG-Grenoble INP, France), Therawat Tooumnauy (Kasetsart University, Thailand), Nam Xuan Cao
         (Vietnam National University, Ho Chi Minh City, Vietnam), Hoang Cong Duy Vu (Vietnam National University, Ho Chi
         Minh City, Vietnam), Nghiem Quoc Minh (Vietnam National University, Ho Chi Minh City, Vietnam), Van Chi Nam
         (Vietnam National University, Ho Chi Minh City, Vietnam), Nguyen Thi Hong Nhung (Vietnam National University, Ho
         Chi Minh City, Vietnam), Pham Thao Thi Xuan (Vietnam National University, Ho Chi Minh City, Vietnam), Ngo Quoc
         Hung (Vietnam National University, Ho Chi Minh City, Vietnam), Tran Tri Quoc (Vietnam National University, Ho Chi
         Minh City, Vietnam)

Más contenido relacionado

Destacado

Student Work - Diabetes
Student Work - DiabetesStudent Work - Diabetes
Student Work - Diabetes
jeremyschriner
 
Hybrid Technique for Associative Classification of Heart Diseases
Hybrid Technique for Associative Classification of Heart DiseasesHybrid Technique for Associative Classification of Heart Diseases
Hybrid Technique for Associative Classification of Heart Diseases
Jagdeep Singh Malhi
 
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
A Novel Approach for Breast Cancer Detection using Data Mining TechniquesA Novel Approach for Breast Cancer Detection using Data Mining Techniques
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
ahmad abdelhafeez
 
Predicting Hospital Readmission Using TreeNet
Predicting Hospital Readmission Using TreeNetPredicting Hospital Readmission Using TreeNet
Predicting Hospital Readmission Using TreeNet
Salford Systems
 
1.PPT (1.PREDICTION OF DISEASES New)
1.PPT (1.PREDICTION OF DISEASES New)1.PPT (1.PREDICTION OF DISEASES New)
1.PPT (1.PREDICTION OF DISEASES New)
Jashvant Shah
 

Destacado (14)

Student Work - Diabetes
Student Work - DiabetesStudent Work - Diabetes
Student Work - Diabetes
 
Medical Data Mining
Medical Data MiningMedical Data Mining
Medical Data Mining
 
4part1
4part14part1
4part1
 
Hybrid Technique for Associative Classification of Heart Diseases
Hybrid Technique for Associative Classification of Heart DiseasesHybrid Technique for Associative Classification of Heart Diseases
Hybrid Technique for Associative Classification of Heart Diseases
 
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
A Novel Approach for Breast Cancer Detection using Data Mining TechniquesA Novel Approach for Breast Cancer Detection using Data Mining Techniques
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
 
Support vector machine parameters tuning using grey wolf optimization
Support vector machine parameters tuning using grey wolf optimizationSupport vector machine parameters tuning using grey wolf optimization
Support vector machine parameters tuning using grey wolf optimization
 
Machine Learning - Breast Cancer Diagnosis
Machine Learning - Breast Cancer DiagnosisMachine Learning - Breast Cancer Diagnosis
Machine Learning - Breast Cancer Diagnosis
 
Data Mining Techniques In Computer Aided Cancer Diagnosis
Data Mining Techniques In Computer Aided Cancer DiagnosisData Mining Techniques In Computer Aided Cancer Diagnosis
Data Mining Techniques In Computer Aided Cancer Diagnosis
 
Predicting Hospital Readmission Using TreeNet
Predicting Hospital Readmission Using TreeNetPredicting Hospital Readmission Using TreeNet
Predicting Hospital Readmission Using TreeNet
 
1.PPT (1.PREDICTION OF DISEASES New)
1.PPT (1.PREDICTION OF DISEASES New)1.PPT (1.PREDICTION OF DISEASES New)
1.PPT (1.PREDICTION OF DISEASES New)
 
a novel approach for breast cancer detection using data mining tool weka
a novel approach for breast cancer detection using data mining tool wekaa novel approach for breast cancer detection using data mining tool weka
a novel approach for breast cancer detection using data mining tool weka
 
Data mining ppt
Data mining pptData mining ppt
Data mining ppt
 
Breast Cancer Diagnosis using a Hybrid Genetic Algorithm for Feature Selectio...
Breast Cancer Diagnosis using a Hybrid Genetic Algorithm for Feature Selectio...Breast Cancer Diagnosis using a Hybrid Genetic Algorithm for Feature Selectio...
Breast Cancer Diagnosis using a Hybrid Genetic Algorithm for Feature Selectio...
 
Data mining for diabetes readmission
Data mining for diabetes readmissionData mining for diabetes readmission
Data mining for diabetes readmission
 

Último

Call Girl In Indore 📞9235973566📞 Just📲 Call Inaaya Indore Call Girls Service ...
Call Girl In Indore 📞9235973566📞 Just📲 Call Inaaya Indore Call Girls Service ...Call Girl In Indore 📞9235973566📞 Just📲 Call Inaaya Indore Call Girls Service ...
Call Girl In Indore 📞9235973566📞 Just📲 Call Inaaya Indore Call Girls Service ...
Sheetaleventcompany
 
💚Call Girls In Amritsar 💯Anvi 📲🔝8725944379🔝Amritsar Call Girl No💰Advance Cash...
💚Call Girls In Amritsar 💯Anvi 📲🔝8725944379🔝Amritsar Call Girl No💰Advance Cash...💚Call Girls In Amritsar 💯Anvi 📲🔝8725944379🔝Amritsar Call Girl No💰Advance Cash...
💚Call Girls In Amritsar 💯Anvi 📲🔝8725944379🔝Amritsar Call Girl No💰Advance Cash...
Sheetaleventcompany
 
Difference Between Skeletal Smooth and Cardiac Muscles
Difference Between Skeletal Smooth and Cardiac MusclesDifference Between Skeletal Smooth and Cardiac Muscles
Difference Between Skeletal Smooth and Cardiac Muscles
MedicoseAcademics
 
💚Chandigarh Call Girls Service 💯Piya 📲🔝8868886958🔝Call Girls In Chandigarh No...
💚Chandigarh Call Girls Service 💯Piya 📲🔝8868886958🔝Call Girls In Chandigarh No...💚Chandigarh Call Girls Service 💯Piya 📲🔝8868886958🔝Call Girls In Chandigarh No...
💚Chandigarh Call Girls Service 💯Piya 📲🔝8868886958🔝Call Girls In Chandigarh No...
Sheetaleventcompany
 
👉 Chennai Sexy Aunty’s WhatsApp Number 👉📞 7427069034 👉📞 Just📲 Call Ruhi Colle...
👉 Chennai Sexy Aunty’s WhatsApp Number 👉📞 7427069034 👉📞 Just📲 Call Ruhi Colle...👉 Chennai Sexy Aunty’s WhatsApp Number 👉📞 7427069034 👉📞 Just📲 Call Ruhi Colle...
👉 Chennai Sexy Aunty’s WhatsApp Number 👉📞 7427069034 👉📞 Just📲 Call Ruhi Colle...
rajnisinghkjn
 
Premium Call Girls Nagpur {9xx000xx09} ❤️VVIP POOJA Call Girls in Nagpur Maha...
Premium Call Girls Nagpur {9xx000xx09} ❤️VVIP POOJA Call Girls in Nagpur Maha...Premium Call Girls Nagpur {9xx000xx09} ❤️VVIP POOJA Call Girls in Nagpur Maha...
Premium Call Girls Nagpur {9xx000xx09} ❤️VVIP POOJA Call Girls in Nagpur Maha...
Sheetaleventcompany
 
Goa Call Girl Service 📞9xx000xx09📞Just Call Divya📲 Call Girl In Goa No💰Advanc...
Goa Call Girl Service 📞9xx000xx09📞Just Call Divya📲 Call Girl In Goa No💰Advanc...Goa Call Girl Service 📞9xx000xx09📞Just Call Divya📲 Call Girl In Goa No💰Advanc...
Goa Call Girl Service 📞9xx000xx09📞Just Call Divya📲 Call Girl In Goa No💰Advanc...
Sheetaleventcompany
 
Call Girl in Chennai | Whatsapp No 📞 7427069034 📞 VIP Escorts Service Availab...
Call Girl in Chennai | Whatsapp No 📞 7427069034 📞 VIP Escorts Service Availab...Call Girl in Chennai | Whatsapp No 📞 7427069034 📞 VIP Escorts Service Availab...
Call Girl in Chennai | Whatsapp No 📞 7427069034 📞 VIP Escorts Service Availab...
amritaverma53
 
Premium Call Girls Dehradun {8854095900} ❤️VVIP ANJU Call Girls in Dehradun U...
Premium Call Girls Dehradun {8854095900} ❤️VVIP ANJU Call Girls in Dehradun U...Premium Call Girls Dehradun {8854095900} ❤️VVIP ANJU Call Girls in Dehradun U...
Premium Call Girls Dehradun {8854095900} ❤️VVIP ANJU Call Girls in Dehradun U...
Sheetaleventcompany
 

Último (20)

Call Girls Rishikesh Just Call 9667172968 Top Class Call Girl Service Available
Call Girls Rishikesh Just Call 9667172968 Top Class Call Girl Service AvailableCall Girls Rishikesh Just Call 9667172968 Top Class Call Girl Service Available
Call Girls Rishikesh Just Call 9667172968 Top Class Call Girl Service Available
 
Call Girl In Indore 📞9235973566📞 Just📲 Call Inaaya Indore Call Girls Service ...
Call Girl In Indore 📞9235973566📞 Just📲 Call Inaaya Indore Call Girls Service ...Call Girl In Indore 📞9235973566📞 Just📲 Call Inaaya Indore Call Girls Service ...
Call Girl In Indore 📞9235973566📞 Just📲 Call Inaaya Indore Call Girls Service ...
 
💚Call Girls In Amritsar 💯Anvi 📲🔝8725944379🔝Amritsar Call Girl No💰Advance Cash...
💚Call Girls In Amritsar 💯Anvi 📲🔝8725944379🔝Amritsar Call Girl No💰Advance Cash...💚Call Girls In Amritsar 💯Anvi 📲🔝8725944379🔝Amritsar Call Girl No💰Advance Cash...
💚Call Girls In Amritsar 💯Anvi 📲🔝8725944379🔝Amritsar Call Girl No💰Advance Cash...
 
Difference Between Skeletal Smooth and Cardiac Muscles
Difference Between Skeletal Smooth and Cardiac MusclesDifference Between Skeletal Smooth and Cardiac Muscles
Difference Between Skeletal Smooth and Cardiac Muscles
 
Cardiac Output, Venous Return, and Their Regulation
Cardiac Output, Venous Return, and Their RegulationCardiac Output, Venous Return, and Their Regulation
Cardiac Output, Venous Return, and Their Regulation
 
❤️Chandigarh Escorts Service☎️9814379184☎️ Call Girl service in Chandigarh☎️ ...
❤️Chandigarh Escorts Service☎️9814379184☎️ Call Girl service in Chandigarh☎️ ...❤️Chandigarh Escorts Service☎️9814379184☎️ Call Girl service in Chandigarh☎️ ...
❤️Chandigarh Escorts Service☎️9814379184☎️ Call Girl service in Chandigarh☎️ ...
 
Cheap Rate Call Girls Bangalore {9179660964} ❤️VVIP BEBO Call Girls in Bangal...
Cheap Rate Call Girls Bangalore {9179660964} ❤️VVIP BEBO Call Girls in Bangal...Cheap Rate Call Girls Bangalore {9179660964} ❤️VVIP BEBO Call Girls in Bangal...
Cheap Rate Call Girls Bangalore {9179660964} ❤️VVIP BEBO Call Girls in Bangal...
 
Race Course Road } Book Call Girls in Bangalore | Whatsapp No 6378878445 VIP ...
Race Course Road } Book Call Girls in Bangalore | Whatsapp No 6378878445 VIP ...Race Course Road } Book Call Girls in Bangalore | Whatsapp No 6378878445 VIP ...
Race Course Road } Book Call Girls in Bangalore | Whatsapp No 6378878445 VIP ...
 
ANATOMY AND PHYSIOLOGY OF REPRODUCTIVE SYSTEM.pptx
ANATOMY AND PHYSIOLOGY OF REPRODUCTIVE SYSTEM.pptxANATOMY AND PHYSIOLOGY OF REPRODUCTIVE SYSTEM.pptx
ANATOMY AND PHYSIOLOGY OF REPRODUCTIVE SYSTEM.pptx
 
💚Chandigarh Call Girls Service 💯Piya 📲🔝8868886958🔝Call Girls In Chandigarh No...
💚Chandigarh Call Girls Service 💯Piya 📲🔝8868886958🔝Call Girls In Chandigarh No...💚Chandigarh Call Girls Service 💯Piya 📲🔝8868886958🔝Call Girls In Chandigarh No...
💚Chandigarh Call Girls Service 💯Piya 📲🔝8868886958🔝Call Girls In Chandigarh No...
 
Call Girls Shahdol Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Shahdol Just Call 8250077686 Top Class Call Girl Service AvailableCall Girls Shahdol Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Shahdol Just Call 8250077686 Top Class Call Girl Service Available
 
Chennai ❣️ Call Girl 6378878445 Call Girls in Chennai Escort service book now
Chennai ❣️ Call Girl 6378878445 Call Girls in Chennai Escort service book nowChennai ❣️ Call Girl 6378878445 Call Girls in Chennai Escort service book now
Chennai ❣️ Call Girl 6378878445 Call Girls in Chennai Escort service book now
 
👉 Chennai Sexy Aunty’s WhatsApp Number 👉📞 7427069034 👉📞 Just📲 Call Ruhi Colle...
👉 Chennai Sexy Aunty’s WhatsApp Number 👉📞 7427069034 👉📞 Just📲 Call Ruhi Colle...👉 Chennai Sexy Aunty’s WhatsApp Number 👉📞 7427069034 👉📞 Just📲 Call Ruhi Colle...
👉 Chennai Sexy Aunty’s WhatsApp Number 👉📞 7427069034 👉📞 Just📲 Call Ruhi Colle...
 
Premium Call Girls Nagpur {9xx000xx09} ❤️VVIP POOJA Call Girls in Nagpur Maha...
Premium Call Girls Nagpur {9xx000xx09} ❤️VVIP POOJA Call Girls in Nagpur Maha...Premium Call Girls Nagpur {9xx000xx09} ❤️VVIP POOJA Call Girls in Nagpur Maha...
Premium Call Girls Nagpur {9xx000xx09} ❤️VVIP POOJA Call Girls in Nagpur Maha...
 
Goa Call Girl Service 📞9xx000xx09📞Just Call Divya📲 Call Girl In Goa No💰Advanc...
Goa Call Girl Service 📞9xx000xx09📞Just Call Divya📲 Call Girl In Goa No💰Advanc...Goa Call Girl Service 📞9xx000xx09📞Just Call Divya📲 Call Girl In Goa No💰Advanc...
Goa Call Girl Service 📞9xx000xx09📞Just Call Divya📲 Call Girl In Goa No💰Advanc...
 
Call Girls Mussoorie Just Call 8854095900 Top Class Call Girl Service Available
Call Girls Mussoorie Just Call 8854095900 Top Class Call Girl Service AvailableCall Girls Mussoorie Just Call 8854095900 Top Class Call Girl Service Available
Call Girls Mussoorie Just Call 8854095900 Top Class Call Girl Service Available
 
Call Girls Kathua Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Kathua Just Call 8250077686 Top Class Call Girl Service AvailableCall Girls Kathua Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Kathua Just Call 8250077686 Top Class Call Girl Service Available
 
Call Girl in Chennai | Whatsapp No 📞 7427069034 📞 VIP Escorts Service Availab...
Call Girl in Chennai | Whatsapp No 📞 7427069034 📞 VIP Escorts Service Availab...Call Girl in Chennai | Whatsapp No 📞 7427069034 📞 VIP Escorts Service Availab...
Call Girl in Chennai | Whatsapp No 📞 7427069034 📞 VIP Escorts Service Availab...
 
Premium Call Girls Dehradun {8854095900} ❤️VVIP ANJU Call Girls in Dehradun U...
Premium Call Girls Dehradun {8854095900} ❤️VVIP ANJU Call Girls in Dehradun U...Premium Call Girls Dehradun {8854095900} ❤️VVIP ANJU Call Girls in Dehradun U...
Premium Call Girls Dehradun {8854095900} ❤️VVIP ANJU Call Girls in Dehradun U...
 
Call 8250092165 Patna Call Girls ₹4.5k Cash Payment With Room Delivery
Call 8250092165 Patna Call Girls ₹4.5k Cash Payment With Room DeliveryCall 8250092165 Patna Call Girls ₹4.5k Cash Payment With Room Delivery
Call 8250092165 Patna Call Girls ₹4.5k Cash Payment With Room Delivery
 

Text mining in action: early detection of disease outbreaks from online media

  • 1. Text mining in action: early detection of disease outbreaks from online media Nigel Collier Associate Professor National Institute of Informatics, Tokyo and Japan Science and Technology Agency SAKIGAKE program collier@nii.ac.jp http://sites.google.com/site/nhcollier/ PI of “BioCaster” project (JST, Sakigake grant-in-aid) AAAS Annual Meeting, Vancouver, Saturday 19th February 2012 (13:30-16:30)
  • 2. >>> From fiction to fact Contagion © Warner Bros, 2011 World Health Organization, Timeline of Influenza A(H1N1), 2009, © WHO
  • 3. Time Sentinel Field Laboratory Rumours GP reports networks workers reports Certainty Blog rumour> Blog rumour> “I’m sick with a “Ahh! Really bad Blog rumour> chest infection” throat.” “Still getting worse. Staying at home News report> News report> temp is up to 39.5.” “Mystery illness “Influenza starts causes concern.” early this year.”
  • 4. Alerting real world events 2. News media response 1. Biological reality 5. PH Partner validate, analyse and communicate Cholera, 2007, Iraq 3. Detect event signals 4. Select anomalous events News volume 400 Alert level 200 0 8/2… 8/2… 8/2… 8/3… 9/1… 9/1… 9/1… 9/1… 9/1/… 9/3/… 9/5/… 9/7/… 9/9/… Time
  • 5. http://born.nii.ac.jp Ontology browsing Trend graphs Email/GeoRSS alerting Watchboard, etc. Event database search Up to date news in Event summaries 12 languages WHO US IT GHSAG UK JP partners FR CA DE Event alerts Real time Twitter analysis
  • 6. Technical challenges X0,000 news providers REAL TIME SCALING 30,000-40,000 news items/day 900 on topic/day 200 events/day 4 alerts/day
  • 7. Technical challenges X0,000 news providers 鳥インフルエンザ Avian Flu REAL TIME SCALING Percentage of News by Language Influenza aviaire Cúm gia cầm MULTILINGUALITY English 조류인플루엔자 Chinese German News event counts for porcine foot- Russian and-mouth outbreak in South Korea Korean 2010-2011 French Vietnamese Portuguese Other  Increased sensitivity and timeliness from multilingual news
  • 8. Technical challenges X0,000 news providers Temporal identification REAL TIME “The Spanish flu outbreak…” SCALING MULTILINGUALITY Entity identification “Obama fever builds as Americans AMBIGUITY await a new era” Toponym grounding Variant transliterations Camden (UK) Camden (AU) Camden (CA) + 19 others Tajoura Tajura Tajoora… Equine influenza in Camden Coreference “Two British holidaymakers fell ill… ” 2 or 4 victims? “Two male pensioners died…”
  • 9. BioCaster‟s semantic enrichment workflow MOSES (underway)
  • 10. A snapshot of the BioCaster ontology [1] Kawazoe, A., Chanlekha, H., Shigematsu, M. and Collier, N. (2008), “Structuring an event ontology for disease outbreak detection”, in BMC Bioinformatics, 9 (Suppl 3):S8. [2] Collier, N., Kawazoe, A., Jin, L., Shigematsu, M., Dien, D. Barrero, R., Takeuchi , K.and Kawtrakul, A. (2007), “A multilingual ontology for infectious disease surveillance: rationale, design and challenges”, Language Resources and Evaluation, Elsevier, DOI: 10.1007/s10579-007- 9019-7.
  • 11. Extant technology gaps – How can we understanding „norms‟ and detect their violations? • Time series analysis and summarization – How do we integrate event features? • Across languages • Across media types • Across ontologies/granularities – How do we rapidly adapt surveillance systems to new vocabulary/event types/domains
  • 12.
  • 13. 5 detection algrorithms 1. Early aberration reporting system (EARS) C2 algorithm – captures the number of standard deviations that the current count exceeds the history mean; – St = max(0, (Ct – (μt + kσt))/ σt) 2. EARS C3 algorithm – similar to C2 except that C3 uses a weighted sum of the previous 3 days for the current period; 3. W2 algorithm – a modified version of C2 which ignores history counts on Saturdays and Sundays to compensate for day of week effects; 4. F statistic – compares the variance in the history window to the variance in the current window; – St = σt 2 +σb 2 5. Exponential Weighted Moving Average (EWMA) – provides less weight to days in the history that are further from the test day. – St = (Yt – μt)/[σt * (λ/(2- λ))1/2], where Y1 = C1 and Yt = λCt + (1- λ)Yt-1 Model parameters were estimated based on an additional 5 epidemic data sets from ProMED-mail (data not shown) [3] Burkom H. S. (2005), “Accessible Alerting Algorithms for Biosurveillance”. National Syndromic Surveillance Conference [4+ Jackson M. L. et all (2007), “A simulation study comparing aberration detection algorithms for syndromic surveillance” Medical Informatics and Decision Making , 7(6): BMC, DOI: 10.1186/1472-6947-7-6. *5+ Madoff L. (2004), “ProMED-mail: An early warning system for emerging diseases”. Clin Infect Dis , 39(2): 227–232.
  • 14. Test Data # Disease Country ProMED-alerts # Disease Country ProMED-alerts 1 Hand,foot,mo PR China 9 10 Influenza Egypt 49 uth 11 Plague USA 8 2 Ebola Congo 17 12 Dengue Brazil 27 3 Yellow fever Brazil 28 13 Dengue Indonesia 14 4 Influenza USA 21 14 Measles UK 13 5 Cholera Iraq 5 15 Chikungunya Malaysia 15 6 Chikungunya Singapore 8 16 Yellow fever Senegal 0 7 Anthrax USA 15 17 Influenza Indonesia 35 8 Yellow fever Argentina 5 18 Influenza Banglades 3 9 Ebola Reston Philippines 15 h • 14 countries and 11 infectious disease types • 366 days of news data was collected from BioCaster for each disease and country • The study period is 17th June 2008 to 17th June 2009
  • 15. Evaluation of time series algorithms C3 C2 W2 F-statistic EWMA Sensitivity 0.74 0.66 0.66 0.78 0.73 (0.69-0.78) (0.61-0.72) (0.60-0.71) (0.74-0.82) (0.68-0.78) Specificity 0.96 0.98 0.98 0.92 0.95 (0.95-0.96) (0.98-0.98) (0.98-0.99) (0.91-0.92) (0.94-0.96) PPV 0.55 0.64 0.65 0.46 0.47 (0.98-0.99) (0.98-0.99) (0.98-0.99) (0.98-0.99) (0.98-0.99) NPV 0.98 0.98 0.98 0.98 0.98 (0.98-0.99) (0.98-0.99) (0.98-0.99) (0.98-0.98) (0.98-0.99) Alarms/100 days 6.48 4.52 4.17 12.34 7.85 F-measure 0.63 0.65 0.66 0.58 0.58 Results in parentheses show 95% confidence intervals [6] Collier, N. (2009), “What’s unusual in online disease outbreak news?”, in BMC Biiomedical Semantics, 1(2).
  • 16. Time from outbreak news to outbreak detection Outbreak characteristics: Early surge vs multi-modal transmission News event frequency over time Testing data sets for a range of diseases used in Collier, N. (2010), “Towards cross-lingual alerting for bursty epidemic events”, J. Biomed. Semantics, 2 (Suppl 5):S10. Best performance using EARS C3 algorithm on multilingual news event counts: 4 days earlier than ProMED with an F-measure of 0.56 and 12.0 Source: BioCaster alarms/100 days.
  • 17. The landscape of Web sensing for public health GPHIN (Ginsberg et al. 2009) EpiSpider (Tolentino et al. 2007) MiTaP (Damianos et al. 2002) BioCaster (Collier et al. 2008) Argus (Wilson et al .2008) Medisys (Yangarber et al. 2007) HealthMap (Friefeld et al. 2008) ProMed-mail (Madoff 2004) MiTaP (?) (Damianos et al. 2002) Newswire Radio Share Ushahidi (Okolloh et al. 2009) Twitter Earthquake Detector (Guy et al. 2010) SMS/ Query Google Flu Trends HealthMap microblog Online (Ginsberg et al. 2009) (Friefeld et al. 2008) Signals BioCaster (Collier et al. 2008) Social Lifestream networks Discuss Livecast
  • 18. Classification scheme • Disease spread can be strongly influenced by behavioural changes [7] • After surveying Twitter messages we conflated Jones and Salathe‟s groupings into three plus two new categories: – (A) Avoiding behaviour • Avoid people who cough/sneeze, Avoid large gatherings of people, Avoid public transportation, Avoid travel to infected areas – (I) Increased sanitation • Wash hands more often, use disinfectant – (W) Wearing a mask – (P) Pharmaceutical intervention • Seeking clinical advice or using medicines or vaccines to prevent disease – (S) Self reported diagnosis • User reports that they have the flu [7] Jones , J, Salathe, M. (2009), “Early assessment of anxiety and behavioral response to novel swine-origin inuenza A(H1N1)”, PLoS One, 4(12):e8032. [8] Collier, N. (2009), “UMG U got flu? Analysis of shared health messages for bio-surveillance”, in Proc. 4th Symposium on Semantic Mining in Biomedicine (SMBM’10).
  • 19. Anxiety indicators have moderately strong correlation with CDC A(H1N1) lab data 2009-2010 3000 450 400 2500 Category Spearman’s P-value 350 Rho 2000 300 CDC A 0.66 0.020 A S 0.66 0.021 250 S I 0.58 0.048 1500 I P 0.67 0.017 200 P A+I+P 0.68 0.008 1000 150 A+I+P A+I+P+S 0.67 0.017 A+I+P+S 100 500 50 0 0 46 47 48 49 50 51 52 1 2 3 4 5
  • 20. DIZIE: Text mining from personal health reports on Twitter Syndromic surveillance for gastrointestinal, respiratory, neurological, dermatological, haemorrhagi c, musculoskeletal from Tweets in 40 world cities.
  • 21. Significance and connections • PH analysis is a highly skilled human task made easier by text mining from open sources • Value in transparent evaluation of core technologies using gold standards – Good understanding now of intrinsic components – More extrinsic evaluations needed to broaden uptake among PH community – Community discussion needed on utility of evaluation strategies. • Power of integrating sources needs to be explored Heat map showing lowest ranked countries by number of reports per „000 population gathered by BioCaster
  • 22. Special thanks • Funding – Japan Science and Technology Agency‟s SAKIGAKE fund – JSPS Young Researcher type A fund • Postdoctoral students: – Son Doan, PhD., Mike Conway, PhD. (now at UCSD), Reiko Goodwin, PhD. (Fordham U.), Ai Kawazoe, PhD. (now at Tsuda U.) • Ph.D. students – John McCrae, PhD. (now at Bielefeld U.), Hutchatai Chanlekha, PhD. (now at Kasetsart U.) • Intern students – Wita Ratsameetip (Chulalongkorn University, Thailand),Nguyen Trurong Son (Vietnam National University, Ho Chi Minh City, Vietnam), Nguyen Thi Ngoc Mai (Vietnam National University, Ho Chi Minh City, Vietnam), Aurelie Chabord (ENSIMAG-Grenoble INP, France), Therawat Tooumnauy (Kasetsart University, Thailand), Nam Xuan Cao (Vietnam National University, Ho Chi Minh City, Vietnam), Hoang Cong Duy Vu (Vietnam National University, Ho Chi Minh City, Vietnam), Nghiem Quoc Minh (Vietnam National University, Ho Chi Minh City, Vietnam), Van Chi Nam (Vietnam National University, Ho Chi Minh City, Vietnam), Nguyen Thi Hong Nhung (Vietnam National University, Ho Chi Minh City, Vietnam), Pham Thao Thi Xuan (Vietnam National University, Ho Chi Minh City, Vietnam), Ngo Quoc Hung (Vietnam National University, Ho Chi Minh City, Vietnam), Tran Tri Quoc (Vietnam National University, Ho Chi Minh City, Vietnam)