SlideShare una empresa de Scribd logo
1 de 19
Enhancing Twitter Data Analysis with Simple Semantic
 Filtering: Example in Tracking Influenza-Like Illnesses
         Son Doan1, Lucila Ohno-Machado1, Nigel Collier2
      1Division   of Biomedical Informatics, University of California San Diego
                        2National Institute of Informatics, Japan

                                IEEE HISB 2012
                       UCSD, La Jolla, CA Sep 27-28, 2012
Time

                   Sentinel         PCP                  Field          Laboratory
     Rumors
                   networks         reports              workers        reports




                                     Certainty
                                                                            Twitter>
Twitter>                                                                    “I’m sick with a
“Ahh! Really bad                 Twitter>                                   chest infection”
throat.”                         “Still getting worse.
                                 Staying at home         News report>
             News report>        temp is up to 39.5.”    “Mystery illness
             “Influenza starts                           causes concern.”
             early this year.”
Social media in event tracking
• Event tracking/predicting:
   –   Predict election, gasoline price: O’Connor et al. (2010)
   –   Predict stock market: Bollen et al. (2011)
   –   Earthquake warning: Sasaki et al. (2010), Guy et al. (2010)
   –   Public mood tracking: Golder and Macy (2011), Doan and Collier
       (2011)


• Predicting the Influenza-Like Illness rate:
   – Google Flu Trends: Ginsberg et al. (2009), Valdivia et al. (2010), now
     extended to dengue tracking (Chan et al. (2012))  used query
     logs, but the query data is closed
   – Culotta (2009), Lampos and Christinini (2010), Signorini et al.
     (2011), Chew and Eysenbach (2011), Doan et al. (2012)  used
     Twitter
Twitter characteristics

• Twitter posts (tweets) are limited to 140 characters
    – High use of abbreviations and aliases
    – Dynamic lexicon of semantic tags (hashtags)

• Very high volume of data: Generate 430 million tweets per day
• High numbers of users: Over 500 active million users
• Meta data: Geo-tagging, time stamping, user profile
• Event reports sometimes ahead of newswire, e.g. Iranian
   presidential protests, swine flu outbreak reports from CDC, deaths of
   famous people (Petrovic et al. 2010)
Twitter corpus
Timeline: 36 weeks for the US 2009 influenza season (Aug 30, 2009 to May
8, 2010), ‘Gardenhose’ data sampling method (~5% sampling rate from the
whole data)

Name        Total
                                      25 mil



                                      20 mil
Tweets      587,290,394
                                      15 mil
Users       23,571,765
                                      10 mil


URL         136,034,309                5 mil


Hash        96,399,587
Tags

 Thanks to Brendan O’Connor (CMU) and Twitter Inc.
Existing methods: empirical approach for predicting
                    the ILI rate
                                                          Case definition from CDC
                                       ILI-related
    Twitter                              tweets           Influenza-like Illness (ILI) =
    corpus                                                fever (> 100o F)* AND
                                  ILI-related
                                                          cough and/or sore throat
                                                          (in the absence of a known
                                  keywords filtering
                                                          cause other than influenza)
                                                          *Temperature   can be measured in
  Culotta4         Signorini3        Chew3                the office or at home

  flu              swine             h1n1
  cough            flu               swine flu
                                                                    Every year:
  headache         influenza         swineflu                3~5 million severe illness
                                                             250 000 – 500 000 deaths
  sore throat                                                      (WHO 2009)

Gold standard from laboratory data reported by the US Outpatient Influenza-Like Illness
                         Surveillance Network (ILINet) (CDC)
Our approach: two-step filtering

                                                 Semantic
                   Syndrome-related
                                                  filtering
Twitter                filtering
corpus
                   Step 1                                Step 2



              Syndrome only           Negation         Emoticon


             Syndrome + “flu”
                                      HashTags          Humor

           Syndrome + “flu” - URL
                                                 Geo


           Knowledge-based
                                        Semantic level
              approach
Knowledge-based approach
 If the tweeter is referring to someone else‘s
 symptom then filter out. Only retain if the tweeter
 is referring to their own symptoms.
Name                                             Example


Syndrome only         tweets containing syndrome Barber just coughed
                      keywords                   on me in the chair.
Syndrome + “flu”      tweets containing syndrome I got flu n coughed a
                      keywords and “flu”         lot.


Syndrome + “flu” -    tweets containing syndrome 7-year-old boy dies of
URL                   keywords and “flu”, remove flu,pneumonia < URL>
                      links
Snapshot of BioCaster ontology
Extract syndrome-related keywords from BioCaster
                        ontology
We extracted keywords only from respiratory syndrome

achy chest               cold symptom      respiratory failure
apnea                    cough             runny nose
asthma                   dyspnea           short of breath
asthmatic                dyspnoea          shortness of breath
                                                                     37
blocked nose             gasping for air   sinusitis             respiratory
breathing difficulties   lung sounds       sore throat            syndrome
                                                                  keywords
breathing trouble        pneumonia         stop breathing
bronchitis               rales             stuffy nose
…                        …                 …
Semantic level filtering

Name                                         Examples
Negation   Remove negation in tweets         I don’t have flu

Emoticon   Remove tweets containing          Glad to hear that you’re beating the flu.
           smiley emoticons, e.g., :-),,:D   :-) Hope you don’t get the nasty cough
                                             that everyone’s getting this year

HashTags   Keeps tweets containing           Still coughing smh #swineflu #h1n1
           keyword “flu”

Humor      Remove humor features in          Hm Im kinda wanting to go to NYC really
           tweets, e.g., “haha”,”hihi”,      soon ***cough … cough*** @Ctmomofsix
           “***cough … cough***”             =)

Geo        Tweets from graphical
           locations (e.g., US)
Detecting negation in Twitter




          Semantic tags




Example
Rule A: If VBZ is followed by XX then that sentence is negative
Correlation to the CDC data
                     Method                               Pearson corr (%)
Empirical approach   Culotta4                             94.85
                     Signorini4                           94.73
                     Chew3                                94.48
Knowledge-based      Syndrome only                        88.60
approach             Syndrome + “flu”                     97.13
                     Syndrome + “flu” - URL               97.52* (p=0.06)
Semantic-based       Negation                             97.65
level                Emoticon                             97.52
                     HashTags                             97.61
                     Humor                                97.65
                     Geo                                  98.39
                     Negation + Emoticon + HashTags +     98.46*(p=0.007)
                     Humor + Geo
Note: Google Flu Trends got 99.12%!!! (using whole Google query logs)
%   Correlation to the CDC data (cont’d)
Semantic-level filtered tweets

Types                    Tweet samples
Influenza confirmation   I got flu n coughed a lot. Now my voice is like
                         monster’s voice. Rrr

Influenza symptoms       My day: flu-like symptoms (headache, body aches,
                         cough, chills, 100.9 fever). Swine flu not ruled out.
                         #H1N1
Flu shots                I’m still getting flu shots, nothing is worth flu turning
                         into bronchitis into pneumonia

Self protection          Cover your mouth if coughing, use a tissue, wash
                         your hands often & get a flu shot - protect and
                         defend your community from #H1N1
Medication               Wondering why I didn’t take the flu shot, laying in
                         bed with cough drops, medicine, and the remote
Challenges

• Technical issues:
   – Data sampling: only ~5% sampling rate
• Semantic issues:
   – Metaphoric symptoms: Cabin fever setting in right now.
   – Interrogative sentences: wonder how long u get off work with
     swine flu?
   – Hypothetical sentences: I can ignore this sore throat no longer.
     And, um, maybe I should have gotten that H1N1 vaccine.
   – Others: Too much lemonade. My throat is burning.
Summary

• We proposed a general and extendable approach for tweet
  filtering based on an ontology of infectious diseases
  (BioCaster Ontology)
   – This methodology can be applied to other languages, e.g., Spanish,
     Japanese


• Our best results showed significantly improvement in
  comparison to state-of-the-art keyword filtering methods

• Using simple semantic filtering in Twitter can improve
  correlation with CDC data
DIZIE: system for syndromic surveillance on Twitter
                             http://born.nii.ac.jp/dizie
                             /




                                                    Gastrointestinal
                                                    Respiratory
                                                    Neurological
                    40 main world                   Dermatological
                                                    Haemorrhagic
                        cities
                                                    Musculoskeletal
                Collier and Doan. eHealth 2012;186-95
Acknowledgements

• Assoc. Prof. Wendy W. Chapman, PhD, DBMI, UCSD
• Mike Conway, PhD, DBMI, UCSD
• Grant-in-aid funding from the National Institute of
  Informatics, Japan

Más contenido relacionado

Similar a Enhancing Twitter Data Analysis with Simple Semantic Filtering: Example in Tracking Influenza-Like Illnesses

Case StudyWithout warning Mr. Lee begins to feel sick. is most.docx
Case StudyWithout warning Mr. Lee begins to feel sick. is most.docxCase StudyWithout warning Mr. Lee begins to feel sick. is most.docx
Case StudyWithout warning Mr. Lee begins to feel sick. is most.docxtroutmanboris
 
Diseases-Cause and Prevention Chp-5 General Science 9th 10th
Diseases-Cause and Prevention Chp-5 General Science 9th 10thDiseases-Cause and Prevention Chp-5 General Science 9th 10th
Diseases-Cause and Prevention Chp-5 General Science 9th 10thKamran Abdullah
 
Swineflu dr rs matoria
Swineflu dr rs matoriaSwineflu dr rs matoria
Swineflu dr rs matoriaRam Matoria
 

Similar a Enhancing Twitter Data Analysis with Simple Semantic Filtering: Example in Tracking Influenza-Like Illnesses (7)

Influenza/Flu
Influenza/FluInfluenza/Flu
Influenza/Flu
 
Influenza/ flu
Influenza/ fluInfluenza/ flu
Influenza/ flu
 
Antibiotic super~heros!
Antibiotic super~heros!Antibiotic super~heros!
Antibiotic super~heros!
 
Case StudyWithout warning Mr. Lee begins to feel sick. is most.docx
Case StudyWithout warning Mr. Lee begins to feel sick. is most.docxCase StudyWithout warning Mr. Lee begins to feel sick. is most.docx
Case StudyWithout warning Mr. Lee begins to feel sick. is most.docx
 
Diseases-Cause and Prevention Chp-5 General Science 9th 10th
Diseases-Cause and Prevention Chp-5 General Science 9th 10thDiseases-Cause and Prevention Chp-5 General Science 9th 10th
Diseases-Cause and Prevention Chp-5 General Science 9th 10th
 
COVID-19 Overview
COVID-19 OverviewCOVID-19 Overview
COVID-19 Overview
 
Swineflu dr rs matoria
Swineflu dr rs matoriaSwineflu dr rs matoria
Swineflu dr rs matoria
 

Último

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 

Último (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Enhancing Twitter Data Analysis with Simple Semantic Filtering: Example in Tracking Influenza-Like Illnesses

  • 1. Enhancing Twitter Data Analysis with Simple Semantic Filtering: Example in Tracking Influenza-Like Illnesses Son Doan1, Lucila Ohno-Machado1, Nigel Collier2 1Division of Biomedical Informatics, University of California San Diego 2National Institute of Informatics, Japan IEEE HISB 2012 UCSD, La Jolla, CA Sep 27-28, 2012
  • 2. Time Sentinel PCP Field Laboratory Rumors networks reports workers reports Certainty Twitter> Twitter> “I’m sick with a “Ahh! Really bad Twitter> chest infection” throat.” “Still getting worse. Staying at home News report> News report> temp is up to 39.5.” “Mystery illness “Influenza starts causes concern.” early this year.”
  • 3. Social media in event tracking • Event tracking/predicting: – Predict election, gasoline price: O’Connor et al. (2010) – Predict stock market: Bollen et al. (2011) – Earthquake warning: Sasaki et al. (2010), Guy et al. (2010) – Public mood tracking: Golder and Macy (2011), Doan and Collier (2011) • Predicting the Influenza-Like Illness rate: – Google Flu Trends: Ginsberg et al. (2009), Valdivia et al. (2010), now extended to dengue tracking (Chan et al. (2012))  used query logs, but the query data is closed – Culotta (2009), Lampos and Christinini (2010), Signorini et al. (2011), Chew and Eysenbach (2011), Doan et al. (2012)  used Twitter
  • 4. Twitter characteristics • Twitter posts (tweets) are limited to 140 characters – High use of abbreviations and aliases – Dynamic lexicon of semantic tags (hashtags) • Very high volume of data: Generate 430 million tweets per day • High numbers of users: Over 500 active million users • Meta data: Geo-tagging, time stamping, user profile • Event reports sometimes ahead of newswire, e.g. Iranian presidential protests, swine flu outbreak reports from CDC, deaths of famous people (Petrovic et al. 2010)
  • 5. Twitter corpus Timeline: 36 weeks for the US 2009 influenza season (Aug 30, 2009 to May 8, 2010), ‘Gardenhose’ data sampling method (~5% sampling rate from the whole data) Name Total 25 mil 20 mil Tweets 587,290,394 15 mil Users 23,571,765 10 mil URL 136,034,309 5 mil Hash 96,399,587 Tags Thanks to Brendan O’Connor (CMU) and Twitter Inc.
  • 6. Existing methods: empirical approach for predicting the ILI rate Case definition from CDC ILI-related Twitter tweets Influenza-like Illness (ILI) = corpus fever (> 100o F)* AND ILI-related cough and/or sore throat (in the absence of a known keywords filtering cause other than influenza) *Temperature can be measured in Culotta4 Signorini3 Chew3 the office or at home flu swine h1n1 cough flu swine flu Every year: headache influenza swineflu 3~5 million severe illness 250 000 – 500 000 deaths sore throat (WHO 2009) Gold standard from laboratory data reported by the US Outpatient Influenza-Like Illness Surveillance Network (ILINet) (CDC)
  • 7. Our approach: two-step filtering Semantic Syndrome-related filtering Twitter filtering corpus Step 1 Step 2 Syndrome only Negation Emoticon Syndrome + “flu” HashTags Humor Syndrome + “flu” - URL Geo Knowledge-based Semantic level approach
  • 8. Knowledge-based approach If the tweeter is referring to someone else‘s symptom then filter out. Only retain if the tweeter is referring to their own symptoms. Name Example Syndrome only tweets containing syndrome Barber just coughed keywords on me in the chair. Syndrome + “flu” tweets containing syndrome I got flu n coughed a keywords and “flu” lot. Syndrome + “flu” - tweets containing syndrome 7-year-old boy dies of URL keywords and “flu”, remove flu,pneumonia < URL> links
  • 10. Extract syndrome-related keywords from BioCaster ontology We extracted keywords only from respiratory syndrome achy chest cold symptom respiratory failure apnea cough runny nose asthma dyspnea short of breath asthmatic dyspnoea shortness of breath 37 blocked nose gasping for air sinusitis respiratory breathing difficulties lung sounds sore throat syndrome keywords breathing trouble pneumonia stop breathing bronchitis rales stuffy nose … … …
  • 11. Semantic level filtering Name Examples Negation Remove negation in tweets I don’t have flu Emoticon Remove tweets containing Glad to hear that you’re beating the flu. smiley emoticons, e.g., :-),,:D :-) Hope you don’t get the nasty cough that everyone’s getting this year HashTags Keeps tweets containing Still coughing smh #swineflu #h1n1 keyword “flu” Humor Remove humor features in Hm Im kinda wanting to go to NYC really tweets, e.g., “haha”,”hihi”, soon ***cough … cough*** @Ctmomofsix “***cough … cough***” =) Geo Tweets from graphical locations (e.g., US)
  • 12. Detecting negation in Twitter Semantic tags Example Rule A: If VBZ is followed by XX then that sentence is negative
  • 13. Correlation to the CDC data Method Pearson corr (%) Empirical approach Culotta4 94.85 Signorini4 94.73 Chew3 94.48 Knowledge-based Syndrome only 88.60 approach Syndrome + “flu” 97.13 Syndrome + “flu” - URL 97.52* (p=0.06) Semantic-based Negation 97.65 level Emoticon 97.52 HashTags 97.61 Humor 97.65 Geo 98.39 Negation + Emoticon + HashTags + 98.46*(p=0.007) Humor + Geo Note: Google Flu Trends got 99.12%!!! (using whole Google query logs)
  • 14. % Correlation to the CDC data (cont’d)
  • 15. Semantic-level filtered tweets Types Tweet samples Influenza confirmation I got flu n coughed a lot. Now my voice is like monster’s voice. Rrr Influenza symptoms My day: flu-like symptoms (headache, body aches, cough, chills, 100.9 fever). Swine flu not ruled out. #H1N1 Flu shots I’m still getting flu shots, nothing is worth flu turning into bronchitis into pneumonia Self protection Cover your mouth if coughing, use a tissue, wash your hands often & get a flu shot - protect and defend your community from #H1N1 Medication Wondering why I didn’t take the flu shot, laying in bed with cough drops, medicine, and the remote
  • 16. Challenges • Technical issues: – Data sampling: only ~5% sampling rate • Semantic issues: – Metaphoric symptoms: Cabin fever setting in right now. – Interrogative sentences: wonder how long u get off work with swine flu? – Hypothetical sentences: I can ignore this sore throat no longer. And, um, maybe I should have gotten that H1N1 vaccine. – Others: Too much lemonade. My throat is burning.
  • 17. Summary • We proposed a general and extendable approach for tweet filtering based on an ontology of infectious diseases (BioCaster Ontology) – This methodology can be applied to other languages, e.g., Spanish, Japanese • Our best results showed significantly improvement in comparison to state-of-the-art keyword filtering methods • Using simple semantic filtering in Twitter can improve correlation with CDC data
  • 18. DIZIE: system for syndromic surveillance on Twitter http://born.nii.ac.jp/dizie / Gastrointestinal Respiratory Neurological 40 main world Dermatological Haemorrhagic cities Musculoskeletal Collier and Doan. eHealth 2012;186-95
  • 19. Acknowledgements • Assoc. Prof. Wendy W. Chapman, PhD, DBMI, UCSD • Mike Conway, PhD, DBMI, UCSD • Grant-in-aid funding from the National Institute of Informatics, Japan

Notas del editor

  1. Having timely and well informed information helps governments to take the right actions to reduce the length and severity of an infectious disease outbreak. This information is important not only for pandemic influenza but also for many other diseases such as measles and mumps as well as more exotic diseases like chikungunya. Governments in advanced countries like Japan have access to many sources of information within their own country borders. These range from the very reliable like laboratory reports to statistics about how many drugs are being sold. However the quickest source of information is often rumours. These can be individual messages published on Web sites like Twitter or news reports published in the media.
  2. Twitter is an example of a microblogging service. Users post messages (tweets) up to 140 characters in length. This enables them to post personal information on-the-go from mobile SMS devices where ever they happen to be. Hand in hand with the short messaging style is a highly abbreviated form of vocabulary. We often see special abbreviations and semantic tags called Hashtags that are developed on the fly to describe new concepts such as H1N1 influenza. Volumes also tend to be very high. Although official statistics are hard to find the Twitter developer’s conference mentioned 106 million users in 2010 and the BBC mentioned over 200 million users in 2011. Although this is a fraction of the total world population it still might be possible to use Twitter messages for alerting in major cities where the are a high density of users.
  3. Talk here about the difficult cases – how they are classified and how we might overcome them in the future.