In this paper, we present an event-based Epidemic Intelligence (EI) system framework leveraging social media data, e.g., Twitter messages (or tweets) for providing public health officials the necessary tools to survey and sift through relevant information, namely, disease outbreak events. There exists three main research challenges in gathering epidemic intelligence from social media streams: 1) dynamic classification to enable message filtering, 2) signal generation producing reliable warnings based on observed term frequency changes in the filtered messages, and 3) providing search and recommendation functionalities to domain experts, for better assessment of the potential outbreak threats associated with the generated signals. We outline possible approaches to solve these important challenges as well as discuss areas where further research is required. The aim of this paper is to provide guidance for similar endeavors, and to give prospective event-based Epidemic Intelligence system builders a more realistic view on the benefits and issues of social media stream analysis.
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
Why Is It Difficult to Detect Outbreaks in Twitter?
1. Why Is It Difficult to Detect Outbreaks
in Twitter?
Avaré Stewart, Nattiya Kanhabua, Sara Romano
Ernesto Diaz-Aviles, Wolf Siberski, and Wolfgang Nejdl
L3S Research Center / Leibniz Universität Hannover, Germany
SIGIR 2013 Workshop on Health Search and Discovery
1 August 2013, Dublin, Ireland
2. Motivation
• Numerous works use Twitter to infer the existence
and magnitude of real-world events in real-time
– Earthquake [Sakaki et al., 2010]
– Predicting financial time series [Ruiz et al., 2012]
– Influenza epidemics [Culotta, 2010; Lampos et al.,
2011; Paul et al., 2011]
4. Health related tweets
• User status updates or news related to
public health are common in Twitter
– I have the mumps...am I alone?
– my baby girl has a Gastroenteritis so great!! Please
do not give it to meee
– #Cholera breaks out in #Dadaab refugee camp in
#Kenya http://t.co/....
– As many as 16 people have been found infected with
Anthrax in Shahjadpur upazila of the Sirajganj district
in Bangladesh.
9. Data Collection
• Official outbreak reports
– ~3,000 ProMED-mail reports from 2011
– WHO reports have very small coverage
• Twitter data
– ~1,200 health-related terms (i.e., infectious
diseases, their synonyms, pathogens and symptoms)
– Over 112 millions of tweets from 2011
• Series of NLP tools including
– OpenNLP (tokenization, sentence splitting, POS
tagging)
– OpenCalais (named entity recognition)
– HeidelTime (temporal expression extraction)
11. Event Extraction
• An event is a sentence containing two entities
– (1) medical condition and (2) geographic expression
– A minimum requirement by domain experts
• A victim and the time of an event can be identified
from the sentence itself, or its surrounding context
• Output: a set of event candidates
Reported by World Health Organization (WHO) on
29 July 2012 about an ongoing Ebola outbreak
in Uganda since the beginning of July 2012
[Kanhabua et al., TAIA’ 12]
12. Message Filtering: Challenges
• Ambiguity
– having several meanings
– used in different contexts
• Incompleteness
– missing or under-reported events
– data processing errors
13. Message Filtering: Challenges
• Ambiguity
– having several meanings
– used in different contexts
• Incompleteness
– missing or under-reported events
– data processing errors
Category Example tweet
Literature A two hour train journey, Love In the Time of Cholera ...
Music Dengue Fever’s “Uku,” Mixed by Paul Dreux Smith
Universal Audio...
Marketing Exclusive distributor of high quality #HIV/AIDS Blood &
Urine and #Hepatitis #Self -testers.
General Identification of genotype 4 Hepatitis E virus binding
proteins on swine liver cells: Hepatitis E virus...
Negative i dont have sniffles and no real coughing..well its
coughing but not like an influenza cough.
Joke Thought I had Bieber Fever. Ends up I just had a combo
of the mumps, mono, measles & the hershey squ...
16. Approach for Noisy Data
• MedISys1
– providing a list of negative keywords
created by medical experts
• Urban Dictionary2
– a Web-based dictionary of slang, ethnic
culture words or phrases
1
http://medusa.jrc.it/medisys/homeedition/en/home.html
2
http://www.urbandictionary.com/
17. Approach for Noisy Data
1
http://medusa.jrc.it/medisys/homeedition/en/home.html
2
http://www.urbandictionary.com/
21. Signal Generation: Challenges
• Temporal Dynamics
– seasonal infectious diseases
– rare and spontaneous outbreaks
• Location Dynamics
– frequency and duration
– levels of prevalence or severity
22. Signal Generation: Challenges
• Temporal Dynamics
– seasonal infectious diseases
– rare and spontaneous outbreaks
• Location Dynamics
– frequency and duration
– levels of prevalence or severity
[Rortais et al., 2010 in Journal of Food Research International]
23. Signal Generation: Challenges
• Temporal Dynamics
– seasonal infectious diseases
– rare and spontaneous outbreaks
• Location Dynamics
– frequency and duration
– levels of prevalence or severity
31. Approach
• Personalized Tweet Ranking for Epidemic
Intelligence
– Learning to rank and recommender systems
– User's context as implicit criteria for
recommendation
[Diaz-Aviles et al., WWW’ 12,
Diaz-Aviles et al., ICWSM’ 12]
34. Future Work
• Real-Time Analysis of Big and Fast
Social Web Streams
– Scalable, efficient methods for filtering and
generating signals in real-time
– Effective methods for aggregating and
visualizing information in a meaningful way
36. References
• [Culotta, 2010] A. Culotta. Towards detecting influenza epidemics by analyzing twitter messages. In
Proceedings of the First Workshop on Social Media Analytics (SOMA’2010), 2010.
• [Diaz-Aviles et al., 2012a] E. Diaz-Aviles, A. Stewart, E. Velasco, K. Denecke, and W. Nejdl. Towards
personalized learning to rank for epidemic intelligence based on social media streams. In Proceedings of
the 21st World Wide Web Conference (WWW ‘2012), 2012.
• [Diaz-Aviles et al., 2012b] E. Diaz-Aviles, A. Stewart, E. Velasco, K. Denecke, and W. Nejdl. Epidemic
intelligence for the crowd, by the crowd. In Proceedings of International AAAI Conference on Weblogs
and Social Media (ICWSM’2012), 2012.
• [Kanhabua et al., 2012a] N. Kanhabua, Sara Romano, and A. Stewart, Identifying Relevant Temporal
Expressions for Real-world Events, In SIGIR 2012 Workshop on Time-aware Information Access
(TAIA'2012), 2012.
• [Kanhabua et al., 2012b] N. Kanhabua, Sara Romano, and A. Stewart and W. Nejdl. Supporting
Temporal Analytics for Health Related Events in Microblogs. In Proceedings of CIKM'2012, 2012.
• [Kanhabua and Nejdl 2013] N. Kanhabua and W. Nejdl. Understanding the Diversity of Tweets in the
Time of Outbreaks. In Proceedings of the First International Web Observatory Workshop (WOW'2013) at
WWW'2013, 2013.
• [Lampos et al., 2011] V. Lampos and N. Cristianini. Nowcasting events from the social web with
statistical learning. ACM TIST, 3, 2011.
• [Paul et al., 2011] M. J. Paul and M. Dredze. You are what you tweet: Analyzing twitter for public health.
In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM’2011), 2011.
• [Ruiz et al., 2012] E. J. Ruiz, V. Hristidis, C. Castillo, A. Gionis, and A. Jaimes. Correlating financial time
series with micro-blogging activity. In Proceedings of WSDM’2012, 2012.
• [Sakaki et al., 2010] T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake shakes twitter users: real-time
event detection by social sensors. In Proceedings of WWW’2010, 2010.
Notas del editor
To exploit this timeliness potential, we present an event-based Epidemic Intelligence (EI) system, which has emerged as a type of intelligence gathering aimed to detect events of interest to the public health from unstructured text on the Web
In the medical domain, there has been a surge in detecting health related tweets for early warning
Allow a rapid response from authorities [Diaz-Aviles et al., 2012]
Note that, there are existing EI systems, such as, the Bio- Caster Global Health Monitor1 or HealthMap 2. However, they differ from our proposed system in the level of analysis, information sources, language coverage and visualization.
Frequencies of cases reported to RKI and number of tweets mentioning the name of the disease: EHEC. Pearson correlation coefficient = 0.864. The monitor of Twitter allowed M-Eco to generate the first signals on Friday, May 20th, 2011.
We study and propose solutions to three main research challenges in gathering epidemic intelligence from social media streams: 1) dynamic classification to enable message filtering, 2) producing reliable warning signals (temporal anomalyies found) based on observe term frequency changes in these messages, using biosurveillance algorithms, and 3) providing suitable information and recommendations to domain experts, for better assessment of the potential outbreak threats associated with the generated signals.
Part I. Ground truth creation
Official outbreak reports
World Health Organization1
ProMED-mail2
Part II. Creating Twitter time series
medical condition
disease name, synonyms, pathogens, symptoms
location
geographic expressions, geo-location, or user profile
3 levels: country, continent, latitude
M-Eco strives to detect a large variety of infectious diseases, so we make use of a list of 1,258 terms consisting of infectious diseases, their synonyms, pathogens and symptoms, which are provided by the domain experts in two languages, namely English and German, for an initial filtering step. All documents and tweets are annotated with locations, medical conditions and
temporal expressions using a series of language processing tools, including OpenNLP2 for tokenization, sentence splitting and part-of-speech tagging, HeidelTime [34] for temporal expression extraction
Hash-tags co-occurring with #EHEC during May 23 and June 19, 2011, the main period of the
outbreak. The hash-tags are classified as entities of type Medical Condition, Location, or Complementary
Context, hash-tags out of these categories are discarded.
Hash-tags co-occurring with #EHEC during May 23 and June 19, 2011, the main period of the
outbreak. The hash-tags are classified as entities of type Medical Condition, Location, or Complementary
Context, hash-tags out of these categories are discarded.
Our approach builds upon [18] and extends it by: 1) incorporating the use of an orthogonal vector, which is learned by a Support Vector Machine (SVM), as a description of the feature change; and 2) computing a novelty score that lets the system identify
those tweets that contribute to the feature change, so that their true labels can be obtained.
In order to detect outbreak events for early warning, we exploit different state-of-the-art Biosurveillance algorithms as anomaly detectors in disease-related Twitter messages: \textbf{C1}, \textbf{C2}, \textbf{C3}, F-Statistic (\textbf{FS}), Experimental Weighted Moving Average (\textbf{EWMA}) and Farrington (\textbf{FA})~\cite{basseville1993detection, farrington_1996}. Traditional bio-surveillance systems usually exploit information from official sources, e.g., laboratory results, mortality rates, or the number of reported patients suffering from a disease outbreak. In recent years, researchers in the medical domain have begun to leverage real-time, social Web data, such as, tweets.
In order to detect outbreak events for early warning, we exploit different state-of-the-art Biosurveillance algorithms as anomaly detectors in disease-related Twitter messages: \textbf{C1}, \textbf{C2}, \textbf{C3}, F-Statistic (\textbf{FS}), Experimental Weighted Moving Average (\textbf{EWMA}) and Farrington (\textbf{FA})~\cite{basseville1993detection, farrington_1996}. Traditional bio-surveillance systems usually exploit information from official sources, e.g., laboratory results, mortality rates, or the number of reported patients suffering from a disease outbreak. In recent years, researchers in the medical domain have begun to leverage real-time, social Web data, such as, tweets.
Identified topics show similar trends during the known time periods of real-world outbreaks
Diversity reflects how the language (i.e., terms and locations) are used differently
Div(entity) highly correlates with topic dynamics for some diseases, i.e., mumps, ebola, botulism and ehec
Div(term) shows correlation with topic dynamics for cholera, anthrax and rubella
Algorithms: SampleDJ, TrackDJ (claims and proofs in [Deng et al., 2012])
Algorithms: SampleDJ, TrackDJ (claims and proofs in [Deng et al., 2012])
(1) Rely upon abundant user interactions and/or the availability of explicit feedback (e.g., ratings, likes, dislikes)
(2) Within M-Eco, we use the tweets from signals in developing techniques to provide a personalized short list of tweets that meets the context of the investigation. In this section, we review one of them; namely Personalized Tweet Ranking for Epidemic Intelligence (PTR4EI) [13, 14] and discuss the evaluation conducted during a major EHEC outbreak in Germany.
where t is a discrete Time interval, MCu the set of Medical Conditions, and Lu the set of Locations of user interest.
More precisely, we expand the user's context, Cu, using latent topics computed with LDA [5] on: 1) an indexed collection of tweets for epidemic intelligence; and 2) the hash-tags that co-occur with this context.
(1) Rely upon abundant user interactions and/or the availability of explicit feedback (e.g., ratings, likes, dislikes)
(2) Within M-Eco, we use the tweets from signals in developing techniques to provide a personalized short list of tweets that meets the context of the investigation. In this section, we review one of them; namely Personalized Tweet Ranking for Epidemic Intelligence (PTR4EI) [13, 14] and discuss the evaluation conducted during a major EHEC outbreak in Germany.
where t is a discrete Time interval, MCu the set of Medical Conditions, and Lu the set of Locations of user interest.
More precisely, we expand the user's context, Cu, using latent topics computed with LDA [5] on: 1) an indexed collection of tweets for epidemic intelligence; and 2) the hash-tags that co-occur with this context.