Semantic Filtering as an example of Semantic technologies for real-time analysis. This presentation emphasizes the value of semantics for social data filtering, specifically for the challenges faced during dynamically evolving event analysis.
1. Semantic Filtering
An example of Semantic technologies for real-time
analysis
Pavan Kapanipathi
Ohio Center of Excellence in Knowledge-enabled Computing (Kno.e.sis)
Wright State University, USA
Tutorial @ Kno.e.sis Centre: Semantics Approach to Big Data and Event Processing, Oct 7-9, 2015
3. Information Overload
500M users generate 500M tweets per day
3
It’s not information overload.
It’s filter failure
-- Clay Shirky
4. Each of our projects face
Information Overload
• Disaster Management
• Hazards SEES
• Healthcare Issues
• Depression
• Societal Issues
• Edrug Trends
• Harassment
5. • Filtering is necessary
• Understanding the
requirements and utilizing
semantics for filtering is
important
Semantic Filtering
6. Two Main Topics
• Twarql
• Streaming annotation and flexible
querying on Twitter
• Continuous Semantics
• Tracking dynamic topics on Twitter
7. Twarql
Tracking health
care debate in the
United States on
Social Media
Health Care Reform
Health Care
Reform
Healthcare reform
legislation in the
United States
Patient Protection
and Affordable Care
Act (Obamacare)
Health Care Reform
16. Dynamic Topics
Manually updating keywords
to get topic relevant tweets is
not feasible
“indianelection”
“modi”
“bjp”
“congress”
“jan25”
“egypt”
“tunisia”
“arabspring”
“sandy”
“newyork”
“redcross”
“fema”
“swineflu” “ebola”
16
17. Problem
How can we automatically update the
filters to track a dynamically evolving
topic on Twitter
17
18. Hashtags as Filters
• Identify a topic on Twitter
• Tweets with hashtags are more
informative
• Users have a lot of freedom to
create them
• Some get popular, most die
18
23. Determining Relevancy of Co-occurring
Hashtags
#indianelection2015
#modikisarkar
Too many
co-occurring hashtags
23
24. Determining Relevancy of Co-occurring
Hashtags
#indianelection2015
#modikisarkar
Co-occurring:
Threshold δ
Preferably a prominent hashtag
24
25. Hashtag Co-occurrence works?
o No. Just co-occurrence does not work
o Many noisy or unrelated hashtags co-occurs
o Determine the “dynamic” relevance of the top
co-occurring hashtag with the dynamic topic
25
26. Determining Relevancy of Co-occurring
Hashtags
#indianelection2015
#modikisarkar
Co-occurring:
Threshold
Latest K (200,500)
Narendra Modi: 0.9
BJP: 0.7
NDA: 0.6
India: 0.4
Elections: 0.2
Rahul Gandhi: 0.2
Congress: 0.2
Entity Extraction
and Scoring
δ
Normalized
Frequency
Scoring
26
(Vector Space Model)
27. Determining Relevancy of Co-occurring
Hashtags (Vector Space Model)
#indianelection2015
#modikisarkar
Co-occurring:
Threshold
Latest K (200,500)
Narendra Modi: 0.9
BJP: 0.7
NDA: 0.6
India: 0.4
Elections: 0.2
Rahul Gandhi: 0.2
Congress: 0.2
Entity Extraction
and Scoring
Indian General
Election,_2014
Dynamically Updated
Background Knowledge
δ
27
30. o Entities mentioned on the Event page of
Wikipedia are relevant to the Event
Event Relevant Background
Knowledge
30
31. o Wikipedia’s Hyperlink structure is very rich
o Page-Page (Wikipedia) links
Indian General
Election, 2014
Narendra Modi
Rahul Gandhi
NDA (India)UPA (India)
BJP
Indian National
Congress
Event Relevant Background
Knowledge – Graph Structure
31
32. Determining Relevancy of Co-occurring
Hashtags (Vector Space Model)
#indianelection2015
#modikisarkar
Co-occurring:
Threshold
Latest K (200,500)
Narendra Modi: 0.9
BJP: 0.7
NDA: 0.6
India: 0.4
Elections: 0.2
Rahul Gandhi: 0.2
Congress: 0.2
Entity Extraction
and Scoring
Indian General
Election,_2014
Extract, Periodically
Update Hyperlink structure
One hop from Event
Page
δ
32
33. o Hyperlink structure is dynamically updated
Indian General
Election, 2014
Narendra Modi
Rahul Gandhi
NDA (India)UPA (India)
BJP
Indian National
Congress
10 May 2010
Event Relevant Background
Knowledge
33
34. o Hyperlink structure is dynamically updated
Indian General
Election, 2014
Narendra Modi
Rahul Gandhi
NDA (India)UPA (India)
BJP
Indian National
Congress
10 May 2010
29 March 2013
29 March 2013 29 March 2013
29 March 2013
Event Relevant Background
Knowledge
34
35. o Hyperlink structure is dynamically updated
Indian General
Election, 2014
Narendra Modi
Rahul Gandhi
NDA (India)UPA (India)
BJP
Indian National
Congress
10 May 2010
29 March 2013
29 March 2013 29 March 2013
29 March 2013
20 May 2013
20 May 2013
Event Relevant Background
Knowledge
35
36. Determining Relevancy of Co-occurring
Hashtags (Vector Space Model)
#indianelection2015
#modikisarkar
Co-occurring:
Threshold
Latest K (200,500)
Narendra Modi: 0.9
BJP: 0.7
NDA: 0.6
India: 0.4
Elections: 0.2
Rahul Gandhi: 0.2
Congress: 0.2
Entity Extraction
and Scoring
Indian General
Election,_2014
Extract, Periodically
Update Hyperlink structure
Entity scoring based
on relevance to the Event
One hop from Event
Page
δ
36
37. o Edge Based Measure
o Link Overlap Measure: Jaccard similarity
o Out(c) are the links in Wikipedia page “c”
o Final Score: r(c,E) = ed(c,E) + oco(c,E)
Hyperlink Entity Scoring
India General
Election, 2014
Narendra Modi
India General
Election, 2014
India General
Election, 2009
1
Mutually
Important
ed (c,E) = 1
ed (c,E) = 2
37
38. Determining Relevancy of Co-occurring
Hashtags (Vector Space Model)
#indianelection2015
#modikisarkar
Co-occurring:
Threshold
Latest K (200,500)
Narendra Modi: 0.9
BJP: 0.7
NDA: 0.6
India: 0.4
Elections: 0.2
Rahul Gandhi: 0.2
Congress: 0.2
Entity Extraction
and Scoring
Indian General
Election,_2014
Extract, Periodically
Update Hyperlink structure
Entity scoring based
on relevance to the Event
One hop from Event
Page
Indian General Elec: 1.0
India: 0.9
Elections: 0.7
UPA: 0.6
BJP: 0.3
NDA: 0.3
Narendra Modi: 0.3
δ
38
39. Determining Relevancy of Co-occurring
Hashtags (Vector Space Model)
#indianelection2015
#modikisarkar
Co-occurring:
Threshold
Latest K (200,500)
Narendra Modi: 0.9
BJP: 0.7
NDA: 0.6
India: 0.4
Elections: 0.2
Rahul Gandhi: 0.2
Congress: 0.2
Entity Extraction
and Scoring
Indian General
Election,_2014
Extract, Periodically
Update Hyperlink structure
Entity scoring based
on relevance to the Event
One hop from Event
Page
Indian General Elec: 1.0
India: 0.9
Elections: 0.7
UPA: 0.6
BJP: 0.3
NDA: 0.3
Narendra Modi: 0.3
Similarity
Check
Relevance Score: 0.6
δ
39
40. o Set Based
o Jaccard Similarity
o Considers the entities without the scores
o Vector Based
o Symmetric
o Cosine Similarity
o Asymmetric
o Subsumption Similarity
Similarity Check
40
42. Determining Relevancy of Co-occurring
Hashtags (Vector Space Model)
#indianelection2015
#modikisarkar
Co-occurring:
Threshold
Latest K (200,500)
Narendra Modi: 0.9
BJP: 0.7
NDA: 0.6
India: 0.4
Elections: 0.2
Rahul Gandhi: 0.2
Congress: 0.2
Entity Extraction
and Scoring
Indian General
Election,_2014
Extract, Periodically
Update Hyperlink structure
Entity scoring based
on relevance to the Event
One hop from Event
Page
Indian General Elec: 1.0
India: 0.9
Elections: 0.7
UPA: 0.6
BJP: 0.3
NDA: 0.3
Narendra Modi: 0.3
Similarity
Check
Relevance Score: 0.6
δ
42
43. o 2 events
o US Presidential Elections (#election2012)
o Hurricane Sandy (#sandy)
o Top 25 co-occurring hashtags
Evaluation – Dataset
43
44. o Ranking Problem
o Rank the Top 25 hashtags based on the relevancy
of tweets to the event
o Experiment with all the similarity metrics
o Manually annotated the tweets of these hashtags
as relevant/irrelevant (Gold Standard)
o Ranking Evaluation Metrics
o Mean Average Precision
o NDCG
Evaluation –
Strategy
44
47. Conclusions
• Semantic Technologies for Real-time filtering of Social
Data
– Wikipedia as a Dynamic Knowledge base for events
– Determining relevant hashtags using Asymmetric similarity
measure
– More hashtags in turn increase the coverage of Tweets for
events
• Hashtag Analysis
– Co-occurrence technique can be used to detect event
relevant hashtags
– More popular hashtags are easier to be detected via co-
occurrence
47
Streams are everywhere. I am sure by now manualle, Dr. Sheth would have convinced you about this.
In this presentation, we will focus on the social data streams
So the journalist needs to track and he opts for some keyphrase healtcare reform.
What would happen if we include semantics
These examples such as epidemics, natural disasters, political events, and civil unrest are dynamic events.
They are continuously evolving. They involve many other entities. for example during indian elections Modi, Rahul Gandhi, Congress, and BJP related to the event. And in many cases these entity-event relevance changes over time. For example, Considering hurricane sandy NYC was a part of it for 2 days.
A naïve approach to get tweets relevant to an event is to use keywords as filters such as these. Twitter’s streaming API allows upto 400 keywords (unpaid) as filters.
However we need to update these keywords as and when the topics evolve. This technique is very tedious and not feasible.
We focus on this problem where we need to automatically update the filters
We focus on this problem where we need to automatically update the filters
The number of tweets collected for colorado shooting are around 125,000 where as OWC we collected approximately 6M tweets. Total number of distinct hashtags found in these tweets are 12K and 191k. In order to retrieve all the tweets of the event we need 7k for CS and 21k tags for OWS. In other words, if we had to automatically update all the filters, that are hashtags, we would need 7k hashtags to get all the tweets of Colorado Shooting and 21k tweets for OWS. It is important to note that we are just crawling for tweets with hashtags for now.
And the top 1% of these hashtags retrieves around 85% of the tweets. So practically speaking, In order to retrieve the tweets, we need to find an approach to automatically reach these top event related hashtag on the go.
This is how the co-occurrence graph of
From these analysis we get to know that
A small percentage of the hashtags should be detected to retrieve most tweets of the event
These tweets can be reached via co-occurrence
If we start with a popular event related hashtag we can reach other popular ones quickly due to its clustering co-efficient
If we use co-occurrence as the primary strategy, we will reach a lot of hashtags as filters.
Now coming back to the approach. The input is a hashtag that is relevant to the event, this is manually added and we hope that this is a prominent hashtag. Using a threshold delta we get the other relevant hashtag
Hence we start with a manually added event relevant hashtag and periodicall determine the dynamic relevance of top co-occurring hashtag
The latest 500 tweets of the hashtags are considered and entities from these tweets are scored based on its normalized frequency. Hence the hashtag is represented using a vector of entities. These entities just build a semantic context for the hashtag.
Next we utilize the wikipedia page of the event and extract all the relevant entities to the event. The relevant entities are the ones present in the Wikipedia page.
There are event wikipedia pages.
Arab Springs
The links to other wikipedia pages form a rich source of infomration. These can be the relevant semi structured knowledge for the event. For example, UPA, NDA, BJP, RG and NaMo are the entities mentioned in the event wikiipedia page and are relevant to the event.
A graph structure can be created with all the entities on the Topic Wikipedia page. This is a subgraph of Indian General Election 2014, with links between entities of the event. Narendra Modi and Indian National Congress
It is also important that the knowledge base has to be dynamically updated based on the changes in the event.
Also, its dynamically updated
Since it is crowd sourced pages on Wikipedia are updated in near real time. we are expecting the knowledge to be
We score these entities based on its relevance to the event – Compare vector of entities of the event with thtat of the hashtag
We use three different types of semantic similarity measures. (1) Set based, (2) Symmetric Vector Based, (3) Assymmetric Similarity
Penalize to check how big subset is the hashtag compared to the event. #modisarkar cannot cover the whole of indian elections where as the vice versa is possible. Better way to explain this. What we want to penalize
To evaluate, we picked two other events. US Presidential Elections and Hurricane Sandy. We picked #election2012 and #sandy as the starting point for the crawl. We retrieved 5000 tweets in real-time and extracted the top 25 co-occurring hashtags. For each of these hashtags we pick the latest 200 tweets for analysis.
We rank the top-25 hashtags based on its relevance to the event. Hence, we transformed it to be a ranking problem using all the similarity measures. For evaluation, all the tweets of the 25 hashtags were manually evaluated to be relevant/irrelevant to the event.
The subsumtion similarity works the best, where as if we look at just the co-occurrence of hashtags as a ranking mechanism it performs pretty low compared to that of using Wikipedia as knowledge base.
Interesting to hear the findings. More about the results.
This presentation emphasizes the value of semantics for social data filtering, specifically for the challenges faced during dynamically evolving event analysis.