SlideShare una empresa de Scribd logo
1 de 48
Semantic Filtering
An example of Semantic technologies for real-time
analysis
Pavan Kapanipathi
Ohio Center of Excellence in Knowledge-enabled Computing (Kno.e.sis)
Wright State University, USA
Tutorial @ Kno.e.sis Centre: Semantics Approach to Big Data and Event Processing, Oct 7-9, 2015
Streams are everywhere
Social Data
Text
Images
Videos
Sensor Data
Streams
Information Overload
500M users generate 500M tweets per day
3
It’s not information overload.
It’s filter failure
-- Clay Shirky
Each of our projects face
Information Overload
• Disaster Management
• Hazards SEES
• Healthcare Issues
• Depression
• Societal Issues
• Edrug Trends
• Harassment
• Filtering is necessary
• Understanding the
requirements and utilizing
semantics for filtering is
important
Semantic Filtering
Two Main Topics
• Twarql
• Streaming annotation and flexible
querying on Twitter
• Continuous Semantics
• Tracking dynamic topics on Twitter
Twarql
Tracking health
care debate in the
United States on
Social Media
Health Care Reform
Health Care
Reform
Healthcare reform
legislation in the
United States
Patient Protection
and Affordable Care
Act (Obamacare)
Health Care Reform
Twarql
Extraction Pipeline - Tweet
I think it’s good deal Apple Ipad Tablet (3G, wifi, WiFi + 3G) Hard
Nylon Cube Carrying Case for ipad ( iPad.. http://bit.ly/cry6LF)
Dbpedia:Ipad
Dbpedia:Tablet
URLs
http://penguinkang.com/tweetprobe/
RDF
• RDF Annotation
• Common RDF/OWL Data formats.
• FOAF, SIOC, OPO, MOAT
: Health_care_reform
Twarql – Use Case
Demo
http://knoesis.wright.edu/library/tools/twarql/demo.swf
Continuous Semantics
13
Dynamic Topics
Continuously
Evolving on
Twitter
Entity – Event
relevance
changes
Many entities are
involved
14
Dynamic Topics
Manually crawl using
keywords
“indianelection”“jan25” “sandy”
“swineflu” “ebola”
15
Dynamic Topics
Manually updating keywords
to get topic relevant tweets is
not feasible
“indianelection”
“modi”
“bjp”
“congress”
“jan25”
“egypt”
“tunisia”
“arabspring”
“sandy”
“newyork”
“redcross”
“fema”
“swineflu” “ebola”
16
Problem
How can we automatically update the
filters to track a dynamically evolving
topic on Twitter
17
Hashtags as Filters
• Identify a topic on Twitter
• Tweets with hashtags are more
informative
• Users have a lot of freedom to
create them
• Some get popular, most die
18
Exploring Hashtags as Evolving Filters for
Dynamic Topics
Colorado Shooting
Occupy Wall Street
CS OWS
Tweets: 122,062 Tweets: 6,077,378
Tags: 192,512
Distinct: 12,350
100% Retrieval: 7,763
Tags: 15,963,209
Distinct: 191,602
100% Retrieval: 21,314
HASHTAG
FILTERS 19
Top 1% retrieves
around 85% of the
tweets
Hashtag distributions
20
Colorado Shooting Occupy Wall Street
Event Related
Hashtags co-occur
with each other
Hashtag Filters Co-occurrence Graph
21
Summarizing Hashtag Analysis
Starting with one of the event relevant
hashtags, by co-occurrence we can reach
other relevant hashtags
22
Determining Relevancy of Co-occurring
Hashtags
#indianelection2015
#modikisarkar
Too many
co-occurring hashtags
23
Determining Relevancy of Co-occurring
Hashtags
#indianelection2015
#modikisarkar
Co-occurring:
Threshold δ
Preferably a prominent hashtag
24
Hashtag Co-occurrence works?
o No. Just co-occurrence does not work
o Many noisy or unrelated hashtags co-occurs
o Determine the “dynamic” relevance of the top
co-occurring hashtag with the dynamic topic
25
Determining Relevancy of Co-occurring
Hashtags
#indianelection2015
#modikisarkar
Co-occurring:
Threshold
Latest K (200,500)
Narendra Modi: 0.9
BJP: 0.7
NDA: 0.6
India: 0.4
Elections: 0.2
Rahul Gandhi: 0.2
Congress: 0.2
Entity Extraction
and Scoring
δ
Normalized
Frequency
Scoring
26
(Vector Space Model)
Determining Relevancy of Co-occurring
Hashtags (Vector Space Model)
#indianelection2015
#modikisarkar
Co-occurring:
Threshold
Latest K (200,500)
Narendra Modi: 0.9
BJP: 0.7
NDA: 0.6
India: 0.4
Elections: 0.2
Rahul Gandhi: 0.2
Congress: 0.2
Entity Extraction
and Scoring
Indian General
Election,_2014
Dynamically Updated
Background Knowledge
δ
27
Event Relevant Background
Knowledge
o Wikipedia Event Pages
28
o Wikipedia Event Pages
Event Relevant Background
Knowledge
29
o Entities mentioned on the Event page of
Wikipedia are relevant to the Event
Event Relevant Background
Knowledge
30
o Wikipedia’s Hyperlink structure is very rich
o Page-Page (Wikipedia) links
Indian General
Election, 2014
Narendra Modi
Rahul Gandhi
NDA (India)UPA (India)
BJP
Indian National
Congress
Event Relevant Background
Knowledge – Graph Structure
31
Determining Relevancy of Co-occurring
Hashtags (Vector Space Model)
#indianelection2015
#modikisarkar
Co-occurring:
Threshold
Latest K (200,500)
Narendra Modi: 0.9
BJP: 0.7
NDA: 0.6
India: 0.4
Elections: 0.2
Rahul Gandhi: 0.2
Congress: 0.2
Entity Extraction
and Scoring
Indian General
Election,_2014
Extract, Periodically
Update Hyperlink structure
One hop from Event
Page
δ
32
o Hyperlink structure is dynamically updated
Indian General
Election, 2014
Narendra Modi
Rahul Gandhi
NDA (India)UPA (India)
BJP
Indian National
Congress
10 May 2010
Event Relevant Background
Knowledge
33
o Hyperlink structure is dynamically updated
Indian General
Election, 2014
Narendra Modi
Rahul Gandhi
NDA (India)UPA (India)
BJP
Indian National
Congress
10 May 2010
29 March 2013
29 March 2013 29 March 2013
29 March 2013
Event Relevant Background
Knowledge
34
o Hyperlink structure is dynamically updated
Indian General
Election, 2014
Narendra Modi
Rahul Gandhi
NDA (India)UPA (India)
BJP
Indian National
Congress
10 May 2010
29 March 2013
29 March 2013 29 March 2013
29 March 2013
20 May 2013
20 May 2013
Event Relevant Background
Knowledge
35
Determining Relevancy of Co-occurring
Hashtags (Vector Space Model)
#indianelection2015
#modikisarkar
Co-occurring:
Threshold
Latest K (200,500)
Narendra Modi: 0.9
BJP: 0.7
NDA: 0.6
India: 0.4
Elections: 0.2
Rahul Gandhi: 0.2
Congress: 0.2
Entity Extraction
and Scoring
Indian General
Election,_2014
Extract, Periodically
Update Hyperlink structure
Entity scoring based
on relevance to the Event
One hop from Event
Page
δ
36
o Edge Based Measure
o Link Overlap Measure: Jaccard similarity
o Out(c) are the links in Wikipedia page “c”
o Final Score: r(c,E) = ed(c,E) + oco(c,E)
Hyperlink Entity Scoring
India General
Election, 2014
Narendra Modi
India General
Election, 2014
India General
Election, 2009
1
Mutually
Important
ed (c,E) = 1
ed (c,E) = 2
37
Determining Relevancy of Co-occurring
Hashtags (Vector Space Model)
#indianelection2015
#modikisarkar
Co-occurring:
Threshold
Latest K (200,500)
Narendra Modi: 0.9
BJP: 0.7
NDA: 0.6
India: 0.4
Elections: 0.2
Rahul Gandhi: 0.2
Congress: 0.2
Entity Extraction
and Scoring
Indian General
Election,_2014
Extract, Periodically
Update Hyperlink structure
Entity scoring based
on relevance to the Event
One hop from Event
Page
Indian General Elec: 1.0
India: 0.9
Elections: 0.7
UPA: 0.6
BJP: 0.3
NDA: 0.3
Narendra Modi: 0.3
δ
38
Determining Relevancy of Co-occurring
Hashtags (Vector Space Model)
#indianelection2015
#modikisarkar
Co-occurring:
Threshold
Latest K (200,500)
Narendra Modi: 0.9
BJP: 0.7
NDA: 0.6
India: 0.4
Elections: 0.2
Rahul Gandhi: 0.2
Congress: 0.2
Entity Extraction
and Scoring
Indian General
Election,_2014
Extract, Periodically
Update Hyperlink structure
Entity scoring based
on relevance to the Event
One hop from Event
Page
Indian General Elec: 1.0
India: 0.9
Elections: 0.7
UPA: 0.6
BJP: 0.3
NDA: 0.3
Narendra Modi: 0.3
Similarity
Check
Relevance Score: 0.6
δ
39
o Set Based
o Jaccard Similarity
o Considers the entities without the scores
o Vector Based
o Symmetric
o Cosine Similarity
o Asymmetric
o Subsumption Similarity
Similarity Check
40
India General
Election 2014
Narendra
Modi
Intuition behind
Asymmetric
India General
Election 2014
Narendra
Modi
Penalized
Ignored
Similarity
Symmetric
Asymmetric
41
Determining Relevancy of Co-occurring
Hashtags (Vector Space Model)
#indianelection2015
#modikisarkar
Co-occurring:
Threshold
Latest K (200,500)
Narendra Modi: 0.9
BJP: 0.7
NDA: 0.6
India: 0.4
Elections: 0.2
Rahul Gandhi: 0.2
Congress: 0.2
Entity Extraction
and Scoring
Indian General
Election,_2014
Extract, Periodically
Update Hyperlink structure
Entity scoring based
on relevance to the Event
One hop from Event
Page
Indian General Elec: 1.0
India: 0.9
Elections: 0.7
UPA: 0.6
BJP: 0.3
NDA: 0.3
Narendra Modi: 0.3
Similarity
Check
Relevance Score: 0.6
δ
42
o 2 events
o US Presidential Elections (#election2012)
o Hurricane Sandy (#sandy)
o Top 25 co-occurring hashtags
Evaluation – Dataset
43
o Ranking Problem
o Rank the Top 25 hashtags based on the relevancy
of tweets to the event
o Experiment with all the similarity metrics
o Manually annotated the tweets of these hashtags
as relevant/irrelevant (Gold Standard)
o Ranking Evaluation Metrics
o Mean Average Precision
o NDCG
Evaluation –
Strategy
44
Evaluation
45
Evaluation
Evaluated tweets comprising of top-relevant
hashtags detected for dynamic topics
• NDCG - 92% at top-5 Mean Average
Precision
46
Conclusions
• Semantic Technologies for Real-time filtering of Social
Data
– Wikipedia as a Dynamic Knowledge base for events
– Determining relevant hashtags using Asymmetric similarity
measure
– More hashtags in turn increase the coverage of Tweets for
events
• Hashtag Analysis
– Co-occurrence technique can be used to detect event
relevant hashtags
– More popular hashtags are easier to be detected via co-
occurrence
47
Thanks
Contact: @pavankaps
pavan@knoesis.org

Más contenido relacionado

Destacado

Semantic Technologies for Big Data
Semantic Technologies for Big DataSemantic Technologies for Big Data
Semantic Technologies for Big Data
Marin Dimitrov
 

Destacado (13)

LOD2 Webinar Series: SILK
LOD2 Webinar Series: SILKLOD2 Webinar Series: SILK
LOD2 Webinar Series: SILK
 
Twarql Architecture - Streaming Annotated Tweets
Twarql Architecture - Streaming Annotated TweetsTwarql Architecture - Streaming Annotated Tweets
Twarql Architecture - Streaming Annotated Tweets
 
Stream Reasoning: mastering the velocity and variety dimensions of Big Data...
Stream Reasoning: mastering the velocity and variety dimensions of Big Data...Stream Reasoning: mastering the velocity and variety dimensions of Big Data...
Stream Reasoning: mastering the velocity and variety dimensions of Big Data...
 
Integrating Sensor and Social Data for Understanding City Events
Integrating Sensor and Social Data for Understanding City EventsIntegrating Sensor and Social Data for Understanding City Events
Integrating Sensor and Social Data for Understanding City Events
 
Examples of Applied Semantic Technologies: Application of Semantic Sensor Net...
Examples of Applied Semantic Technologies: Application of Semantic Sensor Net...Examples of Applied Semantic Technologies: Application of Semantic Sensor Net...
Examples of Applied Semantic Technologies: Application of Semantic Sensor Net...
 
Mastering the variety dimension of Big Data with semantic technologies: high ...
Mastering the variety dimension of Big Data with semantic technologies: high ...Mastering the variety dimension of Big Data with semantic technologies: high ...
Mastering the variety dimension of Big Data with semantic technologies: high ...
 
Semantics Approach to Big Data and Event Processing: an introduction focused ...
Semantics Approach to Big Data and Event Processing: an introduction focused ...Semantics Approach to Big Data and Event Processing: an introduction focused ...
Semantics Approach to Big Data and Event Processing: an introduction focused ...
 
Listening to the pulse of our cities fusing Social Media Streams and Call Dat...
Listening to the pulse of our cities fusing Social Media Streams and Call Dat...Listening to the pulse of our cities fusing Social Media Streams and Call Dat...
Listening to the pulse of our cities fusing Social Media Streams and Call Dat...
 
Examples of Applied Semantic Technologies: Social Data Annotation
Examples of Applied Semantic Technologies:  Social Data AnnotationExamples of Applied Semantic Technologies:  Social Data Annotation
Examples of Applied Semantic Technologies: Social Data Annotation
 
Examples of Real-World Big Data Application
Examples of Real-World Big Data ApplicationExamples of Real-World Big Data Application
Examples of Real-World Big Data Application
 
Mastering the Velocity Dimension of Big Data
Mastering the Velocity Dimension of Big DataMastering the Velocity Dimension of Big Data
Mastering the Velocity Dimension of Big Data
 
RDF Streams and Continuous SPARQL (C-SPARQL)
RDF Streams and Continuous SPARQL (C-SPARQL)RDF Streams and Continuous SPARQL (C-SPARQL)
RDF Streams and Continuous SPARQL (C-SPARQL)
 
Semantic Technologies for Big Data
Semantic Technologies for Big DataSemantic Technologies for Big Data
Semantic Technologies for Big Data
 

Similar a Knoesis-Semantic filtering-Tutorials

Harnessing Volume and Velocity Challenge on the Social Web using Crowd-Source...
Harnessing Volume and Velocity Challenge on the Social Web using Crowd-Source...Harnessing Volume and Velocity Challenge on the Social Web using Crowd-Source...
Harnessing Volume and Velocity Challenge on the Social Web using Crowd-Source...
Artificial Intelligence Institute at UofSC
 
India and Bharat: A Social Media Story
India and Bharat: A Social Media StoryIndia and Bharat: A Social Media Story
India and Bharat: A Social Media Story
Germin8
 
Causal data mining: Identifying causal effects at scale
Causal data mining: Identifying causal effects at scaleCausal data mining: Identifying causal effects at scale
Causal data mining: Identifying causal effects at scale
Amit Sharma
 

Similar a Knoesis-Semantic filtering-Tutorials (20)

2012 Presidential Elections on Twitter - An Analysis of How the US and French...
2012 Presidential Elections on Twitter - An Analysis of How the US and French...2012 Presidential Elections on Twitter - An Analysis of How the US and French...
2012 Presidential Elections on Twitter - An Analysis of How the US and French...
 
Harnessing Volume and Velocity Challenge on the Social Web using Crowd-Source...
Harnessing Volume and Velocity Challenge on the Social Web using Crowd-Source...Harnessing Volume and Velocity Challenge on the Social Web using Crowd-Source...
Harnessing Volume and Velocity Challenge on the Social Web using Crowd-Source...
 
Haystack keynote 2019: What is Search Relevance? - Max Irwin
Haystack keynote 2019: What is Search Relevance? - Max IrwinHaystack keynote 2019: What is Search Relevance? - Max Irwin
Haystack keynote 2019: What is Search Relevance? - Max Irwin
 
Actively Learning to Rank Semantic Associations for Personalized Contextual E...
Actively Learning to Rank Semantic Associations for Personalized Contextual E...Actively Learning to Rank Semantic Associations for Personalized Contextual E...
Actively Learning to Rank Semantic Associations for Personalized Contextual E...
 
Knowledge discovery in social media mining for market analysis
Knowledge discovery in social media mining for market analysisKnowledge discovery in social media mining for market analysis
Knowledge discovery in social media mining for market analysis
 
DaCENA Personalized Exploration of Knowledge Graphs Within a Context. Seminar...
DaCENA Personalized Exploration of Knowledge Graphs Within a Context. Seminar...DaCENA Personalized Exploration of Knowledge Graphs Within a Context. Seminar...
DaCENA Personalized Exploration of Knowledge Graphs Within a Context. Seminar...
 
Mining and analyzing social media part 1 - hicss47 tutorial - dave king
Mining and analyzing social media   part 1 - hicss47 tutorial - dave kingMining and analyzing social media   part 1 - hicss47 tutorial - dave king
Mining and analyzing social media part 1 - hicss47 tutorial - dave king
 
Metodologia para el analisis de redes sociales
Metodologia para el analisis de redes socialesMetodologia para el analisis de redes sociales
Metodologia para el analisis de redes sociales
 
How to use Big Data to drive product strategy and adoption
How to use Big Data to drive product strategy and adoptionHow to use Big Data to drive product strategy and adoption
How to use Big Data to drive product strategy and adoption
 
Data Analytics
Data AnalyticsData Analytics
Data Analytics
 
Twitter Based Sentiment Analysis of Each Presidential Candidate Using Long Sh...
Twitter Based Sentiment Analysis of Each Presidential Candidate Using Long Sh...Twitter Based Sentiment Analysis of Each Presidential Candidate Using Long Sh...
Twitter Based Sentiment Analysis of Each Presidential Candidate Using Long Sh...
 
India and Bharat: A Social Media Story
India and Bharat: A Social Media StoryIndia and Bharat: A Social Media Story
India and Bharat: A Social Media Story
 
SCA2013 Presentation: A Web-Based Content Analysis Tool
SCA2013 Presentation: A Web-Based Content Analysis ToolSCA2013 Presentation: A Web-Based Content Analysis Tool
SCA2013 Presentation: A Web-Based Content Analysis Tool
 
Causal data mining: Identifying causal effects at scale
Causal data mining: Identifying causal effects at scaleCausal data mining: Identifying causal effects at scale
Causal data mining: Identifying causal effects at scale
 
Data Science for Social Good
Data Science for Social GoodData Science for Social Good
Data Science for Social Good
 
2015 hypertext-election prediction
2015 hypertext-election prediction2015 hypertext-election prediction
2015 hypertext-election prediction
 
JDO 2019: Data Science for Developers - Matthew Renze
JDO 2019: Data Science for Developers -  Matthew RenzeJDO 2019: Data Science for Developers -  Matthew Renze
JDO 2019: Data Science for Developers - Matthew Renze
 
Close encounters of the Digital kind
Close encounters of the Digital kindClose encounters of the Digital kind
Close encounters of the Digital kind
 
Social Video Analytics: From Demography to Psychography of User Behaviour
Social Video Analytics: From Demography to Psychography of User BehaviourSocial Video Analytics: From Demography to Psychography of User Behaviour
Social Video Analytics: From Demography to Psychography of User Behaviour
 
The Value of Social Data
The Value of Social DataThe Value of Social Data
The Value of Social Data
 

Más de Pavan Kapanipathi

Personalized and Adaptive Semantic Information Filtering for Social Media
Personalized and Adaptive Semantic Information Filtering for Social MediaPersonalized and Adaptive Semantic Information Filtering for Social Media
Personalized and Adaptive Semantic Information Filtering for Social Media
Pavan Kapanipathi
 
Hierarchical Interest Graphs from Twitter
Hierarchical Interest Graphs from TwitterHierarchical Interest Graphs from Twitter
Hierarchical Interest Graphs from Twitter
Pavan Kapanipathi
 

Más de Pavan Kapanipathi (9)

Improving Natural Language Inference Using External Knowledge in the Science ...
Improving Natural Language Inference Using External Knowledge in the Science ...Improving Natural Language Inference Using External Knowledge in the Science ...
Improving Natural Language Inference Using External Knowledge in the Science ...
 
Personalized and Adaptive Semantic Information Filtering for Social Media
Personalized and Adaptive Semantic Information Filtering for Social MediaPersonalized and Adaptive Semantic Information Filtering for Social Media
Personalized and Adaptive Semantic Information Filtering for Social Media
 
Adressing Volume and Velocity Challenge on the Social Web using Crowd Sourced...
Adressing Volume and Velocity Challenge on the Social Web using Crowd Sourced...Adressing Volume and Velocity Challenge on the Social Web using Crowd Sourced...
Adressing Volume and Velocity Challenge on the Social Web using Crowd Sourced...
 
Hierarchical Interest Graphs from Twitter
Hierarchical Interest Graphs from TwitterHierarchical Interest Graphs from Twitter
Hierarchical Interest Graphs from Twitter
 
User Interests Identification From Twitter using Hierarchical Knowledge Base
User Interests Identification From Twitter using Hierarchical Knowledge BaseUser Interests Identification From Twitter using Hierarchical Knowledge Base
User Interests Identification From Twitter using Hierarchical Knowledge Base
 
Random walk on Graphs
Random walk on GraphsRandom walk on Graphs
Random walk on Graphs
 
SemPuSH: ISWC 2011 Poster
SemPuSH: ISWC 2011 PosterSemPuSH: ISWC 2011 Poster
SemPuSH: ISWC 2011 Poster
 
Privacy Aware Semantic Dissemination
Privacy Aware Semantic DisseminationPrivacy Aware Semantic Dissemination
Privacy Aware Semantic Dissemination
 
Personalized Filtering of Twitter Stream
Personalized Filtering of Twitter StreamPersonalized Filtering of Twitter Stream
Personalized Filtering of Twitter Stream
 

Último

Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Último (20)

General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 

Knoesis-Semantic filtering-Tutorials

  • 1. Semantic Filtering An example of Semantic technologies for real-time analysis Pavan Kapanipathi Ohio Center of Excellence in Knowledge-enabled Computing (Kno.e.sis) Wright State University, USA Tutorial @ Kno.e.sis Centre: Semantics Approach to Big Data and Event Processing, Oct 7-9, 2015
  • 2. Streams are everywhere Social Data Text Images Videos Sensor Data Streams
  • 3. Information Overload 500M users generate 500M tweets per day 3 It’s not information overload. It’s filter failure -- Clay Shirky
  • 4. Each of our projects face Information Overload • Disaster Management • Hazards SEES • Healthcare Issues • Depression • Societal Issues • Edrug Trends • Harassment
  • 5. • Filtering is necessary • Understanding the requirements and utilizing semantics for filtering is important Semantic Filtering
  • 6. Two Main Topics • Twarql • Streaming annotation and flexible querying on Twitter • Continuous Semantics • Tracking dynamic topics on Twitter
  • 7. Twarql Tracking health care debate in the United States on Social Media Health Care Reform Health Care Reform Healthcare reform legislation in the United States Patient Protection and Affordable Care Act (Obamacare) Health Care Reform
  • 9. Extraction Pipeline - Tweet I think it’s good deal Apple Ipad Tablet (3G, wifi, WiFi + 3G) Hard Nylon Cube Carrying Case for ipad ( iPad.. http://bit.ly/cry6LF) Dbpedia:Ipad Dbpedia:Tablet URLs http://penguinkang.com/tweetprobe/
  • 10. RDF • RDF Annotation • Common RDF/OWL Data formats. • FOAF, SIOC, OPO, MOAT
  • 14. Dynamic Topics Continuously Evolving on Twitter Entity – Event relevance changes Many entities are involved 14
  • 15. Dynamic Topics Manually crawl using keywords “indianelection”“jan25” “sandy” “swineflu” “ebola” 15
  • 16. Dynamic Topics Manually updating keywords to get topic relevant tweets is not feasible “indianelection” “modi” “bjp” “congress” “jan25” “egypt” “tunisia” “arabspring” “sandy” “newyork” “redcross” “fema” “swineflu” “ebola” 16
  • 17. Problem How can we automatically update the filters to track a dynamically evolving topic on Twitter 17
  • 18. Hashtags as Filters • Identify a topic on Twitter • Tweets with hashtags are more informative • Users have a lot of freedom to create them • Some get popular, most die 18
  • 19. Exploring Hashtags as Evolving Filters for Dynamic Topics Colorado Shooting Occupy Wall Street CS OWS Tweets: 122,062 Tweets: 6,077,378 Tags: 192,512 Distinct: 12,350 100% Retrieval: 7,763 Tags: 15,963,209 Distinct: 191,602 100% Retrieval: 21,314 HASHTAG FILTERS 19
  • 20. Top 1% retrieves around 85% of the tweets Hashtag distributions 20
  • 21. Colorado Shooting Occupy Wall Street Event Related Hashtags co-occur with each other Hashtag Filters Co-occurrence Graph 21
  • 22. Summarizing Hashtag Analysis Starting with one of the event relevant hashtags, by co-occurrence we can reach other relevant hashtags 22
  • 23. Determining Relevancy of Co-occurring Hashtags #indianelection2015 #modikisarkar Too many co-occurring hashtags 23
  • 24. Determining Relevancy of Co-occurring Hashtags #indianelection2015 #modikisarkar Co-occurring: Threshold δ Preferably a prominent hashtag 24
  • 25. Hashtag Co-occurrence works? o No. Just co-occurrence does not work o Many noisy or unrelated hashtags co-occurs o Determine the “dynamic” relevance of the top co-occurring hashtag with the dynamic topic 25
  • 26. Determining Relevancy of Co-occurring Hashtags #indianelection2015 #modikisarkar Co-occurring: Threshold Latest K (200,500) Narendra Modi: 0.9 BJP: 0.7 NDA: 0.6 India: 0.4 Elections: 0.2 Rahul Gandhi: 0.2 Congress: 0.2 Entity Extraction and Scoring δ Normalized Frequency Scoring 26 (Vector Space Model)
  • 27. Determining Relevancy of Co-occurring Hashtags (Vector Space Model) #indianelection2015 #modikisarkar Co-occurring: Threshold Latest K (200,500) Narendra Modi: 0.9 BJP: 0.7 NDA: 0.6 India: 0.4 Elections: 0.2 Rahul Gandhi: 0.2 Congress: 0.2 Entity Extraction and Scoring Indian General Election,_2014 Dynamically Updated Background Knowledge δ 27
  • 28. Event Relevant Background Knowledge o Wikipedia Event Pages 28
  • 29. o Wikipedia Event Pages Event Relevant Background Knowledge 29
  • 30. o Entities mentioned on the Event page of Wikipedia are relevant to the Event Event Relevant Background Knowledge 30
  • 31. o Wikipedia’s Hyperlink structure is very rich o Page-Page (Wikipedia) links Indian General Election, 2014 Narendra Modi Rahul Gandhi NDA (India)UPA (India) BJP Indian National Congress Event Relevant Background Knowledge – Graph Structure 31
  • 32. Determining Relevancy of Co-occurring Hashtags (Vector Space Model) #indianelection2015 #modikisarkar Co-occurring: Threshold Latest K (200,500) Narendra Modi: 0.9 BJP: 0.7 NDA: 0.6 India: 0.4 Elections: 0.2 Rahul Gandhi: 0.2 Congress: 0.2 Entity Extraction and Scoring Indian General Election,_2014 Extract, Periodically Update Hyperlink structure One hop from Event Page δ 32
  • 33. o Hyperlink structure is dynamically updated Indian General Election, 2014 Narendra Modi Rahul Gandhi NDA (India)UPA (India) BJP Indian National Congress 10 May 2010 Event Relevant Background Knowledge 33
  • 34. o Hyperlink structure is dynamically updated Indian General Election, 2014 Narendra Modi Rahul Gandhi NDA (India)UPA (India) BJP Indian National Congress 10 May 2010 29 March 2013 29 March 2013 29 March 2013 29 March 2013 Event Relevant Background Knowledge 34
  • 35. o Hyperlink structure is dynamically updated Indian General Election, 2014 Narendra Modi Rahul Gandhi NDA (India)UPA (India) BJP Indian National Congress 10 May 2010 29 March 2013 29 March 2013 29 March 2013 29 March 2013 20 May 2013 20 May 2013 Event Relevant Background Knowledge 35
  • 36. Determining Relevancy of Co-occurring Hashtags (Vector Space Model) #indianelection2015 #modikisarkar Co-occurring: Threshold Latest K (200,500) Narendra Modi: 0.9 BJP: 0.7 NDA: 0.6 India: 0.4 Elections: 0.2 Rahul Gandhi: 0.2 Congress: 0.2 Entity Extraction and Scoring Indian General Election,_2014 Extract, Periodically Update Hyperlink structure Entity scoring based on relevance to the Event One hop from Event Page δ 36
  • 37. o Edge Based Measure o Link Overlap Measure: Jaccard similarity o Out(c) are the links in Wikipedia page “c” o Final Score: r(c,E) = ed(c,E) + oco(c,E) Hyperlink Entity Scoring India General Election, 2014 Narendra Modi India General Election, 2014 India General Election, 2009 1 Mutually Important ed (c,E) = 1 ed (c,E) = 2 37
  • 38. Determining Relevancy of Co-occurring Hashtags (Vector Space Model) #indianelection2015 #modikisarkar Co-occurring: Threshold Latest K (200,500) Narendra Modi: 0.9 BJP: 0.7 NDA: 0.6 India: 0.4 Elections: 0.2 Rahul Gandhi: 0.2 Congress: 0.2 Entity Extraction and Scoring Indian General Election,_2014 Extract, Periodically Update Hyperlink structure Entity scoring based on relevance to the Event One hop from Event Page Indian General Elec: 1.0 India: 0.9 Elections: 0.7 UPA: 0.6 BJP: 0.3 NDA: 0.3 Narendra Modi: 0.3 δ 38
  • 39. Determining Relevancy of Co-occurring Hashtags (Vector Space Model) #indianelection2015 #modikisarkar Co-occurring: Threshold Latest K (200,500) Narendra Modi: 0.9 BJP: 0.7 NDA: 0.6 India: 0.4 Elections: 0.2 Rahul Gandhi: 0.2 Congress: 0.2 Entity Extraction and Scoring Indian General Election,_2014 Extract, Periodically Update Hyperlink structure Entity scoring based on relevance to the Event One hop from Event Page Indian General Elec: 1.0 India: 0.9 Elections: 0.7 UPA: 0.6 BJP: 0.3 NDA: 0.3 Narendra Modi: 0.3 Similarity Check Relevance Score: 0.6 δ 39
  • 40. o Set Based o Jaccard Similarity o Considers the entities without the scores o Vector Based o Symmetric o Cosine Similarity o Asymmetric o Subsumption Similarity Similarity Check 40
  • 41. India General Election 2014 Narendra Modi Intuition behind Asymmetric India General Election 2014 Narendra Modi Penalized Ignored Similarity Symmetric Asymmetric 41
  • 42. Determining Relevancy of Co-occurring Hashtags (Vector Space Model) #indianelection2015 #modikisarkar Co-occurring: Threshold Latest K (200,500) Narendra Modi: 0.9 BJP: 0.7 NDA: 0.6 India: 0.4 Elections: 0.2 Rahul Gandhi: 0.2 Congress: 0.2 Entity Extraction and Scoring Indian General Election,_2014 Extract, Periodically Update Hyperlink structure Entity scoring based on relevance to the Event One hop from Event Page Indian General Elec: 1.0 India: 0.9 Elections: 0.7 UPA: 0.6 BJP: 0.3 NDA: 0.3 Narendra Modi: 0.3 Similarity Check Relevance Score: 0.6 δ 42
  • 43. o 2 events o US Presidential Elections (#election2012) o Hurricane Sandy (#sandy) o Top 25 co-occurring hashtags Evaluation – Dataset 43
  • 44. o Ranking Problem o Rank the Top 25 hashtags based on the relevancy of tweets to the event o Experiment with all the similarity metrics o Manually annotated the tweets of these hashtags as relevant/irrelevant (Gold Standard) o Ranking Evaluation Metrics o Mean Average Precision o NDCG Evaluation – Strategy 44
  • 46. Evaluation Evaluated tweets comprising of top-relevant hashtags detected for dynamic topics • NDCG - 92% at top-5 Mean Average Precision 46
  • 47. Conclusions • Semantic Technologies for Real-time filtering of Social Data – Wikipedia as a Dynamic Knowledge base for events – Determining relevant hashtags using Asymmetric similarity measure – More hashtags in turn increase the coverage of Tweets for events • Hashtag Analysis – Co-occurrence technique can be used to detect event relevant hashtags – More popular hashtags are easier to be detected via co- occurrence 47

Notas del editor

  1. Streams are everywhere. I am sure by now manualle, Dr. Sheth would have convinced you about this.
  2. In this presentation, we will focus on the social data streams
  3. So the journalist needs to track and he opts for some keyphrase healtcare reform. What would happen if we include semantics
  4. These examples such as epidemics, natural disasters, political events, and civil unrest are dynamic events.
  5. They are continuously evolving. They involve many other entities. for example during indian elections Modi, Rahul Gandhi, Congress, and BJP related to the event. And in many cases these entity-event relevance changes over time. For example, Considering hurricane sandy NYC was a part of it for 2 days.
  6. A naïve approach to get tweets relevant to an event is to use keywords as filters such as these. Twitter’s streaming API allows upto 400 keywords (unpaid) as filters.
  7. However we need to update these keywords as and when the topics evolve. This technique is very tedious and not feasible.
  8. We focus on this problem where we need to automatically update the filters
  9. We focus on this problem where we need to automatically update the filters
  10. The number of tweets collected for colorado shooting are around 125,000 where as OWC we collected approximately 6M tweets. Total number of distinct hashtags found in these tweets are 12K and 191k. In order to retrieve all the tweets of the event we need 7k for CS and 21k tags for OWS. In other words, if we had to automatically update all the filters, that are hashtags, we would need 7k hashtags to get all the tweets of Colorado Shooting and 21k tweets for OWS. It is important to note that we are just crawling for tweets with hashtags for now.
  11. And the top 1% of these hashtags retrieves around 85% of the tweets. So practically speaking, In order to retrieve the tweets, we need to find an approach to automatically reach these top event related hashtag on the go.
  12. This is how the co-occurrence graph of
  13. From these analysis we get to know that A small percentage of the hashtags should be detected to retrieve most tweets of the event These tweets can be reached via co-occurrence If we start with a popular event related hashtag we can reach other popular ones quickly due to its clustering co-efficient
  14. If we use co-occurrence as the primary strategy, we will reach a lot of hashtags as filters.
  15. Now coming back to the approach. The input is a hashtag that is relevant to the event, this is manually added and we hope that this is a prominent hashtag. Using a threshold delta we get the other relevant hashtag
  16. Hence we start with a manually added event relevant hashtag and periodicall determine the dynamic relevance of top co-occurring hashtag
  17. The latest 500 tweets of the hashtags are considered and entities from these tweets are scored based on its normalized frequency. Hence the hashtag is represented using a vector of entities. These entities just build a semantic context for the hashtag.
  18. Next we utilize the wikipedia page of the event and extract all the relevant entities to the event. The relevant entities are the ones present in the Wikipedia page.
  19. There are event wikipedia pages.
  20. Arab Springs
  21. The links to other wikipedia pages form a rich source of infomration. These can be the relevant semi structured knowledge for the event. For example, UPA, NDA, BJP, RG and NaMo are the entities mentioned in the event wikiipedia page and are relevant to the event.
  22. A graph structure can be created with all the entities on the Topic Wikipedia page. This is a subgraph of Indian General Election 2014, with links between entities of the event. Narendra Modi and Indian National Congress
  23. It is also important that the knowledge base has to be dynamically updated based on the changes in the event.
  24. Also, its dynamically updated
  25. Since it is crowd sourced pages on Wikipedia are updated in near real time. we are expecting the knowledge to be
  26. We score these entities based on its relevance to the event – Compare vector of entities of the event with thtat of the hashtag
  27. We use three different types of semantic similarity measures. (1) Set based, (2) Symmetric Vector Based, (3) Assymmetric Similarity
  28. Penalize to check how big subset is the hashtag compared to the event. #modisarkar cannot cover the whole of indian elections where as the vice versa is possible. Better way to explain this. What we want to penalize
  29. To evaluate, we picked two other events. US Presidential Elections and Hurricane Sandy. We picked #election2012 and #sandy as the starting point for the crawl. We retrieved 5000 tweets in real-time and extracted the top 25 co-occurring hashtags. For each of these hashtags we pick the latest 200 tweets for analysis.
  30. We rank the top-25 hashtags based on its relevance to the event. Hence, we transformed it to be a ranking problem using all the similarity measures. For evaluation, all the tweets of the 25 hashtags were manually evaluated to be relevant/irrelevant to the event.
  31. The subsumtion similarity works the best, where as if we look at just the co-occurrence of hashtags as a ranking mechanism it performs pretty low compared to that of using Wikipedia as knowledge base.
  32. Interesting to hear the findings. More about the results.
  33. This presentation emphasizes the value of semantics for social data filtering, specifically for the challenges faced during dynamically evolving event analysis.