SlideShare una empresa de Scribd logo
1 de 19
Towards a Data-driven Approach to Identify
Crisis-Related Topics in Social Media Streams
Muhammad Imran (@mimran15) and Carlos Castillo (@ChaToX)
Qatar Computing Research Institute
Doha, Qatar.
SWDM’15 : WWW’15 May 18th 2015
Information Variability on Social Media
• Different events present different information
categories
• Even for recurring events, categories proportion
change
Information Variability on Social Media
• Different events present different information
categories
• Even for recurring events, categories proportion
change
Information Variability on Social Media
• Different events present different information
categories
• Even for recurring events, categories proportion
change
Information Variability on Social Media
• Different events present different information
categories
• Even for recurring events, categories proportion
change
Information Variability on Social Media
• Different events present different information
categories
• Even for recurring events, categories proportion
change
Different Classification Approaches
• Various classification approaches exist:
– Manual classification by human experts
– Automatic classification using unsupervised or
supervised approaches(needs training data)
– Hybrid: Automatic + Manual
• Retrospective vs. real-time classification
– Batch processing (offline, training data availability)
– Stream processing (real-time, scarce training data)
Real-time Stream Classification
(Supervised )
• Fewer categories are better
– Decrease workers dropout
– More training data for each category, more accuracy
– “7 plus/minus 2” rule [G. A. Miller, 56]
• Categories need to be defined carefully
– Empty categories (waste space and efforts of workers)
– Categories that are too large introduce heterogeneity
Problem Statement
• How can we classify items arriving as a data
stream into a small number of categories, if
we cannot anticipate exactly which will be the
most frequent categories?
Our research improves crowdsourcing-based and
supervised learning-based systems (e.g. AIDR) by
finding latent categories in fast data streams.
Our Approach (top-down + bottom-up)
1. An expert defines information categories (top-down)
2. Messages are categorized into the initial set plus an
extra “Miscellaneous” category
3. Identify relevant and prevalent categories from the
messages in the “Miscellaneous” category (bottom-
up)
1. Generate candidate categories
2. Learn characteristics of good categories
3. Rank categories on good characteristics
How do we identify relevant categories?
Candidate Generation
We propose to apply Latent Dirichlet Allocation
(LDA) on the Miscellaneous category:
• Input: A set of n documents (all messages in
the Misc. category) and a number m (# of
topics to be generated)
• Output: n x m matrix in which cell(i, j) indicates
the extent to which document i corresponds to
topic j.
Candidate Evaluation
To reduce the workload of experts to decide
which categories to pick or not, we propose the
following criteria:
• Volume: a category shouldn’t be too small
• Novelty: a category must not overlap or be
too similar to the existing categories
• Cohesiveness (intra- and inter-similarity): a
category should be cohesive (should have
small intra-topic and large inter-topic values)
Experimental Testing
• We used Twitter data of 17 crises (from the
CrisisLexT26 dataset at crisislex.org)
A. Affected individuals, deaths, injuries,
missing, found.
B. Infrastructure and utilities: buildings,
roads, services damage.
C. Donation and volunteering: needs,
requests of food, shelter, supplies.
D. Caution and advice: warnings issued
or lifted, guidance and tips.
E. Sympathy and emotional support:
thoughts, prayers, gratitude, etc.
Z. Other useful information not covered
by any of the above categories.
Candidate Generation Setup
• Applied LDA on the messages in the “Z”
category of each crisis
• 5 topics were generated for each crisis
• Considered messages with LDA score > 0.06 in
each topic
• Presented the LDA generated topics to experts
in random order
Candidate Annotation Setup
Recruited two experts from two Int. humanitarian
organizations in the crisis response domain
Results
• Topics with avg. score <= 2.5 considered as bad topics
• Topics with avg. score >= 3.5 considered as good topics
• Hit: if the metric value of good topics > bad topics
A crisis is not considered for evaluation, if all of its topics receive an average score either below or above 3.0.
Conclusion
• Novelty, intra-similarity and cohesiveness are
useful in identifying good topics
• Our approach combines top-down (manual)
and bottom-up (automatic) elements.
• Learned important characteristics of good
topics
• Future work includes candidate ranking
including recommendation for adding,
merging, dropping new unseen categories
Data used in this study can be requested:
Contact: Muhammad Imran at
mimran@qf.org.qa OR @mimran15
Thank you!
Authors contact:
Muhammad Imran @mimran15
Carlos Castillo @ChaToX

Más contenido relacionado

Más de Muhammad Imran

Introduction to Machine Learning: An Application to Disaster Response
Introduction to Machine Learning: An Application to Disaster ResponseIntroduction to Machine Learning: An Application to Disaster Response
Introduction to Machine Learning: An Application to Disaster ResponseMuhammad Imran
 
Artificial Intelligence for Disaster Response
Artificial Intelligence for Disaster ResponseArtificial Intelligence for Disaster Response
Artificial Intelligence for Disaster ResponseMuhammad Imran
 
A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...
A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...
A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...Muhammad Imran
 
Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...
Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...
Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...Muhammad Imran
 
Tweet4act: Using Incident-Specific Profiles for Classifying Crisis-Related Me...
Tweet4act: Using Incident-Specific Profiles for Classifying Crisis-Related Me...Tweet4act: Using Incident-Specific Profiles for Classifying Crisis-Related Me...
Tweet4act: Using Incident-Specific Profiles for Classifying Crisis-Related Me...Muhammad Imran
 
Extracting Information Nuggets from Disaster-Related Messages in Social Media
Extracting Information Nuggets from Disaster-Related Messages in Social MediaExtracting Information Nuggets from Disaster-Related Messages in Social Media
Extracting Information Nuggets from Disaster-Related Messages in Social MediaMuhammad Imran
 
Domain Specific Mashups
Domain Specific MashupsDomain Specific Mashups
Domain Specific MashupsMuhammad Imran
 
Reseval Mashup Platform Talk at SECO
Reseval Mashup Platform Talk at SECOReseval Mashup Platform Talk at SECO
Reseval Mashup Platform Talk at SECOMuhammad Imran
 
ResEval: Resource-oriented Research Impact Evaluation platform
ResEval: Resource-oriented Research Impact Evaluation platformResEval: Resource-oriented Research Impact Evaluation platform
ResEval: Resource-oriented Research Impact Evaluation platformMuhammad Imran
 

Más de Muhammad Imran (9)

Introduction to Machine Learning: An Application to Disaster Response
Introduction to Machine Learning: An Application to Disaster ResponseIntroduction to Machine Learning: An Application to Disaster Response
Introduction to Machine Learning: An Application to Disaster Response
 
Artificial Intelligence for Disaster Response
Artificial Intelligence for Disaster ResponseArtificial Intelligence for Disaster Response
Artificial Intelligence for Disaster Response
 
A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...
A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...
A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...
 
Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...
Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...
Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...
 
Tweet4act: Using Incident-Specific Profiles for Classifying Crisis-Related Me...
Tweet4act: Using Incident-Specific Profiles for Classifying Crisis-Related Me...Tweet4act: Using Incident-Specific Profiles for Classifying Crisis-Related Me...
Tweet4act: Using Incident-Specific Profiles for Classifying Crisis-Related Me...
 
Extracting Information Nuggets from Disaster-Related Messages in Social Media
Extracting Information Nuggets from Disaster-Related Messages in Social MediaExtracting Information Nuggets from Disaster-Related Messages in Social Media
Extracting Information Nuggets from Disaster-Related Messages in Social Media
 
Domain Specific Mashups
Domain Specific MashupsDomain Specific Mashups
Domain Specific Mashups
 
Reseval Mashup Platform Talk at SECO
Reseval Mashup Platform Talk at SECOReseval Mashup Platform Talk at SECO
Reseval Mashup Platform Talk at SECO
 
ResEval: Resource-oriented Research Impact Evaluation platform
ResEval: Resource-oriented Research Impact Evaluation platformResEval: Resource-oriented Research Impact Evaluation platform
ResEval: Resource-oriented Research Impact Evaluation platform
 

Último

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 

Último (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams

  • 1. Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams Muhammad Imran (@mimran15) and Carlos Castillo (@ChaToX) Qatar Computing Research Institute Doha, Qatar. SWDM’15 : WWW’15 May 18th 2015
  • 2. Information Variability on Social Media • Different events present different information categories • Even for recurring events, categories proportion change
  • 3. Information Variability on Social Media • Different events present different information categories • Even for recurring events, categories proportion change
  • 4. Information Variability on Social Media • Different events present different information categories • Even for recurring events, categories proportion change
  • 5. Information Variability on Social Media • Different events present different information categories • Even for recurring events, categories proportion change
  • 6. Information Variability on Social Media • Different events present different information categories • Even for recurring events, categories proportion change
  • 7. Different Classification Approaches • Various classification approaches exist: – Manual classification by human experts – Automatic classification using unsupervised or supervised approaches(needs training data) – Hybrid: Automatic + Manual • Retrospective vs. real-time classification – Batch processing (offline, training data availability) – Stream processing (real-time, scarce training data)
  • 8. Real-time Stream Classification (Supervised ) • Fewer categories are better – Decrease workers dropout – More training data for each category, more accuracy – “7 plus/minus 2” rule [G. A. Miller, 56] • Categories need to be defined carefully – Empty categories (waste space and efforts of workers) – Categories that are too large introduce heterogeneity
  • 9. Problem Statement • How can we classify items arriving as a data stream into a small number of categories, if we cannot anticipate exactly which will be the most frequent categories? Our research improves crowdsourcing-based and supervised learning-based systems (e.g. AIDR) by finding latent categories in fast data streams.
  • 10. Our Approach (top-down + bottom-up) 1. An expert defines information categories (top-down) 2. Messages are categorized into the initial set plus an extra “Miscellaneous” category 3. Identify relevant and prevalent categories from the messages in the “Miscellaneous” category (bottom- up) 1. Generate candidate categories 2. Learn characteristics of good categories 3. Rank categories on good characteristics How do we identify relevant categories?
  • 11. Candidate Generation We propose to apply Latent Dirichlet Allocation (LDA) on the Miscellaneous category: • Input: A set of n documents (all messages in the Misc. category) and a number m (# of topics to be generated) • Output: n x m matrix in which cell(i, j) indicates the extent to which document i corresponds to topic j.
  • 12. Candidate Evaluation To reduce the workload of experts to decide which categories to pick or not, we propose the following criteria: • Volume: a category shouldn’t be too small • Novelty: a category must not overlap or be too similar to the existing categories • Cohesiveness (intra- and inter-similarity): a category should be cohesive (should have small intra-topic and large inter-topic values)
  • 13. Experimental Testing • We used Twitter data of 17 crises (from the CrisisLexT26 dataset at crisislex.org) A. Affected individuals, deaths, injuries, missing, found. B. Infrastructure and utilities: buildings, roads, services damage. C. Donation and volunteering: needs, requests of food, shelter, supplies. D. Caution and advice: warnings issued or lifted, guidance and tips. E. Sympathy and emotional support: thoughts, prayers, gratitude, etc. Z. Other useful information not covered by any of the above categories.
  • 14. Candidate Generation Setup • Applied LDA on the messages in the “Z” category of each crisis • 5 topics were generated for each crisis • Considered messages with LDA score > 0.06 in each topic • Presented the LDA generated topics to experts in random order
  • 15. Candidate Annotation Setup Recruited two experts from two Int. humanitarian organizations in the crisis response domain
  • 16. Results • Topics with avg. score <= 2.5 considered as bad topics • Topics with avg. score >= 3.5 considered as good topics • Hit: if the metric value of good topics > bad topics A crisis is not considered for evaluation, if all of its topics receive an average score either below or above 3.0.
  • 17. Conclusion • Novelty, intra-similarity and cohesiveness are useful in identifying good topics • Our approach combines top-down (manual) and bottom-up (automatic) elements. • Learned important characteristics of good topics • Future work includes candidate ranking including recommendation for adding, merging, dropping new unseen categories
  • 18. Data used in this study can be requested: Contact: Muhammad Imran at mimran@qf.org.qa OR @mimran15
  • 19. Thank you! Authors contact: Muhammad Imran @mimran15 Carlos Castillo @ChaToX