SlideShare una empresa de Scribd logo
1 de 17
Project Update - July 11, 2013
The Eric & Wendy Schmidt
Data Science
for Social Good
Summer Fellowship 2013
www.dssg.io | dssg-ushahidi@googlegroups.com
Ushahidi Workflow
Ushahidi Workflow +
DSSG
Data Sets
23,000 reports from 20 datasets
• 22% English
• 35% non-English
• 43% mixed languages
Each report includes text, category, location,
sometimes more data
Data Sets
Additional
unusable
datasets for
various reasons
(e.g. overly
formulaic
language)
What is the
quality of the
existing "gold
standard"
annotation?
Working on
translations of
Afghanistan election
(peaceful)
Kenyan election
(less peaceful)
Data Set Differences
Current Task Status [July 11]
1) Suggest categories.......................
2) Extract named entities...................
(especially locations)
3) Detect language............................
End of presentation has more extensive technical details
Toy Demo
http://ec2-54-218-196-140.us-west-2.compute.amazonaws.com/home
Note this is ONLY a basic "toy" user interface to demonstrate the current prototype functionality.
Our plan is to deliver an open-source code library,
which Ushahidi will incorporate into the existing user interface.
If link doesn't work -- just look at the screenshots in the next slides. :)
Demo: Example #1
Demo: Example #2
Secondary Project Ideas
1. Detect private info to strip
2. Urgency assessment
3. Filtering irrelevant reports (not strictly spam)
4. Automatically proposing new [sub-]categories
5. Cluster similar (non-identical) reports
6. Hierarchical topic modelling / visualization
Evaluation Plans
• Tap into Ushahidi and crisis mapping
communities for feedback
• Simulate past event with our system
• Success metrics:
o Increased annotator speed
o Increased annotator categorization accuracy
o Decreased annotator frustration/tedium
Feedback welcome!
Contact us at dssg-
ushahidi@googlegroups.com
We would love your input!
See next 4 slides for technical details on our 4 tasks...
or skip if you're happy to stay unaware... :)
1) Suggest categories
Currently:
• Simple bag-of-words unigram features
• 1-vs.-all classification (scikit-learn)
• Little categories fewer big categories
• Performance uninspiring :(
Future:
Bigrams... word frequency filter...
2) Extract named entities
Currently:
• NLTK's Named Entity Recognizer
• Eval: pretty good
Future:
• Train location-recognizer on datasets
• Merge types for non-location NEs
3) Detect Language
Currently:
• Existing packages (Bing, python, ...)
Future:
• Evaluate quality
• Allow event-specific language bias
4) Near-Duplicate
Detection
Currently:
• SimHash compares distances of message
text hashes efficiently
Future:
• Evaluate quality more rigorously
• Explore other methods

Más contenido relacionado

Más de Ushahidi

Pivoting An African Open Source Project
Pivoting An African Open Source ProjectPivoting An African Open Source Project
Pivoting An African Open Source Project
Ushahidi
 

Más de Ushahidi (20)

Ushahidi Toolbox - Implementation
Ushahidi Toolbox - ImplementationUshahidi Toolbox - Implementation
Ushahidi Toolbox - Implementation
 
Ushahidi Toolbox - Assessment
Ushahidi Toolbox - AssessmentUshahidi Toolbox - Assessment
Ushahidi Toolbox - Assessment
 
Kenya Ushahidi Evaluation: Unsung Peace Heros/Building Bridges
Kenya Ushahidi Evaluation: Unsung Peace Heros/Building BridgesKenya Ushahidi Evaluation: Unsung Peace Heros/Building Bridges
Kenya Ushahidi Evaluation: Unsung Peace Heros/Building Bridges
 
Kenya Ushahidi Evaluation: Uchaguzi
Kenya Ushahidi Evaluation: UchaguziKenya Ushahidi Evaluation: Uchaguzi
Kenya Ushahidi Evaluation: Uchaguzi
 
Kenya Ushahidi Evaluation: Blog Series
Kenya Ushahidi Evaluation: Blog SeriesKenya Ushahidi Evaluation: Blog Series
Kenya Ushahidi Evaluation: Blog Series
 
Pivoting An African Open Source Project
Pivoting An African Open Source ProjectPivoting An African Open Source Project
Pivoting An African Open Source Project
 
Ushahidi esri juliana
Ushahidi esri julianaUshahidi esri juliana
Ushahidi esri juliana
 
Ushahidi personas scenarios
Ushahidi personas scenariosUshahidi personas scenarios
Ushahidi personas scenarios
 
Citizen pollution mapping made easy
Citizen pollution mapping made easy Citizen pollution mapping made easy
Citizen pollution mapping made easy
 
Testimony
TestimonyTestimony
Testimony
 
Map it, Change it
Map it, Change itMap it, Change it
Map it, Change it
 
Map it, Make it, Hack it
Map it, Make it, Hack itMap it, Make it, Hack it
Map it, Make it, Hack it
 
What if Citizens Mapped Health?
What if Citizens Mapped Health?What if Citizens Mapped Health?
What if Citizens Mapped Health?
 
Re-imagining Citizen Engagement
Re-imagining Citizen EngagementRe-imagining Citizen Engagement
Re-imagining Citizen Engagement
 
Ushahidi Research Seminar 11.11.11
Ushahidi Research Seminar 11.11.11Ushahidi Research Seminar 11.11.11
Ushahidi Research Seminar 11.11.11
 
Ihub Research
Ihub ResearchIhub Research
Ihub Research
 
What's in the toolkit (Ushahidi at ETHz)
What's in the toolkit (Ushahidi at ETHz)What's in the toolkit (Ushahidi at ETHz)
What's in the toolkit (Ushahidi at ETHz)
 
Volunteer Mappers: Building community resilience with citizen media
Volunteer Mappers: Building community resilience with citizen mediaVolunteer Mappers: Building community resilience with citizen media
Volunteer Mappers: Building community resilience with citizen media
 
Ushahidi Deployment - Output Toolbox
Ushahidi Deployment - Output ToolboxUshahidi Deployment - Output Toolbox
Ushahidi Deployment - Output Toolbox
 
Ushahidi Deployment - Implementation Toolbox
Ushahidi Deployment - Implementation ToolboxUshahidi Deployment - Implementation Toolbox
Ushahidi Deployment - Implementation Toolbox
 

Último

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Último (20)

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

Data Science for Social Good and Ushahidi

  • 1. Project Update - July 11, 2013 The Eric & Wendy Schmidt Data Science for Social Good Summer Fellowship 2013 www.dssg.io | dssg-ushahidi@googlegroups.com
  • 4. Data Sets 23,000 reports from 20 datasets • 22% English • 35% non-English • 43% mixed languages Each report includes text, category, location, sometimes more data
  • 5. Data Sets Additional unusable datasets for various reasons (e.g. overly formulaic language) What is the quality of the existing "gold standard" annotation? Working on translations of
  • 7. Current Task Status [July 11] 1) Suggest categories....................... 2) Extract named entities................... (especially locations) 3) Detect language............................ End of presentation has more extensive technical details
  • 8. Toy Demo http://ec2-54-218-196-140.us-west-2.compute.amazonaws.com/home Note this is ONLY a basic "toy" user interface to demonstrate the current prototype functionality. Our plan is to deliver an open-source code library, which Ushahidi will incorporate into the existing user interface. If link doesn't work -- just look at the screenshots in the next slides. :)
  • 11. Secondary Project Ideas 1. Detect private info to strip 2. Urgency assessment 3. Filtering irrelevant reports (not strictly spam) 4. Automatically proposing new [sub-]categories 5. Cluster similar (non-identical) reports 6. Hierarchical topic modelling / visualization
  • 12. Evaluation Plans • Tap into Ushahidi and crisis mapping communities for feedback • Simulate past event with our system • Success metrics: o Increased annotator speed o Increased annotator categorization accuracy o Decreased annotator frustration/tedium
  • 13. Feedback welcome! Contact us at dssg- ushahidi@googlegroups.com We would love your input! See next 4 slides for technical details on our 4 tasks... or skip if you're happy to stay unaware... :)
  • 14. 1) Suggest categories Currently: • Simple bag-of-words unigram features • 1-vs.-all classification (scikit-learn) • Little categories fewer big categories • Performance uninspiring :( Future: Bigrams... word frequency filter...
  • 15. 2) Extract named entities Currently: • NLTK's Named Entity Recognizer • Eval: pretty good Future: • Train location-recognizer on datasets • Merge types for non-location NEs
  • 16. 3) Detect Language Currently: • Existing packages (Bing, python, ...) Future: • Evaluate quality • Allow event-specific language bias
  • 17. 4) Near-Duplicate Detection Currently: • SimHash compares distances of message text hashes efficiently Future: • Evaluate quality more rigorously • Explore other methods

Notas del editor

  1. We're happy to give an update on our Ushahidi project's . [Abe Gong]
  2. Citizens submit reports (via SMS, twitter, and the web) which are reviewed by annotators. It's a slow manual process -- to categorize, geolocate, strip private info, etc.
  3. We're building a data wizardry system to support the manual annotation process
  4. Since Ushahidi reports are mostly public, private info should be hidden. example: names, phone numbers, and addresses 4. example: in Haiti earthquake, we might observe unexpected robbery reports arising. 5. This is mainly for a better workflow, because annotators can work better when they process similar reports altogether. 6. To see which topics are commonly occurring in Election in general, and which topics only occur in Kenyan election specifically.