This document summarizes a project that uses machine learning and natural language processing to develop a plug-in called Ushine that improves the human review process of crisis reports on the Ushahidi platform. The plug-in detects languages, identifies private information, locations, URLs, and suggests categories for reports. It also detects duplicate reports. The plug-in is intended to guide but not replace human review. Evaluation of the plug-in and future work are discussed. Contact information is provided for collaborating on the open source project.
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Data Science for Social Good and Ushahidi - Final Presentation
1. Ushine Plug-In
Using machine learning and natural language processing
to improve the human review process of crisis reports
2. Topics
● Intro to project
● Project contents
● Data sets
● Evaluation
● Data ethics
● Future work
3. How to Follow Up...
● GitHub repository (open-source project code + wiki documentation):
http://github.com/dssg/ushine-learning
Collaborators welcome! (Both within and outside of Ushahidi.)
● DSSG team e-mail: dssg-ushahidi@googlegroups.com
● Main Ushahidi contacts: Emmanuel Kala + Heather Leson
● Data Science for Social Good fellowship: http://dssg.io
4. Thanks!
Thanks to our partners at Ushahidi and the many
individuals and organizations who generously gave us
their advice and feedback...
Alphabetically:
Chris Albon, Rob Baker, George Chamales, Jennifer Chan,
Crisis Mappers, Schuyler Erle, Sara-Jayne Farmer, Rayid
Ghani, Eric Goodwin, Catherine Graham, Neil Horning,
Humanity Road, Anahi Ayala Iacucci, Rob Mitchum,
Emmanuel Kala, David Kobia, Heather Leson, Rob Munro,
Chris Thompson, Syria Tracker, Juan-Pablo Velez.
5. Project Contents [August 20]
1) Detect language of report text
2) Identify private information in report text
3) Identify locations in report text
4) Identify URLs in report text
5) Suggest categories of report
6) Detect (near-)duplicate reports
9. Scope
● Ushine DOES:
○ Improve the human review process of reports
● Ushine DOESN’T:
○ Verify reports
○ “Really” understand the report
○ Achieve 100% accuracy in anything
10. Useful for:
● In multi-lingual situation, automatically route reports to
speakers of that language
● Flag reports that need / don’t need translations
○ (if deployment specifies certain set of acceptable
languages)
Caveats:
● Not 100% accurate
● Performs less well on “imperfect” writing
○ e.g. SMS-speak, mixed languages
1) Detect report language
11. 1) Detect report language
Technical details:
● Tested 4 plug-in language detectors on 850
reports, for agreement with human language
identification:
12. 2) Identify Private Info
Identify people’s names, organizations’ names, locations, e-mail
addresses, URLs, phone/ID numbers, Twitter usernames
Useful for:
● Flagging private info in report that reviewer might want to remove, to
protect sensitive people/situations
● As an extra check before exporting reports to others.
Technical details:
● Use NLTK’s pre-trained Named Entity Recognizer (NER) to identify people’
s names, organizations’ names, and locations.
● Use regular expressions to identify e-mail addresses, URLs, phone/ID
numbers,and Twitter usernames.
● Better to be overly careful: false negatives are more dangerous than false
positives
13. 2) Identify Private Info
Caveats:
● Not 100% accurate.
○ Use to support, not replace, humans. (Though humans are not 100%
accurate by themselves either!)
○ Always, be aware of responsibility to protect sensitive information.
○ Non-sensitive deployments (non-wars/disasters) may still have
sensitive information.
○ (More on data ethics @ end)
● Definition of “private” can be very subjective and nuanced.
● Does not re-word sentence; only identifies problematic words for editing.
● Currently only useful for English text (though extendable to other
languages given a suitable NER)
14. 3) Identify Locations
Useful for:
● Identifying text within report that may refer to a location
Caveats:
● Imperfect accuracy, especially on imperfect English
● Currently only useful for English text (though extendable to other
languages given a suitable NER)
● Does not geo-locate location for mapping, just makes it easier to figure out
what text to then geo-locate.
Technical details:
● Use NLTK’s pre-trained Named Entity Recognizer (NER)
15. 4) Identify URLs (links)
Useful for:
● Identifying text within report that refers to a URL (photo/video/article/etc.)
Technical details:
● Use regular expressions
16. A Detour on Data Sets
● So far none of the tasks have required
“training data” on past Ushahidi deployments
○ (NLTK’s named entity recognizer uses its own
training data, not from Ushahidi)
● Next task, category rankings, DOES require
Ushahidi training data
● Data cleanliness: Often lacking
○ We wrote scripts to automate cleaning
○ Useful for other Ushahidi work too!
17. Data Sets - Examples
Additional unusable
datasets for various
reasons (e.g. overly
formulaic language)
Many additional
CrowdMap datasets
(not used by Ushine
because of time
constraints)
Sensitive data was
removed before
being shared with
us
19. 5) Category Suggestions
For each category (e.g. “Bribery” or “Violence”),
give 0-100% rating of how likely the report is to belong
Useful for:
● Increasing speed and accuracy of the category assignment process
Caveats:
● Not 100% accurate
● “Cold start” problem
20. 5) Category Suggestions
● Global classifier:
○ Classifier trained on previous deployments (e.g.
previous Indian and Venezuela election reports) then
used for a new deployment (e.g. new Kenyan
election)
● Local classifier:
○ Train a classifier on-the-fly on reports annotated in a
new deployment. Cold-start problem.
● Adaptive classifier:
○ Retrain global classifier on the current deployment
21. 5) Category Suggestions
● Learning Curve Plot from Mexico election
(Higher F1 score means better performance)
22. 5) Category Suggestions
Technical details:
● Binary classifier for each category.
● Local classifier: Bag-of-words unigram
frequency features (with frequency cut-off = 5)
○ In general, bigrams & TF-IDF normalization did not
help.
● Global classifier for election deployment
○ Trained using 7 election deployments
○ For each category label, cross-deployment validation
was used to select feature sets (unigram, tfidf,
bigram, and C parameter).
23. 5) Category Suggestions
Technical details:
● Adaptive Classifier
○ Interpolates between local classifier f and global
classifier g using
(1-α)*g(x) + α*f(x),
where x is a report.
○ α is tuned on-the-fly to maximize F1 score bas
grid search.
24. 6) Detect (near-) duplicates
Has the report already been submitted, or retweeted?
Useful for:
● Identifying (near-)duplicate reports to prevent
copies and redundant work
Caveats:
● Not 100% accurate
● Not looking at “similar/related content”, but rather (near-)duplicates
Technical details:
● SimHash efficiently hashes each report text to a 64-bit representation.
● (Near-)duplicates have short distances
25. Evaluation
Currently analyzing the results of an evaluation experiment
that simulates an election crisis.
Assess the impact on users’ speed and accuracy of
● identifying private info, location, URLs
● choosing categories
3 comparison groups:
1) “Regular” process w/o computer suggestions
2) Our computer’s suggestions
3) “Perfect” suggestions
27. Ushahidi Plugin integration
● Configurable URL for the Ushine web
service
● Extract location names and other entities
from report text. These are displayed as
report metadata
● Detect and display the report language
● Suggest reports that are similar to the
current one
28. Data Ethics
This isn’t today’s focus, but very important as part of an on-going
Ushahidi discussion:
1) Private information tool especially should be used wisely -- not 100%
accurate and does not replace, but rather supports, thoughtful human decision-
making.
2) To improve category classification, need access to training data.
How to store data? Who has access?
Carelessness about sensitive data
can have real and bad consequences!
Non-sensitive deployments (non-wars/disasters)
may still have sensitive information.
29. Automated vs. Suggestions
● In theory, everything could be automated
○ Ex: Automatically select top-ranked categories
instead of giving humans the rankings
● Ushahidi reports need high quality data, so
we recommend using our package’s output
as suggestions to guide human decisions
● Especially important for sensitive tasks like
private information detection!
31. How to Follow Up...
● GitHub repository (project code + wiki documentation): http://github.
com/dssg/ushine-learning
Collaborators welcome! (Both within and outside of Ushahidi.)
● DSSG team e-mail: dssg-ushahidi@googlegroups.com
● Main Ushahidi contacts: Emmanuel Kala + Heather Leson
● Data Science for Social Good fellowship: http://dssg.io