Talk given at Delft University speaker series on "Crowd Computing & Human-Centered AI" (https://www.academicfringe.org/). November 23, 2020. Covers two 2020 works:
(1) Anubrata Das, Brandon Dang, and Matthew Lease. Fast, Accurate, and Healthier: Interactive Blurring Helps Moderators Reduce Exposure to Harmful Content. In Proceedings of the 8th AAAI Conference on Human Computation and Crowdsourcing (HCOMP), 2020.
Alexander Braylan and Matthew Lease. Modeling and Aggregation of Complex Annotations via Annotation Distances. In Proceedings of the Web Conference, pages 1807--1818, 2020.
1. Matt Lease
School of Information
The University of Texas at Austin
Adventures in Crowdsourcing :
Toward Safer Content Moderation & Better
Supporting Complex Annotation Tasks
1
Lab: ir.ischool.utexas.edu
@mattlease
Slides: slideshare.net/mattlease
2. Roadmap
• Context: UT Good Systems & iSchool
• Two parts to talk today
– Content Moderation
– Aggregating Complex Annotations
2
3. 3
Goal: Design a future of Artificial Intelligence (AI)
technologies to meet society’s needs and values.
.
http://goodsystems.utexas.edu
Good Systems: an 8-year, $10M
UT Austin Grand Challenge
4. “The place where people & technology meet”
~ Wobbrock et al., 2009
“iSchools” now exist at over 100 universities around the world
4
What’s an Information School?
5. Anubrata Das, Brandon Dang and Matthew Lease
School of Information
The University of Texas at Austin
Fast, Accurate, and Healthier:
Interactive Blurring Helps Moderators
Reduce Exposure to Harmful Content
5
Lab: ir.ischool.utexas.edu
@mattlease
Slides: slideshare.net/mattlease
6. Today’s Talk: Content Moderation
- Social media platforms are hubs of user generated content
- Some types of content are unacceptable or may cause harm
- pornography & nudity, depictions of violence, hate speech, mis/disinformation
- What is considered acceptable varies by platform and region
- Further issues of free speech & due process in content removal & remediation
- e.g., Moderate Globally, Impact Locally: The Global Impacts of Content Moderation (Yale, Nov. 2020)
6
Alon Halevy et al. "Preserving integrity in online social networks." arXiv preprint, September 25, 2020.
7. Scale of Content Moderation
7Paul M. Barrett. (2020). Who Moderates the Social Media Giants? A Call to End Outsourcing.
Facebook, Youtube
8. Can’t we just use AI?
• High cost of errors -> very high accuracy required
• Continually evolving content and moderation policies
– also regional variants, cultural issues, and adversarial attacks
• While AI systems are often advertised/perceived as fully-automated, in
practice, human labor is typically required and often hidden
– Gray and Suri (2019) “ghost work”, Ekbia and Nardi (2014) ”heteromation”,
Irani and Silberman (2013) “invisible work”
• Human moderators today: Facebook ~15K, Youtube ~10K
• No free lunch: human annotators still needed to create training data 8
9. Barr & Cabrera, ACM Queue 2006
9
“Software developers with innovative ideas for businesses
and technologies are constrained by the limits of artificial
intelligence… If software developers could programmatically
access and incorporate human intelligence into their
applications, a whole new class of innovative businesses
and applications would be possible. This is the goal of
Amazon Mechanical Turk… people are freer to innovate because
they can now imbue software with real human intelligence.”
11. Implication on Moderators
“The psychological effects of viewing harmful content is well
documented, with reports of moderators experiencing
posttraumatic stress disorder (PTSD) symptoms and other
mental health issues as a result of the disturbing content they
are exposed to.” (Cambridge Consultants, 2019)
11
“From my own interviews with more than 100 moderators… a
significant number [get PTSD]. And many other employees
develop long- lasting mental health symptoms that stop short
of full-blown PTSD, including depression, anxiety, and
insomnia.” (Casey Newton, 2020)
Volume quotas (akin to a call center) - “constant measurement
for accuracy is as pressurizing as a quota” (Dwoskin 2019)
Image Source: The Verge
12. The Great Irony
12
The sort of task we most want an algorithm to do
(emotionally disturbing) is what people are doing
because the algorithm isn’t good enough
13. BUT WHO PROTECTS THE
MODERATORS? (HCOMP 2018)
BRANDON DANG1, MARTIN J. RIEDL2, AND MATTHEW LEASE1
1School of Information & 2School of Journalism (both students contributed equally)
The University of Texas at Austin
AAAI HCOMP -&- ACM Collective Intelligence
July 2018, Zurich, Switzerland
14. Research Question
14
By revealing less of an image, can we reduce the emotional
labor of image moderation without compromising
moderator accuracy and efficiency?
15. Design and Demo
http://ir.ischool.utexas.edu/CM/demo/
15Dang, Brandon, Martin J. Riedl, and Matthew Lease. "But who protects the moderators? the case of crowdsourced image
moderation." arXiv preprint arXiv:1804.10999 (2018).
Code: https://github.com/budang/content-moderation
16. Exposure and Control
“shielding moderators from harm begins with giving them
more control of what they’re seeing and how they’re seeing it,
so just the existence of ...preferences helps” (Sullivan 2019)
16
“Scientifically, do we know how much [exposure] is too much?
The answer is no, we don’t... If there’s something that were to
keep me up at night... it’s that question”
(Facebook psychologist Chris Harrison)
“Finding the right balance between content reviewer well-
being and resiliency, quality, and productivity is very
challenging at the scale we operate in. We are continually
working to get this balance right.” (Facebook’s Carolyn
Glanville)
Source: https://images.fastcompany.net/image/upload/w_596,c_limit,q_auto:best,f_auto/wp-cms/uploads/2019/06/Quick-Settings.png
17. Exposure and Control
- Industry moving towards establishing best practices for providing control & tools
17
19. Exposure and Control
- Industry moving towards establishing best practices for providing control & tools
- Such interventions include greyscaling, muting videos, and blurring
- Not well understood how effective such practices are
- Google: Ramakrishnan and Karunakaran (HCOMP 2019) report grayscaling of
images and videos reduces harm. Also study static blurring.
19
21. Survey: Well-being and Usability
21
Usefulness04
Perceived usefulness and
perceived ease of use
(Davis 1989; Venkatesh and Davis 2000)
Emotional Exhaustion03
Slightly modified version of emotional
exhaustion scale
(Wharton 1993) (Cates and Howe 2015)
Positive and Negative
Affect02
7-point Likert scale what emotions they are
currently feeling (I-PANAS-SF)
(Thompson 2007)
Positive and Negative
Experience01
5-point Likert scale how often they experience
the following emotions: positive, negative,
good, bad, pleasant, unpleasant, etc. (SPANE)
(Diener et al, 2010)
22. Experiment
22
- Random sample of 60 synthetic & real images
across categories: 180 total images
- Divided into groups of 9, balanced over classes
- 20 HITs, Five workers/ HIT
- Workers restricted to a single HIT
- Adult content qualification, >98% approval rate
with 300+ submitted HITs
- $7.25/hour
23. Results
Performance
- Accuracy
- Time taken
- Effort*
- # Clicks
- # Mouse Movement
Well-being
- Worker comfort
- Experience
- Affect
- Emotional Exhaustion
- Usefulness
*Brandon Dang, Miles Hutson, & Matthew Lease. MmmTurkey: A Crowdsourcing Framework for Deploying Tasks
and Recording Worker Behavior on Amazon Mechanical Turk. HCOMP 2016. https://github.com/budang/turkey-lite
24. Speed and Accuracy is not Impacted in Interactive Blurring
24
Worker Accuracy Time
30. Increased mean positive affect with increasing level of blur
31
Positive and Negative Affect
31. Summary: Hover is the Champion for Adoption
32
B: Baseline, **p< 0.05, ***p< 0.005
- Slider and hover are both top performers
- Hover shows significantly low emotional exhaustion with comparatively high accuracy
- If key goal is to keep accuracy intact & reduce emotional impact, we recommend hover design
32. 33
Future Work03
• Qualitative Analysis
• Intelligent Unblurring
• Early warning for severity
Conclusion02
As opposed to static blurring that
decreases accuracy, Interactive
blurring, improves well-being without
sacrificing accuracy and speed
Contribution01
Proposed and extensively evaluated
intervention that improves moderator
well-being
33. Alex Braylan1
and Matthew Lease2
1
Dept. of Computer Science & 2
School of Information
The University of Texas at Austin
Modeling and Aggregation of Complex
Annotations via Annotation Distance
34
ml@utexas.edu
@mattlease
Slides: slideshare.net/mattlease
Encore: Dec 11 talk @NeurIPS Crowd Science Workshop (https://research.yandex.com/workshops/crowd/neurips-2020)
Code & Data: https://github.com/Praznat/annotationmodeling
34. Simple annotation & aggregation
• Classification
– sentiment analysis
– image categorization
• Ordinal rating
– product & movie reviews
– search relevance
• Multiple choice selection
– quizzes
Aggregation
• Crowdsourcing: quality control
• Experts: wisdom of crowds
• Goal: select best label available
for each item (no label fusion)
35
38. Caption this image:
When majority voting falls short
Problem: large label space, exact match doesn’t work!
39
A cat is
eating
The cat
eats
A beautiful
picture
39. What about complex annotations?
Ranked lists
Parse trees
A1: A cat is eating
A2: The cat eats
A3: A beautiful picture
Image captions
Range sequences
40
41. Aggregating Simple Labels
• Hundreds of papers
• Multiple benchmarking studies
• Rich body of Bayesian modeling
• General-purpose aggregation
models for simple labels don’t
support complex labels!
Dawid-Skene MACE
Hierarchical Dawid-Skene
Item Difficulty
Logistic Random Effects
Source:
Paun et al 2018
“Comparing bayesian
models of annotation”
42
42. Task-specific models
• Pros:
– Task specialization
maximizes accuracy
• Cons:
– Need new model for
every task
– Complicated, difficult
to formulate
Nguyen et al 2017 (Sequences)
Lin, Mausam, and Weld 2012 (Math)
43
43. Task-specific workflows
• Pros:
– Empower workers
for complex tasks
• Cons:
– Need new workflow
for every task
– Complicated, difficult
to formulate
Noronha et al 2011
(image analysis)
Lasecki et al 2012
(transcription)
44
44. Our goals
• We want aggregation for complex data types
– Build on ideas from simple label aggregation models
• We want to generalize across many labeling tasks
– Can we reduce problem to common simpler state space?
45
46. Key Insight
• Partial credit matching via task-specific distance function
– Encapsulate task-specific label features into requester distance function
– Model annotation distances rather than annotations
– Distance functions already exist for most tasks because people need
evaluation functions to compare predicted labels vs gold
47
47. Distance functions
48
Properties of distance functions
Non-negativity
Symmetry
Triangle inequality
Data Free Text Rankings
Example
evaluation fn
BLEU(x, y)
Example
distance fn
Non-negativity ✓ ✓
Symmetry ✓ ✓
Triangle
inequality
✓ ✓
48. Calculate distances
“a cat is eating” “cat is eating”
“a beautiful picture” “the cat eats”
49
• Example task: text annotation
• Example distance function:
string edit distance
49. Calculate distances
“a cat is eating” “cat is eating”
“a beautiful picture” “the cat eats”
0.05
0.1
0.1
50
• Example task: text annotation
• Example distance function:
string edit distance
50. Calculate distances
“a cat is eating” “cat is eating”
“a beautiful picture” “the cat eats”
0.8
0.82
0.05
0.1
0.1
51
0.82
• Example task: text annotation
• Example distance function:
string edit distance
51. A1: A cat is eating
A2: The cat eats
A3: A beautiful
picture
0.1 0.6
0.3
52
All tasks reduce to matrices of
annotation distances
52. How to aggregate given distances
• Local selection model
• Global selection model
• Combined
53
Current item
Other items
53. Local approach: Smallest Avg Distance
• For each item:
1. Compute average distance between
annotations for the item
2. Choose annotation with smallest
average distance
• Generalization of majority vote
• Independence between items
• Local approach does not model
annotator reliability
54
Current item
Other items
54. Global approach: Best Available User
• For each annotator:
– Score by average distance over full dataset
• For each item:
– Choose label by best-scoring annotator
• Fixed annotator reliability
• Global approach does not model how
well annotators did on specific items
55
Current item
Other items
55. Can we get best of both worlds?
• Want a method that combines:
– Best available user (global)
– Smallest avg distance (local)
• Should build on rich history of work on Bayesian annotation modeling
• Need a principled framework for modeling annotation distance matrices
weights
votes weighted voting
56
56. Multidimensional Annotation Scaling (MAS)
• Based on Multidimensional
Scaling (Kruskal & Wish 1978)
• Probabilistic model of multi-
item distance matrices
• “Hierarchical Bayesian”
– Additional learned parameters
represent crowd effects such as
worker reliability
A cat is
eating
The cat
eats
A beautiful
picture
58
65. Tasks & datasets
SYNTHETIC DATASETS
• Syntactic parse trees
– Distance function: evalb
• Ranked lists
– Distance function: Kendall’s tau
REAL DATASETS
• Biomedical text sequences
– Distance function: Span F1
• Urdu-English translations
– Distance function: GLEU
67
Nguyen et al 2017
Zaidan and Callison-Burch 2011
66. Methods
Baselines:
• Random User (RU): pick one label randomly
• ZenCrowd (ZC) (Demartini et al. 2012)
– Weighted voting based on exact match (rare!)
• Crowd Hidden Markov Model (CHMM) (Nguyen et al. 2017)
– Sequence annotation task only
Upper bound: Oracle (OR) (always picks best label)
• Even if 5 workers answer, limited by best answer any of them gave
68
67. Results
Task Metric RU ZC CHMM MAS Oracle
Translations GLEU 0.185 0.246
Sequences F1 0.561 0.827
Parses EVALB 0.812 0.939
Rankings 0.491 0.724
69
• Diverse complex label datasets
70. Results
Task Metric RU ZC CHMM MAS Oracle
Translations GLEU 0.185 0.188 - 0.217 0.246
Sequences F1 0.561 0.569 0.702 0.709 0.827
Parses EVALB 0.812 0.819 - 0.932 0.939
Rankings 0.491 0.495 - 0.710 0.724
72
• Diverse complex label datasets
• MAS aggregation is best way to get closer to ground truth with no
model alteration between datasets
71. Conclusion
• Goal: general-purpose probabilistic model to aggregate complex annotations
– Categorical-based methods insufficient
– Custom models difficult to design for new annotation types
• Solution: Model annotation distances via task-specific distance functions
– Transforms problem into general-purpose variable space
• Multi-dimensional Annotation Scaling (MAS)
– Allows unsupervised weighted voting with inferred annotator reliability
• Not covered in talk (see paper)
– Semi-supervised learning
– Partial credit 73
72. Ongoing work
• Generalization to more tasks (e.g., image bounding boxes & keypoints)
• Generalization to simple annotation tasks (”one ring to rule them all”)
• Support for multiple latent objects per item
• Merging annotations rather than selecting best one
– e.g. guessing weight of an ox
– MAS vs. non-embedding EM model, varying noise, fewer annotations, …
74
Code & Data: https://github.com/Praznat/annotationmodeling
A1: A cat is eating
A2: The cat eats
A3: A beautiful picture
73. Thank you!
75
Matt Lease (University of Texas at Austin)
Lab: ir.ischool.utexas.edu
@mattlease
Slides: slideshare.net/mattlease
We thank our many talented crowd workers for their contributions to our research!