Crowdsourcing systems are being widely used to overcome
several challenges that require human intervention. While
there is an increase in the adoption of the crowdsourcing
paradigm as a solution, there are no established guidelines
or tangible recommendations for task design with respect
to key parameters such as task length, monetary incentive
and time required for task completion. In this paper, we
propose the tuning of these parameters based on our findings
from extensive experiments and analysis of ‘categorization’
tasks. We delve into the behavior of workers that consume
categorization tasks to determine measures that can make
task design more effective.
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Breaking Bad - Understanding the Behavior of Crowd Workers in Categorization Microtasks
1. Breaking Bad - Understanding Behavior of
Crowd Workers
in Categorization Microtasks
Ujwal Gadiraju,
Ricardo Kawase, Patrick Siehndel and Besnik Fetahu
METU NCC, 2nd
September 2015
3. What is the problem?
● Increase in the number of new task requesters on
AMT (1000 per month) [Difallah et al., WWW’15].
○ Not all task requesters are familiar with task task
task-specific settings
○ No tangible guidelines for task design ;
■ task length
■ monetary incentive
■ task completion time
4. Worker Behavior in Categorization Tasks
● Categorization tasks are one of the most common
types of crowdsourced tasks. [Gadiraju et al., A taxonomy of
Microtasks on the Web, HT’14]
● Experimental Setup:
○ 9 tasks deployed on CrowdFlower
○ Task length : 20, 30, 40 units
○ Monetary Reward : 1 , 2, 3 USD cents
5. Tasks Design
● Clear instructions and help snippets.
● Workers have to select the most
suitable category in each Set (1-5)
consisting of 10 different categories.
● Category options were manually
tailored to avoid ambiguity.
● Set-1 was made compulsory, Set-2
through Set-5 were optional.
● Tasks were deployed non-
concurrently, and order of units were
randomized within each task.
● Tasks designed to facilitate 100%
accuracy in responses (with an aim to
study worker behavior).
6. Data Collection
● Responses gathered from 100 workers in
each task ; 900 workers in total.
● We collected 27,000 unit judgments in total.
In 88% of the cases, workers provided
responses for all sets (incl. optional).
● Average Task Completion Time
○ Tasks with length of 20 Units : 11.3 mins
○ Tasks with length of 30 Units : 16.4 mins
○ Tasks with length of 40 Units : 18.6 mins
7. ● Tipping Point : The first point (unit-
index) at which a worker provides an
unacceptable response after having
provided at least one acceptable response.
[Gadiraju et al., CHI’2015]
● Beaver Workers : Workers who exert
additional effort by answering optional
questions in order to help task requesters.
Definitions
8. Consistency of Units within Tasks
● Avg. accuracy of around 90% with little Std. Dev.
● We tolerate 10% incorrect responses from workers, owing
to possible drifts in attention spans / boredom.
● Bad Workers : Workers who answer 10% or more of the
units within a categorization task incorrectly.
● Poor Starters : Workers whose first 2 responses within a
categorization task are incorrect.
11. Worker Behavior Within a Task
Key Findings
● A worker’s accuracy decreases through the
course of a task. (optional sets are not
considered).
○ This is more prominent as the task length
increases.
● Workers that exert additional effort project
higher accuracies within tasks.
● The additional effort that workers exert
decreases through the course of a task.
○ This is more prominent as the task length
increases.
12. Scrutiny of Additional Responses
● % Correct Additional Responses gradually
decreases from Set-1 to Set-5.
● On average, workers skip more optional
sets as they proceed from Set-2 to Set-5.
13. Workers Breaking Bad
Adjusted Tipping Point (ATP) : Workers that consecutively
respond to at least 10% of the units in a task incorrectly, are said to
have an ATP. The index of the first unit at which this is observed, is
called the ATP of the worker. Such a worker is called a BREAKER.
14. Conclusions & Future Work
To achieve good quality in categorization
tasks…
● It is better to err on the lower side of monetary
incentives offered.
● Use minimum time required as a filter, but give
ample time for task completion. It is better to err on
the higher side of maximum task completion
time.
● It is better to err on the shorter side of task
length.
● We can gauge worker intentions through the nature
of their responses to optional questions.
● We plan to quantify the limits and these guidelines
in the imminent future.
16. Removal of Ineligible Workers
Ineligible workers : The workers who do
not conform to the priorly stated
prerequisites, belong to this category.
● We found 9 ineligible workers who
used browser-embedded translator
tools in order to participate in the task.
● Ineligible workers were not considered
in the further analysis.