4. CROWDSOURCING AND
THE SEMANTIC WEB
Semantic applications
developers use
crowdsourcing to
achieve a goal
The Semantic Web is a
giant crowdsourcing
project
5. THE DESIGN SPACE OF A CROWDSOURCING
PROJECT
What
Goal
Who
Staffing
How
Process
Why
Incentives
7. Tasks based on human
skills, not easily replicable
by machines
Editing knowledge graphs
Adding semantic annotations to
media
Adding multilingual labels to entities
Defining links between entities
…
8. Most effective when used at scale
(‘open call’), in combination w/
machine intelligence
10. Acosta, M., Zaveri, A., Simperl, E., Kontokostas, D., Flöck, F., &
Lehmann, J. (2016). Detecting Linked Data quality issues via
crowdsourcing: A DBpedia study. Semantic Web Journal, 1-34.
There is more to crowdsourcing than
Mechnical Turk
Use the right crowd for the right task
11. BACKGROUND
Varying quality of Linked Data sources
Detecting and correcting errors may require
manual inspection
dbpedia:Dave_Dobbyn dbprop:dateOfBirth “3”.
12. Contest
LD Experts
Difficult task
Final prize
Find Verify
Microtasks
Workers
Easy task
Micropayments
TripleCheckMate MTurk
Incorrect object
Incorrect data type
Incorrect outlink
Object values Data types Interlinks
Linked Data
experts
0.7151 0.8270 0.1525
MTurk
(majority voting)
0.8977 0.4752 0.9412
Results: Precision
Approach MTurk interfaces Findings
Use the right
crowd for the
right task
Experts detect a range
of issues, but will not
invest additional effort
Turkers can carry out the
three tasks and are
exceptionally good at
data comparisons
13. Diverse crowds are better
Piscopo, A., Phethean, C., & Simperl, E. (2017). What Makes a
Good Collaborative Knowledge Graph: Group Composition and
Quality in Wikidata. International Conference on Social
Informatics, 305-322, Springer.
14. BACKGROUND
Items and statements in Wikidata are edited by teams of
editors
Editors have varied tenure and interests
Group composition impacts outcomes
Diversity can have multiple effects
Moderate tenure diversity increases outcome quality
Interest diversity leads to increased group productivity
Chen, J., Ren, Y., Riedl, J.: The effects of diversity on group productivity and member withdrawal in online volunteer groups. In: Proceedings of the
28th international conference on human factors in computing systems - CHI ’10. p. 821. ACM Press, New York, USA (2010)
15. STUDY
Analysed the edit history of items
Corpus of 5k items, whose quality has been manually
assessed (5 levels)*
Edit history focused on community make-up
Community is defined as set of editors of item
Considered features from group diversity literature and
Wikidata-specific aspects
*https://www.wikidata.org/wiki/Wikidata:Item_quality
18. SUMMARY AND IMPLICATIONS
The more is
not always
the merrier
01
Bot edits are
key for quality,
but bots and
humans are
better
02
Registered
editors have
a positive
impact
Diversity
matters
04
Encourage
registration
01
Identify
further areas
for bot editing
02
Design
effective
human-bot
workflows
03
Suggest items
to edit based
on tenure and
interests
04
03
20. Bu, Q., Simperl, E., Zerr, S., & Li, Y. (2016). Using microtasks to
crowdsource DBpedia entity classification: A study in workflow
design. Semantic Web Journal, 1-18.
There are different ways to carry
out a task using crowdsourcing
They will produce different results
21. THREE WORKFLOWS TO
CROWDSOURCE ENTITY
TYPING
Free associations
Validating the machine
Exploring the DBpedia ontology
Findings
Shortlists are easy & fast
Popular classes are not enough
Alternative ways to explore the taxonomy
Freedom comes with a price
Unclassified entities might be unclassifiable
Different human data interfaces
Working at the basic level of abstraction achieves
greatest precision
But when given the freedom to choose, users suggest more specific
classes
4.58M
things
22. Kaffee, L. A., Elsahar, H., Vougiouklis, P., Gravier, C., Laforest, F.,
Hare, J., & Simperl, E. (2018). Mind the (Language) Gap:
Generation of Multilingual Wikipedia Summaries from Wikidata
for ArticlePlaceholders. In European Semantic Web Conference (pp.
319-334). Springer.
Crowds need human-readable
interfaces to KGs
23. BACKGROUND
Wikipedia is available in 287 languages,
but content is unevenly distributed
Wikidata is cross-lingual
ArticlePlaceholders display Wikidata
triples as stubs for articles in underserved
Wikipedia’s
Currently deployed in 11 Wikipedia’s
24. STUDY
Enrich ArticlePlaceholders with textual summaries generated from Wikidata
triples
Train a neural network to generate one sentence summaries resembling the
opening paragraph of a Wikipedia article
Test the approach on two languages, Esperanto and Arabic with readers and
editors of those Wikipedia’s
25. APPROACH
NEURAL NETWORK TRAINED ON WIKIDATA/WIKIPEDIA
Feed-forward architecture encodes
triples from the ArticlePlaceholder into
vector of fixed dimensionality
RNN-based decoder generates text
summaries, one token at a time
Optimisations for different entity
verbalisations, rare entities etc.
26. AUTOMATIC EVALUATION
APPROACH OUTPERFORMS BASELINES
Trained on corpus of Wikipedia sentences and corresponding Wikidata triples (205k Arabic; 102k Esperanto)
Tested against three baselines: machine translation (MT) and template retrieval (TR, TRext)
Using standard metrics: BLEU, METEOR, ROUGEL
The image part with relationship ID rId3 was not found in the file.
27. USER STUDIES
SUMMARIES ARE USEFUL FOR THE COMMUNITY
Readers study,
15 days, mixed
corpus of 60
articles
Editors study, 15
days, 30 summaries
The image part with relationship ID rId4 was not found in the file.
31. 31
EXPERIMENT 1
Make paid microtasks more
cost-effective w/ gamification
Workers will perform better if tasks are more
engaging
Increased accuracy through higher inter-annotator
agreement
Cost savings through reduced unit costs
Micro-targeting incentives when players
attempt to quit improves retention
32. MICROTASK DESIGN
Image labelling tasks, published on microtask
platform
Free-text labels, varying numbers of labels per image,
taboo words
Workers can skip images, play as much as they want
Baseline: ‘standard’ tasks w/ basic spam control
vs
Gamified: same requirements & rewards, but
crowd asked to complete tasks in Wordsmith
vs
Gamified & furtherance incentives: additional
rewards to stay (random, personalised)
32
33. 33
EVALUATION
ESP data set as gold standard
#labels, agreement, mean & max
#labels/worker
Three tasks
Nano: 1 image
Micro: 11 images
Small: up to 2000 images
Probabilistic reasoning to predict worker
exit and personalize furtherance
incentives
34. RESULTS (GAMIFICATION, 1 IMAGE)
BETTER, CHEAPER, BUT FEWER WORKERS
34
Metric CrowdFlower Wordsmith
Total workers 600 423
Total keywords 1,200 41,206
Unique keywords 111 5,708
Avg. agreement 5.72% 37.7%
Avg. images/person 1 32
Max images/person 1 200
35. RESULTS (GAMIFICATION, 11 IMAGES)
COMPARABLE QUALITY, HIGHER UNIT COSTS, FEWER DROPOUTS
35
Metric CrowdFlower Wordsmith
Total workers 600 514
Total keywords 13,200 35,890
Unique keywords 1,323 4,091
Avg. agreement 6.32% 10.9%
Avg. images/person 11 27
Max images/person 1 351
36. RESULTS (WITH FURTHERANCE INCENTIVES)
MORE ENGAGEMENT, TARGETING WORKS
Increased participation
People come back (20 times) and play longer (43 hours vs 3 hours without incentives)
Financial incentives play important role
Targeted incentives work
77% players stayed vs. 27% in the randomised condition
19% more labels compared to no incentives condition
36
37. 37
EXPERIMENT 2
Make paid microtasks more cost-effective
w/ social incentives
Working in pairs is more effective than the
baseline
Increased higher inter-annotator agreement
Higher output
Social incentives improve retention past
payment threshold
38. MICROTASK DESIGN
Image labelling tasks published on microtask platform
Free-text labels, varying numbers of labels per image,
taboo words
Baseline: ‘standard’ tasks w/ basic spam control
vs
Pairs: Wordsmith-based, randomly formed pairs, people
join and leave all the time, in time more partner switches
vs
Pairs & social incentives: let’s play vs please stay
offered to worker when we expect their partner to leave
38
40. 40
EVALUATION
ESP data set as gold standard
Evaluated #labels, agreement, avg/max
#labels/worker
Two tasks
Low threshold: 1 image
High threshold: 11 images
Probabilistic reasoning to predict worker
exit* and offer social incentive
* [Kobren et al, 2015] extended w/ utility
features
43. 43
SUMMARY OF FINDINGS
Social incentives generate more tags and improve retention
Social dynamics: different responses if partner has been paid or not
Paid worker 76% more likely to stay after social pressure, unpaid worker:
95% more likely to stay
Paid workers annotate more if they decide to stay than unpaid workers
Social flow more effective than social pressure in generating more tags: 99%
of unpaid workers are likely to stay
Social pressure works more often overall
44. ONE DOES NOT SIMPLY
CROWDSOURCE THE SEMANTIC WEB
45. 45
CONCLUSIONS
With AI and ML on the rise, crowdsourcing
is a critical for any Semantic Web
developer
Explore the what, who, how, why design
space
Use the full range of approaches and
techniques to scale to large datasets