2010-02-22 Wikipedia MTurk Research talk given in Taiwan's Academica Sinica

Ed H. Chi

Area Manager and Principal Scientist
Augmented Social Cognition Area
Palo Alto Research Center

  Cognition:
the
ability
to
remember,
think,
and
reason;
the
faculty
of

knowing.

  Social
Cognition:
the
ability
of
a
group
to
remember,
think,
and

reason;
the
construction
of
knowledge
structures
by
a
group.

–  (not
quite
the
same
as
in
the
branch
of
psychology
that
studies
the

cognitive
processes
involved
in
social
interaction,
though
included)

  Augmented
Social
Cognition:
Supported
by
systems,
the

enhancement

of
the
ability
of
a
group
to
remember,
think,
and

reason;
the
system-‐supported
construction
of
knowledge

structures
by
a
group.

Citation:
Chi,
IEEE
Computer,
Sept
2008

2010-02-22 Ed H. Chi ASC Overview 2
2

Characteriza*on
Models

Evalua*ons
Prototypes

  Characterize activity on social systems with analytics
  Model interaction social and community dynamics and variables
  Prototype tools to increase benefits or reduce cost
  Evaluate prototypes via Living Laboratories with real users

3

  Characterization and Modeling:
–  Community Analytics and Wikipedia Dynamics
  Prototyping:
–  Social Transparency thru WikiDashboard
  Evaluation:
–  Evaluations using Amazon Mechanical Turk

4

Characteriza*on
Models

Evalua*ons
Prototypes

Conﬂict/Coordination
Eﬀects
in
Wikipedia


Mediator
Pattern
-‐
Terri
Schiavo

Anonymous (vandals/
spammers)

Sympathetic to
husband

Mediators

Sympathetic to parents


Measure
of
controversy

•  Controversial”
tag

• Use
#
revisions
tagged
controversial


Page
metrics

•  Possible
metrics
for
identifying
conﬂict
in
articles

Metric type Page Type
Revisions (#) Article, talk, article/talk
Page length Article, talk, article/talk
Unique editors Article, talk, article/talk
Unique editors / revisions Article, talk
Links from other articles Article, talk
Links to other articles Article, talk
Anonymous edits (#, %) Article, talk
Administrator edits (#, %) Article, talk
Minor edits (#, %) Article, talk
Reverts (#, by unique
Article
editors)


Performance:
Cross-‐validation

• 5x
cross-‐validation,
R2
=
0.897


Determinants
of
conﬂict

Highly weighted features of conflict model:

 Revisions
(talk)

 Minor
edits
(talk)

 Unique
editors
(talk)

 Revisions
(article)

 Unique
editors
(article)

 Anonymous
edits
(talk)

 Anonymous
edits
(article)


Number of Articles (Log Scale)

http://en.wikipedia.org/wiki/Wikipedia:Modelling_Wikipedia’s_growth
12

13

Monthly Edits

14

Monthly Edits

15

Monthly Active Editors

16

18

  Edits beget edits
–  more number of previous edits, more number of new edits

Growth rate depends on current population N
r = growth rate of the population

N(t) = N 0 ⋅ e rt
dN
= r⋅ N
dt
Growth rate Current
of population €
population

€ 2010-02-22 Ed H. Chi ASC Overview 19
19

  Ecological population growth model
–  r, growth rate of the population
–  K, carrying capacity (due to resource limitation)
4000000
K
3500000
3000000

dN N Population
2500000
= r ⋅ N ⋅ (1− ) 2000000
dt K 1500000
1000000
500000
0
2000 2002 2004 2006 2008 2010
Year

20

  Follows a logistic growth curve

New Article

http://en.wikipedia.org/wiki/Wikipedia:Modelling_Wikipedia’s_growth
21

  Carrying Capacity as a function of time.

K(t)
Population

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Year

22

  Biological system
–  Competition increases as
population hit the limits of the
ecology
–  Advantage go to members of the
population that have competitive
dominance over others
  Analogy
–  Limited opportunities to make
novel contributions
–  Increased patterns of conflict and
dominance

23

24

  Highly skewed contribution pattern
–  Top 3% users contribute 50%+ edits
–  A lot of single-edit users
  Five Editor Classes
–  Monthly edit count
–  No bot, vandalism included in the analysis
–  1000+: editors who made more than 1000 edits in that month
–  100-999
–  10-99
–  2-9
–  1

25

Monthly Edits by Editor Class (in thousands)

26

27

Monthly Ratio of Reverted Edits

28

  Two interpretations:
–  Overall increased resistance
from the Wikipedia community
to changing content
–  Disparity of treatment of edits
»  Occasional editors have been
reverted in a higher rate

  Example of increased
patterns of conflict and
dominance

Photo: http://www.flickr.com/photos/efan78/3619921561/

29

30

Bongwon Suh, Gregorio Convertino, Ed H. Chi, Peter Pirolli. WikiSym 2009

31

“Wikipedia is the best thing ever. Anyone in the world can write
anything they want about any subject, so you know you’re getting the
best possible information.”
– Steve Carell, The Office

33

  Content in Wikipedia can be added or
changed by anyone
  Because of this, WP has become one of the
most important resources on the web
–  Hundreds of thousands of contributors
–  Over 2 million articles
–  5th most used websites (Alexa.com)
  Also because of this, is viewed with
skepticism by readers, press, researchers

34

35

Nothing

36

“Wikipedia, just by its nature, is
impossible to trust completely. I don't
think this can necessarily be
changed.”

37

  Risks with using Wikipedia
–  Accuracy of content
–  Motives of editors
–  Expertise of editors
–  Stability of article
–  Coverage of topics
–  Quality of cited information

Insufficient information to evaluate
trustworthiness

38

  Transparency of social dynamics can reduce conflict and coordination
issues
  Attribution encourages contribution
–  WikiDashboard: Social dashboard for wikis
–  Prototype system: http://wikidashboard.parc.com

  Visualization for every wiki page
showing edit history timeline and
top individual editors

  Can drill down into activity history
for specific editors and view edits
to see changes side-by-side

Citation: Suh et al.
CHI 2008 Proceedings
39

40

Surfacing information

•  Numerous studies mining Wikipedia revision
history to surface trust-relevant information
–  Adler & Alfaro, 2007; Dondio et al., 2006; Kittur et al., 2007;
Viegas et al., 2004; Zeng et al., 2006

Suh, Chi, Kittur, & Pendleton, CHI2008

•  But how much impact can this have on user
perceptions in a system which is inherently
mutable?
43

Hypotheses

1.  Visualization will impact perceptions of trust
2.  Compared to baseline, visualization will
impact trust both positively and negatively
3.  Visualization should have most impact when
high uncertainty about article
•  Low quality
•  High controversy

44

Design

•  3 x 2 x 2 design

Controversial Uncontroversial

Visualization Abortion Volcano
High quality
•  High stability George Bush Shark
•  Low stability
•  Baseline (none) Pro-life feminism Disk
defragmenter Low quality
Scientology and
celebrities Beeswax

45

Example: High trust visualization

46

Example: Low trust visualization

47

Summary info

•  % from anonymous
users

48

Summary info

users
•  Last change by
anonymous or
established user

49

Summary info

users
•  Last change by
anonymous or
established user
•  Stability of words

50

Graph

•  Instability

51

Graph

•  Instability
•  Revert activity

52

Method

•  Users recruited via Amazon’s Mechanical Turk
–  253 participants
–  673 ratings
–  7 cents per rating
–  Kittur, Chi, & Suh, CHI 2008: Crowdsourcing user studies
•  To ensure salience and valid answers, participants
answered:
–  In what time period was this article the least stable?
–  How stable has this article been for the last month?
–  Who was the last editor?
–  How trustworthy do you consider the above editor?

53

Results

main effects of quality and controversy:
• high-quality articles > low-quality articles (F(1, 425) = 25.37, p < .001)
• uncontroversial articles > controversial articles (F(1, 425) = 4.69, p = .
031)

54

Results

interaction effects of quality and controversy:
• high quality articles were rated equally trustworthy whether controversial
or not, while
• low quality articles were rated lower when they were controversial than
when they were uncontroversial service.
55

Results

1.  Significant effect of
visualization
–  High > low, p < .001
2.  Viz has both positive and
negative effects
–  High > baseline, p < .001
–  Low > baseline, p < .01
3.  No interaction of
visualization with either
quality or controversy
–  Robust across conditions

56

Results

visualization
–  High > low, p < .001
negative effects
3.  No interaction of

57

Results

visualization
–  High > low, p < .001
negative effects
3.  No interaction effect of

58

Characteriza*on
Models

Methodology

Evalua*ons
Prototypes

User studies

•  Getting input from users is important in HCI
–  surveys
–  rapid prototyping
–  usability tests
–  cognitive walkthroughs
–  performance measures
–  quantitative ratings

User studies

•  Getting input from users is expensive
–  Time costs
–  Monetary costs
•  Often have to trade off costs with sample size

Online solutions

•  Online user surveys
•  Remote usability testing
•  Online experiments
•  But still have difficulties
–  Rely on practitioner for recruiting participants
–  Limited pool of participants

Crowdsourcing

•  Make tasks available for anyone online to complete
•  Quickly access a large user pool, collect data, and
compensate users

•  Experiences at PARC:
–  CSL UbiComp group
–  ISL’s NLTT group

Crowdsourcing

•  Make tasks available for anyone online to complete
•  Quickly access a large user pool, collect data, and
compensate users
•  Example: NASA Clickworkers
–  100k+ volunteers identified Mars craters from
space photographs
–  Aggregate results “virtually indistinguishable” from
expert geologists

experts

crowds

http://clickworkers.arc.nasa.gov

Amazon’s Mechanical turk

•  Market for “human intelligence tasks”
•  Typically short, objective tasks
–  Tag an image
–  Find a webpage
–  Evaluate relevance of search results
•  Users complete for a few pennies each

Using Mechanical Turk for user studies

Traditional user Mechanical Turk
studies
Task complexity Complex Simple
Long Short
Task subjectivity Subjective Objective
Opinions Verifiable
User information Targeted demographics Unknown demographics
High interactivity Limited interactivity

Can Mechanical Turk be usefully used for user studies?

Task

•  Assess quality of Wikipedia articles
•  Started with ratings from expert Wikipedians
–  14 articles (e.g., “Germany”, “Noam Chomsky”)
–  7-point scale
•  Can we get matching ratings with mechanical turk?

Experiment 1

•  Rate articles on 7-point scales:
–  Well written
–  Factually accurate
–  Overall quality
•  Free-text input:
–  What improvements does the article need?
•  Paid $0.05 each

Experiment 1: Good news

•  58 users made 210 ratings (15 per article)
–  $10.50 total
•  Fast results
–  44% within a day, 100% within two days
–  Many completed within minutes

Experiment 1: Bad news

•  Correlation between turkers and Wikipedians
only marginally significant (r=.50, p=.07)
•  Worse, 59% potentially invalid responses
Experiment 1
Invalid 49%
comments
<1 min 31%
responses

•  Nearly 75% of these done by only 8 users

Not a good start
•  Summary so far:
–  Only marginal correlation with experts.
–  Heavy gaming of the system by a minority
•  Possible Response:
–  Can make sure these gamers are not rewarded
–  Ban them from doing your hits in the future
–  Create a reputation system [Delores Lab]
•  Can we change how we collect user input ?

Design changes

•  Use verifiable questions to signal monitoring
–  “How many sections does the article have?”
–  “How many images does the article have?”
–  “How many references does the article have?”

Design changes

•  Make malicious answers as high cost as
good-faith answers
–  “Provide 4-6 keywords that would give someone a
good summary of the contents of the article”

Design changes

good-faith answers
•  Make verifiable answers useful for completing
task
–  Used tasks similar to how Wikipedians described
evaluating quality (organization, presentation,
references)

Design changes

good-faith answers
•  Make verifiable answers useful for completing
task
•  Put verifiable tasks before subjective
responses
–  First do objective tasks and summarization
–  Only then evaluate subjective quality
–  Ecological validity?

Experiment 2: Results

•  124 users provided 277 ratings (~20 per article)
•  Significant positive correlation with Wikipedians (r=.
66, p=.01)

•  Smaller proportion malicious responses
•  Increased time on task

Experiment 1 Experiment 2
Invalid 49% 3%
comments
<1 min 31% 7%
responses
Median time 1:30 4:06

Generalizing to other user studies

•  Combine objective and subjective questions
–  Rapid prototyping: ask verifiable questions about
content/design of prototype before subjective
evaluation
–  User surveys: ask common-knowledge questions
before asking for opinions

Limitations of mechanical turk

•  No control of users’ environment
–  Potential for different browsers, physical
distractions
–  General problem with online experimentation
•  Not designed for user studies
–  Difficult to do between-subjects design
–  Involves some programming
•  Users
–  Uncertainty about user demographics, expertise

Conclusion

•  Mechanical Turk offers the practitioner a way to
access a large user pool and quickly collect data at
low cost
•  Good results require careful task design
1.  Use verifiable questions to signal monitoring
2.  Make malicious answers as high cost as good-faith
answers
3.  Make verifiable answers useful for completing task
4.  Put verifiable tasks before subjective responses

Ed
H.
Chi
(manager,
PS)

Peter
Pirolli
(RF)

Lichan
Hong

Bongwon
Suh

Les
Nelson

Rowan
Nairn

Gregorio
Convertino

Interns/Collaborators:

Sanjay
Kairam,
Jilin
Chen
(UMinn),
Michael
Bernstein
(MIT)

http://asc-‐parc.blogspot.com

Ed H. Chi ASC Overview 81
2010-02-22

  r, growth rate dN N
= rN(1− )
  K, carrying capacity dt K

4000000
3500000
3000000
€ K dominates
2500000 r dominates when N K
2000000
when N is small N
1500000
N (1− ) ≈ 0
1000000
(1− ) ≈ 1 K
500000 K
0
2000 2002 2004 2006 2008 2010
Year
€ €

  r-Strategist
–  Growth or exploitation
–  Less-crowded niches / produce many offspring

  K-Strategist
–  Conservation
–  Strong competitors in crowded niches / invest more heavily in
fewer offspring

  Evolution cycle
–  Resilience of an ecological system
–  Gunderson & Holling 2001

  Exponential growth model dN
–  Growth rate depends on the current N
= r*N
dt
  Ecological population growth model
–  r, growth rate of the population
–  K, carrying capacity (due to resource limitation)
€
dN N
= rN(1− )
dt K

2010-02-22
€ Ed H. Chi ASC Overview 85

  People-ware
–  Growing resistance to changing content
–  Coordination cost and bureaucracy
  Knowledge-ware: Availability of easy topics to write about
  Tool-ware: Quality of tools used by editors and admins

http://www.aerostich.com/
http://www.mikestreetmedia.co.uk/blog/wp-content/uploads/2009/01/knowledge.jpg
http://youropenbook.agitprop.co.uk/growing.php?p=2

2010-02-22 Wikipedia MTurk Research talk given in Taiwan's Academica Sinica

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (17)

Similar a 2010-02-22 Wikipedia MTurk Research talk given in Taiwan's Academica Sinica

Similar a 2010-02-22 Wikipedia MTurk Research talk given in Taiwan's Academica Sinica (17)

Más de Ed Chi

Más de Ed Chi (20)

Último

Último (20)

2010-02-22 Wikipedia MTurk Research talk given in Taiwan's Academica Sinica