SlideShare una empresa de Scribd logo
1 de 40
Descargar para leer sin conexión
Cheap and Fast - But is it Good?
 Evaluating Nonexpert Annotations
     for Natural Language Tasks



Rion Snow Brendan O’Connor Daniel Jurafsky   Andrew Y. Ng
The primacy of data




                 (Banko and Brill, 2001):
           Scaling to Very Very Large Corpora
          for Natural Language Disambiguation
Datasets drive research
                   statistical                      semantic role
                    parsing                           labeling
                                 PropBank
Penn Treebank




                  word sense                            speech
                disambiguation                        recognition
 WordNet                         Switchboard
 SemCor



                                                       statistical
                     textual
                                                       machine
                   entailment
 Pascal RTE                                           translation
                                 UN Parallel Text
The advent of human
         computation



• Open Mind Common Sense (Singh et al., 2002)
• Games with a Purpose (von Ahn and Dabbish, 2004)
• Online Word Games (Vickrey et al., 2008)
Amazon Mechanical Turk
   But what if your task isn’t “fun”?




            mturk.com
Using AMT for dataset
           creation
•   Su et al. (2007): name resolution, attribute extraction

•   Nakov (2008): paraphrasing noun compounds

•   Kaisser and Lowe (2008): sentence-level QA annotation

•   Kaisser et al. (2008): customizing QA summary length

•   Zaenen (2008): evaluating RTE agreement
Using AMT is cheap
         Paper            Labels   Cents/Label
    Su et al. (2007)      10,500       1.5

     Nakov (2008)         19,018   unreported

Kaisser and Lowe (2008)   24,321       2.0

  Kaisser et al. (2008)   45,300       3.7

    Zaenen (2008)         4,000        2.0
And it’s fast...




   blog.doloreslabs.com
But is it good?
• Objective: compare nonexpert annotation
  quality on NLP tasks with gold standard,
  expert-annotated data
• Method: pick 5 standard datasets, and
  relabel each point with 10 new annotations
• Compare Turk agreement to dataset with
  reported expert interannotator agreement
Tasks
• Affect recognition                      fear(“Tropical storm forms in Atlantic”) >
                                                fear(“Goal delight for Sheva”)
 •   Strapparava and Mihalcea (2007)


• Word Similarity                             sim(boy, lad) > sim(rooster, noon)
 •   Miller and Charles (1991)


• Textual Entailment                   if “Microsoft was established in Italy in 1985”,
                                          then “Microsoft was established in 1985” ?
 •   Dagan et al. (2006)


• WSD                                    “a bass on the line” vs. “a funky bass line”
 •   Pradhan et al. (2007)


• Temporal Annotation                          ran happens before fell in:
 •   Pustejovsky et al. (2003)              “The horse ran past the barn fell.”
Tasks
               Expert      Unique    Interannotator   Answer
   Task
              Labelers    Examples     Agreement       Type
  Affect
                 6          700         0.603         numeric
Recognition
   Word
                 1          30          0.958         numeric
 Similarity
 Textual
                 1          800          0.91         binary
Entailment
 Temporal
                 1          462       Unknown         binary
Annotation

  WSD            1          177       Unknown         ternary
Affect Recognition
Interannotator Agreement
                                          Emotion 1-E ITA
                                           Anger     0.459
                                          Disgust    0.583
•   6 total experts.
                                           Fear      0.711
•   One expert’s ITA is calculated as
                                            Joy      0.596
    the average of Pearson correlations
    from each annotator to the avg. of    Sadness    0.645
    the other 5 annotators.
                                          Surprise   0.464
                                          Valence    0.844
                                            All      0.603
Nonexpert ITA
We average over k
annotations to create a
single “proto-labeler”.

We plot the ITA of this
proto-labeler for up to
10 annotations and
compare to the average
single expert ITA.
Interannotator Agreement
                           anger                                disgust
                                                                                      Emotion 1-E ITA 10-N ITA

                                                  0.75
            0.65




                                                                                       Anger     0.459   0.675
correlation




                                             correlation
                                                 0.65
  0.55




                                                                                      Disgust    0.583   0.746
                                                  0.55
       0.45




                   2   4      6     8   10                  2   4     6      8   10

                           fear                                     joy
                                                                                       Fear      0.711   0.689
                                                  0.65
       0.70




                                              0.45 0.55
 correlation




                                             correlation
0.50 0.60




                                                                                        Joy      0.596   0.632
                                                  0.35




                                                                                      Sadness    0.645   0.776
       0.40




                   2   4      6     8   10                  2   4     6      8   10

                       sadness                                  surprise
                                                  0.50




                                                                                      Surprise   0.464   0.496
           0.75




                                             0.30 0.40
correlation




                                              correlation
 0.65




                                                                                      Valence    0.844   0.669
       0.55




                                                  0.20




                                                                                        All      0.603   0.694
                   2   4     6      8   10                  2   4     6      8   10
                       annotators                               annotators




                           Number of nonexpert annotators required to match expert ITA, on average: 4
Interannotator Agreement
                  word similarity                                      RTE
                                                                                          Task       1-E ITA 10-N ITA
0.84 0.90 0.96




                                           0.70 0.80 0.90
                                                                                         Affect
    correlation




                                                accuracy
                                                                                                   0.603 0.694
                                                                                       Recognition
                                                                                          Word
                  2     4   6    8    10                       2   4    6    8    10
                                                                                                     0.958 0.952
                      before/after                                     WSD              Similarity
                                           0.980 0.990 1.000
0.70 0.80 0.90




                                                                                        Textual
     accuracy




                                                accuracy




                                                                                                     0.91 0.897
                                                                                       Entailment
                                                                                        Temporal
                  2     4   6     8   10                       2    4   6     8   10                         0.940
                       annotators                                  annotators          Annotation

                                                                                         WSD                 0.994
Error Analysis: WSD
                       only 1 “mistake” out of 177 labels:


                          “The Egyptian president said
                            he would visit Libya today...”



Semeval Task 17 marks this as “executive officer of a firm” sense,
     while Turkers voted for “head of a country” sense.
Error Analysis: RTE
                                     ~10 disagreements out of 100:
                                      •   Bob Carpenter: “Over half of the residual
                                          disagreements between the Turker annotations and
                                          the gold standard were of this highly suspect
                                          nature and some were just wrong.”

                                      •   Bob Carpenter’s full analysis available at“Fool’s
                                          Gold Standard”, http://lingpipe-blog.com/


                                  Close Examples
T: 
 A car bomb that exploded outside a U.S.          T: “Google files for its long awaited IPO.”
military base near Beiji, killed 11 Iraqis.
                                                      H: “Google goes public.”
H: A car bomb exploded outside a U.S. base in
the northern town of Beiji, killing 11 Iraqis.

Labeled “TRUE” in PASCAL RTE-1,                       Labeled “TRUE” in PASCAL RTE-1,
Turkers vote 6-4 “FALSE”.                             Turkers vote 6-4 “FALSE”.
Weighting Annotators
 • There are a small number of very prolific, very
   noisy annotators. If we plot each annotator:

                          1.0
                          0.8
               accuracy

                          0.6
                          0.4




                                0    200    400    600      800

                                    number of annotations


                                    Task: RTE
• We should be able to do better than majority voting.
Weighting Annotators
• To infer the true value x , we weight each
                              i
    response yi from annotator w using a small gold
    standard training set:




•   We estimate annotator response from 5% of the gold
    standard test set, and evaluate with 20-fold CV.
Weighting Annotators
                    RTE                  before/after
    0.7 0.8 0.9




                               0.9
       accuracy




                               0.8
                                            Gold calibrated
                                            Naive voting




                               0.7
                  annotators              annotators


      RTE: 4.0% avg.           Temporal: 3.4% avg.
     accuracy increase          accuracy increase

• Several follow-up posts at         http://lingpipe-blog.com
Cost Summary
               Total   Cost in   Time in   Labels /   Labels /
   Task
              Labels    USD       hours     USD        Hour
   Affect     7000     $2.00      5.93      3500      1180.4
Recognition
   Word
               300     $0.20      0.17      1500      1724.1
 Similarity
  Textual
              8000     $8.00      89.3      1000       89.59
Entailment
 Temporal
              4620     $13.86     39.9      333.3     115.85
Annotation
  WSD         1770     $1.76      8.59     1005.7      206.1

    All       21690    $25.82    143.9      840.0      150.7
In Summary
   • All collected data and annotator
      instructions are available at:
      http://ai.stanford.edu/~rion/annotations




   • Summary blog post and comments on
      the Dolores Labs blog:
      http://blog.doloreslabs.com




nlp.stanford.edu    doloreslabs.com     ai.stanford.edu
Supplementary Slides
Training systems on
nonexpert annotations
• A simple affect recognition classifier trained
  on the averaged nonexpert votes
  outperforms one trained on a single expert
  annotation
Where are Turkers?
          United States                       77.1%
              India                            5.3%
           Philippines                         2.8%
            Canada                             2.8%
               UK                              1.9%
            Germany                            0.8%
              Italy                            0.5%
          Netherlands                          0.5%
            Portugal                           0.5%
            Australia                          0.4%

          Remaining 7.3% divided among 78 countries / territories

                         Analysis by Dolores Labs
Who are Turkers?


Gender                                        Age




Education                              Annual income
 “Mechanical Turk: The Demographics”, Panos Ipeirotis, NYU
    behind-the-enemy-lines.blogspot.com
Why are Turkers?

A. To Kill Time
B. Fruitful way to spend free time
C. Income purposes
D. Pocket change/extra cash
E. For entertainment
F. Challenge, self-competition
G. Unemployed, no regular job, part-time job
H. To sharpen/ To keep mind sharp
I. Learn English




      “Why People Participate on Mechanical Turk, Now Tabulated”, Panos Ipeirotis, NYU
                   behind-the-enemy-lines.blogspot.com
How much does AMT pay?




      “How Much Turking Pays?”, Panos Ipeirotis, NYU
     behind-the-enemy-lines.blogspot.com
Annotaton Guidelines:
   Affective Text
Annotaton Guidelines:
  Word Similarity
Annotaton Guidelines:
 Textual Entailment
Annotaton Guidelines:
 Temporal Ordering
Annotaton Guidelines:
Word Sense Disambiguation
Affect Recognition


           We label 100 headlines
           for each of 7 emotions
            We pay 4 cents for 20
             headlines (140 total
                    labels)
            Total Cost: $2.00
        Time to complete: 5.94 hrs
Example Task: Word Similarity
                    30 word pairs
                   (Rubenstein and
                  Goodenough, xxxx)

                  We pay 10 Turkers 2
                  cents apiece to score
                    all 30 word pairs

                   Total cost: $0.20
                   Time to complete:
                     10.4 minutes
Word Similarity ITA
                 0.96
     correlation
0.84     0.90




                        2    4     6     8    10
                             annotations
• Comparison against multiple annotators
• (graphs)
• avg. number of nonexperts : expert = 4
Datasets lead the way
WSJ + syntactic annotation = Penn TreeBank enables Statistical
                           parsing

      Brown corpus + sense labeling = Semcor => WSD

         TreeBank + role labels = PropBank => SRL

  political speeches + translations = United Nations parallel
           corpora => statistical machine translation

           more: RTE, Timebank, ACE/MUC, etc...
Datasets drive research
                   statistical                       semantic role
                    parsing                            labeling
                                    PropBank
 Penn Treebank




                   word sense
                                                        speech
                 disambiguation
                                                      recognition
    WordNet
    SemCor                          Switchboard



                 social network
                    analysis                         statistical MT
Enron E-mail
  Corpus                          UN Parallel Text
                   textual
                 entailment
Pascal RTE

Más contenido relacionado

Último

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 

Último (20)

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 

Destacado

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Destacado (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Cheap, Fast, and Good? Evaluating Nonexpert Annotations for NLP Tasks

  • 1. Cheap and Fast - But is it Good? Evaluating Nonexpert Annotations for Natural Language Tasks Rion Snow Brendan O’Connor Daniel Jurafsky Andrew Y. Ng
  • 2. The primacy of data (Banko and Brill, 2001): Scaling to Very Very Large Corpora for Natural Language Disambiguation
  • 3. Datasets drive research statistical semantic role parsing labeling PropBank Penn Treebank word sense speech disambiguation recognition WordNet Switchboard SemCor statistical textual machine entailment Pascal RTE translation UN Parallel Text
  • 4. The advent of human computation • Open Mind Common Sense (Singh et al., 2002) • Games with a Purpose (von Ahn and Dabbish, 2004) • Online Word Games (Vickrey et al., 2008)
  • 5. Amazon Mechanical Turk But what if your task isn’t “fun”? mturk.com
  • 6. Using AMT for dataset creation • Su et al. (2007): name resolution, attribute extraction • Nakov (2008): paraphrasing noun compounds • Kaisser and Lowe (2008): sentence-level QA annotation • Kaisser et al. (2008): customizing QA summary length • Zaenen (2008): evaluating RTE agreement
  • 7. Using AMT is cheap Paper Labels Cents/Label Su et al. (2007) 10,500 1.5 Nakov (2008) 19,018 unreported Kaisser and Lowe (2008) 24,321 2.0 Kaisser et al. (2008) 45,300 3.7 Zaenen (2008) 4,000 2.0
  • 8. And it’s fast... blog.doloreslabs.com
  • 9. But is it good? • Objective: compare nonexpert annotation quality on NLP tasks with gold standard, expert-annotated data • Method: pick 5 standard datasets, and relabel each point with 10 new annotations • Compare Turk agreement to dataset with reported expert interannotator agreement
  • 10. Tasks • Affect recognition fear(“Tropical storm forms in Atlantic”) > fear(“Goal delight for Sheva”) • Strapparava and Mihalcea (2007) • Word Similarity sim(boy, lad) > sim(rooster, noon) • Miller and Charles (1991) • Textual Entailment if “Microsoft was established in Italy in 1985”, then “Microsoft was established in 1985” ? • Dagan et al. (2006) • WSD “a bass on the line” vs. “a funky bass line” • Pradhan et al. (2007) • Temporal Annotation ran happens before fell in: • Pustejovsky et al. (2003) “The horse ran past the barn fell.”
  • 11. Tasks Expert Unique Interannotator Answer Task Labelers Examples Agreement Type Affect 6 700 0.603 numeric Recognition Word 1 30 0.958 numeric Similarity Textual 1 800 0.91 binary Entailment Temporal 1 462 Unknown binary Annotation WSD 1 177 Unknown ternary
  • 13. Interannotator Agreement Emotion 1-E ITA Anger 0.459 Disgust 0.583 • 6 total experts. Fear 0.711 • One expert’s ITA is calculated as Joy 0.596 the average of Pearson correlations from each annotator to the avg. of Sadness 0.645 the other 5 annotators. Surprise 0.464 Valence 0.844 All 0.603
  • 14. Nonexpert ITA We average over k annotations to create a single “proto-labeler”. We plot the ITA of this proto-labeler for up to 10 annotations and compare to the average single expert ITA.
  • 15. Interannotator Agreement anger disgust Emotion 1-E ITA 10-N ITA 0.75 0.65 Anger 0.459 0.675 correlation correlation 0.65 0.55 Disgust 0.583 0.746 0.55 0.45 2 4 6 8 10 2 4 6 8 10 fear joy Fear 0.711 0.689 0.65 0.70 0.45 0.55 correlation correlation 0.50 0.60 Joy 0.596 0.632 0.35 Sadness 0.645 0.776 0.40 2 4 6 8 10 2 4 6 8 10 sadness surprise 0.50 Surprise 0.464 0.496 0.75 0.30 0.40 correlation correlation 0.65 Valence 0.844 0.669 0.55 0.20 All 0.603 0.694 2 4 6 8 10 2 4 6 8 10 annotators annotators Number of nonexpert annotators required to match expert ITA, on average: 4
  • 16. Interannotator Agreement word similarity RTE Task 1-E ITA 10-N ITA 0.84 0.90 0.96 0.70 0.80 0.90 Affect correlation accuracy 0.603 0.694 Recognition Word 2 4 6 8 10 2 4 6 8 10 0.958 0.952 before/after WSD Similarity 0.980 0.990 1.000 0.70 0.80 0.90 Textual accuracy accuracy 0.91 0.897 Entailment Temporal 2 4 6 8 10 2 4 6 8 10 0.940 annotators annotators Annotation WSD 0.994
  • 17. Error Analysis: WSD only 1 “mistake” out of 177 labels: “The Egyptian president said he would visit Libya today...” Semeval Task 17 marks this as “executive officer of a firm” sense, while Turkers voted for “head of a country” sense.
  • 18. Error Analysis: RTE ~10 disagreements out of 100: • Bob Carpenter: “Over half of the residual disagreements between the Turker annotations and the gold standard were of this highly suspect nature and some were just wrong.” • Bob Carpenter’s full analysis available at“Fool’s Gold Standard”, http://lingpipe-blog.com/ Close Examples T: A car bomb that exploded outside a U.S. T: “Google files for its long awaited IPO.” military base near Beiji, killed 11 Iraqis. H: “Google goes public.” H: A car bomb exploded outside a U.S. base in the northern town of Beiji, killing 11 Iraqis. Labeled “TRUE” in PASCAL RTE-1, Labeled “TRUE” in PASCAL RTE-1, Turkers vote 6-4 “FALSE”. Turkers vote 6-4 “FALSE”.
  • 19. Weighting Annotators • There are a small number of very prolific, very noisy annotators. If we plot each annotator: 1.0 0.8 accuracy 0.6 0.4 0 200 400 600 800 number of annotations Task: RTE • We should be able to do better than majority voting.
  • 20. Weighting Annotators • To infer the true value x , we weight each i response yi from annotator w using a small gold standard training set: • We estimate annotator response from 5% of the gold standard test set, and evaluate with 20-fold CV.
  • 21. Weighting Annotators RTE before/after 0.7 0.8 0.9 0.9 accuracy 0.8 Gold calibrated Naive voting 0.7 annotators annotators RTE: 4.0% avg. Temporal: 3.4% avg. accuracy increase accuracy increase • Several follow-up posts at http://lingpipe-blog.com
  • 22. Cost Summary Total Cost in Time in Labels / Labels / Task Labels USD hours USD Hour Affect 7000 $2.00 5.93 3500 1180.4 Recognition Word 300 $0.20 0.17 1500 1724.1 Similarity Textual 8000 $8.00 89.3 1000 89.59 Entailment Temporal 4620 $13.86 39.9 333.3 115.85 Annotation WSD 1770 $1.76 8.59 1005.7 206.1 All 21690 $25.82 143.9 840.0 150.7
  • 23. In Summary • All collected data and annotator instructions are available at: http://ai.stanford.edu/~rion/annotations • Summary blog post and comments on the Dolores Labs blog: http://blog.doloreslabs.com nlp.stanford.edu doloreslabs.com ai.stanford.edu
  • 25. Training systems on nonexpert annotations • A simple affect recognition classifier trained on the averaged nonexpert votes outperforms one trained on a single expert annotation
  • 26. Where are Turkers? United States 77.1% India 5.3% Philippines 2.8% Canada 2.8% UK 1.9% Germany 0.8% Italy 0.5% Netherlands 0.5% Portugal 0.5% Australia 0.4% Remaining 7.3% divided among 78 countries / territories Analysis by Dolores Labs
  • 27. Who are Turkers? Gender Age Education Annual income “Mechanical Turk: The Demographics”, Panos Ipeirotis, NYU behind-the-enemy-lines.blogspot.com
  • 28. Why are Turkers? A. To Kill Time B. Fruitful way to spend free time C. Income purposes D. Pocket change/extra cash E. For entertainment F. Challenge, self-competition G. Unemployed, no regular job, part-time job H. To sharpen/ To keep mind sharp I. Learn English “Why People Participate on Mechanical Turk, Now Tabulated”, Panos Ipeirotis, NYU behind-the-enemy-lines.blogspot.com
  • 29. How much does AMT pay? “How Much Turking Pays?”, Panos Ipeirotis, NYU behind-the-enemy-lines.blogspot.com
  • 30. Annotaton Guidelines: Affective Text
  • 31. Annotaton Guidelines: Word Similarity
  • 35. Affect Recognition We label 100 headlines for each of 7 emotions We pay 4 cents for 20 headlines (140 total labels) Total Cost: $2.00 Time to complete: 5.94 hrs
  • 36. Example Task: Word Similarity 30 word pairs (Rubenstein and Goodenough, xxxx) We pay 10 Turkers 2 cents apiece to score all 30 word pairs Total cost: $0.20 Time to complete: 10.4 minutes
  • 37. Word Similarity ITA 0.96 correlation 0.84 0.90 2 4 6 8 10 annotations
  • 38. • Comparison against multiple annotators • (graphs) • avg. number of nonexperts : expert = 4
  • 39. Datasets lead the way WSJ + syntactic annotation = Penn TreeBank enables Statistical parsing Brown corpus + sense labeling = Semcor => WSD TreeBank + role labels = PropBank => SRL political speeches + translations = United Nations parallel corpora => statistical machine translation more: RTE, Timebank, ACE/MUC, etc...
  • 40. Datasets drive research statistical semantic role parsing labeling PropBank Penn Treebank word sense speech disambiguation recognition WordNet SemCor Switchboard social network analysis statistical MT Enron E-mail Corpus UN Parallel Text textual entailment Pascal RTE