SlideShare una empresa de Scribd logo
1 de 40
Descargar para leer sin conexión
Comparative ASR
  evaluation
             Dan Burnett
Director of Speech Technologies, Voxeo
         SpeechTek New York
             August 2010
Goals for today

• Learn about data selection
• Learn all the steps of doing an eval by
  actually doing them
• Leave with code that runs
Outline
• Overview of comparative ASR evaluation
• How to select an evaluation data set
• Why transcription is important and how to
  do it properly
• What and how to test
• Analyzing the results
Comparative ASR
      Evaluation

• How could you compare ASR accuracy?
• Can you test against any dataset?
• What settings should you use?
  The optimal ones, right?
Today’s approach
•   Choose representative evaluation data set

•   Determine human classification of each recording

•   For each ASR engine

    •   Determine machine classification of each
        recording at “optimal” setting

    •   Compare to human classification to determine
        accuracy

•   Intelligently compare results for the two engines
Evaluation data set
• Ideally at least 100 recordings per grammar
  path for good confidence in results (up to
  10000 minimum for large grammars)
• Must be representative
 • Best to take from actual calls (why?)
 • Do you need all the calls? Consider
   • Time of day, day of week, holidays
   • Regional differences
 • Simplest is to use every nth call
Lab data set
• Stored in all-data
• In “original” format as recorded
• Only post-endpointed data for today
• 1607 recordings of answers to yes/no
  question
• Likely to contain yes/no, but not guaranteed
Transcription

• Why is it needed? Why not automatic?
• Stages
 • Classification
 • Transcription
Audio classification
  •   Motivation:
      •   Applications may distinguish (i.e. possibly behave
          differently) among the following cases:

              Case                        Possible behavior
   No speech in audio sample          Mention that you didn’t hear
          (nospeech)                  anything and ask for repeat
   Speech, but not intelligible
                                             Ask for repeat
        (unintelligible)
Intelligible speech, but not in app
              grammar                 Encourage in-grammar speech
    (out-of-grammar speech)
Intelligible speech, and within app
  grammar (in-grammar speech)         Respond to what person said
Transcribing speech

• Words only, all lower case
• No digits
• Only punctuation allowed is apostrophe
Lab 1
• Copy yn_files.csv to yn_finaltrans.csv and edit
• For each file, append category of nospeech,
  unintelligible, or speech
  •   Example: all-data/.../utt01.wav,unintelligible
• Append transcription if speech
  •   Example: all-data/.../utt01.wav,speech,yes
• Transcription instructions in transcription.html
• How might you validate transcriptions?
What and how to test

• Understanding what to test/measure
• Preparing the data
• Building a test harness
• Running the test
What to test/measure

•   To measure accuracy, we need

    •   For each data file

        •   the human categorization and transcription,
            and

        •   the recognizer’s categorization, recognized
            string, and confidence score
Preparing the data

• Recognizer needs a grammar (typically from
  your application)
• This grammar can be used to classify transcribed
  speech as In-grammar/Out-of-grammar
Lab 2

• Fix GRXML yes/no grammar in “a” directory
  called yesno.grxml
• Copy yn_finaltrans.csv to yn_igog.csv
• Edit yn_igog.csv and change every “yes” or
  “no” line to have a category of “in_grammar”
  (should be 756 yes, 159 no, for total of 915)
Building a test harness
• Why build a test harness? What about
  vendor batch reco tools?
• End-to-end vs. recognizer-only testing
• Harness should be
 • generic
 • customizable to different ASR engines
Lab 3

•   Complete the test harness harness.php
    (see harness_outline.txt)
    •   The harness must use the “a/scripts” scripts
    •   A list of “missing commands” is in
        harness_components.txt
    •   Please review (examine) these scripts
•   FYI, ASR engine is a/a.php -- treat as black box
Lab 4

• Now run the test harness:
 • php harness.php a/scripts <data file> <rundir>
 • Output will be in <rundir>/results.csv
 • Compare your output to “def_results.csv”
Analyzing results


• What are the possible outcomes and
  errors?
• How do we evaluate errors?
Possible ASR Engine
    Classifications
• Silence/nospeech (nospeech)
• Reject (rejected)
• Recognize (recognized)

• What about DTMF?
Possible outcomes
                                   ASR
                          nospeech          rejected    recognized
                           Correct         Improperly
        nospeech                                        Incorrect
                        classification       rejected

                         Improperly         Correct      Assume
       unintelligible
                      treated as silence    behavior    incorrect
True
         out-of-        Improperly          Correct
                                                        Incorrect
        grammar      treated as silence     behavior

                     Improperly      Improperly Either correct
       in-grammar
                  treated as silence rejected    or incorrect
Possible outcomes
                                            Misrecognitions
                                   ASR
                          nospeech          rejected    recognized
                           Correct         Improperly
        nospeech                                        Incorrect
                        classification       rejected

                         Improperly         Correct      Assume
       unintelligible
                      treated as silence    behavior    incorrect
True
         out-of-        Improperly          Correct
                                                        Incorrect
        grammar      treated as silence     behavior

                     Improperly      Improperly Either correct
       in-grammar
                  treated as silence rejected    or incorrect
Possible outcomes
                                            “Misrejections”
                                   ASR
                          nospeech          rejected    recognized
                           Correct         Improperly
        nospeech                                        Incorrect
                        classification       rejected

                         Improperly         Correct      Assume
       unintelligible
                      treated as silence    behavior    incorrect
True
         out-of-        Improperly          Correct
                                                        Incorrect
        grammar      treated as silence     behavior

                     Improperly      Improperly Either correct
       in-grammar
                  treated as silence rejected    or incorrect
Possible outcomes
       “Missilences”                ASR
                           nospeech          rejected    recognized
                            Correct         Improperly
         nospeech                                        Incorrect
                         classification       rejected

                          Improperly         Correct      Assume
        unintelligible
                       treated as silence    behavior    incorrect
True
          out-of-        Improperly          Correct
                                                         Incorrect
         grammar      treated as silence     behavior

                      Improperly      Improperly Either correct
        in-grammar
                   treated as silence rejected    or incorrect
Three types of errors

• Missilences -- called silence, but wasn’t
• Misrejections -- rejected inappropriately
• Misrecognitions -- recognized
  inappropriately or incorrectly
Three types of errors

• Missilences -- called silence, but wasn’t
• Misrejections -- rejected inappropriately
• Misrecognitions -- recognized
  inappropriately or incorrectly


     So how do we evaluate these?
Evaluating errors

• Run ASR Engine on data set
• Try every rejection threshold value
• Plot errors as function of threshold
• Find optimal value
Try every rejection
     threshold value
• Ran data files through test harness with
  rejection threshold of 0 (i.e., no rejection),
  but recorded confidence score
• Now, for each possible rejection threshold
  from 0 to 100
  • Calculate number of misrecognitions,
    misrejections, and missilences
Semantic equivalence

• We call “yes” in-grammar, but what about
  “yes yes yes”?
• Application only cares about whether it does
  the right thing, so
• Our final results need to be semantic results
Lab 5

•   Look at synonyms.txt file
•   Analyze at single threshold and look at the result
    •   php analyze_csv.php <csv file> 50 synonyms.txt
    •   Note the difference between raw and semantic results
•   Now evaluate at all thresholds and look at the (semantic)
    results
    •   php analyze_all_thresholds.php <csv file> <synonyms file>
1000
               ASR Engine A errors
                                                   “Misrejections”




           Misrecognitions
500




                                   “Missilences”


       0                     Rejection Threshold                     100
1000
           ASR Engine A errors


500

                           Minimum
                          Total Error

                    Sum


       0        Rejection Threshold     100
Lab 6
•   You now have engine B in “b” directory
•   Change harness and component scripts as necessary to
    run the same test
•   You need to know that
    •   The API for engine B is different. Run “php b/b.php” to
        find out what it is. It takes ABNF grammars instead of
        XML.
    •   Engine B stores its output in a different file.
    •   Possible outputs from engine B are
        •   <audiofilename>: [NOSPEECH, REJECTION,
            SPOKETOOSOON, MAXSPEECHTIMEOUT]
        •   <audiofilename>: ERROR processing file
1000
               ASR Engine B errors
                                                     “Misrejections”




           Misrecognitions
500




               “Missilences”
       0                       Rejection Threshold                 100
1000
           ASR Engine B errors


500

                               Minimum
                              Total Error


                 Sum

       0        Rejection Threshold         100
Comparing ASR
       accuracy

• Plot and compare
• Remember to compare optimal error rates
  of each (representing tuned accuracy)
1000
           Total errors: A vs B

500         ASR Engine A



                                                 ASR Engine B



       0                   Rejection Threshold                  100
Comparison conclusions

• Optimal error rates are very similar on this
  data set
• Engine A is much more sensitive to
  rejection threshold changes
1000
              Natural Numbers
      Note that optimal
       thresholds are
          different!

500


                          ASR Engine A          ASR Engine B




       0                  Rejection Threshold                  100
Today we . . .
• Learned all the steps of doing an eval by
  actually doing them
  •   How to collect data
  •   Transcribing data
  •   Running a test
  •   Analyzing results

• Finished with code that runs
  (and some homework . . .)

Más contenido relacionado

Más de Voxeo Corp

Voxeo Summit Day 2 -Voxeo APIs and SDKs
Voxeo Summit Day 2 -Voxeo APIs and SDKsVoxeo Summit Day 2 -Voxeo APIs and SDKs
Voxeo Summit Day 2 -Voxeo APIs and SDKsVoxeo Corp
 
Voxeo Summit Day 2 - Voxeo CXP - IVR on Steroids
Voxeo Summit Day 2 - Voxeo CXP - IVR on SteroidsVoxeo Summit Day 2 - Voxeo CXP - IVR on Steroids
Voxeo Summit Day 2 - Voxeo CXP - IVR on SteroidsVoxeo Corp
 
Voxeo Summit Day 2 - Using CXP hotspot analytics
Voxeo Summit Day 2 - Using CXP hotspot analyticsVoxeo Summit Day 2 - Using CXP hotspot analytics
Voxeo Summit Day 2 - Using CXP hotspot analyticsVoxeo Corp
 
Voxeo Summit Day 2 - Securing customer interactions
Voxeo Summit Day 2 - Securing customer interactionsVoxeo Summit Day 2 - Securing customer interactions
Voxeo Summit Day 2 - Securing customer interactionsVoxeo Corp
 
Voxeo Summit Day 2 - Real-time communications with WebRTC
Voxeo Summit Day 2 - Real-time communications with WebRTCVoxeo Summit Day 2 - Real-time communications with WebRTC
Voxeo Summit Day 2 - Real-time communications with WebRTCVoxeo Corp
 
Voxeo Summit Day 2 - Voxeo CXP for business users
Voxeo Summit Day 2 - Voxeo CXP for business usersVoxeo Summit Day 2 - Voxeo CXP for business users
Voxeo Summit Day 2 - Voxeo CXP for business usersVoxeo Corp
 
Voxeo Summit Day 2 - Creating raving fans
Voxeo Summit Day 2 - Creating raving fansVoxeo Summit Day 2 - Creating raving fans
Voxeo Summit Day 2 - Creating raving fansVoxeo Corp
 
Voxeo Summit Day 2 - Advanced CCXML topics
Voxeo Summit Day 2 - Advanced CCXML topicsVoxeo Summit Day 2 - Advanced CCXML topics
Voxeo Summit Day 2 - Advanced CCXML topicsVoxeo Corp
 
Voxeo Summit Day 1 - Customer experience analytics
Voxeo Summit Day 1 - Customer experience analyticsVoxeo Summit Day 1 - Customer experience analytics
Voxeo Summit Day 1 - Customer experience analyticsVoxeo Corp
 
Voxeo Summit Day 1 - Communications-enabled Business Processes (CEBP)
Voxeo Summit Day 1 - Communications-enabled Business Processes (CEBP)Voxeo Summit Day 1 - Communications-enabled Business Processes (CEBP)
Voxeo Summit Day 1 - Communications-enabled Business Processes (CEBP)Voxeo Corp
 
Voxeo Summit Day 1 - A view into the Voxeo cloud
Voxeo Summit Day 1 - A view into the Voxeo cloudVoxeo Summit Day 1 - A view into the Voxeo cloud
Voxeo Summit Day 1 - A view into the Voxeo cloudVoxeo Corp
 
Voxeo Jam Session: What's New in Prophecy 11 and VoiceObjects 11?
Voxeo Jam Session: What's New in Prophecy 11 and VoiceObjects 11?Voxeo Jam Session: What's New in Prophecy 11 and VoiceObjects 11?
Voxeo Jam Session: What's New in Prophecy 11 and VoiceObjects 11?Voxeo Corp
 
How Do You Hear Me Now?
How Do You Hear Me Now?How Do You Hear Me Now?
How Do You Hear Me Now?Voxeo Corp
 
CCXML For Advanced Communications Applications
CCXML For Advanced Communications ApplicationsCCXML For Advanced Communications Applications
CCXML For Advanced Communications ApplicationsVoxeo Corp
 
IPv6 and How It Impacts Communication Applications
IPv6 and How It Impacts Communication ApplicationsIPv6 and How It Impacts Communication Applications
IPv6 and How It Impacts Communication ApplicationsVoxeo Corp
 
7 Critical Success Factors for Outbound IVR
7 Critical Success Factors for Outbound IVR7 Critical Success Factors for Outbound IVR
7 Critical Success Factors for Outbound IVRVoxeo Corp
 
5 Questions When Analyzing Your Analytics Options
5 Questions When Analyzing Your Analytics Options5 Questions When Analyzing Your Analytics Options
5 Questions When Analyzing Your Analytics OptionsVoxeo Corp
 
Serving the Social Customer: Scaling Your Support For Twitter, Facebook and More
Serving the Social Customer: Scaling Your Support For Twitter, Facebook and MoreServing the Social Customer: Scaling Your Support For Twitter, Facebook and More
Serving the Social Customer: Scaling Your Support For Twitter, Facebook and MoreVoxeo Corp
 
Securing Unified Communications Systems
Securing Unified Communications SystemsSecuring Unified Communications Systems
Securing Unified Communications SystemsVoxeo Corp
 
Voxeo Summit 2010: Prophecy 10 - Unified Self Service
Voxeo Summit 2010: Prophecy 10 - Unified Self ServiceVoxeo Summit 2010: Prophecy 10 - Unified Self Service
Voxeo Summit 2010: Prophecy 10 - Unified Self ServiceVoxeo Corp
 

Más de Voxeo Corp (20)

Voxeo Summit Day 2 -Voxeo APIs and SDKs
Voxeo Summit Day 2 -Voxeo APIs and SDKsVoxeo Summit Day 2 -Voxeo APIs and SDKs
Voxeo Summit Day 2 -Voxeo APIs and SDKs
 
Voxeo Summit Day 2 - Voxeo CXP - IVR on Steroids
Voxeo Summit Day 2 - Voxeo CXP - IVR on SteroidsVoxeo Summit Day 2 - Voxeo CXP - IVR on Steroids
Voxeo Summit Day 2 - Voxeo CXP - IVR on Steroids
 
Voxeo Summit Day 2 - Using CXP hotspot analytics
Voxeo Summit Day 2 - Using CXP hotspot analyticsVoxeo Summit Day 2 - Using CXP hotspot analytics
Voxeo Summit Day 2 - Using CXP hotspot analytics
 
Voxeo Summit Day 2 - Securing customer interactions
Voxeo Summit Day 2 - Securing customer interactionsVoxeo Summit Day 2 - Securing customer interactions
Voxeo Summit Day 2 - Securing customer interactions
 
Voxeo Summit Day 2 - Real-time communications with WebRTC
Voxeo Summit Day 2 - Real-time communications with WebRTCVoxeo Summit Day 2 - Real-time communications with WebRTC
Voxeo Summit Day 2 - Real-time communications with WebRTC
 
Voxeo Summit Day 2 - Voxeo CXP for business users
Voxeo Summit Day 2 - Voxeo CXP for business usersVoxeo Summit Day 2 - Voxeo CXP for business users
Voxeo Summit Day 2 - Voxeo CXP for business users
 
Voxeo Summit Day 2 - Creating raving fans
Voxeo Summit Day 2 - Creating raving fansVoxeo Summit Day 2 - Creating raving fans
Voxeo Summit Day 2 - Creating raving fans
 
Voxeo Summit Day 2 - Advanced CCXML topics
Voxeo Summit Day 2 - Advanced CCXML topicsVoxeo Summit Day 2 - Advanced CCXML topics
Voxeo Summit Day 2 - Advanced CCXML topics
 
Voxeo Summit Day 1 - Customer experience analytics
Voxeo Summit Day 1 - Customer experience analyticsVoxeo Summit Day 1 - Customer experience analytics
Voxeo Summit Day 1 - Customer experience analytics
 
Voxeo Summit Day 1 - Communications-enabled Business Processes (CEBP)
Voxeo Summit Day 1 - Communications-enabled Business Processes (CEBP)Voxeo Summit Day 1 - Communications-enabled Business Processes (CEBP)
Voxeo Summit Day 1 - Communications-enabled Business Processes (CEBP)
 
Voxeo Summit Day 1 - A view into the Voxeo cloud
Voxeo Summit Day 1 - A view into the Voxeo cloudVoxeo Summit Day 1 - A view into the Voxeo cloud
Voxeo Summit Day 1 - A view into the Voxeo cloud
 
Voxeo Jam Session: What's New in Prophecy 11 and VoiceObjects 11?
Voxeo Jam Session: What's New in Prophecy 11 and VoiceObjects 11?Voxeo Jam Session: What's New in Prophecy 11 and VoiceObjects 11?
Voxeo Jam Session: What's New in Prophecy 11 and VoiceObjects 11?
 
How Do You Hear Me Now?
How Do You Hear Me Now?How Do You Hear Me Now?
How Do You Hear Me Now?
 
CCXML For Advanced Communications Applications
CCXML For Advanced Communications ApplicationsCCXML For Advanced Communications Applications
CCXML For Advanced Communications Applications
 
IPv6 and How It Impacts Communication Applications
IPv6 and How It Impacts Communication ApplicationsIPv6 and How It Impacts Communication Applications
IPv6 and How It Impacts Communication Applications
 
7 Critical Success Factors for Outbound IVR
7 Critical Success Factors for Outbound IVR7 Critical Success Factors for Outbound IVR
7 Critical Success Factors for Outbound IVR
 
5 Questions When Analyzing Your Analytics Options
5 Questions When Analyzing Your Analytics Options5 Questions When Analyzing Your Analytics Options
5 Questions When Analyzing Your Analytics Options
 
Serving the Social Customer: Scaling Your Support For Twitter, Facebook and More
Serving the Social Customer: Scaling Your Support For Twitter, Facebook and MoreServing the Social Customer: Scaling Your Support For Twitter, Facebook and More
Serving the Social Customer: Scaling Your Support For Twitter, Facebook and More
 
Securing Unified Communications Systems
Securing Unified Communications SystemsSecuring Unified Communications Systems
Securing Unified Communications Systems
 
Voxeo Summit 2010: Prophecy 10 - Unified Self Service
Voxeo Summit 2010: Prophecy 10 - Unified Self ServiceVoxeo Summit 2010: Prophecy 10 - Unified Self Service
Voxeo Summit 2010: Prophecy 10 - Unified Self Service
 

Último

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 

Último (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 

Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010

  • 1. Comparative ASR evaluation Dan Burnett Director of Speech Technologies, Voxeo SpeechTek New York August 2010
  • 2. Goals for today • Learn about data selection • Learn all the steps of doing an eval by actually doing them • Leave with code that runs
  • 3. Outline • Overview of comparative ASR evaluation • How to select an evaluation data set • Why transcription is important and how to do it properly • What and how to test • Analyzing the results
  • 4. Comparative ASR Evaluation • How could you compare ASR accuracy? • Can you test against any dataset? • What settings should you use? The optimal ones, right?
  • 5. Today’s approach • Choose representative evaluation data set • Determine human classification of each recording • For each ASR engine • Determine machine classification of each recording at “optimal” setting • Compare to human classification to determine accuracy • Intelligently compare results for the two engines
  • 6. Evaluation data set • Ideally at least 100 recordings per grammar path for good confidence in results (up to 10000 minimum for large grammars) • Must be representative • Best to take from actual calls (why?) • Do you need all the calls? Consider • Time of day, day of week, holidays • Regional differences • Simplest is to use every nth call
  • 7. Lab data set • Stored in all-data • In “original” format as recorded • Only post-endpointed data for today • 1607 recordings of answers to yes/no question • Likely to contain yes/no, but not guaranteed
  • 8. Transcription • Why is it needed? Why not automatic? • Stages • Classification • Transcription
  • 9. Audio classification • Motivation: • Applications may distinguish (i.e. possibly behave differently) among the following cases: Case Possible behavior No speech in audio sample Mention that you didn’t hear (nospeech) anything and ask for repeat Speech, but not intelligible Ask for repeat (unintelligible) Intelligible speech, but not in app grammar Encourage in-grammar speech (out-of-grammar speech) Intelligible speech, and within app grammar (in-grammar speech) Respond to what person said
  • 10. Transcribing speech • Words only, all lower case • No digits • Only punctuation allowed is apostrophe
  • 11. Lab 1 • Copy yn_files.csv to yn_finaltrans.csv and edit • For each file, append category of nospeech, unintelligible, or speech • Example: all-data/.../utt01.wav,unintelligible • Append transcription if speech • Example: all-data/.../utt01.wav,speech,yes • Transcription instructions in transcription.html • How might you validate transcriptions?
  • 12. What and how to test • Understanding what to test/measure • Preparing the data • Building a test harness • Running the test
  • 13. What to test/measure • To measure accuracy, we need • For each data file • the human categorization and transcription, and • the recognizer’s categorization, recognized string, and confidence score
  • 14. Preparing the data • Recognizer needs a grammar (typically from your application) • This grammar can be used to classify transcribed speech as In-grammar/Out-of-grammar
  • 15. Lab 2 • Fix GRXML yes/no grammar in “a” directory called yesno.grxml • Copy yn_finaltrans.csv to yn_igog.csv • Edit yn_igog.csv and change every “yes” or “no” line to have a category of “in_grammar” (should be 756 yes, 159 no, for total of 915)
  • 16. Building a test harness • Why build a test harness? What about vendor batch reco tools? • End-to-end vs. recognizer-only testing • Harness should be • generic • customizable to different ASR engines
  • 17. Lab 3 • Complete the test harness harness.php (see harness_outline.txt) • The harness must use the “a/scripts” scripts • A list of “missing commands” is in harness_components.txt • Please review (examine) these scripts • FYI, ASR engine is a/a.php -- treat as black box
  • 18. Lab 4 • Now run the test harness: • php harness.php a/scripts <data file> <rundir> • Output will be in <rundir>/results.csv • Compare your output to “def_results.csv”
  • 19. Analyzing results • What are the possible outcomes and errors? • How do we evaluate errors?
  • 20. Possible ASR Engine Classifications • Silence/nospeech (nospeech) • Reject (rejected) • Recognize (recognized) • What about DTMF?
  • 21. Possible outcomes ASR nospeech rejected recognized Correct Improperly nospeech Incorrect classification rejected Improperly Correct Assume unintelligible treated as silence behavior incorrect True out-of- Improperly Correct Incorrect grammar treated as silence behavior Improperly Improperly Either correct in-grammar treated as silence rejected or incorrect
  • 22. Possible outcomes Misrecognitions ASR nospeech rejected recognized Correct Improperly nospeech Incorrect classification rejected Improperly Correct Assume unintelligible treated as silence behavior incorrect True out-of- Improperly Correct Incorrect grammar treated as silence behavior Improperly Improperly Either correct in-grammar treated as silence rejected or incorrect
  • 23. Possible outcomes “Misrejections” ASR nospeech rejected recognized Correct Improperly nospeech Incorrect classification rejected Improperly Correct Assume unintelligible treated as silence behavior incorrect True out-of- Improperly Correct Incorrect grammar treated as silence behavior Improperly Improperly Either correct in-grammar treated as silence rejected or incorrect
  • 24. Possible outcomes “Missilences” ASR nospeech rejected recognized Correct Improperly nospeech Incorrect classification rejected Improperly Correct Assume unintelligible treated as silence behavior incorrect True out-of- Improperly Correct Incorrect grammar treated as silence behavior Improperly Improperly Either correct in-grammar treated as silence rejected or incorrect
  • 25. Three types of errors • Missilences -- called silence, but wasn’t • Misrejections -- rejected inappropriately • Misrecognitions -- recognized inappropriately or incorrectly
  • 26. Three types of errors • Missilences -- called silence, but wasn’t • Misrejections -- rejected inappropriately • Misrecognitions -- recognized inappropriately or incorrectly So how do we evaluate these?
  • 27. Evaluating errors • Run ASR Engine on data set • Try every rejection threshold value • Plot errors as function of threshold • Find optimal value
  • 28. Try every rejection threshold value • Ran data files through test harness with rejection threshold of 0 (i.e., no rejection), but recorded confidence score • Now, for each possible rejection threshold from 0 to 100 • Calculate number of misrecognitions, misrejections, and missilences
  • 29. Semantic equivalence • We call “yes” in-grammar, but what about “yes yes yes”? • Application only cares about whether it does the right thing, so • Our final results need to be semantic results
  • 30. Lab 5 • Look at synonyms.txt file • Analyze at single threshold and look at the result • php analyze_csv.php <csv file> 50 synonyms.txt • Note the difference between raw and semantic results • Now evaluate at all thresholds and look at the (semantic) results • php analyze_all_thresholds.php <csv file> <synonyms file>
  • 31. 1000 ASR Engine A errors “Misrejections” Misrecognitions 500 “Missilences” 0 Rejection Threshold 100
  • 32. 1000 ASR Engine A errors 500 Minimum Total Error Sum 0 Rejection Threshold 100
  • 33. Lab 6 • You now have engine B in “b” directory • Change harness and component scripts as necessary to run the same test • You need to know that • The API for engine B is different. Run “php b/b.php” to find out what it is. It takes ABNF grammars instead of XML. • Engine B stores its output in a different file. • Possible outputs from engine B are • <audiofilename>: [NOSPEECH, REJECTION, SPOKETOOSOON, MAXSPEECHTIMEOUT] • <audiofilename>: ERROR processing file
  • 34. 1000 ASR Engine B errors “Misrejections” Misrecognitions 500 “Missilences” 0 Rejection Threshold 100
  • 35. 1000 ASR Engine B errors 500 Minimum Total Error Sum 0 Rejection Threshold 100
  • 36. Comparing ASR accuracy • Plot and compare • Remember to compare optimal error rates of each (representing tuned accuracy)
  • 37. 1000 Total errors: A vs B 500 ASR Engine A ASR Engine B 0 Rejection Threshold 100
  • 38. Comparison conclusions • Optimal error rates are very similar on this data set • Engine A is much more sensitive to rejection threshold changes
  • 39. 1000 Natural Numbers Note that optimal thresholds are different! 500 ASR Engine A ASR Engine B 0 Rejection Threshold 100
  • 40. Today we . . . • Learned all the steps of doing an eval by actually doing them • How to collect data • Transcribing data • Running a test • Analyzing results • Finished with code that runs (and some homework . . .)