Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
ASA 173, Boston
Crowdsourcing Speech
Intelligibility Judgements
Maria K Wolters, University of
Edinburgh
Karl B Isaac, fre...
Key Questions
❖ What can we know about the context of the judgements
people make?
❖ How might they affect performance?
❖ c...
Data
❖ Series of 14 lab and Amazon Mechanical Turk
experiments on speech synthesis intelligibility (Isaac,
2015, PhD thesi...
Experiment Overview
Study complete
not
complete
Aim
amt 167 62
Semantically unpredictable sentences,
AMT vs Lab, 4 systems...
Important aspects of context
❖ People’s hearing
❖ How they are listening
❖ Where they are listening
❖ Experience with spee...
Hearing Issues
❖ Self-report does not correlate very well with actual
hearing loss (Wolters, Isaac, Johnson 2011)
❖ Yet, m...
How people are listening
❖ Headphones versus no headphones
❖ Type of headphones (earbuds, on ear, full ear …)
❖ Features o...
Where they are listening
❖ Room acoustics
❖ Public / private
❖ Interruptions
❖ background noise
❖ source
❖ loudness
❖ fluct...
Experience with Speech Type
❖ Dialect
❖ Life history
❖ exposure to target speech
Did They Do What They Were Supposed To Do?
❖ Manipulation checks, such as very easy sentence
❖ Different task / item, that...
Effect on Performance
❖ Context Variables:
❖ self-reported hearing problems
❖ self-reported loudness of background noise
❖...
Self-Reported Hearing
(Hearing Handicap Inventory for Adults)
Study mean median IQR Max >=10
amt 3 0 0 38 21 (13%)
matrix ...
Self-Reported Noise Loudness
Study
1
(none)
2 3 4
5
(LOUD)
median IQR
matrix 25 29 4 3 0 2 1
newvoice 29 20 7 4 1 2 1
lowr...
Mean WER
Study min mean median IQR Max
amt 0.06 0.20 0.18 0.8 1.00
matrix 0 0.09 0.08 0.40 0.32
newvoice 0 0.14 0.14 0.15 ...
Self-Reported Intelligibility
Study usually all
usually
most
worse
link
Mean WER
amt 7 (4%) 125 (75%) 35 (21%) p<0.0001
ma...
Checking for Correlations
❖ Spearman test as implemented in R package coin
❖ stratified by relevant experimental variables
...
HHIA vs Mean WER
Study by System by Reverb by SNR
amt p=0.55
matrix p=0.08
newvoice p<0.01
lowrev p=0.37 p=0.44
highrev p=...
Example: NoiseReverb
Loudness vs WER
Study by System by Reverb by SNR
matrix p=0.08
newvoice p=0.30
lowrev p=0.11 p=0.17
highrev p=0.14 p<0.07
...
Loudness vs Self-Reported Understanding
Study by System by Reverb by SNR
matrix p<0.01
newvoice p<0.005
lowrev p<0.005 p<0...
Example: Noise x Reverb
Effects of Context on Performance
• can be subtle
• may depend on whether self-reported or measured
performance
• may depe...
How Can We Make it Easier?
❖ Design between subject rather than within. 90 sentences
on final study was a killer
❖ Pay a li...
Canaries in the Comment Coalmine
❖ issues with the software
❖ issues with their memory
❖ typing while listening
❖ issues w...
Conclusion
❖ Use consistent brief questions regarding context to better characterise your
samples across all your studies
...
Próxima SlideShare
Cargando en…5
×

Crowdsourcing Speech Intelligibility Judgements

128 visualizaciones

Publicado el

This talk looks at the variation in participants that take part in speech intelligibility studies, and explores how that variability can be characterised and integrated into interpreting and discussing results.

Publicado en: Tecnología
  • Sé el primero en comentar

  • Sé el primero en recomendar esto

Crowdsourcing Speech Intelligibility Judgements

  1. 1. ASA 173, Boston Crowdsourcing Speech Intelligibility Judgements Maria K Wolters, University of Edinburgh Karl B Isaac, freelance researcher Contact: maria.wolters@ed.ac.uk, @mariawolters with many thanks to Steve Renals & the EPSRC MultiMemoHome team
  2. 2. Key Questions ❖ What can we know about the context of the judgements people make? ❖ How might they affect performance? ❖ could explain some of increased variation in results ❖ could yield new hypotheses about real-world intelligibility ❖ How can we improve the experience?
  3. 3. Data ❖ Series of 14 lab and Amazon Mechanical Turk experiments on speech synthesis intelligibility (Isaac, 2015, PhD thesis) ❖ Lab vs Mechanical Turk ❖ effect of type of test sentences ❖ effect of noise and reverberation
  4. 4. Experiment Overview Study complete not complete Aim amt 167 62 Semantically unpredictable sentences, AMT vs Lab, 4 systems matrix 61 40 testing matrix sentences newvoice 61 49 three new voices lowrev 68 NA effects of low reverberation highrev 36 NA effects of high reverberation noiserev 78 183 noise x reverberation Total 471 334 no exclusions and filtering
  5. 5. Important aspects of context ❖ People’s hearing ❖ How they are listening ❖ Where they are listening ❖ Experience with speech tested ❖ Did they do what they were supposed to do?
  6. 6. Hearing Issues ❖ Self-report does not correlate very well with actual hearing loss (Wolters, Isaac, Johnson 2011) ❖ Yet, many instances of self-reported hearing difficulties that affect ability to understand speech in noise, with no hearing loss (Bharawaj et al., 2015)
  7. 7. How people are listening ❖ Headphones versus no headphones ❖ Type of headphones (earbuds, on ear, full ear …) ❖ Features of headphones ❖ configuration of listening device (phone / computer; browser; volume)
  8. 8. Where they are listening ❖ Room acoustics ❖ Public / private ❖ Interruptions ❖ background noise ❖ source ❖ loudness ❖ fluctuating / constant / bursty
  9. 9. Experience with Speech Type ❖ Dialect ❖ Life history ❖ exposure to target speech
  10. 10. Did They Do What They Were Supposed To Do? ❖ Manipulation checks, such as very easy sentence ❖ Different task / item, that stirs people out of „tickybox“ mode ❖ Instructions at the start, then questions about aspects of instructions at the end (people are surprisingly honest!)
  11. 11. Effect on Performance ❖ Context Variables: ❖ self-reported hearing problems ❖ self-reported loudness of background noise ❖ Performance Variables: ❖ Word error rate (WER) mean for each within-participant condition ❖ self-reported performance
  12. 12. Self-Reported Hearing (Hearing Handicap Inventory for Adults) Study mean median IQR Max >=10 amt 3 0 0 38 21 (13%) matrix 3 0 0 34 4 (7%) newvoice 3.5 0 4 36 10 (16%) lowrev 1 0 0 18 5 (7%) highrev 1.5 0 0 28 2 (6%) noiserev 1.5 0 0 20 6 (8%)
  13. 13. Self-Reported Noise Loudness Study 1 (none) 2 3 4 5 (LOUD) median IQR matrix 25 29 4 3 0 2 1 newvoice 29 20 7 4 1 2 1 lowrev 36 16 11 4 0 1 1 highrev 18 15 1 1 1 1.5 1 noiserev 44 22 5 1 6 1 1 not captured in AMT study
  14. 14. Mean WER Study min mean median IQR Max amt 0.06 0.20 0.18 0.8 1.00 matrix 0 0.09 0.08 0.40 0.32 newvoice 0 0.14 0.14 0.15 0.42 lowrev 0 0.05 0.04 0.06 0.5 highrev 0 0.15 0.08 0.22 0.92 noiserev 0 0.50 0.48 0.88 1.16
  15. 15. Self-Reported Intelligibility Study usually all usually most worse link Mean WER amt 7 (4%) 125 (75%) 35 (21%) p<0.0001 matrix 27 (44%) 33 (54%) 1 (2%) p<0.005 newvoice 10 (16%) 47 (77%) 4 (6.5%) p<0.01 lowrev 45 (66%) 21 (31%) 1 (1%) p<0.001 highrev 11 (31%) 22 (61%) 3 (8%) p<0.05 noiserev 7 (9%) 31 (40%) 40 (51%) p<0.0001 Link with mean WER assessed using Kruskal-Wallis test
  16. 16. Checking for Correlations ❖ Spearman test as implemented in R package coin ❖ stratified by relevant experimental variables ❖ H0 is that mean WER and HHIA score / loudness are independent, given the experimental variable
  17. 17. HHIA vs Mean WER Study by System by Reverb by SNR amt p=0.55 matrix p=0.08 newvoice p<0.01 lowrev p=0.37 p=0.44 highrev p=0.88 p=0.85 noiserev p=0.11 p<0.01 p<0.005 self-reported hearing becomes relevant * in the most difficult study (noiserev) * in the study with the highest number of people over threshold
  18. 18. Example: NoiseReverb
  19. 19. Loudness vs WER Study by System by Reverb by SNR matrix p=0.08 newvoice p=0.30 lowrev p=0.11 p=0.17 highrev p=0.14 p<0.07 noiserev p<0.05 p=0.14 p=0.18 no evidence for a strong influence
  20. 20. Loudness vs Self-Reported Understanding Study by System by Reverb by SNR matrix p<0.01 newvoice p<0.005 lowrev p<0.005 p<0.005 highrev p<0.005 p<0.005 noiserev p<0.001 p<0.001 p<0.001 Self-reported loudness of environment noise relates to self-reported difficulty, not WER
  21. 21. Example: Noise x Reverb
  22. 22. Effects of Context on Performance • can be subtle • may depend on whether self-reported or measured performance • may depend on who shows up for your study: better understanding of possible confounders! Suggestion: build up library of context data across studies
  23. 23. How Can We Make it Easier? ❖ Design between subject rather than within. 90 sentences on final study was a killer ❖ Pay a living wage ❖ encourage free comments that can be mined for useful information (think canary in a coal mine) ❖ offer more info on goal of study, opt-in to receive results summary
  24. 24. Canaries in the Comment Coalmine ❖ issues with the software ❖ issues with their memory ❖ typing while listening ❖ issues with UK accent for US listeners ❖ how they adjusted the volume at their end
  25. 25. Conclusion ❖ Use consistent brief questions regarding context to better characterise your samples across all your studies ❖ Use free comments to look for aspects you hadn’t considered before ❖ Be kind to your participants Questions? Contact: 
 maria.wolters@ed.ac.uk, @mariawolters, 
 http://mariawolters.net Dr Karl B Isaac

×