Publicidad
Publicidad

Más contenido relacionado

Similar a 2010 INTERSPEECH (20)

Publicidad
Publicidad

2010 INTERSPEECH

  1. Human Interface Laboratory Speech to text adaptation: Towards an efficient cross-modal distillation 2020. 10. 26, @Interspeech Won Ik Cho, Donghyun Kwak, Ji Won Yoon, Nam Soo Kim
  2. Contents • Motivation • Task and Dataset • Related Work • Method • Result and Discussion • Conclusion 1
  3. Motivation • Text and speech : Two main medium of communication • But, Text resources >> Speech resources  Why? • Difficult to control the generation and storage of the recordings 2 “THIS IS A SPEECH” Difference in search result with ‘English’ in ELRA catalog
  4. Motivation • Pretrained language models  Mainly developed for the text-based systems • ELMo, BERT, GPTs …  Bases on huge amount of raw corpus • Trained with simple but non-task-specific objectives • Pretrained speech models?  Recently suggested • SpeechBERT, Speech XLNet …  Why not prevalent? • Difficulties in problem setting – What is the correspondence of the tokens? • Requires much high resources than text data 3
  5. Motivation • How to leverage pretrained LMs (or the inference thereof) in speech processing?  Direct use? • Only if the ASR output are accurate  Training LMs with erroneous speech transcriptions? • Okay, but cannot cover all the possible cases, and requires script for various scenarios  Distillation? 4 (Hinton et al., 2015)
  6. Task and Dataset • Task: Spoken language understanding  Literally – Understanding spoken language?  In literature – Intent identification and slot filling  Our hypothesis: • On either case, abstracted speech data will meet the abstracted representation of text, in semantic pathways 5 Lugosch et al. (2019) Hemphill et al. (1990) Allen (1980)
  7. Task and Dataset • Freely available benchmark!  Fluent speech command • 16kHz single channel 30,043 audio files • Each audio labeled with three slots: action / object / location • 248 different phrases spoken by 97 speakers (77/10/10) • Multi-label classification problem  Why Fluent speech command? (suggested in Lugosch et al., 2019) • Google speech command: – Only short keywords, thus not an SLU • ATIS – Not publicly available • Grabo, Domonica, Pactor – Free, but only a small number of speakers and phrases • Snips audio – Variety of phrases, but less audio 6
  8. Related Work • ASR-NLU pipelines  Conventional approaches  Best if an accurate ASR is guaranteed  Easier to interpret the issue and enhance partial modules • End-to-end SLU  Less prone to ASR errors  Non-textual information might be preserved as well • Pretrained LMs  Takes advantage of massive textual knowledge  High performance, freely available modules • Knowledge distillation  Adaptive to various training schemes  Cross-modal application is probable 7
  9. Related Work • ASR-NLU pipelines  Conventional approaches  Best if an accurate ASR is guaranteed  Easier to interpret the issue and enhance partial modules • End-to-end SLU  Less prone to ASR errors  Non-textual information might be preserved as well • Pretrained LMs  Takes advantage of massive textual knowledge  High performance, freely available modules • Knowledge distillation  Adaptive to various training schemes  Cross-modal application is probable 8
  10. Related Work • End-to-end SLU  Lugosch, Loren, et al. "Speech Model Pre-training for End-to-End Spoken Language Understanding." INTERSPEECH 2019. 9
  11. Related Work • End-to-end SLU  Wang, Pengwei, et al. "Large-Scale Unsupervised Pre-Training for End-to- End Spoken Language Understanding," ICASSP 2020. 10
  12. Related Work 11 • Pretrained LMs  Transformer architectures
  13. Related Work • End-to-end speech processing + PLM  Chuang, Yung-Sung, et al. “SpeechBERT: Cross-Modal Pre-Trained Language Model for End-to-End Spoken Question Answering.“ INTERSPEECH 2020. 12
  14. Related Work • End-to-end speech processing + KD  Liu, Yuchen, et al. "End-to-End Speech Translation with Knowledge Distillation." INTERSPEECH 2019. 13
  15. Method • End-to-end SLU+ PLM + Cross-modal KD 14
  16. Method • End-to-end SLU  Backbone: Lugosch et al. (2019) • Phoneme module (SincNet layer) • Word module – BiGRU-based, with dropout/pooling • Intent module – Consequent prediction of three slots – Also implemented with BiGRU 15 (Ravanelli and Bengio, 2018) From previous ver. of Wang et al. (2020)
  17. Method • End-to-end SLU 16
  18. Method • PLM  Fine-tuning the pretrained model • BERT-Base (Devlin et al., 2018) – Bidirectional encoder representations from Transformers (BERT) • Hugging Face PyTorch wrapper 17
  19. Method • PLM  Fine-tuning with FSC ground truth scripts! 18
  20. Method • Cross-modal KD  Distillation as a teacher-student learning • Loss1 = f answer, inferences • Loss2 = g inferences , inferencet • Different input, same task? – e.g., speech translation 19 𝑇𝑜𝑡𝑎𝑙 𝐿𝑜𝑠𝑠 = 𝐿𝑜𝑠𝑠1 + 𝐿𝑜𝑠𝑠2 Distilled knowledge (Liu et al., 2019)
  21. Method • Cross-modal KD  What determines the loss? • WHO TEACHES • HOW IS THE LOSS CALCULATED – MAE, MSE • HOW MUCH THE GUIDANCE INFLUENCES (SCHEDULING) 20
  22. Method • Cross-modal KD 21
  23. Result and Discussion • Teacher performance  GT-based, high-performance  Not encouraging for ASR result • Why ASR-NLU baseline is borrowed (Wang et al., 2019) • Comparison with the baseline  Distillation is successful for flexible teacher influence  Reaches high performance only with a simple distillation  Professor model does not necessarily dominate, but Hybrid model is effective with MAE as loss function 22
  24. Result and Discussion • Teacher performance  GT-based, high-performance  Not encouraging for ASR result • Why ASR-NLU baseline is borrowed (Wang et al., 2019) • Comparison with the baseline  Distillation is successful for flexible teacher influence  Reaches high performance only with a simple distillation  Professor model does not necessarily dominate, but Hybrid model is effective with MAE as loss function 23
  25. Result and Discussion • Comparison with the baseline (cont’d)  Better teacher performance does not guarantee the high quality distillation • In correspondence with the recent findings in image processing and ASR distillation – Tutor might be better than professor?  MAE overall better than MSE • Probable correspondence with SpeechBERT • Why? – Different nature of input – MSE might amplify the gap and lead to collapse » Partly observed in data shortage scenarios 24 (Chuang et al., 2019)
  26. Result and Discussion • Data shortage scenario  MSE collapse is more explicit  Scheduling also matters • Exp. better than Tri. and err shows that – Warm up and decay is powerful – Teacher influence does not necessarily have to last long • However, less mechanical approach is still anticipated – e.g., Entropy-based?  Overall result suggests that distillation from fine-tuned LM helps student learn some information regarding uncertainty that is difficult to obtain from speech-only end-to-end system? 25
  27. Result and Discussion • Discussion  Is this cross-modal or multi-modal? • Probably; though text (either ASR output or GT) comes from the speech, the format are different by Waveform and Unicode  Is this knowledge sharing? • Also yes; though we exploit logit-level information, the different aspect of uncertainty derived from each modality might affect the distillation process, making the process as knowledge sharing rather than optimization  To engage in paralinguistic properties? • Further study; Frame-level acoustic information can be residual connected to compensate for the loss; this might not leverage much from the text-based LMs 26
  28. Conclusion • Cross-modal distillation works in SLU, even if teacher input modality is explicitly different from that of student • Simple distillation from fine-tuned LM helps student learn some uncertainty that is not probable from speech-only training • MAE loss is effective in speech to text adaptation, possibly with warm-up and decay scheduling of KD loss 27
  29. Reference (in order of appearance) • Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural network." arXiv preprint arXiv:1503.02531 (2015). • Allen, James F., and C. Raymond Perrault. "Analyzing intention in utterances." Artificial intelligence 15.3 (1980): 143-178. • Hemphill, Charles T., John J. Godfrey, and George R. Doddington. "The ATIS spoken language systems pilot corpus." Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990. 1990. • Lugosch, Loren, et al. "Speech Model Pre-training for End-to-End Spoken Language Understanding." arXiv preprint arXiv:1904.03670 (2019). • Wang, Pengwei, et al. "Large-Scale Unsupervised Pre-Training for End-to-End Spoken Language Understanding." ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020. • Peters, Matthew E., et al. "Deep contextualized word representations." arXiv preprint arXiv:1802.05365 (2018). • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008). • Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018). • Chuang, Yung-Sung, Chi-Liang Liu, and Hung-Yi Lee. "SpeechBERT: Cross-modal pre-trained language model for end-to- end spoken question answering." arXiv preprint arXiv:1910.11559 (2019). • Liu, Yuchen, et al. "End-to-end speech translation with knowledge distillation." arXiv preprint arXiv:1904.08075 (2019). • Ravanelli, Mirco, and Yoshua Bengio. "Speaker recognition from raw waveform with sincnet." 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2018. • Wolf, Thomas, et al. "HuggingFace's Transformers: State-of-the-art Natural Language Processing." ArXiv (2019): arXiv- 1910. 28
  30. Thank you! EndOfPresentation

Notas del editor

  1. .
Publicidad