Human Interface Laboratory
Speech to text adaptation:
Towards an efficient cross-modal distillation
2020. 10. 26, @Interspeech
Won Ik Cho, Donghyun Kwak, Ji Won Yoon, Nam Soo Kim
Motivation
• Text and speech : Two main medium of communication
• But, Text resources >> Speech resources
Why?
• Difficult to control the generation
and storage of the recordings
2
“THIS IS A SPEECH”
Difference in search result with ‘English’ in ELRA catalog
Motivation
• Pretrained language models
Mainly developed for the text-based systems
• ELMo, BERT, GPTs …
Bases on huge amount of raw corpus
• Trained with simple but non-task-specific objectives
• Pretrained speech models?
Recently suggested
• SpeechBERT, Speech XLNet …
Why not prevalent?
• Difficulties in problem setting
– What is the correspondence of the tokens?
• Requires much high resources than text data
3
Motivation
• How to leverage pretrained LMs (or the inference thereof) in
speech processing?
Direct use?
• Only if the ASR output are accurate
Training LMs with erroneous speech transcriptions?
• Okay, but cannot cover all the possible cases, and requires script for various
scenarios
Distillation?
4
(Hinton et al., 2015)
Task and Dataset
• Task: Spoken language understanding
Literally – Understanding spoken language?
In literature – Intent identification and slot filling
Our hypothesis:
• On either case, abstracted speech data will meet the abstracted representation
of text, in semantic pathways
5
Lugosch et al. (2019)
Hemphill et al. (1990)
Allen (1980)
Task and Dataset
• Freely available benchmark!
Fluent speech command
• 16kHz single channel 30,043 audio files
• Each audio labeled with three slots: action / object / location
• 248 different phrases spoken by 97 speakers (77/10/10)
• Multi-label classification problem
Why Fluent speech command? (suggested in Lugosch et al., 2019)
• Google speech command:
– Only short keywords, thus not an SLU
• ATIS
– Not publicly available
• Grabo, Domonica, Pactor
– Free, but only a small number of speakers and phrases
• Snips audio
– Variety of phrases, but less audio
6
Related Work
• ASR-NLU pipelines
Conventional approaches
Best if an accurate ASR is guaranteed
Easier to interpret the issue and enhance partial modules
• End-to-end SLU
Less prone to ASR errors
Non-textual information might be preserved as well
• Pretrained LMs
Takes advantage of massive textual knowledge
High performance, freely available modules
• Knowledge distillation
Adaptive to various training schemes
Cross-modal application is probable
7
Related Work
• ASR-NLU pipelines
Conventional approaches
Best if an accurate ASR is guaranteed
Easier to interpret the issue and enhance partial modules
• End-to-end SLU
Less prone to ASR errors
Non-textual information might be preserved as well
• Pretrained LMs
Takes advantage of massive textual knowledge
High performance, freely available modules
• Knowledge distillation
Adaptive to various training schemes
Cross-modal application is probable
8
Related Work
• End-to-end SLU
Lugosch, Loren, et al. "Speech Model Pre-training for End-to-End Spoken
Language Understanding." INTERSPEECH 2019.
9
Related Work
• End-to-end SLU
Wang, Pengwei, et al. "Large-Scale Unsupervised Pre-Training for End-to-
End Spoken Language Understanding," ICASSP 2020.
10
Related Work
• End-to-end speech processing + PLM
Chuang, Yung-Sung, et al.
“SpeechBERT: Cross-Modal
Pre-Trained Language Model
for End-to-End Spoken Question
Answering.“ INTERSPEECH 2020.
12
Related Work
• End-to-end speech processing + KD
Liu, Yuchen, et al. "End-to-End
Speech Translation with Knowledge
Distillation." INTERSPEECH 2019.
13
Method
• End-to-end SLU
Backbone: Lugosch et al. (2019)
• Phoneme module (SincNet layer)
• Word module
– BiGRU-based, with dropout/pooling
• Intent module
– Consequent prediction of three slots
– Also implemented with BiGRU
15
(Ravanelli and Bengio, 2018)
From previous ver. of Wang et al. (2020)
Method
• Cross-modal KD
Distillation as a teacher-student learning
• Loss1 = f answer, inferences
• Loss2 = g inferences , inferencet
• Different input, same task?
– e.g., speech translation
19
𝑇𝑜𝑡𝑎𝑙 𝐿𝑜𝑠𝑠 = 𝐿𝑜𝑠𝑠1 + 𝐿𝑜𝑠𝑠2
Distilled knowledge
(Liu et al., 2019)
Method
• Cross-modal KD
What determines the loss?
• WHO TEACHES
• HOW IS THE LOSS CALCULATED
– MAE, MSE
• HOW MUCH THE GUIDANCE
INFLUENCES (SCHEDULING)
20
Result and Discussion
• Teacher performance
GT-based, high-performance
Not encouraging for ASR result
• Why ASR-NLU baseline is
borrowed (Wang et al., 2019)
• Comparison with the baseline
Distillation is successful for
flexible teacher influence
Reaches high performance
only with a simple distillation
Professor model does not
necessarily dominate, but
Hybrid model is effective with
MAE as loss function
22
Result and Discussion
• Teacher performance
GT-based, high-performance
Not encouraging for ASR result
• Why ASR-NLU baseline is
borrowed (Wang et al., 2019)
• Comparison with the baseline
Distillation is successful for
flexible teacher influence
Reaches high performance
only with a simple distillation
Professor model does not
necessarily dominate, but
Hybrid model is effective with
MAE as loss function
23
Result and Discussion
• Comparison with the baseline (cont’d)
Better teacher performance does not guarantee the high quality distillation
• In correspondence with the recent findings in image processing and ASR
distillation
– Tutor might be better than professor?
MAE overall better than MSE
• Probable correspondence with SpeechBERT
• Why?
– Different nature of input
– MSE might amplify the gap
and lead to collapse
» Partly observed in
data shortage scenarios
24
(Chuang et al., 2019)
Result and Discussion
• Data shortage scenario
MSE collapse is more explicit
Scheduling also matters
• Exp. better than Tri. and err
shows that
– Warm up and decay is powerful
– Teacher influence does not
necessarily have to last long
• However, less mechanical
approach is still anticipated
– e.g., Entropy-based?
Overall result suggests that
distillation from fine-tuned LM
helps student learn some information regarding uncertainty that is difficult
to obtain from speech-only end-to-end system?
25
Result and Discussion
• Discussion
Is this cross-modal or multi-modal?
• Probably; though text (either ASR output or GT) comes from the speech, the
format are different by Waveform and Unicode
Is this knowledge sharing?
• Also yes; though we exploit logit-level information, the different aspect of
uncertainty derived from each modality might affect the distillation process,
making the process as knowledge sharing rather than optimization
To engage in paralinguistic properties?
• Further study; Frame-level acoustic information can be residual connected to
compensate for the loss; this might not leverage much from the text-based LMs
26
Conclusion
• Cross-modal distillation works in SLU, even if teacher input
modality is explicitly different from that of student
• Simple distillation from fine-tuned LM helps student learn some
uncertainty that is not probable from speech-only training
• MAE loss is effective in speech to text adaptation, possibly with
warm-up and decay scheduling of KD loss
27
Reference (in order of appearance)
• Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural network." arXiv preprint
arXiv:1503.02531 (2015).
• Allen, James F., and C. Raymond Perrault. "Analyzing intention in utterances." Artificial intelligence 15.3 (1980): 143-178.
• Hemphill, Charles T., John J. Godfrey, and George R. Doddington. "The ATIS spoken language systems pilot corpus."
Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990. 1990.
• Lugosch, Loren, et al. "Speech Model Pre-training for End-to-End Spoken Language Understanding." arXiv preprint
arXiv:1904.03670 (2019).
• Wang, Pengwei, et al. "Large-Scale Unsupervised Pre-Training for End-to-End Spoken Language Understanding." ICASSP
2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020.
• Peters, Matthew E., et al. "Deep contextualized word representations." arXiv preprint arXiv:1802.05365 (2018).
• Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you
need. In Advances in neural information processing systems (pp. 5998-6008).
• Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint
arXiv:1810.04805 (2018).
• Chuang, Yung-Sung, Chi-Liang Liu, and Hung-Yi Lee. "SpeechBERT: Cross-modal pre-trained language model for end-to-
end spoken question answering." arXiv preprint arXiv:1910.11559 (2019).
• Liu, Yuchen, et al. "End-to-end speech translation with knowledge distillation." arXiv preprint arXiv:1904.08075 (2019).
• Ravanelli, Mirco, and Yoshua Bengio. "Speaker recognition from raw waveform with sincnet." 2018 IEEE Spoken Language
Technology Workshop (SLT). IEEE, 2018.
• Wolf, Thomas, et al. "HuggingFace's Transformers: State-of-the-art Natural Language Processing." ArXiv (2019): arXiv-
1910.
28