12. 提案手法: wav2vec 2.0 [Baevski+, NeurIPS2020]
• vq-wav2vecを1ステップで事前学習できるように改良
– vq-wav2vec同様、Feature Encoder で特徴表現 𝒛𝒊 を得る
– 𝒛𝒊 をランダムにマスクした上で、Context Encoder(Transformer)で音声全体を
エンコードする
– マスクされた箇所のコンテクスト 𝒄𝒊 が、𝒒𝒊 と近くなるようにメトリック学習する
12
X
Z
…
…
C
Q
Masked
CNN
q q q q q
L̀
Contrastive loss
Context
representations
raw waveform
Quantized
representations
Latent speech
representations
Transformer
Figure 1: Illustration of our framework which jointly learns contextualized speech representations
and an inventory of discretized speech units.
Feature Encoder
Context Encoder
Quantization
𝒄𝒊
𝒛𝒊
𝒒𝒊
13. 提案手法: wav2vec 2.0 [Baevski+, NeurIPS2020]
• (補足)なぜContext Encoderの入力とLoss計算の入力が異なるのか
– Context Encoderの入力 𝒛𝒊 : 量子化モジュール未使用
– Contrastive Lossの入力 𝒒𝒊 : 量子化モジュール使用
– 結論:その組み合わせが最も精度が良かったから
13
X
Z
…
…
C
Q
Masked
CNN
q q q q q
L̀
Contrastive loss
Context
representations
raw waveform
Quantized
representations
Latent speech
representations
Transformer
Figure 1: Illustration of our framework which jointly learns contextualized speech representations
and an inventory of discretized speech units.
on labeled data with a Connectionist Temporal Classification (CTC) loss [14, 4] to be used for
𝒄𝒊
𝒛𝒊
𝒒𝒊
25. References
• Baevski, Alexei, et al. "wav2vec 2.0: A framework for self-supervised learning of speech
representations." In NeurIPS, 2020.
• Schneider, Steffen, et al. "wav2vec: Unsupervised pre-training for speech recognition." arXiv
preprint arXiv:1904.05862 (2019).
• Baevski, Alexei, Steffen Schneider, and Michael Auli. "vq-wav2vec: Self-supervised learning of
discrete speech representations." In ICLR, 2020.
• Joshi, Mandar, et al. "Spanbert: Improving pre-training by representing and predicting spans.”
In TACL, 2020.
25