2. 音声波形
a r a y u rsil u g e N j i ts u
音素系列
あらゆる 現実無音
単語系列
音声波形を合成するには・・・
文 「あらゆる現実を全て自分の方へ・・・」
• 音声波形の特徴を上手く捉えなければいけない・・・
• 長期に渡る依存関係をどう捉えるか?
• 揺らぎ成分をどう捉えるか?
これら長年の研究課題を解決する技術が2016年9月に提案された!
WaveNet [van den Oord; ’16b] !
41. [Arik; ’17] S. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y. Kang, X. Li, J. Miller, A. Ng, J.
Raiman, S. Sengupta, M. Shoeybi. Deep Voice: real-time neural text-to-speech. arXiv preprint,
arXiv:1702.07825, 2017.
[Gu; ’17] Y. Gu, Z. Ling. Waveform modeling using stacked dilated convolutional neural networks for speech
bandwidth extension. Proc. INTERSPEECH, pp. 1123–1127, 2017.
[Hayashi; ’17] T. Hayashi, A. Tamamori, K. Kobayashi, K. Takeda, T. Toda. An investigation of multi-speaker
training for WaveNet vocoder. Proc. IEEE ASRU, pp. 712–718, 2017.
[He; ’16] K. He, X. Zhang, S. Ren, J. Sun. Deep residual learning for image recognition. Proc. CVPR, pp. 770–
778, 2016.
[Itakura; ’68] F. Itakura, S. Saito. Analysis synthesis telephony based upon the maximum likelihood method.
Proc. ICA, C-5-5, pp. C17–20, 1968.
[Juvela; ’16] L. Juvela, B. Bollepalli, M. Airaksinen, P. Alku. High-pitched excitation generation for glottal
vocoding in statistical parametric speech synthesis using a deep neural network. Proc. IEEE ICASSP, pp.
5120–5124, 2016.
[Kawahara; ’99] H. Kawahara, I. Masuda-Katsuse, A. Cheveign′e. Restructuring speech representations
using a pitch-adaptive timefrequency smoothing and an instantaneous-frequency-based F0 extraction:
Possible role of a repetitive structure in sounds. Speech Communication, Vol. 27, No. 3–4, pp. 187–207,
1999.
[Kingma; ’16] D.P. Kingma, T. Salimans, M. Welling. Improving variational inference with inverse
autoregressive flow. arXiv preprint, arXiv:1606.04934, 2016.
参考文献
References: 1
42. [Kobayashi; ’17] K. Kobayashi, T. Hayashi, A. Tamamori, T. Toda. Statistical voice conversion with WaveNet-
based waveform generation. Proc. INTERSPEECH, pp. 1138–1142, 2017.
[Maia; ’13] R. Maia, M. Akamine, M. Gales. Complex cepstrum for statistical parametric speech synthesis.
Speech Communication, Vol. 55, No. 5, pp. 606–618, 2013.
[Morise; ’16] M. Morise, F. Yokomori, K. Ozawa. WORLD: a vocoderbased high-quality speech synthesis
system for real-time applications. IEICE trans. inf. & syst., Vol. E99-D, No. 7, pp. 1877–1884, 2016.
[Niwa; ’17] J. Niwa,T. Yoshimura,K. Hashimoto,K. Oura,Y. Nankaku,K. Tokuda. WaveNet-based voice
conversion. 音講論, 1-8-15, pp. 207–208, Sep. 2017.
[Okamoto; ’17] T. Okamoto, K. Tachibana, T. Toda, Y. Shiga, H. Kawai. Subband WaveNet with overlapped
single-sideband filterbanks. Proc. IEEE ASRU, pp. 698–704, 2017.
[Pantazis; ’11] Y. Pantazis, O. Rosec, Y. Stylianou. Adaptive AM–FM signal decomposition with application to
speech analysis. IEEE Trans. on Audio, Speech, & Lang. Process., Vol. 19, No. 2, pp. 290–300, 2011.
[Qian; ’17] K. Qian, Y. Zhang, S. Chang, X. Yang, D. Florêncio, M. Hasegawa-Johnson. Speech enhancement
using bayesian WaveNet. Proc. INTERSPEECH, pp. 2013–2017, 2017.
[Rethage; ’17] D. Rethage, J. Pons, X. Serra. A WaveNet for speech denoising. arXiv preprint,
arXiv:1706.07162, 2017
[Salimans; ’17] T. Salimans, A. Karpathy, X. Chen, D.P. Kingma. PixelCNN++: improving the pixelCNN with
discretized logistic mixture likelihood and other modifications. arXiv preprint, arXiv:1701.05517, 2017.
[Shen; ’17] J. Shen, R. Pang, R.J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerry-
Ryan, R.A. Saurous, Y. Agiomyrgiannakis, Y. Wu. Natural TTS synthesis by conditioning WaveNet on mel
spectrogram predictions. arXiv preprint, arXiv:1712.05884, 2017.
[Takamichi; ’16] S. Takamichi, T. Toda, A.W. Black, G. Neubig, S. Sakti, S. Nakamura. Post-filters to modify
the modulation spectrum for statistical parametric speech synthesis. IEEE/ACM Trans. Audio, Speech &
Lang. Process., Vol. 24, No. 4, pp. 755–767, 2016.
References: 2
43. [橘; ’17] 橘 健太郎, 戸田 智基, 志賀 芳則, 河井 恒. WaveNetにおける音声波形量子化法の評価. 音講
論, 1-Q-28, pp. 291–294, Mar. 2017.
[Tamamori; ’17] A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda, T. Toda. Speaker-dependent WaveNet
vocoder. Proc. INTERSPEECH, pp. 1118–1122, 2017.
[Toda; ’07] T. Toda, A.W. Black, K. Tokuda. Voice conversion based on maximum likelihood estimation of
spectral parameter trajectory. IEEE Trans. Audio, Speech & Lang. Process., Vol. 15, No. 8, pp. 2222–2235,
2007.
[Tokuda; ’15] K. Tokuda, H. Zen. Directly modeling speech waveforms by neural networks for statistical
parametric speech synthesis. Proc. IEEE ICASSP, pp. 4215–4219, 2015
[徳田; ’92] 徳田 恵一, 小林 隆夫, 千葉 健司, 今井 聖. メル一般化ケプス トラム分析による音声のスペク
トル推定. 信学論(A), Vol. J75-A, No. 7, pp. 1124–1134, 1992.
[van den Oord; ’16a] A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, K. Kavukcuoglu.
Conditional image generation with PixelCNN decoders. arXiv preprint, arXiv:1606.05328, 2016.
[van den Oord; ’16b] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N.
Kalchbrenner, A. Senior, K. Kavukcuoglu. Wavenet: a generative model for raw audio. arXiv preprint,
arXiv:1609.03499, 2016.
[van den Oord; ’17a] A. van den Oord, O. Vinyals, K. Kavukcuoglu. Neural discrete representation learning.
arXiv preprint, arXiv:1711.00937, 2017.
[van den Oord; ’17b] A. van den Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. van
den Driessche, E. Lockhart, L.C. Cobo, F. Stimberg, N. Casagrande, D. Grewe, S. Noury, S. Dieleman, E. Elsen,
N. Kalchbrenner, H. Zen, A. Graves, H. King, T. Walters, D. Belov, D. Hassabis. Parallel WaveNet: fast high-
fidelity speech synthesis. arXiv preprint, arXiv:1711.10433, 2017.
[吉村; ’17] 吉村 建慶, 橋本 佳, 大浦 圭一郎, 南角 吉彦, 徳田 恵一. WaveNetにおけるメルケプストラム
に基づくノイズシェーピング量子化法の適用. 音講論, 1-8-8, pp. 193–194, Sep. 2017.
References: 3