Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
NII
Tacotron WaveNet
1contact: wangxin@nii.ac.jp
we welcome critical comments, suggestions, and discussion
Xin WANG, Yusuk...
NII
Part 1: WaveNet ssddddd
2contact: wangxin@nii.ac.jp
we welcome cri4cal comments, sugges4ons, and discussion
Xin WANG
N...
l
l
l
l PhD (2015-2018)
l M.A (2012-2015)
l h0p://tonywangx.github.io
SELF-INTRODUCTION
3
l Introduction: text-to-speech synthesis
l Neural waveform models
l Summary & software
CO N TEN TS
4
Text-to-speech synthesis (TTS) [1,2]
NII’s TTS systems
5
INTRODUCTION
[1] P. Taylor. Text-to-Speech Synthesis. Cambridge U...
6
INTRODUCTION
Text-to-speech synthesis (TTS)
q Architectures
text text-to-speech speechText
Text
analyzer
Acous6c
model(s...
7
INTRODUCTION
Text-to-speech synthesis (TTS)
q Architectures
TTS backend model
text text-to-speech speechText
Text
analyz...
8
INTRODUCTION
Text-to-speech synthesis (TTS)
q Architectures
TTS backend model sss
text text-to-speech speechText
Text
an...
l Introduction: text-to-speech synthesis
l Neural waveform modeling
• Overview
• Autoregressive models
• Normalizing flow
...
10
OVERVIEW
Task definition
Spectrogram
11
OVERVIEW
Task definition
…
c1 c2 cNc3
Waveform
values
Spectrogram
1 2 3 4 T…
o1 o2 o3 o4 oT
T waveform sampling points
...
Neural waveform models
12
OVERVIEW
Task definition
1 2 3 4 T…
o1 o2 o3 o4 oT
c1
…
c2 cNc3
p(o1:T |c1:N ; ⇥)
Waveform
values
13
OVERVIEW
1 2 3 4 T
…
c1
…
c2 cNc3
Naïve model
q Simple network + maximum-likelihood
• Feedforward network
Hidden
layers...
14
OVERVIEW
1 2 3 4 T…
c1
…
c2 cNc3
Naïve model
q Simple network + maximum-likelihood
• RNN
…Up-sampling
Hidden
layers
Dis...
15
OVERVIEW
1 2 3 4 T…
c1
…
c2 cNc3
Naïve model
q Simple network + maximum-likelihood
• Dilated CNN [1,2]
…
Hidden
layers
...
16
OVERVIEW
1 2 3 4 T…
…
Naïve model
…
c1 c2 cNc3
o1 o2 o3 o4 oT
p(o1:T |c1:N ; ⇥) = p(o1|c1:N ; ⇥)· · ·p(oT |c1:N ; ⇥) =
...
17
OVERVIEW
Towards better models
Naïve
model
o1:T
c1:N
Criterion
Better
model
o1:T
c1:N
Criterion
Better model Easier tar...
18
OVERVIEW
WaveNet
WaveRNN
SampleRNN
FFTNet
WaveGlow
FloWaveNet Neural source-
filter model
Parallel WaveNet
ClariNet RNN...
l Introduction: text-to-speech synthesis
l Neural waveform models
• Overview
• Autoregressive models
• Normalizing flow
• ...
20
PART I: AUTOREGRESSIVE MODELS
Core idea
q Naïve model
…
…
c1 c2 cNc3
p(o1:T |c1:N ; ⇥) =
TY
t=1
p(ot|c1:N ; ⇥)
1 2 3 4 ...
21
PART I: AUTOREGRESSIVE MODELS
Core idea
q Autoregressive (AR) model
1 2 3 4 T…
…
…
c1 c2 cNc3
o1 o2 o3 o4 oT
1 2 3
p(o1...
22
PART I: AUTOREGRESSIVE MODELS
Core idea
q Autoregressive (AR) model
• Training: use natural waveform for feedback (teac...
23
PART I: AUTOREGRESSIVE MODELS
Core idea
q Autoregressive (AR) model
• Training: use natural waveform for feedback (teac...
24
PART I: AUTOREGRESSIVE MODELS
Core idea
q Autoregressive (AR) model
• Training: use natural waveform for feedback (teac...
LPCNet
LP-WaveNet
ExcitNet
GlotNet
Subband WaveNet
SampleRNN
25
WaveRNN FFTNet
WaveGlow
FloWaveNet Neural source-
filter mo...
WaveNet
q Network structure
• Details: http://id.nii.ac.jp/1001/00185683/
http://tonywangx.github.io/pdfs/wavenet.pdf
26
P...
SampleRNN
27
WaveRNN FFTNet
LPCNet
LP-WaveNet
ExcitNet
GlotNet
WaveGlow
FloWaveNet Neural source-
filter model
Parallel Wa...
28
PART I: AUTOREGRESSIVE MODELS
WaveRNN
q Linear time AR dependency in WaveNet
q Subscale AR dependency + batch sampling
...
29
PART I: AUTOREGRESSIVE MODELS
WaveRNN
q Linear time AR dependency in WaveNet
q Subscale AR dependency + batch sampling
...
30
PART I: AUTOREGRESSIVE MODELS
WaveRNN
q Linear time AR dependency in WaveNet
q Subscale AR dependency + batch sampling
...
SampleRNN
31
WaveRNN FFTNet
WaveGlow
FloWaveNet Neural source-
filter model
Parallel WaveNet
ClariNet RNN + STFT loss
Model...
SampleRNN
32
WaveRNN FFTNet
Model
o1:T
c1:N
Criterion
Model
o1:T
c1:N
Criterion
Transform
Normlizing flow / feature transf...
l Introduction: text-to-speech synthesis
l Neural waveform models
• Overview
• Autoregressive models
• Normalizing flow
• ...
34
PART II: NORMALIZING FLOW-BASED MODELS
General idea
• o with strong temporal correla3on ---> z with weak temporal corre...
35
PART II: NO RM ALIZIN G FLOW -BASED
MODELSGeneral idea
• o ---> Gaussian noise sequence z
• Multiple transformations: n...
SampleRNN
36
WaveRNN FFTNetLPCNet
LP-WaveNet
ExcitNet
GlotNet
Model
o1:T
c1:N
Criterion
Subband WaveNet
Universal neural v...
37
PART II: NORMALIZING FLOW-BASED MODELS
ClariNet & parallel WaveNet
q Genera(on process
…
N(o; I)
c1:N
bo1:T
f M
(·)
f M...
38
PART II: NORMALIZING FLOW-BASED MODELS
ClariNet & parallel WaveNet
q Naïve training process
…
N(o; I)
c1:N
1 2 3 4 T
1
...
39
PART II: NORMALIZING FLOW-BASED MODELS
o
c
…
f 1
M
(·)
N(o; I)
Criterion
z
f 1
1
(·)
f 1
M 1
(·)
Training Generation
(I...
40
PART II: NORMALIZING FLOW-BASED MODELS
ClariNet & parallel WaveNet
q Fast training : knowledge distilling
• Teacher giv...
SampleRNN
41
WaveRNN FFTNet
Model
o1:T
c1:N
Criterion
Model
o1:T
c1:N
Criterion
Transform
Normlizing flow / feature transf...
WaveGlow
q Why ~ O(T) ? Dependency in linear time domain
q Reduce T? Dependency in subscale time domain
42
PART II: NORMAL...
WaveGlow
q Why ~ O(T) ? Dependency in linear time domain
q Reduce T? Dependency in subscale time domain
43
PART II: NORMAL...
SampleRNN
44
WaveRNN FFTNet
Model
o1:T
c1:N
Criterion
Model
o1:T
c1:N
Criterion
Transform
Normlizing flow / feature transf...
l Introduction: text-to-speech synthesis
l Neural waveform models
• Overview
• Autoregressive models
• Normalizing flow
• ...
46
PART III: STFT-BASED TRAINING CRITERION
1 2 3 4 T…
c1
…
c2 cNc3
Neural source-filter model
q Naïve model and STFT-based...
47
PART III: STFT-BASED TRAINING CRITERION
1 2 3 4 T…
c1
…
c2 cNc3
Neural source-filter model
q Naïve model and STFT-based...
48
PART III: STFT-BASED TRAIN IN G CRITERIO N
1 2 3 4 T…
c1
…
c2 cNc3
Neural source-filter model
q Naïve model and STFT-bas...
49
PART III: STFT-BASED TRAINING CRITERION
Neural source-filter model
q Examples
Only harmonics
No formants
50
PART III: STFT-BASED TRAIN IN G CRITERIO N
Neural source-filter model
q Examples
51
PART III: STFT-BASED TRAINING CRITERION
Neural source-filter model
q Examples
52
PART III: STFT-BASED TRAINING CRITERION
Neural source-filter model
q Examples
53
PART III: STFT-BASED TRAINING CRITERION
Neural source-filter model
q Examples
54
PART III: STFT-BASED TRAINING CRITERION
Neural source-filter model
q Examples
55
PART III: STFT-BASED TRAINING CRITERION
Neural source-filter model
q Results
• Faster than WaveNet (at least 100 2mes)
•...
l Introduction: text-to-speech synthesis
l Neural waveform models
l Summary & software
CO N TEN TS
56
57
WaveNet
WaveRNN
SampleRNN
FFTNet
WaveGlow
FloWaveNet Neural source-
filter model
Parallel WaveNet
ClariNet RNN + STFT l...
58
WaveNet
WaveRNN
SampleRNN
FFTNet
WaveGlow
FloWaveNet Neural source-
filter model
Parallel WaveNet
ClariNet RNN + STFT l...
NII neural network toolkit
q Neural waveform models
1. Toolkit cores: https://github.com/nii-yamagishilab/project-CURRENNT...
NII neural network toolkit
q Useful slides
• http://tonywangx.github.io/pdfs/CURRENNT_TUTORIAL.pdf
• Other related slides ...
61
REFERENCE
WaveNet: A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior...
End of Part 1
62
Codes, scripts, slides
http://nii-yamagishilab.github.io
This work was partially supported by JST CREST G...
Text-to-speech (TTS)
q Speech samples from NII’s TTS system
63
INTRODUCTION
Text Waveform Text Waveform
NII
TTS system
201...
64
PART I: AUTOREGRESSIVE MODELS
SampleRNN
q Idea
• Hierarchical / multi-resolution dependency
S. Mehri, K. Kumar, I. Gulr...
65
PART I: AUTOREGRESSIVE MODELS
SampleRNN
q Example network structure
• R: time resolution increased by * 2
Tie 1
Tie 2
T...
66
PART I: AUTOREGRESSIVE MODELS
SampleRNN
q Example network structure
• Training
• R: 8me resolu8on increased by * 2
1 2 ...
67
PART I: AUTOREGRESSIVE MODELS
SampleRNN
q Example network structure
• Generation process
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 ...
68
PART I: AUTOREGRESSIVE MODELS
WaveNet
q Variants
• u-Law discrete waveform ---> continuous-valued waveform
§ Mixture of...
69
PART I: AUTOREGRESSIVE MODELS
WaveRNN
q WaveNet is inefficient
• Computa4on cost
1. Imprac4cal for 16 bit PCM (so?max of ...
70
PART I: AUTOREGRESSIVE MODELS
WaveRNN
q WaveRNN strategies
• Computation cost
1. Impractical for 16 bit PCM (softmax of...
Próxima SlideShare
Cargando en…5
×

エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~

1.266 visualizaciones

Publicado el

2019年1月27日(日)に金沢において開催された音声研究会(SP)で実施した[チュートリアル招待講演]エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル ~のスライドです。

発表者:シン ワン、安田裕介

Publicado en: Tecnología
  • Sé el primero en comentar

エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~

  1. 1. NII Tacotron WaveNet 1contact: wangxin@nii.ac.jp we welcome critical comments, suggestions, and discussion Xin WANG, Yusuke YASUDA National Institute of Informatics, Japan 2019-01-27
  2. 2. NII Part 1: WaveNet ssddddd 2contact: wangxin@nii.ac.jp we welcome cri4cal comments, sugges4ons, and discussion Xin WANG National Institute of Informatics, Japan 2019-01-27 Neural waveform models
  3. 3. l l l l PhD (2015-2018) l M.A (2012-2015) l h0p://tonywangx.github.io SELF-INTRODUCTION 3
  4. 4. l Introduction: text-to-speech synthesis l Neural waveform models l Summary & software CO N TEN TS 4
  5. 5. Text-to-speech synthesis (TTS) [1,2] NII’s TTS systems 5 INTRODUCTION [1] P. Taylor. Text-to-Speech Synthesis. Cambridge University Press, 2009. [2] T. Dutoit. An Introduction to Text-to-speech Synthesis. Kluwer Academic Publishers, Norwell, MA, USA, 1997. Text Text-to-speech (TTS) Speech waveform Text Waveform Text Waveform NII TTS system 2013 2016 2018 Year
  6. 6. 6 INTRODUCTION Text-to-speech synthesis (TTS) q Architectures text text-to-speech speechText Text analyzer Acous6c model(s) Linguistic features Waveform Waveform model Acoustic features … * | * || Linguistic features Acoustic features Spectral features,F0, Band-aperiodicity, etc.
  7. 7. 7 INTRODUCTION Text-to-speech synthesis (TTS) q Architectures TTS backend model text text-to-speech speechText Text analyzer Acoustic model(s) Linguistic features Waveform Waveform model Acoustic features Text Text analyzer Linguistic features Waveform End-to-end TTS modelText Waveform
  8. 8. 8 INTRODUCTION Text-to-speech synthesis (TTS) q Architectures TTS backend model sss text text-to-speech speechText Text analyzer Acoustic model(s) Linguistic features Waveform Waveform model Text Text analyzer Linguistic features Waveform End-to-end TTS modelText Waveform Waveform module Waveform module Tutorial part 1: neural waveform modeling Tutorial part 2: end-to-end TTS Acoustic features
  9. 9. l Introduction: text-to-speech synthesis l Neural waveform modeling • Overview • Autoregressive models • Normalizing flow • STFT-based training criterion l Summary CONTENTS 9
  10. 10. 10 OVERVIEW Task definition Spectrogram
  11. 11. 11 OVERVIEW Task definition … c1 c2 cNc3 Waveform values Spectrogram 1 2 3 4 T… o1 o2 o3 o4 oT T waveform sampling points N frames
  12. 12. Neural waveform models 12 OVERVIEW Task definition 1 2 3 4 T… o1 o2 o3 o4 oT c1 … c2 cNc3 p(o1:T |c1:N ; ⇥) Waveform values
  13. 13. 13 OVERVIEW 1 2 3 4 T … c1 … c2 cNc3 Naïve model q Simple network + maximum-likelihood • Feedforward network Hidden layers Distribution p(o4|c1:N ; ⇥) …Up-sampling Waveform values
  14. 14. 14 OVERVIEW 1 2 3 4 T… c1 … c2 cNc3 Naïve model q Simple network + maximum-likelihood • RNN …Up-sampling Hidden layers Distribution Waveform values
  15. 15. 15 OVERVIEW 1 2 3 4 T… c1 … c2 cNc3 Naïve model q Simple network + maximum-likelihood • Dilated CNN [1,2] … Hidden layers Distribution Waveform values Up-sampling [1] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015. [2] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. Lang. Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(3):328–339, 1989.
  16. 16. 16 OVERVIEW 1 2 3 4 T… … Naïve model … c1 c2 cNc3 o1 o2 o3 o4 oT p(o1:T |c1:N ; ⇥) = p(o1|c1:N ; ⇥)· · ·p(oT |c1:N ; ⇥) = TY t=1 p(ot|c1:N ; ⇥) o1:TWeak temporal model <---------------> strongly correlated data mismatch Hidden layers Distribution Waveform values Up-sampling
  17. 17. 17 OVERVIEW Towards better models Naïve model o1:T c1:N Criterion Better model o1:T c1:N Criterion Better model Easier target Model o1:T c1:N Criterion Transform Model o1:T c1:N Criterion Better criterion Weak temporal model <---------------> strongly correlated data mismatch Combination Better Model o1:T c1:N Criterion Knowledge- based Transform
  18. 18. 18 OVERVIEW WaveNet WaveRNN SampleRNN FFTNet WaveGlow FloWaveNet Neural source- filter model Parallel WaveNet ClariNet RNN + STFT loss Model o1:T c1:N Criterion Model o1:T c1:N Criterion Transform Model o1:T c1:N Criterion Autoregressive models Normlizing flow / feature transformation Training criterion in frequency domain Universal neural vocoder Model o1:T c1:N Criterion Knowledge-based Transform Speech/signal processing ☛ Please check reference in the appendix LPCNet LP-WaveNet ExcitNet GlotNet Subband WaveNet
  19. 19. l Introduction: text-to-speech synthesis l Neural waveform models • Overview • Autoregressive models • Normalizing flow • STFT-based training criterion l Summary & software CONTENTS 19
  20. 20. 20 PART I: AUTOREGRESSIVE MODELS Core idea q Naïve model … … c1 c2 cNc3 p(o1:T |c1:N ; ⇥) = TY t=1 p(ot|c1:N ; ⇥) 1 2 3 4 T… o1 o2 o3 o4 oT
  21. 21. 21 PART I: AUTOREGRESSIVE MODELS Core idea q Autoregressive (AR) model 1 2 3 4 T… … … c1 c2 cNc3 o1 o2 o3 o4 oT 1 2 3 p(o1:T |c1:N ; ⇥) = TY t=1 p(ot|o1:t 1, c1:N ; ⇥)
  22. 22. 22 PART I: AUTOREGRESSIVE MODELS Core idea q Autoregressive (AR) model • Training: use natural waveform for feedback (teacher forcing [1]) 1 2 3 4 T… … … c1 c2 cNc3 o1 o2 o3 o4 oT 1 2 3 [1] R. J. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280, 1989.
  23. 23. 23 PART I: AUTOREGRESSIVE MODELS Core idea q Autoregressive (AR) model • Training: use natural waveform for feedback (teacher forcing [1]) 1 2 3 4 T… … … c1 c2 cNc3 o1 o2 o3 o4 oT 1 2 3 [1] R. J. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280, 1989. Natural waveform
  24. 24. 24 PART I: AUTOREGRESSIVE MODELS Core idea q Autoregressive (AR) model • Training: use natural waveform for feedback (teacher forcing [1]) • Generation: … … c1 c2 cNc3 1 2 1 bo1 2 bo2 3 bo3 4 3 bo4 T… boT p(bo1:T |c1:N ; ⇥) = QT t=1 p(bot|bot P :t 1, c1:N ; ⇥) [1] R. J. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280, 1989. Generated waveform
  25. 25. LPCNet LP-WaveNet ExcitNet GlotNet Subband WaveNet SampleRNN 25 WaveRNN FFTNet WaveGlow FloWaveNet Neural source- filter model Parallel WaveNet ClariNet RNN + STFT loss Model o1:T c1:N Criterion Model o1:T c1:N Criterion Transform Model o1:T c1:N Criterion Normlizing flow / feature transformation Training criterion in frequency domain Universal neural vocoder Model o1:T c1:N Criterion Knowledge-based Transform Speech/signal processing PART I: AUTOREGRESSIVE MODELS WaveNet Autoregressive models
  26. 26. WaveNet q Network structure • Details: http://id.nii.ac.jp/1001/00185683/ http://tonywangx.github.io/pdfs/wavenet.pdf 26 PART I: AUTOREGRESSIVE MODELS 1 2 3 4 T 1 2 3 c1 c2 cNc3 A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
  27. 27. SampleRNN 27 WaveRNN FFTNet LPCNet LP-WaveNet ExcitNet GlotNet WaveGlow FloWaveNet Neural source- filter model Parallel WaveNet ClariNet RNN + STFT loss Model o1:T c1:N Criterion Model o1:T c1:N Criterion Transform Model o1:T c1:N Criterion Normlizing flow / feature transformation Training criterion in frequency domain Subband WaveNet Universal neural vocoder Model o1:T c1:N Criterion Knowledge-based Transform PART I: AUTOREGRESSIVE MODELS WaveNet Autoregressive models ☛ Please check SampleRNN in the appendix q Issues with WaveNet & Sample RNN Generation is very very … very slow q Acceleration? • Engineering-based: parallelize/simplify computation • Knowledge-based: speech / signal processing theory Speech/signal processing
  28. 28. 28 PART I: AUTOREGRESSIVE MODELS WaveRNN q Linear time AR dependency in WaveNet q Subscale AR dependency + batch sampling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Waveform values 16 steps to generate
  29. 29. 29 PART I: AUTOREGRESSIVE MODELS WaveRNN q Linear time AR dependency in WaveNet q Subscale AR dependency + batch sampling • Generate multiple samples at the same time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 16 steps to generate 10 steps to generate step1 step2 step3 …
  30. 30. 30 PART I: AUTOREGRESSIVE MODELS WaveRNN q Linear time AR dependency in WaveNet q Subscale AR dependency + batch sampling • Generate multiple samples at the same time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 16 steps to generate 10 steps to generate step1 step2 step3 …
  31. 31. SampleRNN 31 WaveRNN FFTNet WaveGlow FloWaveNet Neural source- filter model Parallel WaveNet ClariNet RNN + STFT loss Model o1:T c1:N Criterion Model o1:T c1:N Criterion Transform Model o1:T c1:N Criterion Normlizing flow / feature transformation Training criterion in frequency domain Universal neural vocoder Model o1:T c1:N Criterion Knowledge-based Transform PART I: AUTOREGRESSIVE MODELS WaveNet Autoregressive models q AR models Generation is still slow p(o1:T |c1:N ; ⇥) = TY t=1 p(ot|o1:t 1, c1:N ; ⇥) Speech/signal processing LPCNet LP-WaveNet ExcitNet GlotNet Subband WaveNet
  32. 32. SampleRNN 32 WaveRNN FFTNet Model o1:T c1:N Criterion Model o1:T c1:N Criterion Transform Normlizing flow / feature transformation Universal neural vocoder Model o1:T c1:N Criterion Knowledge-based Transform PART I: AUTOREGRESSIVE MODELS WaveNet Autoregressive models Non-autoregressive model? Speech/signal processing LPCNet LP-WaveNet ExcitNet GlotNet Subband WaveNet
  33. 33. l Introduction: text-to-speech synthesis l Neural waveform models • Overview • Autoregressive models • Normalizing flow • STFT-based training criterion l Summary & software CONTENTS 33
  34. 34. 34 PART II: NORMALIZING FLOW-BASED MODELS General idea • o with strong temporal correla3on ---> z with weak temporal correla3on • Principle of changing random variable Model s Criterion ⇥ o f 1 (o) c Model s⇥ c bo po(o|c; ⇥, ) = pz(z = f 1 (o)|c; ⇥) det @f 1 (o) @o bz f (bz) z training generation ☛ https://en.wikipedia.org/wiki/Probability_density_function#Dependent_variables_and_change_of_variables
  35. 35. 35 PART II: NO RM ALIZIN G FLOW -BASED MODELSGeneral idea • o ---> Gaussian noise sequence z • Multiple transformations: normalizing flow o c bo … f 1 M (·) N(o; I) Criterion c … N(o; I) bz z po(o|c; 1, · · · , M ) = pz(z = f 1 1 · · · f 1 M 1 f 1 M (o)|c) MY m=1 det J(f 1 m ) f 1 1 (·) f 1 M 1 (·) f M (·) f M 1 (·) f 1 (·) training generation D. Rezende and S. Mohamed. Variational inference with normalizing flows. In Proc. ICML, pages 1530–1538, 2015.
  36. 36. SampleRNN 36 WaveRNN FFTNetLPCNet LP-WaveNet ExcitNet GlotNet Model o1:T c1:N Criterion Subband WaveNet Universal neural vocoder Model o1:T c1:N Criterion Knowledge-based Transform Speech production model WaveNet Autoregressive models Model o1:T c1:N Criterion Transform PART II: NORMALIZING FLOW-BASED MODELS po(o|c; 1, · · · , M ) = pz(z = f 1 1 · · · f 1 M 1 f 1 M (o)|c) MY m=1 det J(f 1 m ) WaveGlow FloWaveNet Parallel WaveNet ClariNet f m (·) in linear time domain f m (·) in subscale time domain f m (·)• Different time complexity , different training strategies A. v. d. Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. v. d. Driessche, E. Lockhart, L. C. Cobo, F. Stimberg, et al. Parallel WaveNet: Fast high-fidelity speech synthesis. arXiv preprint arXiv:1711.10433, 2017. W. Ping, K. Peng, and J. Chen. Clarinet: Parallel wave generation in end-to-end text-to-speech. arXiv preprint arXiv:1807.07281, 2018. S. Kim, S.-g. Lee, J. Song, and S. Yoon. Flowavenet: A generative flow for raw audio. arXiv preprint arXiv:1811.02155, 2018. R. Prenger, R. Valle, and B. Catanzaro. Waveglow: A flow-based generative network for speech synthesis. arXiv preprint arXiv:1811.00002, 2018.
  37. 37. 37 PART II: NORMALIZING FLOW-BASED MODELS ClariNet & parallel WaveNet q Genera(on process … N(o; I) c1:N bo1:T f M (·) f M 1 (·) f 1 (·) CNN 1 (·) 1 2 3 4 T 1 2 3 4 T 1:T µ1:T bz (0) 1:T 1 2 3 4 T ü Parallel computation ü Fast generation * Initial condition may be implemented differently 1 2 3 4 T bz (0) 1:T bz (1) 1:T bz (1) t = bz (0) t · t + µt bz (1) 1:T
  38. 38. 38 PART II: NORMALIZING FLOW-BASED MODELS ClariNet & parallel WaveNet q Naïve training process … N(o; I) c1:N 1 2 3 4 T 1 1 2 2 3 3 4 4 T T CNN 1 (·) 1 2 3 4 1:T µ1:T T f 1 M (·) f 1 1 (·) f 1 M 1 (·) o1:T z (1) 1:T z (0) 1:T Training 7me ~ O(T) z (0) t = (z (1) t µt)/ t z (1) 1:T z (0) 1:T
  39. 39. 39 PART II: NORMALIZING FLOW-BASED MODELS o c … f 1 M (·) N(o; I) Criterion z f 1 1 (·) f 1 M 1 (·) Training Generation (Inverse-AR [1]) Normalizing flow bo c … N(o; I) bz f M (·) f M 1 (·) f 1 (·) Original WaveNet 1 2 3 4 1 2 3 1 2 3 4 1 2 3 ~ O(T) [1] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improved variational inference with inverse autoregressive flow. In Proc. NIPS, pages 4743–4751, 2016. ✓ ~ O(1) ✓ ~ O(1) ~ O(T)
  40. 40. 40 PART II: NORMALIZING FLOW-BASED MODELS ClariNet & parallel WaveNet q Fast training : knowledge distilling • Teacher gives • Student gives … N(o; I) f M (·) f M 1 (·) f 1 (·) c1:N bo1:T 1 2 3 p(bot|z1:T , c1:T , student) p(bot|bo1:t 1, c1:T , teacher) Student learns from teacher Teacher Pre-trained WaveNet Student
  41. 41. SampleRNN 41 WaveRNN FFTNet Model o1:T c1:N Criterion Model o1:T c1:N Criterion Transform Normlizing flow / feature transformation Universal neural vocoder Model o1:T c1:N Criterion Knowledge-based Transform PART II: NORMALIZING FLOW-BASED MODELS WaveNet Autoregressive models Parallel WaveNet ClariNet q Parallel WaveNet & ClariNet ü No AR dependency in generation Need pre-trained WaveNet Without teacher WaveNet ? Speech/signal processing LPCNet LP-WaveNet ExcitNet GlotNet Subband WaveNet
  42. 42. WaveGlow q Why ~ O(T) ? Dependency in linear time domain q Reduce T? Dependency in subscale time domain 42 PART II: NORMALIZING FLOW-BASED MODELS f 1 1 (·) Generation Training 1 2 3 4 5 9 13 1 2 3 4 5 9 13 1 2 3 4 5 9 13 1 2 3 4 5 9 13 1 2 3 4 T 1 2 3 4 Generation 1 2 3 4 T 1 2 3 4 Training R. Prenger, R. Valle, and B. Catanzaro. Waveglow: A flow-based generative network for speech synthesis. arXiv preprint arXiv:1811.00002, 2018.
  43. 43. WaveGlow q Why ~ O(T) ? Dependency in linear time domain q Reduce T? Dependency in subscale time domain 43 PART II: NORMALIZING FLOW-BASED MODELS f 1 1 (·) Generation Training 1 2 3 4 5 9 13 1 2 3 4 5 9 13 1 2 3 4 5 9 13 1 2 3 4 5 9 13 1 2 3 4 T 1 2 3 4 Generation 1 2 3 4 T 1 2 3 4 Training R. Prenger, R. Valle, and B. Catanzaro. Waveglow: A flow-based generative network for speech synthesis. arXiv preprint arXiv:1811.00002, 2018. 1 2 3 4 5 9 13 1 2 3 4
  44. 44. SampleRNN 44 WaveRNN FFTNet Model o1:T c1:N Criterion Model o1:T c1:N Criterion Transform Normlizing flow / feature transformation Universal neural vocoder Model o1:T c1:N Criterion Knowledge-based Transform PART II: NORMALIZING FLOW-BASED MODELS WaveNet Autoregressive models Parallel WaveNet ClariNet WaveGlow FloWaveNet ?Simple model without normalizing flow Neural source- filter model Speech/signal processing LPCNet LP-WaveNet ExcitNet GlotNet Subband WaveNet
  45. 45. l Introduction: text-to-speech synthesis l Neural waveform models • Overview • Autoregressive models • Normalizing flow • STFT-based training criterion l Summary & software CONTENTS 45
  46. 46. 46 PART III: STFT-BASED TRAINING CRITERION 1 2 3 4 T… c1 … c2 cNc3 Neural source-filter model q Naïve model and STFT-based criterion … Generated waveforms X. Wang, S. Takaki, and J. Yamagishi. Neural source-filter-based waveform model for statistical parametric speech synthesis. arXiv preprint arXiv:1810.11946, 2018.
  47. 47. 47 PART III: STFT-BASED TRAINING CRITERION 1 2 3 4 T… c1 … c2 cNc3 Neural source-filter model q Naïve model and STFT-based criterion … Generated waveforms Natural waveforms STFT STFT Spectral amplitude error 1 2 3 4 T X. Wang, S. Takaki, and J. Yamagishi. Neural source-filter-based waveform model for statistical parametric speech synthesis. arXiv preprint arXiv:1810.11946, 2018.
  48. 48. 48 PART III: STFT-BASED TRAIN IN G CRITERIO N 1 2 3 4 T… c1 … c2 cNc3 Neural source-filter model q Naïve model and STFT-based criterion … Generated waveforms 1 2 3 4 T Natural waveforms STFT STFT Spectral amplitude error 1 2 3 4 T Sine & harmonics F0
  49. 49. 49 PART III: STFT-BASED TRAINING CRITERION Neural source-filter model q Examples Only harmonics No formants
  50. 50. 50 PART III: STFT-BASED TRAIN IN G CRITERIO N Neural source-filter model q Examples
  51. 51. 51 PART III: STFT-BASED TRAINING CRITERION Neural source-filter model q Examples
  52. 52. 52 PART III: STFT-BASED TRAINING CRITERION Neural source-filter model q Examples
  53. 53. 53 PART III: STFT-BASED TRAINING CRITERION Neural source-filter model q Examples
  54. 54. 54 PART III: STFT-BASED TRAINING CRITERION Neural source-filter model q Examples
  55. 55. 55 PART III: STFT-BASED TRAINING CRITERION Neural source-filter model q Results • Faster than WaveNet (at least 100 2mes) • Smaller than WaveNet • Speech quality is similar to WaveNet Natural WOR WAD WAC NSF 1 2 3 4 5 Quality(MOS) WORLD vocoder WaveNet u-law WaveNet Gaussian Neural source-filter Copy-synthesis Text-to-speech https://nii-yamagishilab.github.io/samples-nsf/index.html
  56. 56. l Introduction: text-to-speech synthesis l Neural waveform models l Summary & software CO N TEN TS 56
  57. 57. 57 WaveNet WaveRNN SampleRNN FFTNet WaveGlow FloWaveNet Neural source- filter model Parallel WaveNet ClariNet RNN + STFT loss Model o1:T c1:N Criterion Model o1:T c1:N Criterion Transform Model o1:T c1:N Criterion Universal neural vocoder Model o1:T c1:N Criterion Knowledge-based Transform SU M M ARY LPCNet LP-WaveNet ExcitNet GlotNet Subband WaveNet
  58. 58. 58 WaveNet WaveRNN SampleRNN FFTNet WaveGlow FloWaveNet Neural source- filter model Parallel WaveNet ClariNet RNN + STFT loss Model o1:T c1:N Criterion Model o1:T c1:N Criterion Transform Model o1:T c1:N Criterion Universal neural vocoder Model o1:T c1:N Criterion Knowledge-based Transform SOFTWARE NII neural network toolkit LPCNet LP-WaveNet ExcitNet GlotNet Subband WaveNet
  59. 59. NII neural network toolkit q Neural waveform models 1. Toolkit cores: https://github.com/nii-yamagishilab/project-CURRENNT-public.git 2. Toolkit scripts: https://github.com/nii-yamagishilab/project-CURRENNT-scripts 59 SOFTWARE
  60. 60. NII neural network toolkit q Useful slides • http://tonywangx.github.io/pdfs/CURRENNT_TUTORIAL.pdf • Other related slides http://tonywangx.github.io/slides.html 60 SO FTW ARE
  61. 61. 61 REFERENCE WaveNet: A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016. SampleRNN: S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio. Samplernn: An unconditional end-to-end neural audio generation model. arXiv preprint arXiv:1612.07837, 2016. WaveRNN: N. Kalchbrenner, E. Elsen, K. Simonyan, et.al. Efficient neural audio synthesis. In J. Dy and A. Krause, editors, Proc. ICML, volume 80 of Proceedings of Machine Learning Research, pages 2410–2419, 10–15 Jul 2018. FFTNet: Z. Jin, A. Finkelstein, G. J. Mysore, and J. Lu. FFTNet: A real-time speaker-dependent neural vocoder. In Proc. ICASSP, pages 2251–2255. IEEE, 2018. Universal vocoder: J. Lorenzo-Trueba, T. Drugman, J. Latorre, T. Merritt, B. Putrycz, and R. Barra-Chicote. Robust universal neural vocoding. arXiv preprint arXiv:1811.06292, 2018. Subband WaveNet: T. Okamoto, K. Tachibana, T. Toda, Y. Shiga, and H. Kawai. An investigation of subband wavenet vocoder covering entire audible frequency range with limited acoustic features. In Proc. ICASSP, pages 5654–5658. 2018. Parallel WaveNet: A. van den Oord, Y. Li, I. Babuschkin, et. al.. Parallel WaveNet: Fast high-fidelity speech synthesis. In Proc. ICML, pages 3918–3926, 2018. ClariNet: W. Ping, K. Peng, and J. Chen. Clarinet: Parallel wave generation in end-to-end text-to-speech. arXiv preprint arXiv:1807.07281, 2018. FlowWaveNet: S. Kim, S.-g. Lee, J. Song, and S. Yoon. Flowavenet: A generative flow for raw audio. arXiv preprint arXiv:1811.02155, 2018. WaveGlow: R. Prenger, R. Valle, and B. Catanzaro. Waveglow: A flow-based generative network for speech synthesis. arXiv preprint arXiv:1811.00002, 2018. RNN+STFT: S. Takaki, T. Nakashika, X. Wang, and J. Yamagishi. STFT spectral loss for training a neural speech waveform model. In Proc. ICASSP (submitted), 2018. NSF: X. Wang, S. Takaki, and J. Yamagishi. Neural source-filter-based waveform model for statistical para- metric speech synthesis. arXiv preprint arXiv:1810.11946, 2018. LP-WavNet: M.-J. Hwang, F. Soong, F. Xie, X. Wang, and H.-G. Kang. Lp-wavenet: Linear prediction-based wavenet speech synthesis. arXiv preprint arXiv:1811.11913, 2018. GlotNet: L. Juvela, V. Tsiaras, B. Bollepalli, M. Airaksinen, J. Yamagishi, and P. Alku. Speaker-independent raw waveform model for glottal excitation. arXiv preprint arXiv:1804.09593, 2018. ExcitNet: E. Song, K. Byun, and H.-G. Kang. Excitnet vocoder: A neural excitation model for parametric speech synthesis systems. arXiv preprint arXiv:1811.04769, 2018. LPCNet: J.-M. Valin and J. Skoglund. Lpcnet: Improving neural speech synthesis through linear prediction. arXiv preprint arXiv:1810.11846, 2018.
  62. 62. End of Part 1 62 Codes, scripts, slides http://nii-yamagishilab.github.io This work was partially supported by JST CREST Grant Number JPMJCR18A6, Japan and by MEXT KAKENHI Grant Numbers (16H06302, 16K16096, 17H04687, 18H04120, 18H04112, 18KT0051), Japan.
  63. 63. Text-to-speech (TTS) q Speech samples from NII’s TTS system 63 INTRODUCTION Text Waveform Text Waveform NII TTS system 2013 2016 2018 Year
  64. 64. 64 PART I: AUTOREGRESSIVE MODELS SampleRNN q Idea • Hierarchical / multi-resolution dependency S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio. Samplernn: An unconditional end-to-end neural audio generation model. arXiv preprint arXiv:1612.07837, 2016.
  65. 65. 65 PART I: AUTOREGRESSIVE MODELS SampleRNN q Example network structure • R: time resolution increased by * 2 Tie 1 Tie 2 Tie 3
  66. 66. 66 PART I: AUTOREGRESSIVE MODELS SampleRNN q Example network structure • Training • R: 8me resolu8on increased by * 2 1 2 3 4 5 6 7 8 9 T Tie 1: R = 1 Tie 2: R = 2 Tie 3: R = 4 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9
  67. 67. 67 PART I: AUTOREGRESSIVE MODELS SampleRNN q Example network structure • Generation process 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 Tie 1: R = 1 Tie 2: R = 2 Tie 3: R = 4 1 2 1 3 2 4 3 5 4 6 5 7 6 8 7 9 8 9 T
  68. 68. 68 PART I: AUTOREGRESSIVE MODELS WaveNet q Variants • u-Law discrete waveform ---> continuous-valued waveform § Mixture of logistic distribution [1] § GMM / Single-Gaussian [2] • Quantization noise shaping [3], related noise shaping method [4] [1] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517, 2017. [2] W. Ping, K. Peng, and J. Chen. Clarinet: Parallel wave generation in end-to-end text-to-speech. arXiv preprint arXiv:1807.07281, 2018. [3] T. Yoshimura, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda. Mel-cepstrum-based quantization noise shaping applied to neural-network-based speech waveform synthesis. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(7):1173–1180, 2018. [4] K. Tachibana, T. Toda, Y. Shiga, and H. Kawai. An investigation of noise shaping with perceptual weighting for WaveNet-based speech generation. In Proc. ICASSP, pages 5664–5668. IEEE, 2018. 1 2 1 2 3 4 3 softmax 1 2 1 2 3 4 3 GMM/Gaussian
  69. 69. 69 PART I: AUTOREGRESSIVE MODELS WaveRNN q WaveNet is inefficient • Computa4on cost 1. Imprac4cal for 16 bit PCM (so?max of size 65536) 2. Very deep network (50 dilated CNN …) 3. … • Time latency 1. Genera4on 4me ~ O(waveform_length) 1 2 3 4 T 1 2 3
  70. 70. 70 PART I: AUTOREGRESSIVE MODELS WaveRNN q WaveRNN strategies • Computation cost 1. Impractical for 16 bit PCM (softmax of size 65536) 2. Very deep network (50 dilated CNN …) 3. ... • Time latency 1. Generation time ~ O(waveform_length) 1 2 3 4 T 1 2 3 Recurrent + feedfoward layers N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. van den Oord, S. Dieleman, and K. Kavukcuoglu. Efficient neural audio synthesis. In J. Dy and A. Krause, editors, Proc. ICML, volume 80 of Proceedings of Machine Learning Research, pages 2410–2419, Stock- holmsm ̈assan, Stockholm Sweden, 10–15 Jul 2018. PMLR. Two-level softmax RNN + Feedforward Subscale dependency + batch

×