SlideShare una empresa de Scribd logo
1 de 69
Descargar para leer sin conexión
Tutorial on end-to-end text-to-speech
synthesis
Part 1 – Neural waveform modeling
1contact:	wangxin@nii.ac.jp
we	welcome	critical	comments,	suggestions,	and	discussion
Xin	WANG
National	Institute	of	Informatics,	Japan
2019-01-27
l 国立情報学研究所
l 山岸研究室
l 特任研究員
l PhD	(2015-2018)			総研大・国立情報学研究所
l M.A	(2012-2015)			中国科学技術大学
l http://tonywangx.github.io
SELF-INTRODUCTION
2
ワン シン
王 鑫
l Introduction:	text-to-speech	synthesis	
l Neural	waveform	models
l Summary	&	software
CONTENTS
3
Text-to-speech	synthesis	(TTS)	[1,2]
NII’s	TTS	systems
4
INTRODUCTION
[1] P. Taylor. Text-to-Speech Synthesis. Cambridge University Press, 2009.
[2] T. Dutoit. An Introduction to Text-to-speech Synthesis. Kluwer Academic Publishers, Norwell, MA, USA, 1997.
Text
Text-to-speech
(TTS)
Speech
waveform
Text Waveform Text Waveform
NII	
TTS	system
2013 2016 2018 Year
5
INTRODUCTION
Text-to-speech	synthesis	(TTS)
q Architectures
text text-to-speech speechText
Text	
analyzer
Acoustic		
model(s)
Linguistic
features
Waveform
Waveform	
model
Acoustic
features
次は新金岡、
新金岡です。
次 は 新金岡 …
名詞
ツギ
助詞
ワ
名詞
シンカナオカ
*
| *
||
Linguistic	features Acoustic	features
Spectral	features,F0,	
Band-aperiodicity,	 etc.
6
INTRODUCTION
Text-to-speech	synthesis	(TTS)
q Architectures
TTS	backend	model
text text-to-speech speechText
Text	
analyzer
Acoustic		
model(s)
Linguistic
features
Waveform
Waveform	
model
Acoustic
features
Text
Text	
analyzer
Linguistic
features
Waveform
End-to-end	TTS	modelText Waveform
7
INTRODUCTION
Text-to-speech	synthesis	(TTS)
q Architectures
TTS	backend	model				sss
text text-to-speech speechText
Text	
analyzer
Acoustic		
model(s)
Linguistic
features
Waveform
Waveform	
model
Text
Text	
analyzer
Linguistic
features
Waveform
End-to-end	TTS	modelText Waveform
Waveform	
module
Waveform	
module
Tutorial	part	1:	neural	waveform	modeling
Tutorial	part	2:	end-to-end	TTS
Acoustic
features
l Introduction:	text-to-speech	synthesis	
l Neural	waveform	modeling
• Overview
• Autoregressive	models
• Normalizing	flow
• STFT-based	training	criterion
l Summary
CONTENTS
8
9
OVERVIEW
Task	definition
Spectrogram
10
OVERVIEW
Task	definition
…
c1 c2 cNc3
Waveform
values
Spectrogram
1 2 3 4 T…
o1 o2 o3 o4 oT
T waveform	sampling	points
N frames
Neural	waveform	models
11
OVERVIEW
Task	definition
1 2 3 4 T…
o1 o2 o3 o4 oT
c1
…
c2 cNc3
p(o1:T |c1:N ; ⇥)
Waveform
values
12
OVERVIEW
1 2 3 4 T
…
c1
…
c2 cNc3
Naïve	model
q Simple	network	+	maximum-likelihood
• Feedforward	network
Hidden	
layers
Distribution
p(o4|c1:N ; ⇥)
…Up-sampling
Waveform
values
13
OVERVIEW
1 2 3 4 T…
c1
…
c2 cNc3
Naïve	model
q Simple	network	+	maximum-likelihood
• RNN
…Up-sampling
Hidden	
layers
Distribution
Waveform
values
14
OVERVIEW
1 2 3 4 T…
c1
…
c2 cNc3
Naïve	model
q Simple	network	+	maximum-likelihood
• Dilated	CNN	[1,2]
…
Hidden	
layers
Distribution
Waveform
values
Up-sampling
[1]	F.	Yu	and	V.	Koltun.	Multi-scale	context	aggregation	by	dilated	convolutions.	arXiv preprint	arXiv:1511.07122,	2015.
[2]	A.	Waibel,	T.	Hanazawa,	G.	Hinton,	K.	Shikano,	and	K.	Lang.	Phoneme	recognition	using	time-delay	neural	networks.	IEEE	 Transactions	
on	Acoustics,	Speech,	and	Signal	Processing,	37(3):328–339,	1989.
15
OVERVIEW
1 2 3 4 T…
…
Naïve	model
…
c1 c2 cNc3
o1 o2 o3 o4 oT
p(o1:T |c1:N ; ⇥) = p(o1|c1:N ; ⇥)· · ·p(oT |c1:N ; ⇥) =
TY
t=1
p(ot|c1:N ; ⇥)
o1:TWeak	temporal	model	<--------------->	strongly	correlated	data
mismatch
Hidden	
layers
Distribution
Waveform
values
Up-sampling
16
OVERVIEW
Towards	better	models
Naïve	
model
o1:T
c1:N
Criterion
Better
model
o1:T
c1:N
Criterion
Better	model Easier	target
Model
o1:T
c1:N
Criterion
Transform
Model
o1:T
c1:N
Criterion
Better	criterion
Weak	temporal	model	<--------------->	strongly	correlated	data
mismatch
Combination
Better
Model
o1:T
c1:N
Criterion
Knowledge-
based
Transform
17
OVERVIEW
WaveNet	
WaveRNN
SampleRNN
FFTNet
WaveGlow
FloWaveNet Neural	source-
filter	model
Parallel	WaveNet
ClariNet RNN	+	STFT	loss
Model
o1:T
c1:N
Criterion
Model
o1:T
c1:N
Criterion
Transform
Model
o1:T
c1:N
Criterion
Autoregressive	models
Normlizing flow	/	feature	transformation Training	criterion	in	frequency	domain
Universal	neural	vocoder
Model
o1:T
c1:N
Criterion
Knowledge-based
Transform
Speech/signal	processing
☛ Please check reference in the appendix
LPCNet
LP-WaveNet
ExcitNet
GlotNet
Subband WaveNet
l Introduction:	text-to-speech	synthesis	
l Neural	waveform	models
• Overview
• Autoregressive	models
• Normalizing	flow
• STFT-based	training	criterion
l Summary	&	software
CONTENTS
18
19
PART I:	AUTOREGRESSIVE MODELS
Core	idea
q Naïve	model
…
…
c1 c2 cNc3
p(o1:T |c1:N ; ⇥) =
TY
t=1
p(ot|c1:N ; ⇥)
1 2 3 4 T…
o1 o2 o3 o4 oT
20
PART I:	AUTOREGRESSIVE MODELS
Core	idea
q Autoregressive	(AR)	model
1 2 3 4 T…
…
…
c1 c2 cNc3
o1 o2 o3 o4 oT
1 2 3
p(o1:T |c1:N ; ⇥) =
TY
t=1
p(ot|o1:t 1, c1:N ; ⇥)
21
PART I:	AUTOREGRESSIVE MODELS
Core	idea
q Autoregressive	(AR)	model
• Training:	use	natural	waveform	for	feedback	(teacher	forcing	[1])
1 2 3 4 T…
…
…
c1 c2 cNc3
o1 o2 o3 o4 oT
1 2 3
[1]	R.	J.	Williams	and	D.	Zipser.	A	learning	algorithm	for	continually	running	fully	recurrent	neural	networks.	Neural	computation,	1(2):270–280,	1989.
22
PART I:	AUTOREGRESSIVE MODELS
Core	idea
q Autoregressive	(AR)	model
• Training:	use	natural	waveform	for	feedback	(teacher	forcing	[1])
1 2 3 4 T…
…
…
c1 c2 cNc3
o1 o2 o3 o4 oT
1 2 3
[1]	R.	J.	Williams	and	D.	Zipser.	A	learning	algorithm	for	continually	running	fully	recurrent	neural	networks.	Neural	computation,	1(2):270–280,	1989.
Natural
waveform
23
PART I:	AUTOREGRESSIVE MODELS
Core	idea
q Autoregressive	(AR)	model
• Training:	use	natural	waveform	for	feedback	(teacher	forcing	[1])
• Generation:		
…
…
c1 c2 cNc3
1 2
1
bo1
2
bo2
3
bo3
4
3
bo4
T
…
boT
p(bo1:T |c1:N ; ⇥) =
QT
t=1 p(bot|bot P :t 1, c1:N ; ⇥)
[1]	R.	J.	Williams	and	D.	Zipser.	A	learning	algorithm	for	continually	running	fully	recurrent	neural	networks.	Neural	computation,	1(2):270–280,	1989.
Generated
waveform
LPCNet
LP-WaveNet
ExcitNet
GlotNet
Subband WaveNet
SampleRNN
24
WaveRNN FFTNet
WaveGlow
FloWaveNet Neural	source-
filter	model
Parallel	WaveNet
ClariNet RNN	+	STFT	loss
Model
o1:T
c1:N
Criterion
Model
o1:T
c1:N
Criterion
Transform
Model
o1:T
c1:N
Criterion
Normlizing flow	/	feature	transformation Training	criterion	in	frequency	domain
Universal	neural	vocoder
Model
o1:T
c1:N
Criterion
Knowledge-based
Transform
Speech/signal	processing
PART I:	AUTOREGRESSIVE MODELS
WaveNet	 Autoregressive	models
WaveNet
q Network	structure
• Details:	http://id.nii.ac.jp/1001/00185683/
http://tonywangx.github.io/pdfs/wavenet.pdf
25
PART I:	AUTOREGRESSIVE MODELS
1 2 3 4 T
1 2 3
c1 c2 cNc3
A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu.
WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
SampleRNN
26
WaveRNN FFTNet
LPCNet
LP-WaveNet
ExcitNet
GlotNet
WaveGlow
FloWaveNet Neural	source-
filter	model
Parallel	WaveNet
ClariNet RNN	+	STFT	loss
Model
o1:T
c1:N
Criterion
Model
o1:T
c1:N
Criterion
Transform
Model
o1:T
c1:N
Criterion
Normlizing flow	/	feature	transformation Training	criterion	in	frequency	domain
Subband WaveNet Universal	neural	vocoder
Model
o1:T
c1:N
Criterion
Knowledge-based
Transform
PART I:	AUTOREGRESSIVE MODELS
WaveNet	 Autoregressive	models
☛ Please check SampleRNN in the appendix
q Issues	with	WaveNet	&	Sample	RNN
! Generation	is	very	very … very	slow
q Acceleration?
• Engineering-based:	parallelize/simplify	computation
• Knowledge-based:		speech	/	signal	processing	theory
Speech/signal	processing
27
PART I:	AUTOREGRESSIVE MODELS
WaveRNN
q Linear	time	AR	dependency	in	WaveNet
q Subscale	AR	dependency	+	batch	sampling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Waveform
values
16	steps	to	generate
28
PART I:	AUTOREGRESSIVE MODELS
WaveRNN
q Linear	time	AR	dependency	in	WaveNet
q Subscale	AR	dependency	+	batch	sampling
• Generate	multiple	samples	at	the	same	time
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
16	steps	to	generate
10	steps	to	generate
step1
step2
step3
…
29
PART I:	AUTOREGRESSIVE MODELS
WaveRNN
q Linear	time	AR	dependency	in	WaveNet
q Subscale	AR	dependency	+	batch	sampling
• Generate	multiple	samples	at	the	same	time
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
16	steps	to	generate
10	steps	to	generate
step1
step2
step3
…
SampleRNN
30
WaveRNN FFTNet
WaveGlow
FloWaveNet Neural	source-
filter	model
Parallel	WaveNet
ClariNet RNN	+	STFT	loss
Model
o1:T
c1:N
Criterion
Model
o1:T
c1:N
Criterion
Transform
Model
o1:T
c1:N
Criterion
Normlizing flow	/	feature	transformation Training	criterion	in	frequency	domain
Universal	neural	vocoder
Model
o1:T
c1:N
Criterion
Knowledge-based
Transform
PART I:	AUTOREGRESSIVE MODELS
WaveNet	 Autoregressive	models
q AR	models
! Generation	is	still	slow
p(o1:T |c1:N ; ⇥) =
TY
t=1
p(ot|o1:t 1, c1:N ; ⇥)
Speech/signal	processing
LPCNet
LP-WaveNet
ExcitNet
GlotNet
Subband WaveNet
SampleRNN
31
WaveRNN FFTNet
Model
o1:T
c1:N
Criterion
Model
o1:T
c1:N
Criterion
Transform
Normlizing flow	/	feature	transformation
Universal	neural	vocoder
Model
o1:T
c1:N
Criterion
Knowledge-based
Transform
PART I:	AUTOREGRESSIVE MODELS
WaveNet	 Autoregressive	models
Non-autoregressive	model?	
Speech/signal	processing
LPCNet
LP-WaveNet
ExcitNet
GlotNet
Subband WaveNet
l Introduction:	text-to-speech	synthesis	
l Neural	waveform	models
• Overview
• Autoregressive	models
• Normalizing	flow
• STFT-based	training	criterion
l Summary	&	software
CONTENTS
32
33
PART II:	NORMALIZING FLOW-BASED MODELS
General	idea
• o	with	strong	temporal	correlation	--->	z	with	weak	temporal	correlation
• Principle	of	changing	random	variable
Model	s
Criterion
⇥
o
f 1
(o)
c
Model	s⇥
c
bo
po(o|c; ⇥, ) = pz(z = f 1
(o)|c; ⇥) det
@f 1
(o)
@o
bz
f (bz)
z
training generation
☛ https://en.wikipedia.org/wiki/Probability_density_function#Dependent_variables_and_change_of_variables
34
PART II:	NORMALIZING FLOW-BASED MODELS
General	idea
• o --->	Gaussian	noise	sequence	z
• Multiple	transformations:normalizing	flow
o
c
bo
…
f 1
M
(·)
N(o; I)
Criterion
c
…
N(o; I)
bz
z
po(o|c; 1, · · · , M ) = pz(z = f 1
1
· · · f 1
M 1
f 1
M
(o)|c)
MY
m=1
det J(f 1
m
)
f 1
1
(·)
f 1
M 1
(·)
f M
(·)
f M 1
(·)
f 1
(·)
training generation
D. Rezende and S. Mohamed. Variational inference with normalizing flows. In Proc. ICML, pages 1530–1538, 2015.
SampleRNN
35
WaveRNN FFTNetLPCNet
LP-WaveNet
ExcitNet
GlotNet
Model
o1:T
c1:N
Criterion
Subband WaveNet
Universal	neural	vocoder
Model
o1:T
c1:N
Criterion
Knowledge-based
Transform
Speech	production	 model
WaveNet	 Autoregressive	models
Model
o1:T
c1:N
Criterion
Transform
PART II:	NORMALIZING FLOW-BASED MODELS
po(o|c; 1, · · · , M ) = pz(z = f 1
1
· · · f 1
M 1
f 1
M
(o)|c)
MY
m=1
det J(f 1
m
)
WaveGlow
FloWaveNet
Parallel	WaveNet
ClariNet f m
(·) in	linear	time	domain
f m
(·) in	subscale	time	domain
f m
(·)• Different	time	complexity														,	different	training	strategies
A.	v.	d.	Oord,	Y.	Li,	I.	Babuschkin,	K.	Simonyan,	O.	Vinyals,	K.	Kavukcuoglu,	G.	v.	d.	Driessche,	E.	Lockhart,	L.	C.	Cobo,	F.	Stimberg,	et	al.	Parallel	WaveNet:	
Fast	high-fidelity	speech	synthesis.	arXiv preprint	arXiv:1711.10433,	2017.
W.	Ping,	K.	Peng,	and	J.	Chen.	Clarinet:	Parallel	 wave	generation	in	end-to-end	text-to-speech.	arXiv preprint	arXiv:1807.07281,	2018.
S.	Kim,	S.-g.	Lee,	J.	Song,	and	S.	Yoon.	Flowavenet:	A	generative	 flow	for	raw	audio.	arXiv preprint	arXiv:1811.02155,	2018.
R.	Prenger,	R.	Valle,	 and	B.	Catanzaro.	Waveglow:	A	flow-based	generative	network	for	speech	synthesis.	arXiv preprint	arXiv:1811.00002,	2018.
36
PART II:	NORMALIZING FLOW-BASED MODELS
ClariNet &	parallel	WaveNet
q Generation	process
…
N(o; I)
c1:N
bo1:T
f M
(·)
f M 1
(·)
f 1
(·) CNN 1
(·)
1 2 3 4 T
1 2 3 4 T
1:T
µ1:T
bz
(0)
1:T
1 2 3 4 T
ü Parallel	computation
ü Fast	generation
*	Initial	condition	may	be	implemented	differently
1 2 3 4 T
bz
(0)
1:T
bz
(1)
1:T
bz
(1)
t = bz
(0)
t · t + µt
bz
(1)
1:T
37
PART II:	NORMALIZING FLOW-BASED MODELS
ClariNet &	parallel	WaveNet
q Naïve	training	process
…
N(o; I)
c1:N
1 2 3 4 T
1
1
2
2
3
3
4
4
T
T
CNN 1
(·)
1 2 3 4
1:T
µ1:T
T
f 1
M
(·)
f 1
1
(·)
f 1
M 1
(·)
o1:T
z
(1)
1:T
z
(0)
1:T
! Training	time	~	O(T)
z
(0)
t = (z
(1)
t µt)/ t
z
(1)
1:T
z
(0)
1:T
38
PART II:	NORMALIZING FLOW-BASED MODELS
o
c
…
f 1
M
(·)
N(o; I)
Criterion
z
f 1
1
(·)
f 1
M 1
(·)
Training Generation
(Inverse-AR	[1])
Normalizing	
flow
bo
c
…
N (o; I)
bz
f M
(·)
f M 1
(·)
f 1
(·)
Original
WaveNet
1 2 3 4
1 2 3
1 2 3 4
1 2 3
! ~	O(T)
[1] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improved variational inference with inverse autoregressive flow.
In Proc. NIPS, pages 4743–4751, 2016.
✓ ~	O(1)
✓ ~	O(1) ! ~	O(T)
39
PART II:	NORMALIZING FLOW-BASED MODELS
ClariNet &	parallel	WaveNet
q Fast	training	:	knowledge	distilling
• Teacher	gives
• Student	gives	
…
N(o; I)
f M
(·)
f M 1
(·)
f 1
(·)
c1:N
bo1:T
1 2 3
p(bot|z1:T , c1:T , student)
p(bot|bo1:t 1, c1:T , teacher)
Student	learns	from	teacher
Teacher	
Pre-trained	WaveNet
Student
SampleRNN
40
WaveRNN FFTNet
Model
o1:T
c1:N
Criterion
Model
o1:T
c1:N
Criterion
Transform
Normlizing flow	/	feature	transformation
Universal	neural	vocoder
Model
o1:T
c1:N
Criterion
Knowledge-based
Transform
PART II:	NORMALIZING FLOW-BASED MODELS
WaveNet	 Autoregressive	models
Parallel	WaveNet
ClariNet
q Parallel	WaveNet	&	ClariNet
ü No	AR	dependency	in	generation
! Need	pre-trained	WaveNet
Without	teacher	WaveNet	?
Speech/signal	processing
LPCNet
LP-WaveNet
ExcitNet
GlotNet
Subband WaveNet
WaveGlow
q Why														~	O(T)	? Dependency	in	linear	time	domain
q Reduce	T?	Dependency	in	subscale	time	domain
41
PART II:	NORMALIZING FLOW-BASED MODELS
f 1
1
(·)
Generation Training
1
2
3
4
5 9 13
1
2
3
4
5 9 13
1
2
3
4
5 9 13
1
2
3
4
5 9 13
1 2 3 4 T
1 2 3 4
Generation
1 2 3 4 T
1 2 3 4
Training
R.	Prenger,	R.	Valle,	 and	B.	Catanzaro.	Waveglow:	A	flow-based	generative	network	for	speech	synthesis.	arXiv preprint	arXiv:1811.00002,	2018.
WaveGlow
q Why														~	O(T)	? Dependency	in	linear	time	domain
q Reduce	T?	Dependency	in	subscale	time	domain
42
PART II:	NORMALIZING FLOW-BASED MODELS
f 1
1
(·)
Generation Training
1
2
3
4
5 9 13
1
2
3
4
5 9 13
1
2
3
4
5 9 13
1
2
3
4
5 9 13
1 2 3 4 T
1 2 3 4
Generation
1 2 3 4 T
1 2 3 4
Training
R.	Prenger,	R.	Valle,	 and	B.	Catanzaro.	Waveglow:	A	flow-based	generative	network	for	speech	synthesis.	arXiv preprint	arXiv:1811.00002,	2018.
1
2
3
4
5 9 13
1 2 3 4
SampleRNN
43
WaveRNN FFTNet
Model
o1:T
c1:N
Criterion
Model
o1:T
c1:N
Criterion
Transform
Normlizing flow	/	feature	transformation
Universal	neural	vocoder
Model
o1:T
c1:N
Criterion
Knowledge-based
Transform
PART II:	NORMALIZING FLOW-BASED MODELS
WaveNet	 Autoregressive	models
Parallel	WaveNet
ClariNet
WaveGlow
FloWaveNet
?Simple	model	without	normalizing	 flow
Neural	source-
filter	model
Speech/signal	processing
LPCNet
LP-WaveNet
ExcitNet
GlotNet
Subband WaveNet
l Introduction:	text-to-speech	synthesis	
l Neural	waveform	models
• Overview
• Autoregressive	models
• Normalizing	flow
• STFT-based	training	criterion
l Summary	&	software
CONTENTS
44
45
PART III:	STFT-BASED TRAINING CRITERION
1 2 3 4 T…
c1
…
c2 cNc3
Neural	source-filter	model
q Naïve	model	and	STFT-based	criterion
…
Generated
waveforms
X. Wang, S. Takaki, and J. Yamagishi. Neural source-filter-basedwaveform model for statistical parametric speech synthesis.
arXiv preprint arXiv:1810.11946, 2018.
46
PART III:	STFT-BASED TRAINING CRITERION
1 2 3 4 T…
c1
…
c2 cNc3
Neural	source-filter	model
q Naïve	model	and	STFT-based	criterion
…
Generated
waveforms
Natural	
waveforms
STFT
STFT
Spectral	amplitude	error
1 2 3 4 T
X. Wang, S. Takaki, and J. Yamagishi. Neural source-filter-basedwaveform model for statistical parametric speech synthesis.
arXiv preprint arXiv:1810.11946, 2018.
47
PART III:	STFT-BASED TRAINING CRITERION
1 2 3 4 T…
c1
…
c2 cNc3
Neural	source-filter	model
q Naïve	model	and	STFT-based	criterion
…
Generated
waveforms
1 2 3 4 T
Natural	
waveforms
STFT
STFT
Spectral	amplitude	error
1 2 3 4 T
Sine	&	
harmonics F0
48
PART III:	STFT-BASED TRAINING CRITERION
Neural	source-filter	model
q Examples
Only	harmonics
No	formants
49
PART III:	STFT-BASED TRAINING CRITERION
Neural	source-filter	model
q Examples
50
PART III:	STFT-BASED TRAINING CRITERION
Neural	source-filter	model
q Examples
51
PART III:	STFT-BASED TRAINING CRITERION
Neural	source-filter	model
q Examples
52
PART III:	STFT-BASED TRAINING CRITERION
Neural	source-filter	model
q Examples
53
PART III:	STFT-BASED TRAINING CRITERION
Neural	source-filter	model
q Examples
54
PART III:	STFT-BASED TRAINING CRITERION
Neural	source-filter	model
q Results
• Faster	than	WaveNet (at	least	100 times)
• Smaller	than	WaveNet
• Speech	quality	is	similar	to	WaveNet
Natural WOR WAD WAC NSF
1
2
3
4
5
Quality(MOS)
WORLD
vocoder
WaveNet	
u-law
WaveNet	
Gaussian
Neural	
source-filter
Copy-synthesis
Text-to-speech
https://nii-yamagishilab.github.io/samples-nsf/index.html
l Introduction:	text-to-speech	synthesis	
l Neural	waveform	models
l Summary	&	software
CONTENTS
55
56
WaveNet	
WaveRNN
SampleRNN
FFTNet
WaveGlow
FloWaveNet Neural	source-
filter	model
Parallel	WaveNet
ClariNet RNN	+	STFT	loss
Model
o1:T
c1:N
Criterion
Model
o1:T
c1:N
Criterion
Transform
Model
o1:T
c1:N
Criterion
Universal	neural	vocoder
Model
o1:T
c1:N
Criterion
Knowledge-based
Transform
SUMMARY
LPCNet
LP-WaveNet
ExcitNet
GlotNet
Subband WaveNet
57
WaveNet	
WaveRNN
SampleRNN
FFTNet
WaveGlow
FloWaveNet Neural	source-
filter	model
Parallel	WaveNet
ClariNet RNN	+	STFT	loss
Model
o1:T
c1:N
Criterion
Model
o1:T
c1:N
Criterion
Transform
Model
o1:T
c1:N
Criterion
Universal	neural	vocoder
Model
o1:T
c1:N
Criterion
Knowledge-based
Transform
SOFTWARE
NII	neural	network	toolkit
LPCNet
LP-WaveNet
ExcitNet
GlotNet
Subband WaveNet
NII	neural	network	toolkit
q Neural	waveform	models
1. Toolkit	cores:	https://github.com/nii-yamagishilab/project-CURRENNT-public.git
2. Toolkit	scripts:	https://github.com/nii-yamagishilab/project-CURRENNT-scripts
58
SOFTWARE
NII	neural	network	toolkit
q Useful	slides
• http://tonywangx.github.io/pdfs/CURRENNT_TUTORIAL.pdf
• Other	related	slides	http://tonywangx.github.io/slides.html
59
SOFTWARE
60
REFERENCE
WaveNet: A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K.
Kavukcuoglu. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
SampleRNN: S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio. Samplernn: An
unconditional end-to-end neural audio generation model. arXiv preprint arXiv:1612.07837, 2016.
WaveRNN: N. Kalchbrenner, E. Elsen, K. Simonyan, et.al. Efficient neural audio synthesis. In J. Dy and A. Krause, editors,
Proc. ICML, volume 80 of Proceedings of Machine Learning Research, pages 2410–2419, 10–15 Jul 2018.
FFTNet: Z. Jin, A. Finkelstein, G. J. Mysore, and J. Lu. FFTNet: A real-time speaker-dependent neural vocoder. In Proc.
ICASSP, pages 2251–2255.IEEE, 2018.
Universal vocoder: J. Lorenzo-Trueba, T. Drugman, J. Latorre, T. Merritt, B. Putrycz, and R. Barra-Chicote. Robust universal neural
vocoding. arXiv preprint arXiv:1811.06292, 2018.
Subband WaveNet: T. Okamoto, K. Tachibana, T. Toda, Y. Shiga, and H. Kawai. An investigation of subband wavenet vocoder
covering entire audible frequency range with limited acoustic features. In Proc. ICASSP, pages 5654–5658. 2018.
Parallel WaveNet: A. van den Oord, Y. Li, I. Babuschkin, et. al.. Parallel WaveNet: Fast high-fidelity speech synthesis. In Proc. ICML,
pages 3918–3926, 2018.
ClariNet: W. Ping, K. Peng, and J. Chen. Clarinet: Parallel wave generation in end-to-end text-to-speech. arXiv preprint
arXiv:1807.07281, 2018.
FlowWaveNet: S. Kim, S.-g. Lee, J. Song, and S. Yoon. Flowavenet: A generative flow for raw audio. arXiv preprint
arXiv:1811.02155, 2018.
WaveGlow: R. Prenger, R. Valle, and B. Catanzaro. Waveglow: A flow-based generative network for speech synthesis. arXiv
preprint arXiv:1811.00002,2018.
RNN+STFT: S. Takaki, T. Nakashika, X. Wang, and J. Yamagishi. STFT spectral loss for training a neural speech waveform
model. In Proc. ICASSP (submitted), 2018.
NSF: X. Wang, S. Takaki, and J. Yamagishi. Neural source-filter-based waveform model for statistical para- metric
speech synthesis. arXiv preprint arXiv:1810.11946, 2018.
LP-WavNet: M.-J. Hwang, F. Soong, F. Xie, X. Wang, and H.-G. Kang. Lp-wavenet: Linear prediction-based wavenet speech
synthesis. arXiv preprint arXiv:1811.11913, 2018.
GlotNet: L. Juvela, V. Tsiaras, B. Bollepalli, M. Airaksinen, J. Yamagishi, and P. Alku. Speaker-independent raw waveform
model for glottal excitation. arXiv preprint arXiv:1804.09593, 2018.
ExcitNet: E. Song, K. Byun, and H.-G. Kang. Excitnet vocoder: A neural excitation model for parametric speech synthesis
systems. arXiv preprint arXiv:1811.04769, 2018.
LPCNet: J.-M. Valin and J. Skoglund. Lpcnet: Improving neural speech synthesis through linear prediction. arXiv preprint
arXiv:1810.11846, 2018.
End	of	Part	1
61
Codes,	scripts,	slides
http://nii-yamagishilab.github.io
This	work	was	partially	supported	 by	JST	CREST	Grant	Number	
JPMJCR18A6,	Japan	and	by	MEXT	KAKENHI	Grant	Numbers	(16H06302,	
16K16096,	17H04687,	18H04120,	18H04112,	18KT0051),	Japan.
Text-to-speech	(TTS)
q Speech	samples	from	NII’s	TTS	system
62
INTRODUCTION
Text Waveform Text Waveform
NII	
TTS	system
2013 2016 2018 Year
63
PART I:	AUTOREGRESSIVE MODELS
SampleRNN
q Idea
• Hierarchical	/	multi-resolution	dependency
S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio. Samplernn: An unconditional end-to-end
neural audio generation model. arXiv preprint arXiv:1612.07837, 2016.
64
PART I:	AUTOREGRESSIVE MODELS
SampleRNN
q Example	network	structure
• R:	time	resolution	increased	by	*	2
Tie	1	
Tie	2	
Tie	3
65
PART I:	AUTOREGRESSIVE MODELS
SampleRNN
q Example	network	structure
• Training
• R:	time	resolution	increased	by	*	2
1 2 3 4 5 6 7 8 9 T
Tie	1:
R	=	1
Tie	2:
R	=	2
Tie	3:
R	=	4
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 9
66
PART I:	AUTOREGRESSIVE MODELS
SampleRNN
q Example	network	structure
• Generation	process
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 9
Tie	1:
R	=	1
Tie	2:
R	=	2
Tie	3:
R	=	4
1 2
1
3
2
4
3
5
4
6
5
7
6
8
7
9
8 9
T
67
PART I:	AUTOREGRESSIVE MODELS
WaveNet
q Variants
• u-Law	discrete	waveform	--->	continuous-valued	waveform
§ Mixture	of	logistic	distribution	[1]
§ GMM	/	Single-Gaussian	[2]
• Quantization	noise	shaping	[3],	related	noise	shaping	method	[4]
[1]	T.	Salimans,	A.	Karpathy,	X.	Chen,	and	D.	P.	Kingma.	Pixelcnn++:	Improving	the	pixelcnn with	discretized	logistic	mixture	likelihood	and	other	modifications.	arXiv preprint	arXiv:1701.05517,	2017.
[2]	W.	Ping,	K.	Peng,	and	J.	Chen.	Clarinet:	Parallel	wave	generation	in	end-to-end	text-to-speech.	arXiv preprint	arXiv:1807.07281,	2018.
[3]	T.	Yoshimura,	K.	Hashimoto,	K.	Oura,	Y.	Nankaku,	and	K.	Tokuda.	Mel-cepstrum-based	quantization	noise	shaping	applied	to	neural-network-based	speech	waveform	synthesis.	IEEE/ACM	
Transactions	on	Audio,	Speech,	and	Language	Processing,	26(7):1173–1180,	2018.
[4]	K.	Tachibana,	T.	Toda,	Y.	Shiga,	and	H.	Kawai.	An	investigation	of	noise	shaping	with	perceptual	weighting	for	WaveNet-based	speech	generation.	In	Proc.	ICASSP,	pages	5664–5668.	IEEE,	2018.
1 2
1 2 3 4
3
softmax
1 2
1 2 3 4
3
GMM/Gaussian
68
PART I:	AUTOREGRESSIVE MODELS
WaveRNN
q WaveNet	is	inefficient
• Computation	cost
1. Impractical	for	16	bit	PCM	(softmax of	size	65536)
2. Very	deep	network	(50	dilated	CNN	…)
3. …
• Time	latency
1. Generation	time	~	O(waveform_length)
1 2 3 4 T
1 2 3
69
PART I:	AUTOREGRESSIVE MODELS
WaveRNN
q WaveRNN strategies
• Computation	cost
1. Impractical	for	16	bit	PCM	(softmax of	size	65536)
2. Very	deep	network	(50	dilated	CNN	…)
3. ...
• Time	latency
1. Generation	time	~	O(waveform_length)
1 2 3 4 T
1 2 3
Recurrent	+	feedfoward layers
N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. van den Oord, S. Dieleman, and K.
Kavukcuoglu. Efficient neural audio synthesis. In J. Dy and A. Krause, editors, Proc. ICML, volume 80 of Proceedings of Machine Learning
Research, pages 2410–2419, Stock- holmsm ̈assan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
Two-level	softmax
RNN	+	Feedforward
Subscale	dependency	+	batch

Más contenido relacionado

La actualidad más candente

WaveNetが音声合成研究に与える影響
WaveNetが音声合成研究に与える影響WaveNetが音声合成研究に与える影響
WaveNetが音声合成研究に与える影響NU_I_TODALAB
 
Introduction to text to speech
Introduction to text to speechIntroduction to text to speech
Introduction to text to speechBilgin Aksoy
 
Transformer in Computer Vision
Transformer in Computer VisionTransformer in Computer Vision
Transformer in Computer VisionDongmin Choi
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learningAmr Rashed
 
PR-409: Denoising Diffusion Probabilistic Models
PR-409: Denoising Diffusion Probabilistic ModelsPR-409: Denoising Diffusion Probabilistic Models
PR-409: Denoising Diffusion Probabilistic ModelsHyeongmin Lee
 
Generative adversarial network and its applications to speech signal and natu...
Generative adversarial network and its applications to speech signal and natu...Generative adversarial network and its applications to speech signal and natu...
Generative adversarial network and its applications to speech signal and natu...宏毅 李
 
(第3版)「知能の構成的解明の研究動向と今後の展望」についての個人的見解:Chain of thought promptingやpostdictionを中...
(第3版)「知能の構成的解明の研究動向と今後の展望」についての個人的見解:Chain of thought promptingやpostdictionを中...(第3版)「知能の構成的解明の研究動向と今後の展望」についての個人的見解:Chain of thought promptingやpostdictionを中...
(第3版)「知能の構成的解明の研究動向と今後の展望」についての個人的見解:Chain of thought promptingやpostdictionを中...KIT Cognitive Interaction Design
 
Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with PerformersJoonhyung Lee
 
Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnnKuppusamy P
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Deep Learning Italia
 
MS COCO Dataset Introduction
MS COCO Dataset IntroductionMS COCO Dataset Introduction
MS COCO Dataset IntroductionShinagawa Seitaro
 
CS571: Sentiment Analysis
CS571: Sentiment AnalysisCS571: Sentiment Analysis
CS571: Sentiment AnalysisJinho Choi
 
【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features
【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features
【論文読み会】Deep Clustering for Unsupervised Learning of Visual FeaturesARISE analytics
 
Understanding RNN and LSTM
Understanding RNN and LSTMUnderstanding RNN and LSTM
Understanding RNN and LSTM健程 杨
 
transformer解説~Chat-GPTの源流~
transformer解説~Chat-GPTの源流~transformer解説~Chat-GPTの源流~
transformer解説~Chat-GPTの源流~MasayoshiTsutsui
 
Generative models (Geek hub 2021 lecture)
Generative models (Geek hub 2021 lecture)Generative models (Geek hub 2021 lecture)
Generative models (Geek hub 2021 lecture)Vitaly Bondar
 
Wandb Monthly Meetup August 2023.pdf
Wandb Monthly Meetup August 2023.pdfWandb Monthly Meetup August 2023.pdf
Wandb Monthly Meetup August 2023.pdfYuya Yamamoto
 
[DL輪読会]Parallel WaveNet: Fast High-Fidelity Speech Synthesis
[DL輪読会]Parallel WaveNet: Fast High-Fidelity Speech Synthesis[DL輪読会]Parallel WaveNet: Fast High-Fidelity Speech Synthesis
[DL輪読会]Parallel WaveNet: Fast High-Fidelity Speech SynthesisDeep Learning JP
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRUananth
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPMENGSAYLOEM1
 

La actualidad más candente (20)

WaveNetが音声合成研究に与える影響
WaveNetが音声合成研究に与える影響WaveNetが音声合成研究に与える影響
WaveNetが音声合成研究に与える影響
 
Introduction to text to speech
Introduction to text to speechIntroduction to text to speech
Introduction to text to speech
 
Transformer in Computer Vision
Transformer in Computer VisionTransformer in Computer Vision
Transformer in Computer Vision
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
 
PR-409: Denoising Diffusion Probabilistic Models
PR-409: Denoising Diffusion Probabilistic ModelsPR-409: Denoising Diffusion Probabilistic Models
PR-409: Denoising Diffusion Probabilistic Models
 
Generative adversarial network and its applications to speech signal and natu...
Generative adversarial network and its applications to speech signal and natu...Generative adversarial network and its applications to speech signal and natu...
Generative adversarial network and its applications to speech signal and natu...
 
(第3版)「知能の構成的解明の研究動向と今後の展望」についての個人的見解:Chain of thought promptingやpostdictionを中...
(第3版)「知能の構成的解明の研究動向と今後の展望」についての個人的見解:Chain of thought promptingやpostdictionを中...(第3版)「知能の構成的解明の研究動向と今後の展望」についての個人的見解:Chain of thought promptingやpostdictionを中...
(第3版)「知能の構成的解明の研究動向と今後の展望」についての個人的見解:Chain of thought promptingやpostdictionを中...
 
Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with Performers
 
Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnn
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
 
MS COCO Dataset Introduction
MS COCO Dataset IntroductionMS COCO Dataset Introduction
MS COCO Dataset Introduction
 
CS571: Sentiment Analysis
CS571: Sentiment AnalysisCS571: Sentiment Analysis
CS571: Sentiment Analysis
 
【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features
【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features
【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features
 
Understanding RNN and LSTM
Understanding RNN and LSTMUnderstanding RNN and LSTM
Understanding RNN and LSTM
 
transformer解説~Chat-GPTの源流~
transformer解説~Chat-GPTの源流~transformer解説~Chat-GPTの源流~
transformer解説~Chat-GPTの源流~
 
Generative models (Geek hub 2021 lecture)
Generative models (Geek hub 2021 lecture)Generative models (Geek hub 2021 lecture)
Generative models (Geek hub 2021 lecture)
 
Wandb Monthly Meetup August 2023.pdf
Wandb Monthly Meetup August 2023.pdfWandb Monthly Meetup August 2023.pdf
Wandb Monthly Meetup August 2023.pdf
 
[DL輪読会]Parallel WaveNet: Fast High-Fidelity Speech Synthesis
[DL輪読会]Parallel WaveNet: Fast High-Fidelity Speech Synthesis[DL輪読会]Parallel WaveNet: Fast High-Fidelity Speech Synthesis
[DL輪読会]Parallel WaveNet: Fast High-Fidelity Speech Synthesis
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
 

Similar a Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform modeling

エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~Yamagishi Laboratory, National Institute of Informatics, Japan
 
Lecture14 xing fei-fei
Lecture14 xing fei-feiLecture14 xing fei-fei
Lecture14 xing fei-feiTianlu Wang
 
University timetable scheduling
University timetable schedulingUniversity timetable scheduling
University timetable schedulingAshish Mishra
 
[Paper Reading] Unsupervised Learning of Sentence Embeddings using Compositi...
[Paper Reading]  Unsupervised Learning of Sentence Embeddings using Compositi...[Paper Reading]  Unsupervised Learning of Sentence Embeddings using Compositi...
[Paper Reading] Unsupervised Learning of Sentence Embeddings using Compositi...Hiroki Shimanaka
 
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...National Institute of Informatics
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalBhaskar Mitra
 
Exploring temporal graph data with Python: 
a study on tensor decomposition o...
Exploring temporal graph data with Python: 
a study on tensor decomposition o...Exploring temporal graph data with Python: 
a study on tensor decomposition o...
Exploring temporal graph data with Python: 
a study on tensor decomposition o...André Panisson
 
Molecular autoencoder
Molecular autoencoderMolecular autoencoder
Molecular autoencoderDan Elton
 
Graph Spectra through Network Complexity Measures: Information Content of Eig...
Graph Spectra through Network Complexity Measures: Information Content of Eig...Graph Spectra through Network Complexity Measures: Information Content of Eig...
Graph Spectra through Network Complexity Measures: Information Content of Eig...Hector Zenil
 
A genetic algorithm for a university weekly courses timetabling problem
A genetic algorithm for a university weekly courses timetabling problemA genetic algorithm for a university weekly courses timetabling problem
A genetic algorithm for a university weekly courses timetabling problemMotasem Smadi
 
Reading papers - survey on Non-Convex Optimization
Reading papers - survey on Non-Convex OptimizationReading papers - survey on Non-Convex Optimization
Reading papers - survey on Non-Convex OptimizationX 37
 
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...Jinho Choi
 
Lecture artificial neural networks and pattern recognition
Lecture   artificial neural networks and pattern recognitionLecture   artificial neural networks and pattern recognition
Lecture artificial neural networks and pattern recognitionHưng Đặng
 
Lecture artificial neural networks and pattern recognition
Lecture   artificial neural networks and pattern recognitionLecture   artificial neural networks and pattern recognition
Lecture artificial neural networks and pattern recognitionHưng Đặng
 
Pratik aasigement
Pratik aasigementPratik aasigement
Pratik aasigementPratik Ojha
 
Take-Home Exam Questions on Brain and Computation'
Take-Home Exam Questions on Brain and Computation'Take-Home Exam Questions on Brain and Computation'
Take-Home Exam Questions on Brain and Computation'butest
 

Similar a Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform modeling (20)

エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
 
Lecture14 xing fei-fei
Lecture14 xing fei-feiLecture14 xing fei-fei
Lecture14 xing fei-fei
 
University timetable scheduling
University timetable schedulingUniversity timetable scheduling
University timetable scheduling
 
Voice Cloning
Voice CloningVoice Cloning
Voice Cloning
 
[Paper Reading] Unsupervised Learning of Sentence Embeddings using Compositi...
[Paper Reading]  Unsupervised Learning of Sentence Embeddings using Compositi...[Paper Reading]  Unsupervised Learning of Sentence Embeddings using Compositi...
[Paper Reading] Unsupervised Learning of Sentence Embeddings using Compositi...
 
Lecture1.pptx
Lecture1.pptxLecture1.pptx
Lecture1.pptx
 
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information Retrieval
 
Exploring temporal graph data with Python: 
a study on tensor decomposition o...
Exploring temporal graph data with Python: 
a study on tensor decomposition o...Exploring temporal graph data with Python: 
a study on tensor decomposition o...
Exploring temporal graph data with Python: 
a study on tensor decomposition o...
 
Molecular autoencoder
Molecular autoencoderMolecular autoencoder
Molecular autoencoder
 
Graph Spectra through Network Complexity Measures: Information Content of Eig...
Graph Spectra through Network Complexity Measures: Information Content of Eig...Graph Spectra through Network Complexity Measures: Information Content of Eig...
Graph Spectra through Network Complexity Measures: Information Content of Eig...
 
A genetic algorithm for a university weekly courses timetabling problem
A genetic algorithm for a university weekly courses timetabling problemA genetic algorithm for a university weekly courses timetabling problem
A genetic algorithm for a university weekly courses timetabling problem
 
Reading papers - survey on Non-Convex Optimization
Reading papers - survey on Non-Convex OptimizationReading papers - survey on Non-Convex Optimization
Reading papers - survey on Non-Convex Optimization
 
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
 
Computing finance
Computing financeComputing finance
Computing finance
 
Lecture artificial neural networks and pattern recognition
Lecture   artificial neural networks and pattern recognitionLecture   artificial neural networks and pattern recognition
Lecture artificial neural networks and pattern recognition
 
Lecture artificial neural networks and pattern recognition
Lecture   artificial neural networks and pattern recognitionLecture   artificial neural networks and pattern recognition
Lecture artificial neural networks and pattern recognition
 
Pratik aasigement
Pratik aasigementPratik aasigement
Pratik aasigement
 
Take-Home Exam Questions on Brain and Computation'
Take-Home Exam Questions on Brain and Computation'Take-Home Exam Questions on Brain and Computation'
Take-Home Exam Questions on Brain and Computation'
 
L3 v2
L3 v2L3 v2
L3 v2
 

Más de Yamagishi Laboratory, National Institute of Informatics, Japan

エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~Yamagishi Laboratory, National Institute of Informatics, Japan
 

Más de Yamagishi Laboratory, National Institute of Informatics, Japan (16)

DDS: A new device-degraded speech dataset for speech enhancement
DDS: A new device-degraded speech dataset for speech enhancement DDS: A new device-degraded speech dataset for speech enhancement
DDS: A new device-degraded speech dataset for speech enhancement
 
The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022
 
Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...
Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...
Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...
 
Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...
Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...
Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...
 
Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...
Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...
Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...
 
Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...
Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...
Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...
 
Generalization Ability of MOS Prediction Networks
Generalization Ability of MOS Prediction NetworksGeneralization Ability of MOS Prediction Networks
Generalization Ability of MOS Prediction Networks
 
Estimating the confidence of speech spoofing countermeasure
Estimating the confidence of speech spoofing countermeasureEstimating the confidence of speech spoofing countermeasure
Estimating the confidence of speech spoofing countermeasure
 
Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...
Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...
Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...
 
How do Voices from Past Speech Synthesis Challenges Compare Today?
 How do Voices from Past Speech Synthesis Challenges Compare Today? How do Voices from Past Speech Synthesis Challenges Compare Today?
How do Voices from Past Speech Synthesis Challenges Compare Today?
 
Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis
Text-to-Speech Synthesis Techniques for MIDI-to-Audio SynthesisText-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis
Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis
 
Preliminary study on using vector quantization latent spaces for TTS/VC syste...
Preliminary study on using vector quantization latent spaces for TTS/VC syste...Preliminary study on using vector quantization latent spaces for TTS/VC syste...
Preliminary study on using vector quantization latent spaces for TTS/VC syste...
 
Neural Waveform Modeling
Neural Waveform Modeling Neural Waveform Modeling
Neural Waveform Modeling
 
Neural source-filter waveform model
Neural source-filter waveform modelNeural source-filter waveform model
Neural source-filter waveform model
 
Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related...
Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related...Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related...
Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related...
 
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
 

Último

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 

Último (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 

Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform modeling