作者:tanxu 会议:IJCAI 2020 单位:微软
acoustic model
AR & LSTM model
Tacotron(location sensitive attention)
![在这里插入图片描述](https://img-blog.csdnimg.cn/685abed285ca47ee9ef4c789b01d7049.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA5p6X5p6X5a6L,size_20,color_FFFFFF,t_70,g_se,x_16)
DurIAN
- 单独的duration model,时长显示可控
![在这里插入图片描述](https://img-blog.csdnimg.cn/a6d066c53e064aa2b9eab777731b7c50.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA5p6X5p6X5a6L,size_20,color_FFFFFF,t_70,g_se,x_16)
NAR & CNN/Transformer model
DeepVoice 3
- 全CNN结构,推理加速,支持不同的声学特征输出(vocoder: world, GL, WaveNet),支持multi-speaker(2000人,小数据量)
![在这里插入图片描述](https://img-blog.csdnimg.cn/c3b694ee151e452f9ab4f67bcacba2fc.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA5p6X5p6X5a6L,size_20,color_FFFFFF,t_70,g_se,x_16)
TransformerTTS
- 和tacotron结构类似,将encoder&decoder中的lstm替换为transformer;并行化训练,且质量与tacotron2相当;因为并行化计算,attention的鲁棒性不足(??)
![在这里插入图片描述](https://img-blog.csdnimg.cn/64a7143fa3b24d2c9c8a7bde0ce16d20.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA5p6X5p6X5a6L,size_17,color_FFFFFF,t_70,g_se,x_16)
Fastspeech2
![在这里插入图片描述](https://img-blog.csdnimg.cn/29409fcaf04f482b94389e38c55bbbde.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA5p6X5p6X5a6L,size_19,color_FFFFFF,t_70,g_se,x_16) ![在这里插入图片描述](https://img-blog.csdnimg.cn/1bae8a0500264db39692cab1fc44b8fd.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA5p6X5p6X5a6L,size_19,color_FFFFFF,t_70,g_se,x_16)
![在这里插入图片描述](https://img-blog.csdnimg.cn/e43fa8f3c26847aab0751a89453d1a1f.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA5p6X5p6X5a6L,size_20,color_FFFFFF,t_70,g_se,x_16)
- fastspeech是基于teacher-student方法训练,知识蒸馏过程中有信息损失;
- fastspeech2为了解决一对多的问题,加入额外的条件输入(duration,pitch,energy),训练阶段这些特征直接从target中提取,infer阶段是predictor预测的(predictor和FastSpeech2模型一起训练);
VITS
![在这里插入图片描述](https://img-blog.csdnimg.cn/2fbc753c8375405eac88e461156fe88f.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA5p6X5p6X5a6L,size_20,color_FFFFFF,t_70,g_se,x_16)
vocoder
![在这里插入图片描述](https://img-blog.csdnimg.cn/c7d9af543b9f4392974b57e2856547e5.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA5p6X5p6X5a6L,size_20,color_FFFFFF,t_70,g_se,x_16)
LPCNet
HiFiGan
PWG
Advanced topics in TTS
![在这里插入图片描述](https://img-blog.csdnimg.cn/fbd8bcd728c041128e1df74d181af065.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA5p6X5p6X5a6L,size_20,color_FFFFFF,t_70,g_se,x_16)
expressive
![在这里插入图片描述](https://img-blog.csdnimg.cn/5eef2b55cce94fea8fb118e2a2d0dde0.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA5p6X5p6X5a6L,size_20,color_FFFFFF,t_70,g_se,x_16)
Synthesize clean speech for noisy speakers
??【120/434】
adaptive for everyone
- basemodel的泛化性要足够强,因为target speaker的风格可能有异于基础数据库,这样性能就会明显下降;
- 少量数据的情况下,只finetune部分相关参数;(拆分成phn encoder, speaker encoder等多个部分,只更新speaker encoder)
- AdaSpeech 2:少量说话人数据
- AdaSpeech 3:朗读风格到自由风格
![在这里插入图片描述](https://img-blog.csdnimg.cn/7f440a7d038d438a8d2d07cf3f10c31d.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA5p6X5p6X5a6L,size_20,color_FFFFFF,t_70,g_se,x_16)
|