Improving LPCNet-Based Text-to-Speech with Linear Prediction-Structured Mixture Density Network - LY Corporation R&D

Publications

CONFERENCE (INTERNATIONAL) Improving LPCNet-Based Text-to-Speech with Linear Prediction-Structured Mixture Density Network

Min-Jae Hwang (Search Solutions Inc), Eunwoo Song (NAVER), Ryuichi Yamamoto, Frank Soong (Microsoft Research Asia), Hong-Goo Kang (Yonsei University)

2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2020)

May 04, 2020

In this paper, we propose an improved LPCNet vocoder using a linear prediction (LP)-structured mixture density network (MDN). The recently proposed LPCNet vocoder has successfully achieved high-quality and lightweight speech synthesis systems by combining a vocal tract LP filter with a WaveRNN-based vocal source (i.e., excitation) generator. However, the quality of synthesized speech is often unstable because the vocal source component is insufficiently represented by the μ-law quantization method, and the model is trained without considering the entire speech production mechanism. To address this problem, we first introduce LP-MDN, which enables the autoregressive neural vocoder to structurally represent the interactions between the vocal tract and vocal source components. Then, we propose to incorporate the LP-MDN to the LPCNet vocoder by replacing the conventional discretized output with continuous density distribution. The experimental results verify that the proposed system provides high quality synthetic speech by achieving a mean opinion score of 4.41 within a text-to-speech framework.

Paper : Improving LPCNet-Based Text-to-Speech with Linear Prediction-Structured Mixture Density Network open into new tab or window (external link)