Semi-Supervised Speaker Adaptation for End-to-End Speech Synthesis with Pretrained Models - LY Corporation R&D

Publications

CONFERENCE (INTERNATIONAL) Semi-Supervised Speaker Adaptation for End-to-End Speech Synthesis with Pretrained Models

Katsuki Inoue (Okayama University), Sunao Hara (Okayama University), Masanobu Abe (Okayama University), Tomoki Hayashi (Nagoya University), Ryuichi Yamamoto, Shinji Watanabe (Johns Hopkins University)

2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2020)

May 04, 2020

Recently, end-to-end text-to-speech (TTS) models have achieved a remarkable performance, however, requiring a large amount of paired text and speech data for training. On the other hand, we can easily collect unpaired dozen minutes of speech recordings for a target speaker without corresponding text data. To make use of such accessible data, the proposed method leverages the recent great success of state-of-the-art end-to-end automatic speech recognition (ASR) systems and obtains corresponding transcriptions from pretrained ASR models. Although these models could only provide text output instead of intermediate linguistic features like phonemes, end-to-end TTS can be well trained with such raw text data directly. Thus, the proposed method can greatly simplify a speaker adaptation pipeline by consistently employing end-to-end ASR/TTS ecosystems. The experimental results show that our proposed method achieved comparable performance to a paired data adaptation method in terms of subjective speaker similarity and objective cepstral distance measures.

Paper : Semi-Supervised Speaker Adaptation for End-to-End Speech Synthesis with Pretrained Models open into new tab or window (external link)