Enhancing Multilingual TTS with Voice Conversion Based Data Augmentation and Posterior Embedding - LY Corporation R&D

Publications

CONFERENCE (INTERNATIONAL) Enhancing Multilingual TTS with Voice Conversion Based Data Augmentation and Posterior Embedding

Hyun-Wook Yoon (NAVER Cloud), Jin-Seob Kim (NAVER Cloud), Ryuichi Yamamoto, Ryo Terashima, Chan-Ho Song (NAVER Cloud), Jae-Min Kim (NAVER Cloud), Eunwoo Song (NAVER Cloud)

2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2024)

April 14, 2024

This paper proposes a multilingual, multi-speaker (MM) TTS system by using a voice conversion (VC)-based data augmentation method. Creating an MM-TTS model is challenging, owing to the difficulties of collecting polyglot data from multiple speakers. To address this problem, we adopt a cross-lingual, multi-speaker (CM) VC model trained with multiple speakers' monolingual databases. As this model effectively transfers acoustic attributes while retaining the content information, it is possible to generate each speaker's polyglot corpora. Subsequently, we design the MM-TTS model with variational autoencoder (VAE)-based posterior embeddings. It is to be noted that incorporating VC-augmented polyglot corpora into the TTS training process might degrade synthetic quality, since the corpora sometimes contain unwanted artifacts. To mitigate this issue, the VAE is trained to capture the acoustic dissimilarity between the recorded and VC-augmented datasets. Through the selective choice of the posterior embeddings obtained from the original recordings in the training set, the proposed model enables the generation of acoustically clearer voices.

Paper : Enhancing Multilingual TTS with Voice Conversion Based Data Augmentation and Posterior Embedding open into new tab or window (external link)