Publications

CONFERENCE (INTERNATIONAL) Noise-Robust Voice Conversion by Conditional Denoising Training Using Latent Variables of Recording Quality and Environment

Takuto Igarashi (The University of Tokyo), Yuki Saito (The University of Tokyo), Kentaro Seki (The University of Tokyo), Shinnosuke Takamichi (The University of Tokyo), Ryuichi Yamamoto, Kentaro Tachibana, Hiroshi Saruwatari (The University of Tokyo)

The 25th Annual Conference of the International Speech Communication Association (INTERSPEECH 2024)

September 01, 2024

We propose noise-robust voice conversion (VC) which takes into account the recording quality and environment of noisy source speech. Conventional denoising training improves the noise robustness of a VC model by learning noisy-to-clean VC process. However, the naturalness of the converted speech is limited when the noise of the source speech is unseen during the training. To this end, our proposed training conditions a VC model on two latent variables representing the recording quality and environment of the source speech. These latent variables are derived from deep neural networks pre-trained on recording quality assessment and acoustic scene classification and calculated in an utterance-wise or frame-wise manner. As a result, the trained VC model can explicitly learn information about speech degradation during the training. Objective and subjective evaluations show that our training improves the quality of the converted speech compared to the conventional training.

Paper : Noise-Robust Voice Conversion by Conditional Denoising Training Using Latent Variables of Recording Quality and Environmentopen into new tab or window (external link)

PDF : Noise-Robust Voice Conversion by Conditional Denoising Training Using Latent Variables of Recording Quality and Environment