A Unified Accent Estimation Method Based on Multi-Task Learning for Japanese Text-to-Speech - LINEヤフーの研究開発

Publications

カンファレンス (国際) A Unified Accent Estimation Method Based on Multi-Task Learning for Japanese Text-to-Speech

Byeongseon Park, Ryuichi Yamamoto, Kentaro Tachibana

The 23rd Annual Conference of the International Speech Communication Association (INTERSPEECH 2022)

2022.9.18

We propose a unified accent estimation method for Japanese text-to-speech (TTS). Unlike the conventional two-stage methods, which separately train two models for predicting accent phrase boundaries and accent nucleus positions, our method merges the two models and jointly optimizes the entire model in a multi-task learning framework. Furthermore, considering the hierarchical linguistic structure of intonation phrases (IPs), accent phrases, and accent nuclei, we generalize the proposed approach to simultaneously model the IP boundaries with accent information. Objective evaluation results reveal that the proposed method achieves an accent estimation accuracy of 80.4%, which is 6.67% higher than the conventional two-stage method. When the proposed method is incorporated into a neural TTS framework, the system achieves a 4.29 mean opinion score with respect to prosody naturalness.

Paper : A Unified Accent Estimation Method Based on Multi-Task Learning for Japanese Text-to-Speech 新しいタブまたはウィンドウで開く（外部サイト）