Online Register For Dual-Mode Self-Supervised Speech Models: Mitigating the Lack of Future Context - LY Corporation R&D

Publications

CONFERENCE (INTERNATIONAL) Online Register For Dual-Mode Self-Supervised Speech Models: Mitigating the Lack of Future Context

Keita Goto, Takashi Maekaku, Jin Sakuma, Jinchuan Tian (Carnegie Mellon University), Yusuke Shinohara, Shinji Watanabe (Carnegie Mellon University)

2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2026)

May 04, 2026

Self-supervised speech models (S3Ms) have become a key foundation for modern speech processing. However, models pre-trained under offline conditions often suffer from degraded performance in online scenarios due to the absence of future context. To address this challenge, we propose online registers, learnable tokens appended to each chunk in online mode. These tokens act as virtual placeholders for unseen future frames, enabling the model to compensate for missing context without introducing additional latency. Furthermore, we introduce a future prediction loss that explicitly guides the registers to capture predictive cues, thereby enriching their ability to retain future information. Experiments on LibriSpeech, and out-of-domain benchmarks demonstrate that online registers consistently reduce the performance gap between offline and online modes, achieving a 3.4% relative improvement on LibriSpeech with 160 ms chunks, especially in low-latency settings.

Speech Processing

Paper : Online Register For Dual-Mode Self-Supervised Speech Models: Mitigating the Lack of Future Context open into new tab or window (external link)

PDF : Online Register For Dual-Mode Self-Supervised Speech Models: Mitigating the Lack of Future Context