Bagpiper: Solving Open-Ended Audio Tasks via Rich Captions - LINEヤフーの研究開発

Publications

その他 (国際) Bagpiper: Solving Open-Ended Audio Tasks via Rich Captions

Jinchuan Tian (Carnegie Mellon University), Haoran Wang (Carnegie Mellon University), Bo-Hao Su (Carnegie Mellon University), Chien-Yu Huang (Carnegie Mellon University), Qingzheng Wang (Carnegie Mellon University), Jiatong Shi (Carnegie Mellon University), William Chen (Carnegie Mellon University), Xun Gong (Carnegie Mellon University), Siddhant Arora (Carnegie Mellon University), Chin-Jou Li (Carnegie Mellon University), Masao Someki (Carnegie Mellon University), Takashi Maekaku, Keita Goto, Yusuke Shinohara, Jin Sakuma, Chao-Han Huck Yang (NVIDIA Research), Shinji Watanabe (Carnegie Mellon University)

arXiv.org (arXiv)

2026.2.6

Current audio foundation models typically rely on rigid, task-specific supervision, addressing isolated factors of audio rather than the whole. In contrast, human intelligence processes audio holistically, seamlessly bridging physical signals with abstract cognitive concepts to execute complex tasks. Grounded in this philosophy, we introduce Bagpiper, an 8B audio foundation model that interprets physical audio via rich captions, i.e., comprehensive natural language descriptions that encapsulate the critical cognitive concepts inherent in the signal (e.g., transcription, audio events). By pre-training on a massive corpus of 600B tokens, the model establishes a robust bidirectional mapping between raw audio and this high-level conceptual space. During fine-tuning, Bagpiper adopts a caption-then-process workflow, simulating an intermediate cognitive reasoning step to solve diverse tasks without task-specific priors. Experimentally, Bagpiper outperforms Qwen-2.5-Omni on MMAU and AIR-Bench for audio understanding and surpasses CosyVoice3 and TangoFlux in generation quality, capable of synthesizing arbitrary compositions of speech, music, and sound effects. To the best of our knowledge, Bagpiper is among the first works that achieve unified understanding-generation for general audio. Model, data, and code are available at Bagpiper Home Page.

音声処理

Paper : Bagpiper: Solving Open-Ended Audio Tasks via Rich Captions 新しいタブまたはウィンドウで開く（外部サイト）