Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment - LINEヤフーの研究開発

Publications

カンファレンス (国際) Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment

Shigeki Kusaka (University of Tsukuba), Keita Saito (University of Tsukuba), Mikoto Kudo (University of Tsukuba/RIKEN AIP), Takumi Tanabe, Akifumi Wachi, Youhei Akimoto (University of Tsukuba/RIKEN AIP/Institute of Science Tokyo)

The 40th Annual AAAI Conference on Artificial Intelligence (AAAI-26)

2026.1.24

Large language models (LLMs) are increasingly deployed in real-world systems, making it critical to understand their vul- nerabilities. While data poisoning attacks during RLHF/DPO alignment have been studied empirically, their theoretical foun- dations remain unclear. We investigate the minimum-cost poi- soning attack required to steer an LLM’s policy toward an attacker’s target by flipping preference labels during RLH- F/DPO, without altering the compared outputs. We formulate this as a convex optimization problem with linear constraints, deriving lower and upper bounds on the minimum attack cost. As a byproduct of this theoretical analysis, we show that any existing label-flipping attack can be post-processed via our proposed method to reduce the number of label flips required while preserving the intended poisoning effect. Empirical re- sults demonstrate that this cost-minimization post-processing can significantly reduce poisoning costs over baselines, par- ticularly when the reward model’s feature dimension is small relative to the dataset size. These findings highlight fundamen- tal vulnerabilities in RLHF/DPO pipelines and provide tools to evaluate their robustness against low-cost poisoning attacks.

Paper : Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment 新しいタブまたはウィンドウで開く（外部サイト）