Long-term Safe Reinforcement Learning with Binary Feedback - LINEヤフーの研究開発

Publications

カンファレンス (国際) Long-term Safe Reinforcement Learning with Binary Feedback

Akifumi Wachi, Wataru Hashimoto (Osaka University), Kazumune Hashimoto (Osaka University)

Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

2024.3.24

Safety is an indispensable requirement for applying reinforcement learning (RL) to real problems. Although there has been a surge of safe RL algorithms proposed in recent years, most existing work typically 1) relies on receiving nu- meric safety feedback; 2) does not guarantee safety during the learning process; 3) limits the problem to a priori known, deterministic transition dynamics; and/or 4) assume the existence of a known safe policy for any states. Addressing the issues mentioned above, we thus propose Long-term Binary- feedback Safe RL (LoBiSaRL), a safe RL algorithm for constrained Markov decision processes (CMDPs) with binary safety feedback and an unknown, stochastic state transition function. LoBiSaRL optimizes a policy to maximize rewards while guaranteeing a long-term safety that an agent executes only safe state-action pairs throughout each episode with high probability. Specifically, LoBiSaRL models the binary safety function via a generalized linear model (GLM) and conservatively takes only a safe action at every time step while inferring its effect on future safety under proper assumptions. Our theoretical results show that LoBiSaRL guarantees the long-term safety constraint, with high probability. Finally, our empirical results demonstrate that our algorithm is safer than existing methods without significantly compromising performance in terms of reward.

Paper : Long-term Safe Reinforcement Learning with Binary Feedback 新しいタブまたはウィンドウで開く（外部サイト）