ZTE Communications ›› 2023, Vol. 21 ›› Issue (3): 29-36.DOI: 10.12142/ZTECOM.202303005

• Special Topic • Previous Articles     Next Articles

Boundary Data Augmentation for Offline Reinforcement Learning

SHEN Jiahao1,2, JIANG Ke1,2, TAN Xiaoyang1,2()   

  1. 1.College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China
    2.MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, Nanjing 211106, China
  • Received:2023-07-05 Online:2023-09-21 Published:2023-03-22
  • About author:SHEN Jiahao is currently a postgraduate student in College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, China. His research interests include reinforcement learning and generative model.|JIANG Ke is currently a PhD student in the College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, China. His research interest is reinforcement learning.|TAN Xiaoyang (x.tan@nuaa.edu.cn) is currently a faculty member of Department of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, China. His current research interests include machine learning and pattern recognition, computer vision, and reinforcement learning.
  • Supported by:
    the National Key R&D program of China(2021ZD0113203);National Science Foundation of China(61976115)

Abstract:

Offline reinforcement learning (ORL) aims to learn a rational agent purely from behavior data without any online interaction. One of the major challenges encountered in ORL is the problem of distribution shift, i.e., the mismatch between the knowledge of the learned policy and the reality of the underlying environment. Recent works usually handle this in a too pessimistic manner to avoid out-of-distribution (OOD) queries as much as possible, but this can influence the robustness of the agents at unseen states. In this paper, we propose a simple but effective method to address this issue. The key idea of our method is to enhance the robustness of the new policy learned offline by weakening its confidence in highly uncertain regions, and we propose to find those regions by simulating them with modified Generative Adversarial Nets (GAN) such that the generated data not only follow the same distribution with the old experience but are very difficult to deal with by themselves, with regard to the behavior policy or some other reference policy. We then use this information to regularize the ORL algorithm to penalize the overconfidence behavior in these regions. Extensive experiments on several publicly available offline RL benchmarks demonstrate the feasibility and effectiveness of the proposed method.

Key words: offline reinforcement learning, out‐of‐distribution state, robustness, uncertainty