Boundary Data Augmentation for Offline Reinforcement Learning

doi:10.12142/ZTECOM.202303005

Abstract

Abstract:

Offline reinforcement learning (ORL) aims to learn a rational agent purely from behavior data without any online interaction. One of the major challenges encountered in ORL is the problem of distribution shift, i.e., the mismatch between the knowledge of the learned policy and the reality of the underlying environment. Recent works usually handle this in a too pessimistic manner to avoid out-of-distribution (OOD) queries as much as possible, but this can influence the robustness of the agents at unseen states. In this paper, we propose a simple but effective method to address this issue. The key idea of our method is to enhance the robustness of the new policy learned offline by weakening its confidence in highly uncertain regions, and we propose to find those regions by simulating them with modified Generative Adversarial Nets (GAN) such that the generated data not only follow the same distribution with the old experience but are very difficult to deal with by themselves, with regard to the behavior policy or some other reference policy. We then use this information to regularize the ORL algorithm to penalize the overconfidence behavior in these regions. Extensive experiments on several publicly available offline RL benchmarks demonstrate the feasibility and effectiveness of the proposed method.

Key words: offline reinforcement learning, out‐of‐distribution state, robustness, uncertainty

SHEN Jiahao, JIANG Ke, TAN Xiaoyang. Boundary Data Augmentation for Offline Reinforcement Learning[J]. ZTE Communications, 2023, 21(3): 29-36.

Figures/Tables 8

Figure 1 Overall architecture of the proposed Boundary Conservative Q Learning (Boundary-CQL, BCQL) method, where the left column illustrates the pipeline to generate boundary OOD data based on an adversarial generative model, while the right column performs Offline RL with the generated data

Table 1 Performance of BCQL and prior methods on MuJoCo tasks from D4RL, on the normalized return metric (the highest means are bolded)

Task Name	BC	BEAR	SAC	TD3+BC	CQL	BCQL
Halfcheetah-medium-v2	42.4±0.2	37.1±2.3	55.2±27.8	48.3±0.3	47.1±0.2	47.1±0.7
Hopper-medium-v2	53.5±2.0	30.8±0.9	0.8±0.0	59.3±4.2	64.9±4.1	66.1±5.2
Walker2d-medium-v2	63.2±18.8	56±8.5	-0.3±0.2	83.7±2.1	80.4±3.5	84.6±2.5
Halfcheetah-medium-replay-v2	35.7±2.7	36.2±5.6	0.8±1.0	44.6±0.5	45.2±0.6	46.1±1.5
Hopper-medium-replay-v2	29.8±2.4	31.1±7.2	7.4±0.5	60.9±18.8	87.7±14.4	93.9±11.8
Walker2d-medium-replay-v2	21.8±11.7	13.6±2.1	-0.4±0.3	81.8±5.5	79.3±4.9	82.5±7.3
Halfcheetah-medium-expert-v2	56.0±8.5	44.2±13.8	28.4±19.4	90.7±4.3	96±0.8	97.5±3.2
Hopper-medium-expert-v2	52.3±4.6	67.3±32.5	0.7±0.0	98.0±9.4	93.9±14.3	95.9±13.3
Walker2d-medium-expert-v2	99.0±18.5	43.8±6.0	1.9±3.9	110.1±0.5	109.7±0.5	110.2±1.0
Halfcheetah-expert-v2	91.8±1.5	100.2±1.8	-0.8±1.8	96.7±1.1	96.3±1.3	98.4±3.2
Hopper-expert-v2	107.7±0.7	108.3±3.5	0.7±0.0	107.8±7	109.5±14.3	111.7±8.3
Walker2d-expert-v2	106.7±0.2	106.1±6.0	0.7±0.3	110.2±0.3	108.5±0.5	109.7±1.0
Total average	63.3	56.2	7.9	82.7	83.8	87.0

Figure 2 Policy performance during training in different environments

Figure 3 Visualization of data generated in different environments

Figure 4 Distribution of generated states and real states in different environments

Figure 5 Confidence of actions on generated data under different ?βG in a walker2d-medium environment

Table 2 KL divergence between generated states and origin states under different βG in a walker2d-medium environment

$β G$	KL Divergence	$β G$	KL Divergence
0.2	0.08	1.0	0.41
0.4	0.21	1.5	0.57
0.5	0.34	2.0	0.76

Table 2 KL divergence between generated states and origin states under different βG in a walker2d-medium environment

$β G$	KL Divergence	$β G$	KL Divergence
0.2	0.08	1.0	0.41
0.4	0.21	1.5	0.57
0.5	0.34	2.0	0.76

Table 3 Performance of BCQL with different λ, on the normalized return metric (the highest means are bolded)

Task Name	$λ = 0 ? (C Q L)$	$λ = 0.5$	$λ = 1$	$λ = 1.5$
Halfcheetah-medium-v2	47.1±0.2	46.8±1.3	47.1±0.7	46.1±2.2
Hopper-medium-v2	64.9±4.1	64.8±7.6	64.3±4.2	66.1±7.2
Walker2d-medium-v2	80.4±3.5	84.6±2.5	81.8±2.1	79.3±4.5
Halfcheetah-medium-replay-v2	45.2±0.6	45.0±1.0	44.9±0.5	46.1±1.5
Hopper-medium-replay-v2	87.7±14.4	90.2±10.5	93.9±11.8	83.5±14.4
Walker2d-medium-replay-v2	79.3±4.9	82.5±7.3	80.7±5.5	77.7±8.9
Halfcheetah-medium-expert-v2	96±0.8	95.2±0.4	97.5±3.2	96.9±1.1
Hopper-medium-expert-v2	93.9±14.3	94.0±8.7	95.9±13.3	94.8±11.3
Walker2d-medium-expert-v2	109.7±0.5	110.2±1.0	110.1±0.5	109.3±0.3
Halfcheetah-expert-v2	96.3±1.3	98.4±3.2	97.4±2.3	98.2±1.3
Hopper-expert-v2	106.5±14.3	109.7±7.9	111.7±8.3	111.2±10.2
Walker2d-expert-v2	108.5±0.5	108.7±0.3	109.7±1.0	109.5±0.5

Table 3 Performance of BCQL with different λ, on the normalized return metric (the highest means are bolded)

Task Name	$λ = 0 ? (C Q L)$	$λ = 0.5$	$λ = 1$	$λ = 1.5$
Halfcheetah-medium-v2	47.1±0.2	46.8±1.3	47.1±0.7	46.1±2.2
Hopper-medium-v2	64.9±4.1	64.8±7.6	64.3±4.2	66.1±7.2
Walker2d-medium-v2	80.4±3.5	84.6±2.5	81.8±2.1	79.3±4.5
Halfcheetah-medium-replay-v2	45.2±0.6	45.0±1.0	44.9±0.5	46.1±1.5
Hopper-medium-replay-v2	87.7±14.4	90.2±10.5	93.9±11.8	83.5±14.4
Walker2d-medium-replay-v2	79.3±4.9	82.5±7.3	80.7±5.5	77.7±8.9
Halfcheetah-medium-expert-v2	96±0.8	95.2±0.4	97.5±3.2	96.9±1.1
Hopper-medium-expert-v2	93.9±14.3	94.0±8.7	95.9±13.3	94.8±11.3
Walker2d-medium-expert-v2	109.7±0.5	110.2±1.0	110.1±0.5	109.3±0.3
Halfcheetah-expert-v2	96.3±1.3	98.4±3.2	97.4±2.3	98.2±1.3
Hopper-expert-v2	106.5±14.3	109.7±7.9	111.7±8.3	111.2±10.2
Walker2d-expert-v2	108.5±0.5	108.7±0.3	109.7±1.0	109.5±0.5

References 37

1	CHEN D, CHEN K A, LI Z J, et al. PowerNet: multi-agent deep reinforcement learning for scalable powergrid control [J]. IEEE transactions on power systems, 2022, 37(2): 1007–1017. DOI: 10.1109/TPWRS.2021.3100898 DOI URL
2	CHEN X C, YAO L N, MCAULEY J, et al. A survey of deep reinforcement learning in recommender systems: a systematic review and future directions [EB/OL]. (2021-09-08)[2023-04-12].
3	OLIFF H, LIU Y, KUMAR M, et al. Reinforcement learning for facilitating human-robot-interaction in manufacturing [J]. Journal of manufacturing systems, 2020, 56: 326–340. DOI: 10.1016/j.jmsy.2020.06.018 DOI URL
4	YU C, LIU J M, NEMATI S, et al. Reinforcement learning in healthcare: a survey [J]. ACM computing surveys, 2023, 55(1): 1–36. DOI: 10.1145/3477600 DOI URL
5	GRIGORESCU S, TRASNEA B, COCIAS T, et al. A survey of deep learning techniques for autonomous driving [J]. Journal of field robotics, 2020, 37(3): 362–386. DOI: 10.1002/rob.21918 DOI URL
6	FUJIMOTO S, MEGER D, PRECUP D. Off-policy deep reinforcement learning without exploration [C]//International Conference on Machine Learning. PMLR, 2019: 2052–2062. DOI: 10.48550/arXiv.1812.02900 DOI URL
7	LEVINE S, KUMAR A, TUCKER G, et al. Offline reinforcement learning: tutorial, review and perspectives on open problems [EB/OL]. (2020-05-04)[2023-04-12].
8	KUMAR A, ZHOU A, TUCKER G, et al. Conservative Q-learning for offline reinforcement learning [EB/OL]. (2020-06-08)[2022-11-08].
9	COUPRIE C, FARABET C, NAJMAN L, et al. Indoor semantic segmentation using depth information [EB/OL]. (2013-01-16)[2023-04-02].
10	MCCRACKEN M W. Robust out-of-sample inference [J]. Journal of econometrics, 2000, 99(2): 195–223. DOI: 10.1016/S0304-4076(00)00022-1 DOI URL
11	YU T H, THOMAS G, YU L T, et al. MOPO: model-based offline policy optimization [EB/OL]. (2020-05-27)[2022-10-08].
12	GUO K Y, SHAO Y F, GENG Y H. Model-based offline reinforcement learning with pessimism-modulated dynamics belief [EB/OL]. (2022-10-13)[2023-04-02].
13	WU C Y, MANMATHA R, SMOLA A J, et al. Sampling matters in deep embedding learning [C]//IEEE International Conference on Computer Vision (ICCV). IEEE, 2017: 2859–2867. DOI: 10.1109/ICCV.2017.309 DOI URL
14	SHRIVASTAVA A, GUPTA A, GIRSHICK R. Training region-based object detectors with online hard example mining [C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2016: 761–769. DOI: 10.1109/CVPR.2016.89 DOI URL
15	ROBINSON J, CHUANG C Y, SRA S, et al. Contrastive learning with hard negative samples [EB/OL]. (2020-10-09)[2023-03-23].
16	FUJIMOTO S, MEGER D, PRECUP D. Off-policy deep reinforcement learning without exploration [EB/OL]. (2018-12-7)[2022-10-2].
17	KUMAR A, FU J, TUCKER G, et al. Stabilizing off-policy Q-learning via bootstrapping error reduction [C]//33rd International Conference on Neural Information Processing Systems. Curran Associates Inc., 2019: 11784–11794. DOI: 10.48550/arXiv.1906.00949 DOI URL
18	WU Y F, TUCKER G, NACHUM O. Behavior regularized offline reinforcement learning [EB/OL]. (2019-11-26)[2022-09-11].
19	AGARWAL R, SCHUURMANS D, NOROUZI M. An optimistic perspective on offline reinforcement learning [EB/OL]. (2019-07-10)[2022-10-11].
20	LYU J F, MA X T, LI X, et al. Mildly conservative Q-learning for offline reinforcement learning [EB/OL]. (2022-07-09)[2023-04-12].
21	YANG J K, ZHOU K Y, LI Y X, et al. Generalized out-of-distribution detection: a survey [EB/OL]. (2021-10-21)[2022-03-01].
22	GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks [J]. Communications of the ACM, 2020, 63(11): 139–144. DOI: 10.1145/3422622 DOI URL
23	KINGMA D P and WELLING M. Auto-encoding variational bayes [EB/OL]. (2013-12-20)[2023-02-14].
24	PIDHORSKYI S, ALMOHSEN R, ADJEROH D A, et al. Generative probabilistic novelty detection with adversarial autoencoders [EB/OL]. (2018-07-06)[2023-04-10].
25	TIAN K, ZHOU S G, FAN J P, et al. Learning competitive and discriminative reconstructions for anomaly detection [C]//33th AAAI Conference on Artificial Intelligence. ACM, 2019: 5167–5174. DOI: 10.1609/aaai.v33i01.33015167 DOI URL
26	DEECKE L, VANDERMEULEN R, RUFF L, et al. Image anomaly detection with generative adversarial networks [C]//European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. ECMLPKDD, 2018. DOI: 10.1007/978-3-030-10925-7_1 DOI URL
27	ZHOU K, XIAO Y T, YANG J L, et al. Encoding structure-texture relation with P-net for anomaly detection in retinal images [M]//Computer Vision–ECCV 2020. Cham, Switzerland: Springer international publishing, 2020
28	WEI H, YE D, LIU Z, et al. Boosting offline reinforcement learning with residual generative modeling [C]//Thirtieth International Joint Conference on Artificial Intelligence. IJCAI, 2021: 3574–3580. DOI: 10.24963/ijcai.2021/492 DOI URL
29	WANG Z, HUNT J J, ZHOU M. Diffusion policies as an expressive policy class for offline reinforcement learning [EB/OL]. (2022-08-12)[2023-05-15].
30	CHEN H, LU C, YING C, et al. Offline reinforcement learning via high-fidelity generative behavior modeling [EB/OL]. (2022-09-29)[2023-05-10].
31	MNIH V, KAVUKCUOGLU K, SILVER D, et al. Playing atari with deep reinforcement learning [EB/OL]. (2013-12-19)[2022-09-03].
32	SZITA I, LÖRINCZ A. Learning tetris using the noisy cross-entropy method [J]. Neural computation, 2006, 18(12): 2936–2941. DOI: 10.1162/neco.2006.18.12.2936 DOI URL
33	SOHN K, YAN X, LEE H, et al. Learning structured output representation using deep conditional generative models [C]//28th International Conference on Neural Information Processing Systems. NIPS, 2015: 3483–3491
34	FU J, KUMAR A, NACHUM O, et al. D4 RL: datasets for deep data-driven reinforcement learning [EB/OL]. (2020-04-15)[2022-08-15].
35	HAARNOJA T, ZHOU A, ABBEEL P, et al. Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor [C]//International Conference on Machine Learning. Association for Computing Machinery, 2018. DOI: 10.48550/arXiv.1801.01290 DOI URL
36	FUJIMOTO S, GU S S. A minimalist approach to offline reinforcement learning [EB/OL]. (2021-06-12)[2022-09-13].
37	TARASOV D, NIKULIN A, AKIMOV D, et al. CORL: research-oriented deep offline reinforcement learning library [C]//3rd Offline Reinforcement Learning Workshop: Offline RL as a “Launchpad”. NeurIPS, 2022. DOI: 10.48550/arXiv.2210.07105 DOI URL

[1]	YU Junpeng, CHEN Yiyu. A Practical Reinforcement Learning Framework for Automatic Radar Detection [J]. ZTE Communications, 2023, 21(3): 22-28.
[2]	Biao He, Xiangyun Zhou, and Thushara D. Abhayapala. Wireless Physical Layer Security with Imperfect Channel State Information: A Survey [J]. ZTE Communications, 2013, 11(3): 11-19.