ZTE Communications ›› 2023, Vol. 21 ›› Issue (3): 29-36.DOI: 10.12142/ZTECOM.202303005
• Special Topic • Previous Articles Next Articles
SHEN Jiahao1,2, JIANG Ke1,2, TAN Xiaoyang1,2()
Received:
2023-07-05
Online:
2023-09-21
Published:
2023-03-22
About author:
SHEN Jiahao is currently a postgraduate student in College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, China. His research interests include reinforcement learning and generative model.|JIANG Ke is currently a PhD student in the College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, China. His research interest is reinforcement learning.|TAN Xiaoyang (Supported by:
SHEN Jiahao, JIANG Ke, TAN Xiaoyang. Boundary Data Augmentation for Offline Reinforcement Learning[J]. ZTE Communications, 2023, 21(3): 29-36.
Add to citation manager EndNote|Ris|BibTeX
URL: https://zte.magtechjournal.com/EN/10.12142/ZTECOM.202303005
Figure 1 Overall architecture of the proposed Boundary Conservative Q Learning (Boundary-CQL, BCQL) method, where the left column illustrates the pipeline to generate boundary OOD data based on an adversarial generative model, while the right column performs Offline RL with the generated data
Task Name | BC | BEAR | SAC | TD3+BC | CQL | BCQL |
---|---|---|---|---|---|---|
Halfcheetah-medium-v2 | 42.4±0.2 | 37.1±2.3 | 55.2±27.8 | 48.3±0.3 | 47.1±0.2 | 47.1±0.7 |
Hopper-medium-v2 | 53.5±2.0 | 30.8±0.9 | 0.8±0.0 | 59.3±4.2 | 64.9±4.1 | 66.1±5.2 |
Walker2d-medium-v2 | 63.2±18.8 | 56±8.5 | -0.3±0.2 | 83.7±2.1 | 80.4±3.5 | 84.6±2.5 |
Halfcheetah-medium-replay-v2 | 35.7±2.7 | 36.2±5.6 | 0.8±1.0 | 44.6±0.5 | 45.2±0.6 | 46.1±1.5 |
Hopper-medium-replay-v2 | 29.8±2.4 | 31.1±7.2 | 7.4±0.5 | 60.9±18.8 | 87.7±14.4 | 93.9±11.8 |
Walker2d-medium-replay-v2 | 21.8±11.7 | 13.6±2.1 | -0.4±0.3 | 81.8±5.5 | 79.3±4.9 | 82.5±7.3 |
Halfcheetah-medium-expert-v2 | 56.0±8.5 | 44.2±13.8 | 28.4±19.4 | 90.7±4.3 | 96±0.8 | 97.5±3.2 |
Hopper-medium-expert-v2 | 52.3±4.6 | 67.3±32.5 | 0.7±0.0 | 98.0±9.4 | 93.9±14.3 | 95.9±13.3 |
Walker2d-medium-expert-v2 | 99.0±18.5 | 43.8±6.0 | 1.9±3.9 | 110.1±0.5 | 109.7±0.5 | 110.2±1.0 |
Halfcheetah-expert-v2 | 91.8±1.5 | 100.2±1.8 | -0.8±1.8 | 96.7±1.1 | 96.3±1.3 | 98.4±3.2 |
Hopper-expert-v2 | 107.7±0.7 | 108.3±3.5 | 0.7±0.0 | 107.8±7 | 109.5±14.3 | 111.7±8.3 |
Walker2d-expert-v2 | 106.7±0.2 | 106.1±6.0 | 0.7±0.3 | 110.2±0.3 | 108.5±0.5 | 109.7±1.0 |
Total average | 63.3 | 56.2 | 7.9 | 82.7 | 83.8 | 87.0 |
Table 1 Performance of BCQL and prior methods on MuJoCo tasks from D4RL, on the normalized return metric (the highest means are bolded)
Task Name | BC | BEAR | SAC | TD3+BC | CQL | BCQL |
---|---|---|---|---|---|---|
Halfcheetah-medium-v2 | 42.4±0.2 | 37.1±2.3 | 55.2±27.8 | 48.3±0.3 | 47.1±0.2 | 47.1±0.7 |
Hopper-medium-v2 | 53.5±2.0 | 30.8±0.9 | 0.8±0.0 | 59.3±4.2 | 64.9±4.1 | 66.1±5.2 |
Walker2d-medium-v2 | 63.2±18.8 | 56±8.5 | -0.3±0.2 | 83.7±2.1 | 80.4±3.5 | 84.6±2.5 |
Halfcheetah-medium-replay-v2 | 35.7±2.7 | 36.2±5.6 | 0.8±1.0 | 44.6±0.5 | 45.2±0.6 | 46.1±1.5 |
Hopper-medium-replay-v2 | 29.8±2.4 | 31.1±7.2 | 7.4±0.5 | 60.9±18.8 | 87.7±14.4 | 93.9±11.8 |
Walker2d-medium-replay-v2 | 21.8±11.7 | 13.6±2.1 | -0.4±0.3 | 81.8±5.5 | 79.3±4.9 | 82.5±7.3 |
Halfcheetah-medium-expert-v2 | 56.0±8.5 | 44.2±13.8 | 28.4±19.4 | 90.7±4.3 | 96±0.8 | 97.5±3.2 |
Hopper-medium-expert-v2 | 52.3±4.6 | 67.3±32.5 | 0.7±0.0 | 98.0±9.4 | 93.9±14.3 | 95.9±13.3 |
Walker2d-medium-expert-v2 | 99.0±18.5 | 43.8±6.0 | 1.9±3.9 | 110.1±0.5 | 109.7±0.5 | 110.2±1.0 |
Halfcheetah-expert-v2 | 91.8±1.5 | 100.2±1.8 | -0.8±1.8 | 96.7±1.1 | 96.3±1.3 | 98.4±3.2 |
Hopper-expert-v2 | 107.7±0.7 | 108.3±3.5 | 0.7±0.0 | 107.8±7 | 109.5±14.3 | 111.7±8.3 |
Walker2d-expert-v2 | 106.7±0.2 | 106.1±6.0 | 0.7±0.3 | 110.2±0.3 | 108.5±0.5 | 109.7±1.0 |
Total average | 63.3 | 56.2 | 7.9 | 82.7 | 83.8 | 87.0 |
KL Divergence | KL Divergence | ||
---|---|---|---|
0.2 | 0.08 | 1.0 | 0.41 |
0.4 | 0.21 | 1.5 | 0.57 |
0.5 | 0.34 | 2.0 | 0.76 |
Table 2 KL divergence between generated states and origin states under different βG in a walker2d-medium environment
KL Divergence | KL Divergence | ||
---|---|---|---|
0.2 | 0.08 | 1.0 | 0.41 |
0.4 | 0.21 | 1.5 | 0.57 |
0.5 | 0.34 | 2.0 | 0.76 |
Task Name | ||||
---|---|---|---|---|
Halfcheetah-medium-v2 | 47.1±0.2 | 46.8±1.3 | 47.1±0.7 | 46.1±2.2 |
Hopper-medium-v2 | 64.9±4.1 | 64.8±7.6 | 64.3±4.2 | 66.1±7.2 |
Walker2d-medium-v2 | 80.4±3.5 | 84.6±2.5 | 81.8±2.1 | 79.3±4.5 |
Halfcheetah-medium-replay-v2 | 45.2±0.6 | 45.0±1.0 | 44.9±0.5 | 46.1±1.5 |
Hopper-medium-replay-v2 | 87.7±14.4 | 90.2±10.5 | 93.9±11.8 | 83.5±14.4 |
Walker2d-medium-replay-v2 | 79.3±4.9 | 82.5±7.3 | 80.7±5.5 | 77.7±8.9 |
Halfcheetah-medium-expert-v2 | 96±0.8 | 95.2±0.4 | 97.5±3.2 | 96.9±1.1 |
Hopper-medium-expert-v2 | 93.9±14.3 | 94.0±8.7 | 95.9±13.3 | 94.8±11.3 |
Walker2d-medium-expert-v2 | 109.7±0.5 | 110.2±1.0 | 110.1±0.5 | 109.3±0.3 |
Halfcheetah-expert-v2 | 96.3±1.3 | 98.4±3.2 | 97.4±2.3 | 98.2±1.3 |
Hopper-expert-v2 | 106.5±14.3 | 109.7±7.9 | 111.7±8.3 | 111.2±10.2 |
Walker2d-expert-v2 | 108.5±0.5 | 108.7±0.3 | 109.7±1.0 | 109.5±0.5 |
Table 3 Performance of BCQL with different λ, on the normalized return metric (the highest means are bolded)
Task Name | ||||
---|---|---|---|---|
Halfcheetah-medium-v2 | 47.1±0.2 | 46.8±1.3 | 47.1±0.7 | 46.1±2.2 |
Hopper-medium-v2 | 64.9±4.1 | 64.8±7.6 | 64.3±4.2 | 66.1±7.2 |
Walker2d-medium-v2 | 80.4±3.5 | 84.6±2.5 | 81.8±2.1 | 79.3±4.5 |
Halfcheetah-medium-replay-v2 | 45.2±0.6 | 45.0±1.0 | 44.9±0.5 | 46.1±1.5 |
Hopper-medium-replay-v2 | 87.7±14.4 | 90.2±10.5 | 93.9±11.8 | 83.5±14.4 |
Walker2d-medium-replay-v2 | 79.3±4.9 | 82.5±7.3 | 80.7±5.5 | 77.7±8.9 |
Halfcheetah-medium-expert-v2 | 96±0.8 | 95.2±0.4 | 97.5±3.2 | 96.9±1.1 |
Hopper-medium-expert-v2 | 93.9±14.3 | 94.0±8.7 | 95.9±13.3 | 94.8±11.3 |
Walker2d-medium-expert-v2 | 109.7±0.5 | 110.2±1.0 | 110.1±0.5 | 109.3±0.3 |
Halfcheetah-expert-v2 | 96.3±1.3 | 98.4±3.2 | 97.4±2.3 | 98.2±1.3 |
Hopper-expert-v2 | 106.5±14.3 | 109.7±7.9 | 111.7±8.3 | 111.2±10.2 |
Walker2d-expert-v2 | 108.5±0.5 | 108.7±0.3 | 109.7±1.0 | 109.5±0.5 |
1 |
CHEN D, CHEN K A, LI Z J, et al. PowerNet: multi-agent deep reinforcement learning for scalable powergrid control [J]. IEEE transactions on power systems, 2022, 37(2): 1007–1017. DOI: 10.1109/TPWRS.2021.3100898
DOI URL |
2 | CHEN X C, YAO L N, MCAULEY J, et al. A survey of deep reinforcement learning in recommender systems: a systematic review and future directions [EB/OL]. (2021-09-08)[2023-04-12]. |
3 |
OLIFF H, LIU Y, KUMAR M, et al. Reinforcement learning for facilitating human-robot-interaction in manufacturing [J]. Journal of manufacturing systems, 2020, 56: 326–340. DOI: 10.1016/j.jmsy.2020.06.018
DOI URL |
4 |
YU C, LIU J M, NEMATI S, et al. Reinforcement learning in healthcare: a survey [J]. ACM computing surveys, 2023, 55(1): 1–36. DOI: 10.1145/3477600
DOI URL |
5 |
GRIGORESCU S, TRASNEA B, COCIAS T, et al. A survey of deep learning techniques for autonomous driving [J]. Journal of field robotics, 2020, 37(3): 362–386. DOI: 10.1002/rob.21918
DOI URL |
6 |
FUJIMOTO S, MEGER D, PRECUP D. Off-policy deep reinforcement learning without exploration [C]//International Conference on Machine Learning. PMLR, 2019: 2052–2062. DOI: 10.48550/arXiv.1812.02900
DOI URL |
7 | LEVINE S, KUMAR A, TUCKER G, et al. Offline reinforcement learning: tutorial, review and perspectives on open problems [EB/OL]. (2020-05-04)[2023-04-12]. |
8 | KUMAR A, ZHOU A, TUCKER G, et al. Conservative Q-learning for offline reinforcement learning [EB/OL]. (2020-06-08)[2022-11-08]. |
9 | COUPRIE C, FARABET C, NAJMAN L, et al. Indoor semantic segmentation using depth information [EB/OL]. (2013-01-16)[2023-04-02]. |
10 |
MCCRACKEN M W. Robust out-of-sample inference [J]. Journal of econometrics, 2000, 99(2): 195–223. DOI: 10.1016/S0304-4076(00)00022-1
DOI URL |
11 | YU T H, THOMAS G, YU L T, et al. MOPO: model-based offline policy optimization [EB/OL]. (2020-05-27)[2022-10-08]. |
12 | GUO K Y, SHAO Y F, GENG Y H. Model-based offline reinforcement learning with pessimism-modulated dynamics belief [EB/OL]. (2022-10-13)[2023-04-02]. |
13 |
WU C Y, MANMATHA R, SMOLA A J, et al. Sampling matters in deep embedding learning [C]//IEEE International Conference on Computer Vision (ICCV). IEEE, 2017: 2859–2867. DOI: 10.1109/ICCV.2017.309
DOI URL |
14 |
SHRIVASTAVA A, GUPTA A, GIRSHICK R. Training region-based object detectors with online hard example mining [C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2016: 761–769. DOI: 10.1109/CVPR.2016.89
DOI URL |
15 | ROBINSON J, CHUANG C Y, SRA S, et al. Contrastive learning with hard negative samples [EB/OL]. (2020-10-09)[2023-03-23]. |
16 | FUJIMOTO S, MEGER D, PRECUP D. Off-policy deep reinforcement learning without exploration [EB/OL]. (2018-12-7)[2022-10-2]. |
17 |
KUMAR A, FU J, TUCKER G, et al. Stabilizing off-policy Q-learning via bootstrapping error reduction [C]//33rd International Conference on Neural Information Processing Systems. Curran Associates Inc., 2019: 11784–11794. DOI: 10.48550/arXiv.1906.00949
DOI URL |
18 | WU Y F, TUCKER G, NACHUM O. Behavior regularized offline reinforcement learning [EB/OL]. (2019-11-26)[2022-09-11]. |
19 | AGARWAL R, SCHUURMANS D, NOROUZI M. An optimistic perspective on offline reinforcement learning [EB/OL]. (2019-07-10)[2022-10-11]. |
20 | LYU J F, MA X T, LI X, et al. Mildly conservative Q-learning for offline reinforcement learning [EB/OL]. (2022-07-09)[2023-04-12]. |
21 | YANG J K, ZHOU K Y, LI Y X, et al. Generalized out-of-distribution detection: a survey [EB/OL]. (2021-10-21)[2022-03-01]. |
22 |
GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks [J]. Communications of the ACM, 2020, 63(11): 139–144. DOI: 10.1145/3422622
DOI URL |
23 | KINGMA D P and WELLING M. Auto-encoding variational bayes [EB/OL]. (2013-12-20)[2023-02-14]. |
24 | PIDHORSKYI S, ALMOHSEN R, ADJEROH D A, et al. Generative probabilistic novelty detection with adversarial autoencoders [EB/OL]. (2018-07-06)[2023-04-10]. |
25 |
TIAN K, ZHOU S G, FAN J P, et al. Learning competitive and discriminative reconstructions for anomaly detection [C]//33th AAAI Conference on Artificial Intelligence. ACM, 2019: 5167–5174. DOI: 10.1609/aaai.v33i01.33015167
DOI URL |
26 |
DEECKE L, VANDERMEULEN R, RUFF L, et al. Image anomaly detection with generative adversarial networks [C]//European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. ECMLPKDD, 2018. DOI: 10.1007/978-3-030-10925-7_1
DOI URL |
27 | ZHOU K, XIAO Y T, YANG J L, et al. Encoding structure-texture relation with P-net for anomaly detection in retinal images [M]//Computer Vision–ECCV 2020. Cham, Switzerland: Springer international publishing, 2020 |
28 |
WEI H, YE D, LIU Z, et al. Boosting offline reinforcement learning with residual generative modeling [C]//Thirtieth International Joint Conference on Artificial Intelligence. IJCAI, 2021: 3574–3580. DOI: 10.24963/ijcai.2021/492
DOI URL |
29 | WANG Z, HUNT J J, ZHOU M. Diffusion policies as an expressive policy class for offline reinforcement learning [EB/OL]. (2022-08-12)[2023-05-15]. |
30 | CHEN H, LU C, YING C, et al. Offline reinforcement learning via high-fidelity generative behavior modeling [EB/OL]. (2022-09-29)[2023-05-10]. |
31 | MNIH V, KAVUKCUOGLU K, SILVER D, et al. Playing atari with deep reinforcement learning [EB/OL]. (2013-12-19)[2022-09-03]. |
32 |
SZITA I, LÖRINCZ A. Learning tetris using the noisy cross-entropy method [J]. Neural computation, 2006, 18(12): 2936–2941. DOI: 10.1162/neco.2006.18.12.2936
DOI URL |
33 | SOHN K, YAN X, LEE H, et al. Learning structured output representation using deep conditional generative models [C]//28th International Conference on Neural Information Processing Systems. NIPS, 2015: 3483–3491 |
34 | FU J, KUMAR A, NACHUM O, et al. D4 RL: datasets for deep data-driven reinforcement learning [EB/OL]. (2020-04-15)[2022-08-15]. |
35 |
HAARNOJA T, ZHOU A, ABBEEL P, et al. Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor [C]//International Conference on Machine Learning. Association for Computing Machinery, 2018. DOI: 10.48550/arXiv.1801.01290
DOI URL |
36 | FUJIMOTO S, GU S S. A minimalist approach to offline reinforcement learning [EB/OL]. (2021-06-12)[2022-09-13]. |
37 |
TARASOV D, NIKULIN A, AKIMOV D, et al. CORL: research-oriented deep offline reinforcement learning library [C]//3rd Offline Reinforcement Learning Workshop: Offline RL as a “Launchpad”. NeurIPS, 2022. DOI: 10.48550/arXiv.2210.07105
DOI URL |
[1] | YU Junpeng, CHEN Yiyu. A Practical Reinforcement Learning Framework for Automatic Radar Detection [J]. ZTE Communications, 2023, 21(3): 22-28. |
[2] | Biao He, Xiangyun Zhou, and Thushara D. Abhayapala. Wireless Physical Layer Security with Imperfect Channel State Information: A Survey [J]. ZTE Communications, 2013, 11(3): 11-19. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||