ZTE Communications ›› 2023, Vol. 21 ›› Issue (3): 29-36.DOI: 10.12142/ZTECOM.202303005
• Special Topic • Previous Articles Next Articles
SHEN Jiahao1,2, JIANG Ke1,2, TAN Xiaoyang1,2(
)
Received:2023-07-05
Online:2023-09-21
Published:2023-03-22
About author:SHEN Jiahao is currently a postgraduate student in College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, China. His research interests include reinforcement learning and generative model.|JIANG Ke is currently a PhD student in the College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, China. His research interest is reinforcement learning.|TAN Xiaoyang (Supported by:SHEN Jiahao, JIANG Ke, TAN Xiaoyang. Boundary Data Augmentation for Offline Reinforcement Learning[J]. ZTE Communications, 2023, 21(3): 29-36.
Add to citation manager EndNote|Ris|BibTeX
URL: https://zte.magtechjournal.com/EN/10.12142/ZTECOM.202303005
Figure 1 Overall architecture of the proposed Boundary Conservative Q Learning (Boundary-CQL, BCQL) method, where the left column illustrates the pipeline to generate boundary OOD data based on an adversarial generative model, while the right column performs Offline RL with the generated data
| Task Name | BC | BEAR | SAC | TD3+BC | CQL | BCQL |
|---|---|---|---|---|---|---|
| Halfcheetah-medium-v2 | 42.4±0.2 | 37.1±2.3 | 55.2±27.8 | 48.3±0.3 | 47.1±0.2 | 47.1±0.7 |
| Hopper-medium-v2 | 53.5±2.0 | 30.8±0.9 | 0.8±0.0 | 59.3±4.2 | 64.9±4.1 | 66.1±5.2 |
| Walker2d-medium-v2 | 63.2±18.8 | 56±8.5 | -0.3±0.2 | 83.7±2.1 | 80.4±3.5 | 84.6±2.5 |
| Halfcheetah-medium-replay-v2 | 35.7±2.7 | 36.2±5.6 | 0.8±1.0 | 44.6±0.5 | 45.2±0.6 | 46.1±1.5 |
| Hopper-medium-replay-v2 | 29.8±2.4 | 31.1±7.2 | 7.4±0.5 | 60.9±18.8 | 87.7±14.4 | 93.9±11.8 |
| Walker2d-medium-replay-v2 | 21.8±11.7 | 13.6±2.1 | -0.4±0.3 | 81.8±5.5 | 79.3±4.9 | 82.5±7.3 |
| Halfcheetah-medium-expert-v2 | 56.0±8.5 | 44.2±13.8 | 28.4±19.4 | 90.7±4.3 | 96±0.8 | 97.5±3.2 |
| Hopper-medium-expert-v2 | 52.3±4.6 | 67.3±32.5 | 0.7±0.0 | 98.0±9.4 | 93.9±14.3 | 95.9±13.3 |
| Walker2d-medium-expert-v2 | 99.0±18.5 | 43.8±6.0 | 1.9±3.9 | 110.1±0.5 | 109.7±0.5 | 110.2±1.0 |
| Halfcheetah-expert-v2 | 91.8±1.5 | 100.2±1.8 | -0.8±1.8 | 96.7±1.1 | 96.3±1.3 | 98.4±3.2 |
| Hopper-expert-v2 | 107.7±0.7 | 108.3±3.5 | 0.7±0.0 | 107.8±7 | 109.5±14.3 | 111.7±8.3 |
| Walker2d-expert-v2 | 106.7±0.2 | 106.1±6.0 | 0.7±0.3 | 110.2±0.3 | 108.5±0.5 | 109.7±1.0 |
| Total average | 63.3 | 56.2 | 7.9 | 82.7 | 83.8 | 87.0 |
Table 1 Performance of BCQL and prior methods on MuJoCo tasks from D4RL, on the normalized return metric (the highest means are bolded)
| Task Name | BC | BEAR | SAC | TD3+BC | CQL | BCQL |
|---|---|---|---|---|---|---|
| Halfcheetah-medium-v2 | 42.4±0.2 | 37.1±2.3 | 55.2±27.8 | 48.3±0.3 | 47.1±0.2 | 47.1±0.7 |
| Hopper-medium-v2 | 53.5±2.0 | 30.8±0.9 | 0.8±0.0 | 59.3±4.2 | 64.9±4.1 | 66.1±5.2 |
| Walker2d-medium-v2 | 63.2±18.8 | 56±8.5 | -0.3±0.2 | 83.7±2.1 | 80.4±3.5 | 84.6±2.5 |
| Halfcheetah-medium-replay-v2 | 35.7±2.7 | 36.2±5.6 | 0.8±1.0 | 44.6±0.5 | 45.2±0.6 | 46.1±1.5 |
| Hopper-medium-replay-v2 | 29.8±2.4 | 31.1±7.2 | 7.4±0.5 | 60.9±18.8 | 87.7±14.4 | 93.9±11.8 |
| Walker2d-medium-replay-v2 | 21.8±11.7 | 13.6±2.1 | -0.4±0.3 | 81.8±5.5 | 79.3±4.9 | 82.5±7.3 |
| Halfcheetah-medium-expert-v2 | 56.0±8.5 | 44.2±13.8 | 28.4±19.4 | 90.7±4.3 | 96±0.8 | 97.5±3.2 |
| Hopper-medium-expert-v2 | 52.3±4.6 | 67.3±32.5 | 0.7±0.0 | 98.0±9.4 | 93.9±14.3 | 95.9±13.3 |
| Walker2d-medium-expert-v2 | 99.0±18.5 | 43.8±6.0 | 1.9±3.9 | 110.1±0.5 | 109.7±0.5 | 110.2±1.0 |
| Halfcheetah-expert-v2 | 91.8±1.5 | 100.2±1.8 | -0.8±1.8 | 96.7±1.1 | 96.3±1.3 | 98.4±3.2 |
| Hopper-expert-v2 | 107.7±0.7 | 108.3±3.5 | 0.7±0.0 | 107.8±7 | 109.5±14.3 | 111.7±8.3 |
| Walker2d-expert-v2 | 106.7±0.2 | 106.1±6.0 | 0.7±0.3 | 110.2±0.3 | 108.5±0.5 | 109.7±1.0 |
| Total average | 63.3 | 56.2 | 7.9 | 82.7 | 83.8 | 87.0 |
| KL Divergence | KL Divergence | ||
|---|---|---|---|
| 0.2 | 0.08 | 1.0 | 0.41 |
| 0.4 | 0.21 | 1.5 | 0.57 |
| 0.5 | 0.34 | 2.0 | 0.76 |
Table 2 KL divergence between generated states and origin states under different βG in a walker2d-medium environment
| KL Divergence | KL Divergence | ||
|---|---|---|---|
| 0.2 | 0.08 | 1.0 | 0.41 |
| 0.4 | 0.21 | 1.5 | 0.57 |
| 0.5 | 0.34 | 2.0 | 0.76 |
| Task Name | ||||
|---|---|---|---|---|
| Halfcheetah-medium-v2 | 47.1±0.2 | 46.8±1.3 | 47.1±0.7 | 46.1±2.2 |
| Hopper-medium-v2 | 64.9±4.1 | 64.8±7.6 | 64.3±4.2 | 66.1±7.2 |
| Walker2d-medium-v2 | 80.4±3.5 | 84.6±2.5 | 81.8±2.1 | 79.3±4.5 |
| Halfcheetah-medium-replay-v2 | 45.2±0.6 | 45.0±1.0 | 44.9±0.5 | 46.1±1.5 |
| Hopper-medium-replay-v2 | 87.7±14.4 | 90.2±10.5 | 93.9±11.8 | 83.5±14.4 |
| Walker2d-medium-replay-v2 | 79.3±4.9 | 82.5±7.3 | 80.7±5.5 | 77.7±8.9 |
| Halfcheetah-medium-expert-v2 | 96±0.8 | 95.2±0.4 | 97.5±3.2 | 96.9±1.1 |
| Hopper-medium-expert-v2 | 93.9±14.3 | 94.0±8.7 | 95.9±13.3 | 94.8±11.3 |
| Walker2d-medium-expert-v2 | 109.7±0.5 | 110.2±1.0 | 110.1±0.5 | 109.3±0.3 |
| Halfcheetah-expert-v2 | 96.3±1.3 | 98.4±3.2 | 97.4±2.3 | 98.2±1.3 |
| Hopper-expert-v2 | 106.5±14.3 | 109.7±7.9 | 111.7±8.3 | 111.2±10.2 |
| Walker2d-expert-v2 | 108.5±0.5 | 108.7±0.3 | 109.7±1.0 | 109.5±0.5 |
Table 3 Performance of BCQL with different λ, on the normalized return metric (the highest means are bolded)
| Task Name | ||||
|---|---|---|---|---|
| Halfcheetah-medium-v2 | 47.1±0.2 | 46.8±1.3 | 47.1±0.7 | 46.1±2.2 |
| Hopper-medium-v2 | 64.9±4.1 | 64.8±7.6 | 64.3±4.2 | 66.1±7.2 |
| Walker2d-medium-v2 | 80.4±3.5 | 84.6±2.5 | 81.8±2.1 | 79.3±4.5 |
| Halfcheetah-medium-replay-v2 | 45.2±0.6 | 45.0±1.0 | 44.9±0.5 | 46.1±1.5 |
| Hopper-medium-replay-v2 | 87.7±14.4 | 90.2±10.5 | 93.9±11.8 | 83.5±14.4 |
| Walker2d-medium-replay-v2 | 79.3±4.9 | 82.5±7.3 | 80.7±5.5 | 77.7±8.9 |
| Halfcheetah-medium-expert-v2 | 96±0.8 | 95.2±0.4 | 97.5±3.2 | 96.9±1.1 |
| Hopper-medium-expert-v2 | 93.9±14.3 | 94.0±8.7 | 95.9±13.3 | 94.8±11.3 |
| Walker2d-medium-expert-v2 | 109.7±0.5 | 110.2±1.0 | 110.1±0.5 | 109.3±0.3 |
| Halfcheetah-expert-v2 | 96.3±1.3 | 98.4±3.2 | 97.4±2.3 | 98.2±1.3 |
| Hopper-expert-v2 | 106.5±14.3 | 109.7±7.9 | 111.7±8.3 | 111.2±10.2 |
| Walker2d-expert-v2 | 108.5±0.5 | 108.7±0.3 | 109.7±1.0 | 109.5±0.5 |
| 1 |
CHEN D, CHEN K A, LI Z J, et al. PowerNet: multi-agent deep reinforcement learning for scalable powergrid control [J]. IEEE transactions on power systems, 2022, 37(2): 1007–1017. DOI: 10.1109/TPWRS.2021.3100898
DOI URL |
| 2 | CHEN X C, YAO L N, MCAULEY J, et al. A survey of deep reinforcement learning in recommender systems: a systematic review and future directions [EB/OL]. (2021-09-08)[2023-04-12]. |
| 3 |
OLIFF H, LIU Y, KUMAR M, et al. Reinforcement learning for facilitating human-robot-interaction in manufacturing [J]. Journal of manufacturing systems, 2020, 56: 326–340. DOI: 10.1016/j.jmsy.2020.06.018
DOI URL |
| 4 |
YU C, LIU J M, NEMATI S, et al. Reinforcement learning in healthcare: a survey [J]. ACM computing surveys, 2023, 55(1): 1–36. DOI: 10.1145/3477600
DOI URL |
| 5 |
GRIGORESCU S, TRASNEA B, COCIAS T, et al. A survey of deep learning techniques for autonomous driving [J]. Journal of field robotics, 2020, 37(3): 362–386. DOI: 10.1002/rob.21918
DOI URL |
| 6 |
FUJIMOTO S, MEGER D, PRECUP D. Off-policy deep reinforcement learning without exploration [C]//International Conference on Machine Learning. PMLR, 2019: 2052–2062. DOI: 10.48550/arXiv.1812.02900
DOI URL |
| 7 | LEVINE S, KUMAR A, TUCKER G, et al. Offline reinforcement learning: tutorial, review and perspectives on open problems [EB/OL]. (2020-05-04)[2023-04-12]. |
| 8 | KUMAR A, ZHOU A, TUCKER G, et al. Conservative Q-learning for offline reinforcement learning [EB/OL]. (2020-06-08)[2022-11-08]. |
| 9 | COUPRIE C, FARABET C, NAJMAN L, et al. Indoor semantic segmentation using depth information [EB/OL]. (2013-01-16)[2023-04-02]. |
| 10 |
MCCRACKEN M W. Robust out-of-sample inference [J]. Journal of econometrics, 2000, 99(2): 195–223. DOI: 10.1016/S0304-4076(00)00022-1
DOI URL |
| 11 | YU T H, THOMAS G, YU L T, et al. MOPO: model-based offline policy optimization [EB/OL]. (2020-05-27)[2022-10-08]. |
| 12 | GUO K Y, SHAO Y F, GENG Y H. Model-based offline reinforcement learning with pessimism-modulated dynamics belief [EB/OL]. (2022-10-13)[2023-04-02]. |
| 13 |
WU C Y, MANMATHA R, SMOLA A J, et al. Sampling matters in deep embedding learning [C]//IEEE International Conference on Computer Vision (ICCV). IEEE, 2017: 2859–2867. DOI: 10.1109/ICCV.2017.309
DOI URL |
| 14 |
SHRIVASTAVA A, GUPTA A, GIRSHICK R. Training region-based object detectors with online hard example mining [C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2016: 761–769. DOI: 10.1109/CVPR.2016.89
DOI URL |
| 15 | ROBINSON J, CHUANG C Y, SRA S, et al. Contrastive learning with hard negative samples [EB/OL]. (2020-10-09)[2023-03-23]. |
| 16 | FUJIMOTO S, MEGER D, PRECUP D. Off-policy deep reinforcement learning without exploration [EB/OL]. (2018-12-7)[2022-10-2]. |
| 17 |
KUMAR A, FU J, TUCKER G, et al. Stabilizing off-policy Q-learning via bootstrapping error reduction [C]//33rd International Conference on Neural Information Processing Systems. Curran Associates Inc., 2019: 11784–11794. DOI: 10.48550/arXiv.1906.00949
DOI URL |
| 18 | WU Y F, TUCKER G, NACHUM O. Behavior regularized offline reinforcement learning [EB/OL]. (2019-11-26)[2022-09-11]. |
| 19 | AGARWAL R, SCHUURMANS D, NOROUZI M. An optimistic perspective on offline reinforcement learning [EB/OL]. (2019-07-10)[2022-10-11]. |
| 20 | LYU J F, MA X T, LI X, et al. Mildly conservative Q-learning for offline reinforcement learning [EB/OL]. (2022-07-09)[2023-04-12]. |
| 21 | YANG J K, ZHOU K Y, LI Y X, et al. Generalized out-of-distribution detection: a survey [EB/OL]. (2021-10-21)[2022-03-01]. |
| 22 |
GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks [J]. Communications of the ACM, 2020, 63(11): 139–144. DOI: 10.1145/3422622
DOI URL |
| 23 | KINGMA D P and WELLING M. Auto-encoding variational bayes [EB/OL]. (2013-12-20)[2023-02-14]. |
| 24 | PIDHORSKYI S, ALMOHSEN R, ADJEROH D A, et al. Generative probabilistic novelty detection with adversarial autoencoders [EB/OL]. (2018-07-06)[2023-04-10]. |
| 25 |
TIAN K, ZHOU S G, FAN J P, et al. Learning competitive and discriminative reconstructions for anomaly detection [C]//33th AAAI Conference on Artificial Intelligence. ACM, 2019: 5167–5174. DOI: 10.1609/aaai.v33i01.33015167
DOI URL |
| 26 |
DEECKE L, VANDERMEULEN R, RUFF L, et al. Image anomaly detection with generative adversarial networks [C]//European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. ECMLPKDD, 2018. DOI: 10.1007/978-3-030-10925-7_1
DOI URL |
| 27 | ZHOU K, XIAO Y T, YANG J L, et al. Encoding structure-texture relation with P-net for anomaly detection in retinal images [M]//Computer Vision–ECCV 2020. Cham, Switzerland: Springer international publishing, 2020 |
| 28 |
WEI H, YE D, LIU Z, et al. Boosting offline reinforcement learning with residual generative modeling [C]//Thirtieth International Joint Conference on Artificial Intelligence. IJCAI, 2021: 3574–3580. DOI: 10.24963/ijcai.2021/492
DOI URL |
| 29 | WANG Z, HUNT J J, ZHOU M. Diffusion policies as an expressive policy class for offline reinforcement learning [EB/OL]. (2022-08-12)[2023-05-15]. |
| 30 | CHEN H, LU C, YING C, et al. Offline reinforcement learning via high-fidelity generative behavior modeling [EB/OL]. (2022-09-29)[2023-05-10]. |
| 31 | MNIH V, KAVUKCUOGLU K, SILVER D, et al. Playing atari with deep reinforcement learning [EB/OL]. (2013-12-19)[2022-09-03]. |
| 32 |
SZITA I, LÖRINCZ A. Learning tetris using the noisy cross-entropy method [J]. Neural computation, 2006, 18(12): 2936–2941. DOI: 10.1162/neco.2006.18.12.2936
DOI URL |
| 33 | SOHN K, YAN X, LEE H, et al. Learning structured output representation using deep conditional generative models [C]//28th International Conference on Neural Information Processing Systems. NIPS, 2015: 3483–3491 |
| 34 | FU J, KUMAR A, NACHUM O, et al. D4 RL: datasets for deep data-driven reinforcement learning [EB/OL]. (2020-04-15)[2022-08-15]. |
| 35 |
HAARNOJA T, ZHOU A, ABBEEL P, et al. Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor [C]//International Conference on Machine Learning. Association for Computing Machinery, 2018. DOI: 10.48550/arXiv.1801.01290
DOI URL |
| 36 | FUJIMOTO S, GU S S. A minimalist approach to offline reinforcement learning [EB/OL]. (2021-06-12)[2022-09-13]. |
| 37 |
TARASOV D, NIKULIN A, AKIMOV D, et al. CORL: research-oriented deep offline reinforcement learning library [C]//3rd Offline Reinforcement Learning Workshop: Offline RL as a “Launchpad”. NeurIPS, 2022. DOI: 10.48550/arXiv.2210.07105
DOI URL |
| [1] | YU Junpeng, CHEN Yiyu. A Practical Reinforcement Learning Framework for Automatic Radar Detection [J]. ZTE Communications, 2023, 21(3): 22-28. |
| [2] | Biao He, Xiangyun Zhou, and Thushara D. Abhayapala. Wireless Physical Layer Security with Imperfect Channel State Information: A Survey [J]. ZTE Communications, 2013, 11(3): 11-19. |
| Viewed | ||||||
|
Full text |
|
|||||
|
Abstract |
|
|||||