ZTE Communications ›› 2019, Vol. 17 ›› Issue (3): 31-41.DOI: 10.12142/ZTECOM.201903006
收稿日期:
2019-07-10
出版日期:
2019-09-29
发布日期:
2019-12-06
DONG Shaokang, CHEN Jiarui, LIU Yong, BAO Tianyi, GAO Yang
Received:
2019-07-10
Online:
2019-09-29
Published:
2019-12-06
About author:
DONG Shaokang (shaokangdong@gmail.com) obtained his B.S. degree from the Advanced Class of Huazhong University of Science and Technology, China in 2018. He is currently a Ph.D. student in the Department of Computer Science and Technology, Nanjing University, China. His research interests include machine learning, reinforcement learning, and multi-armed bandits.|CHEN Jiarui obtained his B.S. degree from Dongbei University of Finance and Economics, China in 2018. He is currently a master student in the Department of Computer Science and Technology, Nanjing University, China. His research interests include machine learning, multi-agent reinforcement learning, and game.|LIU Yong received a B.S degree in communication engineering from China Agricultural University, China in 2017. He is currently a master student in the Department of Computer Science and Technology, Nanjing University, China. His current research interests include reinforcement learning, multi-agent learning, and transfer learning.|BAO Tianyi is an undergraduate student currently studying in the University of Michigan, USA. She studies computer science and psychology and will receive her B.S. degree in 2020. Her current research interests include the machine learning and human-computer interaction.|GAO Yang received the Ph.D. degree in computer software and theory from the Department of Computer Science and Technology, Nanjing University, China in 2000. He is a professor with the Department of Computer Science and Technology, Nanjing University. His current research interests include artificial intelligence and machine learning. He has published over 100 papers in top international conferences and journals.
. [J]. ZTE Communications, 2019, 17(3): 31-41.
DONG Shaokang, CHEN Jiarui, LIU Yong, BAO Tianyi, GAO Yang. Reinforcement Learning from Algorithm Model to Industry Innovation: A Foundation Stone of Future Artificial Intelligence[J]. ZTE Communications, 2019, 17(3): 31-41.
Category | DP | On-policy MC | Off-policy MC | SARSA | Q-learning |
---|---|---|---|---|---|
Model-based | √ | ||||
Model-free | √ | √ | √ | √ | |
On-policy | √ | √ | |||
Off-policy | √ | √ | |||
Bootstrapping | √ | √ | √ |
Table 1 Comparisons between reinforcement learning (RL) methods
Category | DP | On-policy MC | Off-policy MC | SARSA | Q-learning |
---|---|---|---|---|---|
Model-based | √ | ||||
Model-free | √ | √ | √ | √ | |
On-policy | √ | √ | |||
Off-policy | √ | √ | |||
Bootstrapping | √ | √ | √ |
Figure 5. The Gym games (from left to right and top to bottom: CarRacing, Mountainair, Ant, RoboschoolHumanoidFlagrunHarder, FetchPickAndPlace and MontezumaRevenge).
[1] | THORNDIKE E L . Animal Intelligence: An Experimental Study of the Associate Processes in Animals[J]. American Psychologist, 1998,53(10):1125. DOI: 10.1037/0003-066X.53.10.1125 |
[2] | MINSKY M L. Theory of Neural-Analog Reinforcement Systems and Its Application to the Brain Model Problem [M]. New Jersey, America: Princeton University Press, 1954 |
[3] | BELLMAN R . On the Theory of Dynamic Programming[J]. Proceedings of the National Academy of Sciences of the United States of America, 1952,38(8):716. DOI: 10.1073/pnas.38.8.716 |
[4] | BELLMAN R . A Markovian Decision Process[J]. Journal of Mathematics and Mechanics, 1957: 679-684. DOI: 10.2307/2343663 |
[5] | WERBOS P . Advanced Forecasting Methods for Global Crisis Warning and Models of Intelligence[J]. General System Yearbook, 1977,22:25-38 |
[6] | SUTTON R . Learning to Predict by the Methods of Temporal Differences[J]. Machine Learning, 1988,3(1):9-44. DOI: 10.1007/BF00115009 |
[7] | WATKINS C J C H, DAYAN P . Q-Learning[J]. Machine Learning, 1992,8:279-292 |
[8] | RUMMERY G A, NIRANJAN M . On-Line Q-Learning Using Connectionist Systems[M]. Cambridge, England: the University of Cambridge, 1994 |
[9] | THRUN S . Monte Carlo POMDPS [C]//Advances in Neural Information Processing Systems. Cambridge, USA, 2000 |
[10] | SUTTON R S, MCALLESTER D A, SINGH S P , et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation [C]//Advances in neural information processing systems. Cambridge, USA, 2000 |
[11] | SILVER D, LEVER G, HEESS N , et al. Deterministic Policy Gradient Algorithms [C]//31th International Conference on Machine Learning. Beijing, China, 2014 |
[12] | MNIH V, KAVUKCUOGLU K, SILVER D , et al. Human-Level Control Through Deep Reinforcement Learning[J]. Nature, 2015,518(7540):529. DOI: 10.1038/nature14236 |
[13] | VAN HASSELT H, GUEZ A, SILVER D . Deep Reinforcement Learning with Double Q-Learning [C]//13th AAAI Conference on Artificial Intelligence, Phoenix, USA, 2016 |
[14] | SCHAUL T, QUAN J, ANTONOGLOU I , et al. Prioritized Experience Replay [DB/OL]. ( 2015-11-18). |
[15] | WANG Z, SCHAUL T, HESSEL M , et al. Dueling Network Architectures for Deep Reinforcement Learning [C]//International Conference on Machine Learning. New York, USA, 2016. DOI: 10.1155/2018/2129393 |
[16] | LILLICRAP T P, HUNT J J, PRITZEL A , et al. Continuous Control with Deep Reinforcement Learning [DB/OL]. (2015-09-09). |
[17] | NAIR A, SRINIVASAN P, BLACKWELL S , et al. Massively Parallel Methods for Deep Reinforcement Learning [DB/OL]. (2015-07-15). |
[18] | SCHULMAN J, LEVINE S, ABBEEL P , et al. Trust Region Policy Optimization [C]//31th International Conference on Machine Learning. Lille, France, 2015 |
[19] | SILVER D, HUANG A, MADDISON C J , et al. Mastering the Game of Go with Deep Neural Networks and Tree Search[J]. Nature, 2016,529(7587):484. DOI: 10.13140/RG.2.2.18893.74727 |
[20] | DUAN Y, CHEN X, HOUTHOOFT R , et al. Benchmarking Deep Reinforcement Learning for Continuous Control [C]//33rd International Conference on Machine Learning, New York, USA, 2016 |
[21] | BEATTIE C, LEIBO J Z, TEPLYASHIN D , et al. Deepmind Lab [DB/OL]. (2016-12-12). |
[22] | WU B . Hierarchical Macro Strategy Model for Moba Game AI [C]//AAAI Conference on Artificial Intelligence. Honolulu, USA, 2019. DOI: 10.1609/aaai.v33i01.33011206 |
[23] | SCHULMAN J, WOLSKI F, DHARIWAL P , et al. Proximal Policy Optimization Algorithms [DB/OL]. ( 2017-07-20). |
[24] | HEESS N, SRIRAM S, LEMMON J , et al. Emergence of Locomotion Behaviours in Rich Environments [DB/OL]. ( 2017-07-07). |
[25] | CLARK K . Neural Coreference Resolution [EB/OL]. |
[26] | JI Y, TAN C, MARTSCHAT S , et al. Dynamic Entity Representations in Neural Language Models [C]//2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen, Denmark, 2017. DOI: 10.18653/v1/D17-1195 |
[27] | BELLEMARE M G, NADDAF Y, VENESS J , et al. The Arcade Learning Environment: An Evaluation Platform for General Agents[J]. Journal of Artificial Intelligence Research, 2013,47:253-279. DOI: 10.1613/jair.3912 |
[28] | SILVER D, SCHRITTWIESER J, SIMONYAN K , et al. Mastering the Game of Go Without Human Knowledge[J]. Nature, 2017,550(7676):354. DOI: 10.1038/nature24270 |
[29] | FERNáNDEZ F, VELOSO M . Probabilistic Policy Reuse in a Reinforcement Learning Agent [C]//5th International Joint Conference on Autonomous Agents and Multiagent Systems. Hakodate, Japan, 2006. DOI: 10.1145/1160633.1160762 |
[30] | MCGOVERN A, BARTO A G . Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density [C]//8th International Conference on Machine Learning. Williams College, USA, 2001. |
[31] | DIETTERICH T G . Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition[J]. Journal of Artificial Intelligence Research, 2000,13:227-303. |
[32] | MAHADEVAN S . Proto-Value Functions: Developmental Reinforcement Learning [C]//22nd International Conference on Machine Learning. Bonn, Germany, 2005: 553-560. DOI: 10.1145/1102351.1102421 |
[33] | MADDEN M G, HOWLEY T . Transfer of Experience Between Reinforcement Learning Environments with Progressive Difficulty[J]. Artificial Intelligence Review, 2004,21(3-4):375-398. DOI: 10.1023/B:AIRE.0000036264.95672.64 |
[34] | DRIESSENS K, RAMON J, CROONENBORGHS T . Transfer Learning for Reinforcement Learning Through Goal and Policy Parametrization [C]//ICML Workshop on Structural Knowledge Transfer for Machine Learning. Pittsburgh, USA, 2006 |
[35] | LITTMAN M L . Markov Games as a Framework for Multi-Agent Reinforcement Learning[M]. Machine Learning Proceedings 1994. Burlington, USA: Morgan Kaufmann, 1994: 157-163. DOI: 10.1016/B978-1-55860-335-6.50027-1 |
[36] | GREENWALD A, HALL K, SERRANO R . Correlated Q-Learning [C]//20th International Conference on Machine Learning. Washington D. C., USA, 2003 |
[37] | HU J and WELLMAN M P . Multiagent Reinforcement Learning: Theoretical Framework and an Algorithm [C]//15th International Conference on Machine Learning, Madison, USA, 1998 |
[38] | SUNEHAG P, LEVER G, GRUSLYS A , et al. Value-Decomposition Networks for Cooperative Multi-Agent Learning [DB/OL]. ( 2017-06-16). |
[39] | RASHID T, SAMVELYAN M, WITT C S , et al. QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning [C]//35th International Conference on Machine Learning. Stockholm, Sweden, 2018 |
[40] | LOWE R, WU Y, TAMAR A , et al. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments [C]//Conference on Neural Information Processing Systems (NIPS). Long Beach, USA, 2017: 6379-6390. |
No related articles found! |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||