Reinforcement Learning from Algorithm Model to Industry Innovation: A Foundation Stone of Future Artificial Intelligence

doi:10.12142/ZTECOM.201903006

ZTE Communications ›› 2019, Vol. 17 ›› Issue (3): 31-41.DOI: 10.12142/ZTECOM.201903006

收稿日期:2019-07-10 出版日期:2019-09-29 发布日期:2019-12-06

Reinforcement Learning from Algorithm Model to Industry Innovation: A Foundation Stone of Future Artificial Intelligence

DONG Shaokang, CHEN Jiarui, LIU Yong, BAO Tianyi, GAO Yang

State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210008, China

Received:2019-07-10 Online:2019-09-29 Published:2019-12-06
About author:DONG Shaokang (shaokangdong@gmail.com) obtained his B.S. degree from the Advanced Class of Huazhong University of Science and Technology, China in 2018. He is currently a Ph.D. student in the Department of Computer Science and Technology, Nanjing University, China. His research interests include machine learning, reinforcement learning, and multi-armed bandits.|CHEN Jiarui obtained his B.S. degree from Dongbei University of Finance and Economics, China in 2018. He is currently a master student in the Department of Computer Science and Technology, Nanjing University, China. His research interests include machine learning, multi-agent reinforcement learning, and game.|LIU Yong received a B.S degree in communication engineering from China Agricultural University, China in 2017. He is currently a master student in the Department of Computer Science and Technology, Nanjing University, China. His current research interests include reinforcement learning, multi-agent learning, and transfer learning.|BAO Tianyi is an undergraduate student currently studying in the University of Michigan, USA. She studies computer science and psychology and will receive her B.S. degree in 2020. Her current research interests include the machine learning and human-computer interaction.|GAO Yang received the Ph.D. degree in computer software and theory from the Department of Computer Science and Technology, Nanjing University, China in 2000. He is a professor with the Department of Computer Science and Technology, Nanjing University. His current research interests include artificial intelligence and machine learning. He has published over 100 papers in top international conferences and journals.

摘要/Abstract

Abstract:

Reinforcement learning (RL) algorithm has been introduced for several decades, which becomes a paradigm in sequential decision-making and control. The development of reinforcement learning, especially in recent years, has enabled this algorithm to be applied in many industry fields, such as robotics, medical intelligence, and games. This paper first introduces the history and background of reinforcement learning, and then illustrates the industrial application and open source platforms. After that, the successful applications from AlphaGo to AlphaZero and future reinforcement learning technique are focused on. Finally, the artificial intelligence for complex interaction (e.g., stochastic environment, multiple players, selfish behavior, and distributes optimization) is considered and this paper concludes with the highlight and outlook of future general artificial intelligence.

Key words: artificial intelligence, decision-making and control problems, reinforcement learning

. [J]. ZTE Communications, 2019, 17(3): 31-41.

DONG Shaokang, CHEN Jiarui, LIU Yong, BAO Tianyi, GAO Yang. Reinforcement Learning from Algorithm Model to Industry Innovation: A Foundation Stone of Future Artificial Intelligence[J]. ZTE Communications, 2019, 17(3): 31-41.

图/表 9

Figure 1. The development trajectory of reinforcement learning technique.

Figure 2. The interaction between the agent and environment.

Figure 3. The classification of different reinforcement learning methods.

Table 1 Comparisons between reinforcement learning (RL) methods

Category	DP	On-policy MC	Off-policy MC	SARSA	Q-learning
Model-based	√
Model-free		√	√	√	√
On-policy		√		√
Off-policy			√		√
Bootstrapping	√			√	√

Figure 4. Neural network of the deep Q-learning [12].

Figure 5. The Gym games (from left to right and top to bottom: CarRacing, Mountainair, Ant, RoboschoolHumanoidFlagrunHarder, FetchPickAndPlace and MontezumaRevenge).

Figure 6. The different networks of AlphaGo [19].

Figure 7. The process of MCTS [19].

Figure 8. The network of AlphaZero [28].

参考文献 40

[1]	THORNDIKE E L . Animal Intelligence: An Experimental Study of the Associate Processes in Animals[J]. American Psychologist, 1998,53(10):1125. DOI: 10.1037/0003-066X.53.10.1125
[2]	MINSKY M L. Theory of Neural-Analog Reinforcement Systems and Its Application to the Brain Model Problem [M]. New Jersey, America: Princeton University Press, 1954
[3]	BELLMAN R . On the Theory of Dynamic Programming[J]. Proceedings of the National Academy of Sciences of the United States of America, 1952,38(8):716. DOI: 10.1073/pnas.38.8.716
[4]	BELLMAN R . A Markovian Decision Process[J]. Journal of Mathematics and Mechanics, 1957: 679-684. DOI: 10.2307/2343663
[5]	WERBOS P . Advanced Forecasting Methods for Global Crisis Warning and Models of Intelligence[J]. General System Yearbook, 1977,22:25-38
[6]	SUTTON R . Learning to Predict by the Methods of Temporal Differences[J]. Machine Learning, 1988,3(1):9-44. DOI: 10.1007/BF00115009
[7]	WATKINS C J C H, DAYAN P . Q-Learning[J]. Machine Learning, 1992,8:279-292
[8]	RUMMERY G A, NIRANJAN M . On-Line Q-Learning Using Connectionist Systems[M]. Cambridge, England: the University of Cambridge, 1994
[9]	THRUN S . Monte Carlo POMDPS [C]//Advances in Neural Information Processing Systems. Cambridge, USA, 2000
[10]	SUTTON R S, MCALLESTER D A, SINGH S P , et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation [C]//Advances in neural information processing systems. Cambridge, USA, 2000
[11]	SILVER D, LEVER G, HEESS N , et al. Deterministic Policy Gradient Algorithms [C]//31th International Conference on Machine Learning. Beijing, China, 2014
[12]	MNIH V, KAVUKCUOGLU K, SILVER D , et al. Human-Level Control Through Deep Reinforcement Learning[J]. Nature, 2015,518(7540):529. DOI: 10.1038/nature14236
[13]	VAN HASSELT H, GUEZ A, SILVER D . Deep Reinforcement Learning with Double Q-Learning [C]//13th AAAI Conference on Artificial Intelligence, Phoenix, USA, 2016
[14]	SCHAUL T, QUAN J, ANTONOGLOU I , et al. Prioritized Experience Replay [DB/OL]. ( 2015-11-18).
[15]	WANG Z, SCHAUL T, HESSEL M , et al. Dueling Network Architectures for Deep Reinforcement Learning [C]//International Conference on Machine Learning. New York, USA, 2016. DOI: 10.1155/2018/2129393
[16]	LILLICRAP T P, HUNT J J, PRITZEL A , et al. Continuous Control with Deep Reinforcement Learning [DB/OL]. (2015-09-09).
[17]	NAIR A, SRINIVASAN P, BLACKWELL S , et al. Massively Parallel Methods for Deep Reinforcement Learning [DB/OL]. (2015-07-15).
[18]	SCHULMAN J, LEVINE S, ABBEEL P , et al. Trust Region Policy Optimization [C]//31th International Conference on Machine Learning. Lille, France, 2015
[19]	SILVER D, HUANG A, MADDISON C J , et al. Mastering the Game of Go with Deep Neural Networks and Tree Search[J]. Nature, 2016,529(7587):484. DOI: 10.13140/RG.2.2.18893.74727
[20]	DUAN Y, CHEN X, HOUTHOOFT R , et al. Benchmarking Deep Reinforcement Learning for Continuous Control [C]//33rd International Conference on Machine Learning, New York, USA, 2016
[21]	BEATTIE C, LEIBO J Z, TEPLYASHIN D , et al. Deepmind Lab [DB/OL]. (2016-12-12).
[22]	WU B . Hierarchical Macro Strategy Model for Moba Game AI [C]//AAAI Conference on Artificial Intelligence. Honolulu, USA, 2019. DOI: 10.1609/aaai.v33i01.33011206
[23]	SCHULMAN J, WOLSKI F, DHARIWAL P , et al. Proximal Policy Optimization Algorithms [DB/OL]. ( 2017-07-20).
[24]	HEESS N, SRIRAM S, LEMMON J , et al. Emergence of Locomotion Behaviours in Rich Environments [DB/OL]. ( 2017-07-07).
[25]	CLARK K . Neural Coreference Resolution [EB/OL].
[26]	JI Y, TAN C, MARTSCHAT S , et al. Dynamic Entity Representations in Neural Language Models [C]//2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen, Denmark, 2017. DOI: 10.18653/v1/D17-1195
[27]	BELLEMARE M G, NADDAF Y, VENESS J , et al. The Arcade Learning Environment: An Evaluation Platform for General Agents[J]. Journal of Artificial Intelligence Research, 2013,47:253-279. DOI: 10.1613/jair.3912
[28]	SILVER D, SCHRITTWIESER J, SIMONYAN K , et al. Mastering the Game of Go Without Human Knowledge[J]. Nature, 2017,550(7676):354. DOI: 10.1038/nature24270
[29]	FERNáNDEZ F, VELOSO M . Probabilistic Policy Reuse in a Reinforcement Learning Agent [C]//5th International Joint Conference on Autonomous Agents and Multiagent Systems. Hakodate, Japan, 2006. DOI: 10.1145/1160633.1160762
[30]	MCGOVERN A, BARTO A G . Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density [C]//8th International Conference on Machine Learning. Williams College, USA, 2001.
[31]	DIETTERICH T G . Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition[J]. Journal of Artificial Intelligence Research, 2000,13:227-303.
[32]	MAHADEVAN S . Proto-Value Functions: Developmental Reinforcement Learning [C]//22nd International Conference on Machine Learning. Bonn, Germany, 2005: 553-560. DOI: 10.1145/1102351.1102421
[33]	MADDEN M G, HOWLEY T . Transfer of Experience Between Reinforcement Learning Environments with Progressive Difficulty[J]. Artificial Intelligence Review, 2004,21(3-4):375-398. DOI: 10.1023/B:AIRE.0000036264.95672.64
[34]	DRIESSENS K, RAMON J, CROONENBORGHS T . Transfer Learning for Reinforcement Learning Through Goal and Policy Parametrization [C]//ICML Workshop on Structural Knowledge Transfer for Machine Learning. Pittsburgh, USA, 2006
[35]	LITTMAN M L . Markov Games as a Framework for Multi-Agent Reinforcement Learning[M]. Machine Learning Proceedings 1994. Burlington, USA: Morgan Kaufmann, 1994: 157-163. DOI: 10.1016/B978-1-55860-335-6.50027-1
[36]	GREENWALD A, HALL K, SERRANO R . Correlated Q-Learning [C]//20th International Conference on Machine Learning. Washington D. C., USA, 2003
[37]	HU J and WELLMAN M P . Multiagent Reinforcement Learning: Theoretical Framework and an Algorithm [C]//15th International Conference on Machine Learning, Madison, USA, 1998
[38]	SUNEHAG P, LEVER G, GRUSLYS A , et al. Value-Decomposition Networks for Cooperative Multi-Agent Learning [DB/OL]. ( 2017-06-16).
[39]	RASHID T, SAMVELYAN M, WITT C S , et al. QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning [C]//35th International Conference on Machine Learning. Stockholm, Sweden, 2018
[40]	LOWE R, WU Y, TAMAR A , et al. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments [C]//Conference on Neural Information Processing Systems (NIPS). Long Beach, USA, 2017: 6379-6390.

编辑推荐 0

Metrics

阅读次数

全文

301

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	3	0	0	298

来源	本网站	其他网站

次数	282	19
比例	94%	6%

摘要

155

最新录用	在线预览	正式出版

0	0	155

	来源	本网站

	次数	155
	比例	100%

Reinforcement Learning from Algorithm Model to Industry Innovation: A Foundation Stone of Future Artificial Intelligence

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 9

参考文献 40

相关文章 0

编辑推荐 0

Metrics