Reinforcement Learning from Algorithm Model to Industry Innovation: A Foundation Stone of Future Artificial Intelligence

doi:10.12142/ZTECOM.201903006

ZTE Communications ›› 2019, Vol. 17 ›› Issue (3): 31-41.DOI: 10.12142/ZTECOM.201903006

• Review • Previous Articles Next Articles

Reinforcement Learning from Algorithm Model to Industry Innovation: A Foundation Stone of Future Artificial Intelligence

DONG Shaokang, CHEN Jiarui, LIU Yong, BAO Tianyi, GAO Yang

State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210008, China

Received:2019-07-10 Online:2019-09-29 Published:2019-12-06
About author:DONG Shaokang (shaokangdong@gmail.com) obtained his B.S. degree from the Advanced Class of Huazhong University of Science and Technology, China in 2018. He is currently a Ph.D. student in the Department of Computer Science and Technology, Nanjing University, China. His research interests include machine learning, reinforcement learning, and multi-armed bandits.|CHEN Jiarui obtained his B.S. degree from Dongbei University of Finance and Economics, China in 2018. He is currently a master student in the Department of Computer Science and Technology, Nanjing University, China. His research interests include machine learning, multi-agent reinforcement learning, and game.|LIU Yong received a B.S degree in communication engineering from China Agricultural University, China in 2017. He is currently a master student in the Department of Computer Science and Technology, Nanjing University, China. His current research interests include reinforcement learning, multi-agent learning, and transfer learning.|BAO Tianyi is an undergraduate student currently studying in the University of Michigan, USA. She studies computer science and psychology and will receive her B.S. degree in 2020. Her current research interests include the machine learning and human-computer interaction.|GAO Yang received the Ph.D. degree in computer software and theory from the Department of Computer Science and Technology, Nanjing University, China in 2000. He is a professor with the Department of Computer Science and Technology, Nanjing University. His current research interests include artificial intelligence and machine learning. He has published over 100 papers in top international conferences and journals.

Abstract

Abstract:

Reinforcement learning (RL) algorithm has been introduced for several decades, which becomes a paradigm in sequential decision-making and control. The development of reinforcement learning, especially in recent years, has enabled this algorithm to be applied in many industry fields, such as robotics, medical intelligence, and games. This paper first introduces the history and background of reinforcement learning, and then illustrates the industrial application and open source platforms. After that, the successful applications from AlphaGo to AlphaZero and future reinforcement learning technique are focused on. Finally, the artificial intelligence for complex interaction (e.g., stochastic environment, multiple players, selfish behavior, and distributes optimization) is considered and this paper concludes with the highlight and outlook of future general artificial intelligence.

Key words: artificial intelligence, decision-making and control problems, reinforcement learning

DONG Shaokang, CHEN Jiarui, LIU Yong, BAO Tianyi, GAO Yang. Reinforcement Learning from Algorithm Model to Industry Innovation: A Foundation Stone of Future Artificial Intelligence[J]. ZTE Communications, 2019, 17(3): 31-41.

Figures/Tables 9

Figure 1. The development trajectory of reinforcement learning technique.

Figure 2. The interaction between the agent and environment.

Figure 3. The classification of different reinforcement learning methods.

Table 1 Comparisons between reinforcement learning (RL) methods

Category	DP	On-policy MC	Off-policy MC	SARSA	Q-learning
Model-based	√
Model-free		√	√	√	√
On-policy		√		√
Off-policy			√		√
Bootstrapping	√			√	√

Figure 4. Neural network of the deep Q-learning [12].

Figure 5. The Gym games (from left to right and top to bottom: CarRacing, Mountainair, Ant, RoboschoolHumanoidFlagrunHarder, FetchPickAndPlace and MontezumaRevenge).

Figure 6. The different networks of AlphaGo [19].

Figure 7. The process of MCTS [19].

Figure 8. The network of AlphaZero [28].

References 40

[1]	THORNDIKE E L . Animal Intelligence: An Experimental Study of the Associate Processes in Animals[J]. American Psychologist, 1998,53(10):1125. DOI: 10.1037/0003-066X.53.10.1125
[2]	MINSKY M L. Theory of Neural-Analog Reinforcement Systems and Its Application to the Brain Model Problem [M]. New Jersey, America: Princeton University Press, 1954
[3]	BELLMAN R . On the Theory of Dynamic Programming[J]. Proceedings of the National Academy of Sciences of the United States of America, 1952,38(8):716. DOI: 10.1073/pnas.38.8.716
[4]	BELLMAN R . A Markovian Decision Process[J]. Journal of Mathematics and Mechanics, 1957: 679-684. DOI: 10.2307/2343663
[5]	WERBOS P . Advanced Forecasting Methods for Global Crisis Warning and Models of Intelligence[J]. General System Yearbook, 1977,22:25-38
[6]	SUTTON R . Learning to Predict by the Methods of Temporal Differences[J]. Machine Learning, 1988,3(1):9-44. DOI: 10.1007/BF00115009
[7]	WATKINS C J C H, DAYAN P . Q-Learning[J]. Machine Learning, 1992,8:279-292
[8]	RUMMERY G A, NIRANJAN M . On-Line Q-Learning Using Connectionist Systems[M]. Cambridge, England: the University of Cambridge, 1994
[9]	THRUN S . Monte Carlo POMDPS [C]//Advances in Neural Information Processing Systems. Cambridge, USA, 2000
[10]	SUTTON R S, MCALLESTER D A, SINGH S P , et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation [C]//Advances in neural information processing systems. Cambridge, USA, 2000
[11]	SILVER D, LEVER G, HEESS N , et al. Deterministic Policy Gradient Algorithms [C]//31th International Conference on Machine Learning. Beijing, China, 2014
[12]	MNIH V, KAVUKCUOGLU K, SILVER D , et al. Human-Level Control Through Deep Reinforcement Learning[J]. Nature, 2015,518(7540):529. DOI: 10.1038/nature14236
[13]	VAN HASSELT H, GUEZ A, SILVER D . Deep Reinforcement Learning with Double Q-Learning [C]//13th AAAI Conference on Artificial Intelligence, Phoenix, USA, 2016
[14]	SCHAUL T, QUAN J, ANTONOGLOU I , et al. Prioritized Experience Replay [DB/OL]. ( 2015-11-18).
[15]	WANG Z, SCHAUL T, HESSEL M , et al. Dueling Network Architectures for Deep Reinforcement Learning [C]//International Conference on Machine Learning. New York, USA, 2016. DOI: 10.1155/2018/2129393
[16]	LILLICRAP T P, HUNT J J, PRITZEL A , et al. Continuous Control with Deep Reinforcement Learning [DB/OL]. (2015-09-09).
[17]	NAIR A, SRINIVASAN P, BLACKWELL S , et al. Massively Parallel Methods for Deep Reinforcement Learning [DB/OL]. (2015-07-15).
[18]	SCHULMAN J, LEVINE S, ABBEEL P , et al. Trust Region Policy Optimization [C]//31th International Conference on Machine Learning. Lille, France, 2015
[19]	SILVER D, HUANG A, MADDISON C J , et al. Mastering the Game of Go with Deep Neural Networks and Tree Search[J]. Nature, 2016,529(7587):484. DOI: 10.13140/RG.2.2.18893.74727
[20]	DUAN Y, CHEN X, HOUTHOOFT R , et al. Benchmarking Deep Reinforcement Learning for Continuous Control [C]//33rd International Conference on Machine Learning, New York, USA, 2016
[21]	BEATTIE C, LEIBO J Z, TEPLYASHIN D , et al. Deepmind Lab [DB/OL]. (2016-12-12).
[22]	WU B . Hierarchical Macro Strategy Model for Moba Game AI [C]//AAAI Conference on Artificial Intelligence. Honolulu, USA, 2019. DOI: 10.1609/aaai.v33i01.33011206
[23]	SCHULMAN J, WOLSKI F, DHARIWAL P , et al. Proximal Policy Optimization Algorithms [DB/OL]. ( 2017-07-20).
[24]	HEESS N, SRIRAM S, LEMMON J , et al. Emergence of Locomotion Behaviours in Rich Environments [DB/OL]. ( 2017-07-07).
[25]	CLARK K . Neural Coreference Resolution [EB/OL].
[26]	JI Y, TAN C, MARTSCHAT S , et al. Dynamic Entity Representations in Neural Language Models [C]//2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen, Denmark, 2017. DOI: 10.18653/v1/D17-1195
[27]	BELLEMARE M G, NADDAF Y, VENESS J , et al. The Arcade Learning Environment: An Evaluation Platform for General Agents[J]. Journal of Artificial Intelligence Research, 2013,47:253-279. DOI: 10.1613/jair.3912
[28]	SILVER D, SCHRITTWIESER J, SIMONYAN K , et al. Mastering the Game of Go Without Human Knowledge[J]. Nature, 2017,550(7676):354. DOI: 10.1038/nature24270
[29]	FERNáNDEZ F, VELOSO M . Probabilistic Policy Reuse in a Reinforcement Learning Agent [C]//5th International Joint Conference on Autonomous Agents and Multiagent Systems. Hakodate, Japan, 2006. DOI: 10.1145/1160633.1160762
[30]	MCGOVERN A, BARTO A G . Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density [C]//8th International Conference on Machine Learning. Williams College, USA, 2001.
[31]	DIETTERICH T G . Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition[J]. Journal of Artificial Intelligence Research, 2000,13:227-303.
[32]	MAHADEVAN S . Proto-Value Functions: Developmental Reinforcement Learning [C]//22nd International Conference on Machine Learning. Bonn, Germany, 2005: 553-560. DOI: 10.1145/1102351.1102421
[33]	MADDEN M G, HOWLEY T . Transfer of Experience Between Reinforcement Learning Environments with Progressive Difficulty[J]. Artificial Intelligence Review, 2004,21(3-4):375-398. DOI: 10.1023/B:AIRE.0000036264.95672.64
[34]	DRIESSENS K, RAMON J, CROONENBORGHS T . Transfer Learning for Reinforcement Learning Through Goal and Policy Parametrization [C]//ICML Workshop on Structural Knowledge Transfer for Machine Learning. Pittsburgh, USA, 2006
[35]	LITTMAN M L . Markov Games as a Framework for Multi-Agent Reinforcement Learning[M]. Machine Learning Proceedings 1994. Burlington, USA: Morgan Kaufmann, 1994: 157-163. DOI: 10.1016/B978-1-55860-335-6.50027-1
[36]	GREENWALD A, HALL K, SERRANO R . Correlated Q-Learning [C]//20th International Conference on Machine Learning. Washington D. C., USA, 2003
[37]	HU J and WELLMAN M P . Multiagent Reinforcement Learning: Theoretical Framework and an Algorithm [C]//15th International Conference on Machine Learning, Madison, USA, 1998
[38]	SUNEHAG P, LEVER G, GRUSLYS A , et al. Value-Decomposition Networks for Cooperative Multi-Agent Learning [DB/OL]. ( 2017-06-16).
[39]	RASHID T, SAMVELYAN M, WITT C S , et al. QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning [C]//35th International Conference on Machine Learning. Stockholm, Sweden, 2018
[40]	LOWE R, WU Y, TAMAR A , et al. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments [C]//Conference on Neural Information Processing Systems (NIPS). Long Beach, USA, 2017: 6379-6390.

Reinforcement Learning from Algorithm Model to Industry Innovation: A Foundation Stone of Future Artificial Intelligence

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 9

References 40

Related Articles 14

Recommended Articles 0

Metrics

[1]	WEI Zhiqing, ZHANG Yongji, JI Danna, LI Chenfei. Sensing and Communication Integrated Fast Neighbor Discovery for UAV Networks [J]. ZTE Communications, 2024, 22(3): 69-82.
[2]	REN Min, XU Renyu, ZHU Ting. Double Deep Q-Network Decoder Based on EEG Brain-Computer Interface [J]. ZTE Communications, 2023, 21(3): 3-10.
[3]	FENG Bingyi, FENG Mingxiao, WANG Minrui, ZHOU Wengang, LI Houqiang. Multi-Agent Hierarchical Graph Attention Reinforcement Learning for Grid-Aware Energy Management [J]. ZTE Communications, 2023, 21(3): 11-21.
[4]	YU Junpeng, CHEN Yiyu. A Practical Reinforcement Learning Framework for Automatic Radar Detection [J]. ZTE Communications, 2023, 21(3): 22-28.
[5]	SHEN Jiahao, JIANG Ke, TAN Xiaoyang. Boundary Data Augmentation for Offline Reinforcement Learning [J]. ZTE Communications, 2023, 21(3): 29-36.
[6]	YANG Bei, LIANG Xin, LIU Shengnan, JIANG Zheng, ZHU Jianchi, SHE Xiaoming. Intelligent 6G Wireless Network with Multi-Dimensional Information Perception [J]. ZTE Communications, 2023, 21(2): 3-10.
[7]	YOU Qian, XU Qian, YANG Xin, ZHANG Tao, CHEN Ming. RIS-Assisted UAV-D2D Communications Exploiting Deep Reinforcement Learning [J]. ZTE Communications, 2023, 21(2): 61-69.
[8]	JIA Haonan, HE Zhenqing, TAN Wanlong, RUI Hua, LIN Wei. Distributed Multi-Cell Multi-User MISO Downlink Beamforming via Deep Reinforcement Learning [J]. ZTE Communications, 2022, 20(4): 69-77.
[9]	LIU Weichen, SHEN Mengqi, ZHANG Anda, CHENG Yiting, ZHANG Wenqiang. Artificial Intelligence Rehabilitation Evaluation and Training System for Degeneration of Joint Disease [J]. ZTE Communications, 2021, 19(3): 46-55.
[10]	JI Hong, ZHANG Tianxiang, ZHANG Kai, WANG Wanyuan, WU Weiwei. Efficient Network Slicing with Dynamic Resource Allocation [J]. ZTE Communications, 2021, 19(1): 11-19.
[11]	Stephen ANOKYE, Mohammed SEID, SUN Guolin. A Survey on Machine Learning Based Proactive Caching [J]. ZTE Communications, 2019, 17(4): 46-55.
[12]	REN Fuji and SUN Xiao. Current Situation and Development of Intelligence Robots [J]. ZTE Communications, 2016, 14(S1): 25-34.
[13]	Fuji Ren, Yu Gu. Using Artificial Intelligence in the Internet of Things [J]. ZTE Communications, 2015, 13(2): 1-2.
[14]	Yixin Zhong. I²oT: Advanced Direction of the Internet of Things [J]. ZTE Communications, 2015, 13(2): 3-6.