ZTE Communications ›› 2017, Vol. 15 ›› Issue (S2): 23-29.DOI: 10.3969/j.issn.1673-5188.2017.S2.004

• Special Topic • Previous Articles     Next Articles

Multimodal Emotion Recognition with Transfer Learning of Deep Neural Network

HUANG Jian1,2, LI Ya1, TAO Jianhua1,2,3, YI Jiangyan1,2   

  1. 1. National Laboratory of Pattern Recognition, Beijing 100190, China
    2. School of Artificial Intelligence, University of Chinese Academy of Science, Beijing 100190, China
    3. CAS Center for Excellence in Brain Science and Intelligence Technology, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
  • Received:2017-08-15 Online:2017-12-25 Published:2020-04-16
  • About author:HUANG Jian (jian.huang@nlpr.ia.ac.cn) is a Ph.D. candidate at the National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences (CASIA), China. His research interest covers affective computing, deep learning, and multimodal emotion recognition.|LI Ya (yli@nlpr.ia.ac.cn) received the B.E. degree from University of Science and Technology of China (USTC), China in 2007, and Ph.D. degree from the NLPR, CASIA in 2012. From November 2012 to December 2012, she was a visiting scholar in the University of Tokyo, Japan. From May 2014 to September 2014, she was a research fellow with Trinity College Dublin, Ireland. She is currently an associate professor with the NLPR, CASIA. She won several best student papers in INTERSPEECH, NCMMSC, etc. Her general interests include speech recognition and synthesis, affective computing, human computer interaction, and natural language processing.|TAO Jianhua (jhtao@nlpr.ia.ac.cn) received his Ph.D. from Tsinghua University, China in 2001 and MS. from Nanjing University, China in 1996. He is currently a professor with the NLPR, CASIA. His current research interests include speech synthesis and coding methods, human computer interaction, multimedia information processing, and pattern recognition. He has published more than eighty papers on major journals and proceedings including IEEE Transactions on ASLP, and got several awards from the important conferences, such as Eurospeech, NCMMSC, etc. He serves as the chair or program committee member for several major conferences, including ICPR, ACII, ICMI, ISCSLP, NCMMSC, etc. He also serves as the steering committee member for IEEE Transactions on Affective Computing, an associate editor for Journal on Multimodal User Interface and International Journal on Synthetic Emotions, and Deputy Editor-in-chief for Chinese Journal of Phonetics.|YI Jiangyan (jiangyan.yi@nlpr.ia.ac.cn) is a Ph.D. candidate at the NLPR, CASIA, China. Her research interest covers deep learning, speech recognition, and transfer learning.
  • Supported by:
    This work is is supported by the National Natural Science Foundation of China (NSFC)(61425017);This work is is supported by the National Natural Science Foundation of China (NSFC)(61773379);the National Key Research & Development Plan of China(2017YFB1002804);the Major Program for the National Social Science Fund of China(13&ZD189)

Abstract:

Due to the lack of large-scale emotion databases, it is hard to obtain comparable improvement in multimodal emotion recognition of the deep neural network by deep learning, which has made great progress in other areas. We use transfer learning to improve its performance with pre-trained models on large-scale data. Audio is encoded using deep speech recognition networks with 500 hours’ speech and video is encoded using convolutional neural networks with over 110,000 images. The extracted audio and visual features are fed into Long Short-Term Memory to train models respectively. Logistic regression and ensemble method are performed in decision level fusion. The experiment results indicate that 1) audio features extracted from deep speech recognition networks achieve better performance than handcrafted audio features; 2) the visual emotion recognition obtains better performance than audio emotion recognition; 3) the ensemble method gets better performance than logistic regression and prior knowledge from micro-F1 value further improves the performance and robustness, achieving accuracy of 67.00% for “happy”, 54.90% for “angry”, and 51.69% for “sad”.

Key words: deep neutral network, ensemble method, multimodal emotion recognition, transfer learning