ZTE Communications ›› 2021, Vol. 19 ›› Issue (1): 61-71.DOI: 10.12142/ZTECOM.202101008

• Review • Previous Articles     Next Articles

Next Generation Semantic and Spatial Joint Perception

ZHU Fang1,2()   

  1. 1.State Key Laboratory of Mobile Network and Mobile Multimedia Technology, Shenzhen 518057, China
    2.ZTE Corporation, Shenzhen 518057, China
  • Received:2020-12-25 Online:2021-03-25 Published:2021-04-09
  • About author:ZHU Fang (zhu.fang@zte.com.cn) received the B.Eng. degree in electrical engineering and the M.Sc. degree in information and system from Xi’an Jiaotong University, China and the Ph.D. degree in electronic engineering from Southeast University, China. He is currently the director of the technical committee in digital video and vision of ZTE Corporation, and also the deputy director in multimedia of State Key Laboratory of Mobile Network and Mobile Multimedia Technology. He is a senior member of IEEE, focusing on circuits and systems for video technology and smart vision. His research interests include core technology, cloud architecture and acceleration chipset for specific application of XR & Smart Vision based on mobile computing.


Efficient perception of the real world is a long-standing effort of computer vision. Modern visual computing techniques have succeeded in attaching semantic labels to thousands of daily objects and reconstructing dense depth maps of complex scenes. However, simultaneous semantic and spatial joint perception, so-called dense 3D semantic mapping, estimating the 3D geometry of a scene and attaching semantic labels to the geometry, remains a challenging problem that, if solved, would make structured vision understanding and editing more widely accessible. Concurrently, progress in computer vision and machine learning has motivated us to pursue the capability of understanding and digitally reconstructing the surrounding world. Neural metric-semantic understanding is a new and rapidly emerging field that combines differentiable machine learning techniques with physical knowledge from computer vision, e.g., the integration of visual-inertial simultaneous localization and mapping (SLAM), mesh reconstruction, and semantic understanding. In this paper, we attempt to summarize the recent trends and applications of neural metric-semantic understanding. Starting with an overview of the underlying computer vision and machine learning concepts, we discuss critical aspects of such perception approaches. Specifically, our emphasis is on fully leveraging the joint semantic and 3D information. Later on, many important applications of the perception capability such as novel view synthesis and semantic augmented reality (AR) contents manipulation are also presented. Finally, we conclude with a discussion of the technical implications of the technology under a 5G edge computing scenario.

Key words: visual computing, semantic and spatial joint perception, dense 3D semantic mapping, neural metric-semantic understanding