Next Generation Semantic and Spatial Joint Perception

doi:10.12142/ZTECOM.202101008

Abstract

Abstract:

Efficient perception of the real world is a long-standing effort of computer vision. Modern visual computing techniques have succeeded in attaching semantic labels to thousands of daily objects and reconstructing dense depth maps of complex scenes. However, simultaneous semantic and spatial joint perception, so-called dense 3D semantic mapping, estimating the 3D geometry of a scene and attaching semantic labels to the geometry, remains a challenging problem that, if solved, would make structured vision understanding and editing more widely accessible. Concurrently, progress in computer vision and machine learning has motivated us to pursue the capability of understanding and digitally reconstructing the surrounding world. Neural metric-semantic understanding is a new and rapidly emerging field that combines differentiable machine learning techniques with physical knowledge from computer vision, e.g., the integration of visual-inertial simultaneous localization and mapping (SLAM), mesh reconstruction, and semantic understanding. In this paper, we attempt to summarize the recent trends and applications of neural metric-semantic understanding. Starting with an overview of the underlying computer vision and machine learning concepts, we discuss critical aspects of such perception approaches. Specifically, our emphasis is on fully leveraging the joint semantic and 3D information. Later on, many important applications of the perception capability such as novel view synthesis and semantic augmented reality (AR) contents manipulation are also presented. Finally, we conclude with a discussion of the technical implications of the technology under a 5G edge computing scenario.

Key words: visual computing, semantic and spatial joint perception, dense 3D semantic mapping, neural metric-semantic understanding

ZHU Fang. Next Generation Semantic and Spatial Joint Perception[J]. ZTE Communications, 2021, 19(1): 61-71.

Figures/Tables 6

Figure 1 Semantic and spatial joint perception of a variety of scenes[2–3]

Figure 2 Overview of the overall architecture for the classical panoptic segmentation (pictures taken from Ref. [26])

Figure 3 Dense semantic 3D reconstruction[9]

Figure 4 Relighting in the wild[18] reconstructs a proxy 3D model from a large-scale Internet photo collection

Figure 5 Scene representation networks[17] allow full 3D reconstruction from a single image (bottom row, surface normals and color render) by learning strong priors via a continuous, 3D-structure-aware neural scene representation

Figure 6 Illustration of semantic AR contents manipulation: (a) retargetable AR; (b) framework that retargets the AR scene to various real scenes by comparing the AR scene graph with 3D scene graphs constructed in each of the scenes[19]

References 45

1	HERMANS A, FLOROS G, LEIBE B. Dense 3D semantic mapping of indoor scenes from RGB⁃D images [C]//2014 IEEE International Conference on Robotics and Automation (ICRA). Hong Kong, China: IEEE, 2014: 2631–2638. DOI:10.1109/ICRA.2014.6907236 DOI
2	ROSINOL A, ABATE M, CHANG Y, et al. Kimera: an open⁃source library for real⁃time metric⁃semantic localization and mapping [C]//2020 IEEE International Conference on Robotics and Automation (ICRA). Paris, France: IEEE, 2020: 1689–1696. DOI: 10.1109/ICRA40945.2020.9196885 DOI
3	TULSIANI S, KAR A, CARREIRA J, et al. Learning category⁃specific deformable 3D models for object reconstruction [J]. IEEE transactions on pattern analysis and machine intelligence, 2017, 39(4): 719–731. DOI:10.1109/TPAMI.2016.2574713 DOI
4	TATENO K, TOMBARI F, NAVAB N. Real⁃time and scalable incremental segmentation on dense SLAM [C]//2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Hamburg, Germany: IEEE, 2015: 4465–4472. DOI: 10.1109/IROS.2015.7354011 DOI
5	MCCORMAC J, HANDA A, DAVISON A, et al. SemanticFusion: dense 3D semantic mapping with convolutional neural networks [C]//2017 IEEE International Conference on Robotics and Automation (ICRA). Singapore, Singapore: IEEE, 2017: 4628–4635. DOI: 10.1109/ICRA.2017.7989538 DOI
6	NAKAJIMA Y, TATENO K, TOMBARI F, et al. Fast and accurate semantic mapping through geometric⁃based incremental segmentation [C]//2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Madrid, Spain: IEEE, 2018: 385–392. DOI: 10.1109/IROS.2018.8593993 DOI
7	NARITA G, SENO T, ISHIKAWA T, et al. PanopticFusion: online volumetric semantic mapping at the level of stuff and things [C]//2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Macao, China: IEEE, 2019: 4205–4212. DOI: 10.1109/IROS40897.2019.8967890 DOI
8	PHAM Q H, HUA B S, NGUYEN T, et al. Real⁃time progressive 3D semantic segmentation for indoor scenes [C]//2019 IEEE Winter Conference on Applications of Computer Vision (WACV). Waikoloa Village, USA: IEEE, 2019: 1089–1098. DOI: 10.1109/WACV.2019.00121 DOI
9	HÄNE C, POLLEFEYS M. An overview of recent progress in volumetric semantic 3D reconstruction [C]//2016 23rd International Conference on Pattern Recognition (ICPR). Cancun, Mexico: IEEE, 2016: 3294–3307. DOI:10.1109/ICPR.2016.7900143 DOI
10	LADICKÝ L, ZEISL B, POLLEFEYS M. Discriminatively trained dense surface normal estimation [C]//European Conference on Computer vision. Zurich, Switzerland: ECCV, 2014: 0906–0912. DOI: 10.1007/978-3-319-10602-1_31 DOI
11	GÜNEY F, GEIGER A. Displets: Resolving stereo ambiguities using object knowledge [C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, USA: IEEE, 2015: 4165–4175. DOI:10.1109/CVPR.2015.7299044 DOI
12	LI R H, GU D B, LIU Q, et al. Semantic scene mapping with spatio⁃temporal deep neural network for robotic applications [J]. Cognitive computation, 2018, 10(2): 260–271. DOI: 10.1007/s12559-017-9526-9 DOI
13	CHERABIER I, SCHÖNBERGER J L, OSWALD M R, et al. Learning priors for semantic 3D reconstruction [M]//European Conference on Computer vision. Murich, Germany: ECCV, 2018: 325–341. . DOI: 10.1007/978-3-030-01258-8_20 DOI
14	LIANOS K N, SCHÖNBERGER J L, POLLEFEYS M, et al. VSO: visual semantic odometry [M]//Computer Vision – ECCV 2018. Cham, switzerland: Springer International Publishing, 2018: 246–263. DOI: 10.1007/978-3-030-01225-0_15 DOI
15	HAN L, ZHENG T, ZHU Y H, et al. Live semantic 3D perception for immersive augmented reality [J]. IEEE transactions on visualization and computer graphics, 2020, 26(5): 2012–2022. DOI: 10.1109/TVCG.2020.2973477 DOI
16	NGUYEN⁃PHUOC T, LI C, THEIS L, et al. HoloGAN: unsupervised learning of 3D representations from natural images [C]//2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, South Korea: IEEE, 2019: 7587–7596. DOI: 10.1109/ICCV.2019.00768 DOI
17	SITZMANN V, ZOLLHÖFER M, WETZSTEIN G. Scene representation networks: continuous 3D⁃structure⁃aware neural scene representations [EB/OL]. [2021⁃01⁃05].
18	MESHRY M, GOLDMAN D B, KHAMIS S, et al. Neural rerendering in the wild [C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, 2019: 6871–6880. DOI:10.1109/CVPR.2019.00704 DOI
19	TAHARA T, SENO T, NARITA G, et al. Retargetable AR: context⁃aware augmented reality in indoor scenes based on 3D scene graph [EB/OL]. (2020⁃08⁃18) [2021⁃01⁃05].
20	SCHARSTEIN D, SZELISKI R, ZABIH R. A taxonomy and evaluation of dense two⁃frame stereo correspondence algorithms [C]//Proceedings IEEE Workshop on Stereo and Multi⁃Baseline Vision (SMBV 2001). Kauai, HI, USA: IEEE, 2001: 131–140. DOI: 10.1109/SMBV.2001.988771 DOI
21	NAIR R, RUHL K, LENZEN F, et al. A Survey on time⁃of⁃flight stereo fusion [J]. Time⁃of⁃flight and depth imaging. sensors, algorithms, and applications, 2013, 8200:105–127. DOI: 10.1007/978-3-642-44964-2_6 DOI
22	DAI A, NIEßNER M, ZOLLHÖFER M, et al. BundleFusion [J]. ACM transactions on graphics, 2017, 36(4): 1. DOI: 10.1145/3072959.3126814 DOI
23	WHELAN T, SALAS⁃MORENO R F, GLOCKER B, et al. ElasticFusion: Real⁃time dense SLAM and light source estimation [J]. The international journal of robotics research, 2016, 35(14): 1697–1716. DOI: 10.1177/0278364916669237 DOI
24	HAN L, FANG L. FlashFusion: real⁃time globally consistent dense 3D reconstruction using CPU computing [C]//Robotics: Science and Systems XIV. Robotics: Science and Systems Foundation, 2018. DOI: 10.15607/rss.2018.xiv.006 DOI
25	DE GEUS D, MELETIS P, DUBBELMAN G. Fast panoptic segmentation network [J]. IEEE robotics and automation letters, 2020, 5(2): 1742–1749. DOI:10.1109/LRA.2020.2969919 DOI
26	MOHAN R, VALADA A. EfficientPS: efficient panoptic segmentation [EB/OL]. (2020⁃05⁃19) [2021⁃01⁃05]
27	ARMENI I, SENER O, ZAMIR A R, et al. 3D semantic parsing of large⁃scale indoor spaces [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE, 2016: 1534–1543. DOI:10.1109/CVPR.2016.170 DOI
28	GRAHAM B, ENGELCKE M, MAATEN L V D. 3D semantic segmentation with submanifold sparse convolutional networks [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018: 9224–9232. DOI: 10.1109/CVPR.2018.00961 DOI
29	DAI A, NIENER M. 3DMV: joint 3D⁃multi⁃view prediction for 3D semantic scene segmentation [C]//Computer vision. Munich, Germany: ECCV, 2018: 0908–0914. DOI: 10.1007/978-3-030-01249-6_28 DOI
30	ROZUMNYI D, CHERABIER I, POLLEFEYS M, et al. Learned semantic multi⁃sensor depth map fusion [C]//2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW). Seoul, South Korea: IEEE, 2019: 2089–2099. DOI: 10.1109/ICCVW.2019.00264 DOI
31	LUO X, HUANG J B, SZELISKI R, et al. Consistent video depth estimation [J]. ACM transactions on graphics, 2020, 39(4): 1–13. DOI:10.1145/3386569.3392377 DOI
32	SHIM I, OH T H, KWEON I. High⁃fidelity depth upsampling using the self⁃learning framework [J]. Sensors, 2018, 19(1): 81. DOI: 10.3390/s19010081 DOI
33	YAN S, WU C L, WANG L Z, et al. DDRNet: depth map denoising and refinement for consumer depth cameras using cascaded CNNs [C]//European Conference on Computer vision. Murich, Germany: ECCV, 2018. DOI:10.1007/978-3-030-01249-6_10 DOI
34	PHILIP J, GHARBI M, ZHOU T H, et al. Multi⁃view relighting using a geometry⁃aware network [J]. ACM transactions on graphics, 2019, 38(4): 1–14. DOI:10.1145/3306346.3323013 DOI
35	COLLET A, CHUANG M, SWEENEY P, et al. High⁃quality streamable free⁃viewpoint video [J]. ACM transactions on graphics, 2015, 34(4): 1–13. DOI:10.1145/2766945 DOI
36	DOU M, KHAMIS S, DEGTYAREV Y, et al. Fusion4D: Real⁃time performance capture of challenging scenes [J]. ACM transactions on graphics, 2016, 35(4): 1–13. DOI: 10.1145/2897824.2925969 DOI
37	DOU M S, DAVIDSON P, FANELLO S R, et al. Motion2Fusion [J]. ACM transactions on graphics, 2017, 36(6): 1–16. DOI: 10.1145/3130800.3130801 DOI
38	XU Z X, BI S, SUNKAVALLI K, et al. Deep view synthesis from sparse photometric images [J]. ACM transactions on graphics, 2019, 38(4): 1–13. DOI:10.1145/3306346.3323007 DOI
39	KIM K, BILLINGHURST M, BRUDER G, et al. Revisiting trends in augmented reality research: a review of the 2nd decade of ISMAR (2008–2017) [J]. IEEE transactions on visualization and computer graphics, 2018, 24(11): 2947–2962. DOI: 10.1109/TVCG.2018.2868591 DOI
40	NAQVI N Z, MOENS K, RAMAKRISHNAN A, et al. To cloud or not to cloud: a context⁃aware deployment perspective of augmented reality mobile applications [C]//Proceedings of the 30th Annual ACM Symposium on Applied Computing. Salamanca Spain. New York, USA: ACM, 2015: 0413–0417. DOI:10.1145/2695664.2695880 DOI
41	BARESI L, FILGUEIRA MENDONÇA D, GARRIGA M. Empowering low⁃latency applications through a serverless edge computing architecture [C]//Service⁃oriented and cloud computing. Oslo, Norway: ESOCC, 2017: 0927–0929. DOI: 10.1007/978-3-319-67262-5_15 DOI
42	CHATZIELEFTHERIOU L E, IOSIFIDIS G, KOUTSOPOULOS I, et al. Towards resource⁃efficient wireless edge analytics for mobile augmented reality applications [C]//2018 15th International Symposium on Wireless Communication Systems (ISWCS). Lisbon, Portugal: IEEE, 2018: 1–5. DOI: 10.1109/ISWCS.2018.8491206 DOI
43	BARESI L, FILGUEIRA MENDONÇA D. Towards a serverless platform for edge computing [C]//2019 IEEE International Conference on Fog Computing (ICFC). Prague, Czech Republic: IEEE, 2019: 1–10. DOI:10.1109/ICFC.2019.00008 DOI
44	PITTALUGA F, KOPPAL S J, KANG S B, et al. Revealing scenes by inverting structure from motion reconstructions [C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, 2019: 145–154. DOI: 10.1109/CVPR.2019.00023 DOI
45	GEPPERT M, LARSSON V, SPECIALE P, et al. Privacy preserving structure⁃from⁃motion [C]//16th European Conference Computer Vision. Glasgow, United Kingdom: EVVC, 2020:0823–0828. DOI: 10.1007/978-3-030-58452-8_20 DOI