Ultra-Lightweight Face Animation Method for Ultra-Low Bitrate Video Conferencing

doi:10.12142/ZTECOM.202301008

Abstract

Abstract:

Video conferencing systems face the dilemma between smooth streaming and decent visual quality because traditional video compression algorithms fail to produce bitstreams low enough for bandwidth-constrained networks. An ultra-lightweight face-animation-based method that enables better video conferencing experience is proposed in this paper. The proposed method compresses high-quality upper-body videos with ultra-low bitrates and runs efficiently on mobile devices without high-end graphics processing units (GPU). Moreover, a visual quality evaluation algorithm is used to avoid image degradation caused by extreme face poses and/or expressions, and a full resolution image composition algorithm to reduce unnaturalness, which guarantees the user experience. Experiments show that the proposed method is efficient and can generate high-quality videos at ultra-low bitrates.

Key words: talking heads, face animation, video conferencing, generative adversarial network

LU Jianguo, ZHENG Qingfang. Ultra-Lightweight Face Animation Method for Ultra-Low Bitrate Video Conferencing[J]. ZTE Communications, 2023, 21(1): 64-71.

Figures/Tables 6

Figure 1 Proposed video conference system consists of three parts: the sender on mobile devices, video generator on servers, and receiver on mobile devices. In the encoder part, the motion encoder extracts keypoints from the driving images. The feature-based image quality evaluation filters out unnatural images. The decoder synthesizes images from the keypoints and reconstructs full-resolution images, which are encoded by H.264 or H.265 and sent to the receiver. The receiver decodes the video stream and shows it on the phone screen

Figure 2 Examples of face animation failure. The first row shows a result caused by large-pose; the face area becomes blurred and there are some artifacts on the hair of the woman. The second row shows a degraded image caused by weak temporal correlation and the reconstructed image looks terrible and weird

Figure 3 Qualitative comparisons with state-of-the-art methods. The first three rows are images from the VoxCeleb dataset and the following four rows are images from our in-house dataset. Our method produces competitive results

Table 1 Efficiency comparison between our face animation method and FOMM

Model		MAC	Parameters/M	Model size/MB	Inference time/ms
Encoder	FOMM	1 280 M	14.21	55.54	57
Encoder	Ours	14.62 M	0.16	0.60	3.5
Decoder	FOMM	120.70 G	45.56	299.10	20
Decoder	Ours	31.42 G	16.16	81.77	5

Table 2 Visual quality comparison among different face animation methods on VoxCeleb dataset

	L1	AKD	AED
X2Face^[22]	0.078	7.69	0.405
Monkey-Net^[23]	0.049	1.89	0.199
FOMM^[6]	0.041	1.27	0.134
Ours	0.043	1.37	0.147

Figure 4 Results of full-resolution image generation. The first row shows images generated by simply replacing the head region in the source image with the new animated head region. The third row shows image results by our method in Section 3.4. In the second and fourth rows, connections between head regions and body regions are zoomed in for clearer comparison

References 36

1	SULLIVAN G J, OHM J R, HAN W J, et al. Overview of the high efficiency video coding (HEVC) standard [J]. IEEE transactions on circuits and systems for video technology, 2012, 22(12): 1649–1668. DOI: 10.1109/TCSVT.2012.2221191
2	KONUKO G, VALENZISE G, LATHUILIÈRE S. Ultra-low bitrate video conferencing using deep image animation [C]//IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021: 4210–4214. DOI: 10.1109/ICASSP39728.2021.9414731
3	FENG D H, HUANG Y, ZHANG Y W, et al. A generative compression framework for low bandwidth video conference [C]//IEEE International Conference on Multimedia & Expo Workshops (ICMEW). IEEE, 2021: 1–6. DOI: 10.1109/ICMEW53276.2021.9455985
4	OQUAB M, STOCK P, GAFNI O, et al. Low bandwidth video-chat compression using deep generative models [C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE, 2021: 2388–2397. DOI: 10.1109/CVPRW53098.2021.00271
5	WANG T C, MALLYA A, LIU M Y. One-shot free-view neural talking-head synthesis for video conferencing [C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2021: 10034–10044. DOI: 10.1109/CVPR46437.2021.00991
6	SIAROHIN A, LATHUILIÈRE S, TULYAKOV S, et al. First order motion model for image animation [J]. Advances in neural information processing systems. 2019, 32: 7135–7145
7	GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets [J]. Advances in neural information processing systems. 2014, 27: 2672–2680
8	VLASIC D, BRAND M, PFISTER H, et al. Face transfer with multilinear models [J]. ACM transactions on graphics, 2005, 24(3): 426–433. DOI: 10.1145/1073204.1073209
9	DALE K, SUNKAVALLI K, JOHNSON M K, et al. Video face replacement [J]. ACM transactions on graphics, 2011, 30(6): 1–10. DOI: 10.1145/2070781.2024164
10	THIES J, ZOLLHÖFER M, STAMMINGER M, et al. Face2Face: real-time face capture and reenactment of RGB videos [C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2016: 2387–2395. DOI: 10.1109/CVPR.2016.262
11	NAGANO K, SEO J, XING J, et al. PaGAN: real-time avatars using dynamic textures [J]. ACM transactions on graphics, 2018, 37(6): 1–12. DOI: 10.1145/3272127.3275075
12	KIM H, GARRIDO P, TEWARI A, et al. Deep video portraits [J]. ACM transactions on graphics (TOG), 2018, 37(4): 1–14. DOI: 10.1145/3197517.3201283
13	BLANZ V, VETTER T. A morphable model for the synthesis of 3D faces [C]//26th Annual Conference on Computer Graphics and Interactive Techniques. ACM, 1999: 187–194. DOI: 10.1145/311535.311556
14	BURKOV E, PASECHNIK I, GRIGOREV A, et al. Neural head reenactment with latent pose descriptors [C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020: 13783–13792. DOI: 10.1109/CVPR42600.2020.01380
15	OLSZEWSKI K, LI Z M, YANG C, et al. Realistic dynamic facial textures from a single image using GANs [C]//IEEE International Conference on Computer Vision (ICCV). IEEE, 2017: 5439–5448. DOI: 10.1109/ICCV.2017.580
16	SONG Y, ZHU J W, LI D W, et al. Talking face generation by conditional recurrent adversarial network [C]//Twenty-Eighth International Joint Conference on Artificial Intelligence. IJCAI, 2019: 919–925. DOI: 10.24963/ijcai.2019/129
17	YU J H, LIN Z, YANG J M, et al. Generative image inpainting with contextual attention [C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2018: 5505–5514. DOI: 10.1109/CVPR.2018.00577
18	ZAKHAROV E, SHYSHEYA A, BURKOV E, et al. Few-shot adversarial learning of realistic neural talking head models [C]//IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2020: 9458–9467. DOI: 10.1109/ICCV.2019.00955
19	ZAKHAROV E, IVAKHNENKO A, SHYSHEYA A, et al. Fast bi-layer neural synthesis of one-shot realistic head avatars [C]//European Conference on Computer Vision. Springer, 2020: 524–540. DOI: 10.1007/978-3-030-58610-2_31
20	LIU J, CHEN P, LIANG T, et al. Li-Net: large-pose identity-preserving face reenactment network [C]//IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2021: 1–6. DOI: 10.1109/ICME51207.2021.9428233
21	ZHAO R Q, WU T Y, GUO G D. Sparse to dense motion transfer for face image animation [C]//IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). IEEE, 2021: 1991–2000. DOI: 10.1109/ICCVW54120.2021.00226
22	WILES O, KOEPKE A S, ZISSERMAN A. X2Face: a network for controlling face generation using images, audio, and pose codes [C]//European Conference on Computer Vision. Springer, 2018: 690–706. DOI: 10.1007/978-3-030-01261-8_41
23	SIAROHIN A, LATHUILIÈRE S, TULYAKOV S, et al. Animating arbitrary objects via deep motion transfer [C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020: 2372–2381. DOI: 10.1109/CVPR.2019.00248
24	SIAROHIN A, WOODFORD O J, REN J, et al. Motion representations for articulated animation [C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2021: 13648–13657. DOI: 10.1109/CVPR46437.2021.01344
25	ZHAO J, ZHANG H. Thin-plate spline motion model for image animation [C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022: 3647–3656. DOI: 10.1109/CVPR52688.2022.00364
26	AGUSTSSON E, TSCHANNEN M, MENTZER F, et al. Generative adversarial networks for extreme learned image compression [C]//IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2020: 221–231. DOI: 10.1109/ICCV.2019.00031
27	KAPLANYAN A S, SOCHENOV A, LEIMKÜHLER T, et al. DeepFovea: neural reconstruction for foveated rendering and video compression using learned statistics of natural videos [J]. ACM transactions on graphics, 2019, 38(6): 1–13. DOI: 10.1145/3355089.3356557
28	LU G, OUYANG W L, XU D, et al. Deep kalman filtering network for video compression artifact reduction [C]//European Conference on Computer Vision. Springer, 2018: 591–608. DOI: 10.1007/978-3-030-01264-9_35
29	GUO Y H, ZHANG X, WU X L. Deep multi-modality soft-decoding of very low bit-rate face videos [C]//28th ACM International Conference on Multimedia. ACM, 2020: 3947–3955. DOI: 10.1145/3394171.3413709
30	PRABHAKAR R, CHANDAK S, CHIU C, et al. Reducing latency and bandwidth for video streaming using keypoint extraction and digital puppetry [C]//Data Compression Conference (DCC). IEEE, 2021: 360. DOI: 10.1109/DCC50243.2021.00057
31	SANDLER M, HOWARD A, ZHU M L, et al. MobileNetV2: inverted residuals and linear bottlenecks [C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2018: 4510–4520. DOI: 10.1109/CVPR.2018.00474
32	NAGRANI A, CHUNG J S, ZISSERMAN A. VoxCeleb: a large-scale speaker identification dataset [C]//18th Annual Conference of the International Speech Communication Association. ISCA, 2017: 2616–2620. DOI: 10.21437/interspeech.2017-950
33	BULAT A, TZIMIROPOULOS G. How far are we from solving the 2D & 3D face alignment problem? (and a dataset of 230 000 3D facial landmarks) [C]//IEEE International Conference on Computer Vision (ICCV). IEEE, 2017: 1021–1030. DOI: 10.1109/ICCV.2017.116
34	AMOS B, LUDWICZUK B, SATYANARAYANAN M. Openface: a general-purpose face recognition library with mobile applications: CMU-CS-16-118 [R]. USA: School of Computer Science, Carnegie Mellon University, 2016. DOI:10.13140/RG.2.2.26719.07842
35	JIANG X, WANG H, CHEN Y, et al. MNN: a universal and efficient inference engine [C]//Third Conference on Machine Learning and Systems. MLSys, 2020, 2: 1–13. DOI: 10.48550/arXiv.2002.12418
36	NVIDIA. NVIDIA TensorRT [EB/OL]. [2022-02-22].

[1]	LI Yiming, LI Weihua, SHEN Zan, NI Bingbing. Crowd Counting for Real Monitoring Scene [J]. ZTE Communications, 2020, 18(2): 74-82.
[2]	Yan Ye, Yong He, YeKui Wang, Hendry. SHVC, the Scalable Extensions of HEVC, and Its Applications [J]. ZTE Communications, 2016, 14(1): 24-41.