ZTE Communications ›› 2023, Vol. 21 ›› Issue (1): 64-71.DOI: 10.12142/ZTECOM.202301008
• Research Paper • Previous Articles
LU Jianguo1,2, ZHENG Qingfang1,2()
Received:
2022-08-25
Online:
2023-03-25
Published:
2024-03-15
About author:
LU Jianguo received his BS and MS degrees from Huazhong University of Science and Technology, China in 2017 and 2020 respectively. After graduation, he has been working at ZTE Corporation. His research interests include computer vision, artificial intelligence and augmented reality.LU Jianguo, ZHENG Qingfang. Ultra-Lightweight Face Animation Method for Ultra-Low Bitrate Video Conferencing[J]. ZTE Communications, 2023, 21(1): 64-71.
Add to citation manager EndNote|Ris|BibTeX
URL: http://zte.magtechjournal.com/EN/10.12142/ZTECOM.202301008
Figure 1 Proposed video conference system consists of three parts: the sender on mobile devices, video generator on servers, and receiver on mobile devices. In the encoder part, the motion encoder extracts keypoints from the driving images. The feature-based image quality evaluation filters out unnatural images. The decoder synthesizes images from the keypoints and reconstructs full-resolution images, which are encoded by H.264 or H.265 and sent to the receiver. The receiver decodes the video stream and shows it on the phone screen
Figure 2 Examples of face animation failure. The first row shows a result caused by large-pose; the face area becomes blurred and there are some artifacts on the hair of the woman. The second row shows a degraded image caused by weak temporal correlation and the reconstructed image looks terrible and weird
Figure 3 Qualitative comparisons with state-of-the-art methods. The first three rows are images from the VoxCeleb dataset and the following four rows are images from our in-house dataset. Our method produces competitive results
Encoder | FOMM | 1 280 M | 14.21 | 55.54 | 57 |
Ours | 14.62 M | 0.16 | 0.60 | ||
Decoder | FOMM | 120.70 G | 45.56 | 299.10 | 20 |
Ours | 31.42 G | 16.16 | 81.77 |
Table 1 Efficiency comparison between our face animation method and FOMM
Encoder | FOMM | 1 280 M | 14.21 | 55.54 | 57 |
Ours | 14.62 M | 0.16 | 0.60 | ||
Decoder | FOMM | 120.70 G | 45.56 | 299.10 | 20 |
Ours | 31.42 G | 16.16 | 81.77 |
L1 | AKD | AED | |
---|---|---|---|
X2Face[ | 0.078 | 7.69 | 0.405 |
Monkey-Net[ | 0.049 | 1.89 | 0.199 |
FOMM[ | 0.041 | 1.27 | 0.134 |
Ours | 0.043 | 1.37 | 0.147 |
Table 2 Visual quality comparison among different face animation methods on VoxCeleb dataset
L1 | AKD | AED | |
---|---|---|---|
X2Face[ | 0.078 | 7.69 | 0.405 |
Monkey-Net[ | 0.049 | 1.89 | 0.199 |
FOMM[ | 0.041 | 1.27 | 0.134 |
Ours | 0.043 | 1.37 | 0.147 |
Figure 4 Results of full-resolution image generation. The first row shows images generated by simply replacing the head region in the source image with the new animated head region. The third row shows image results by our method in Section 3.4. In the second and fourth rows, connections between head regions and body regions are zoomed in for clearer comparison
1 | SULLIVAN G J, OHM J R, HAN W J, et al. Overview of the high efficiency video coding (HEVC) standard [J]. IEEE transactions on circuits and systems for video technology, 2012, 22(12): 1649–1668. DOI: 10.1109/TCSVT.2012.2221191 |
2 | KONUKO G, VALENZISE G, LATHUILIÈRE S. Ultra-low bitrate video conferencing using deep image animation [C]//IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021: 4210–4214. DOI: 10.1109/ICASSP39728.2021.9414731 |
3 | FENG D H, HUANG Y, ZHANG Y W, et al. A generative compression framework for low bandwidth video conference [C]//IEEE International Conference on Multimedia & Expo Workshops (ICMEW). IEEE, 2021: 1–6. DOI: 10.1109/ICMEW53276.2021.9455985 |
4 | OQUAB M, STOCK P, GAFNI O, et al. Low bandwidth video-chat compression using deep generative models [C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE, 2021: 2388–2397. DOI: 10.1109/CVPRW53098.2021.00271 |
5 | WANG T C, MALLYA A, LIU M Y. One-shot free-view neural talking-head synthesis for video conferencing [C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2021: 10034–10044. DOI: 10.1109/CVPR46437.2021.00991 |
6 | SIAROHIN A, LATHUILIÈRE S, TULYAKOV S, et al. First order motion model for image animation [J]. Advances in neural information processing systems. 2019, 32: 7135–7145 |
7 | GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets [J]. Advances in neural information processing systems. 2014, 27: 2672–2680 |
8 | VLASIC D, BRAND M, PFISTER H, et al. Face transfer with multilinear models [J]. ACM transactions on graphics, 2005, 24(3): 426–433. DOI: 10.1145/1073204.1073209 |
9 | DALE K, SUNKAVALLI K, JOHNSON M K, et al. Video face replacement [J]. ACM transactions on graphics, 2011, 30(6): 1–10. DOI: 10.1145/2070781.2024164 |
10 | THIES J, ZOLLHÖFER M, STAMMINGER M, et al. Face2Face: real-time face capture and reenactment of RGB videos [C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2016: 2387–2395. DOI: 10.1109/CVPR.2016.262 |
11 | NAGANO K, SEO J, XING J, et al. PaGAN: real-time avatars using dynamic textures [J]. ACM transactions on graphics, 2018, 37(6): 1–12. DOI: 10.1145/3272127.3275075 |
12 | KIM H, GARRIDO P, TEWARI A, et al. Deep video portraits [J]. ACM transactions on graphics (TOG), 2018, 37(4): 1–14. DOI: 10.1145/3197517.3201283 |
13 | BLANZ V, VETTER T. A morphable model for the synthesis of 3D faces [C]//26th Annual Conference on Computer Graphics and Interactive Techniques. ACM, 1999: 187–194. DOI: 10.1145/311535.311556 |
14 | BURKOV E, PASECHNIK I, GRIGOREV A, et al. Neural head reenactment with latent pose descriptors [C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020: 13783–13792. DOI: 10.1109/CVPR42600.2020.01380 |
15 | OLSZEWSKI K, LI Z M, YANG C, et al. Realistic dynamic facial textures from a single image using GANs [C]//IEEE International Conference on Computer Vision (ICCV). IEEE, 2017: 5439–5448. DOI: 10.1109/ICCV.2017.580 |
16 | SONG Y, ZHU J W, LI D W, et al. Talking face generation by conditional recurrent adversarial network [C]//Twenty-Eighth International Joint Conference on Artificial Intelligence. IJCAI, 2019: 919–925. DOI: 10.24963/ijcai.2019/129 |
17 | YU J H, LIN Z, YANG J M, et al. Generative image inpainting with contextual attention [C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2018: 5505–5514. DOI: 10.1109/CVPR.2018.00577 |
18 | ZAKHAROV E, SHYSHEYA A, BURKOV E, et al. Few-shot adversarial learning of realistic neural talking head models [C]//IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2020: 9458–9467. DOI: 10.1109/ICCV.2019.00955 |
19 | ZAKHAROV E, IVAKHNENKO A, SHYSHEYA A, et al. Fast bi-layer neural synthesis of one-shot realistic head avatars [C]//European Conference on Computer Vision. Springer, 2020: 524–540. DOI: 10.1007/978-3-030-58610-2_31 |
20 | LIU J, CHEN P, LIANG T, et al. Li-Net: large-pose identity-preserving face reenactment network [C]//IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2021: 1–6. DOI: 10.1109/ICME51207.2021.9428233 |
21 | ZHAO R Q, WU T Y, GUO G D. Sparse to dense motion transfer for face image animation [C]//IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). IEEE, 2021: 1991–2000. DOI: 10.1109/ICCVW54120.2021.00226 |
22 | WILES O, KOEPKE A S, ZISSERMAN A. X2Face: a network for controlling face generation using images, audio, and pose codes [C]//European Conference on Computer Vision. Springer, 2018: 690–706. DOI: 10.1007/978-3-030-01261-8_41 |
23 | SIAROHIN A, LATHUILIÈRE S, TULYAKOV S, et al. Animating arbitrary objects via deep motion transfer [C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020: 2372–2381. DOI: 10.1109/CVPR.2019.00248 |
24 | SIAROHIN A, WOODFORD O J, REN J, et al. Motion representations for articulated animation [C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2021: 13648–13657. DOI: 10.1109/CVPR46437.2021.01344 |
25 | ZHAO J, ZHANG H. Thin-plate spline motion model for image animation [C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022: 3647–3656. DOI: 10.1109/CVPR52688.2022.00364 |
26 | AGUSTSSON E, TSCHANNEN M, MENTZER F, et al. Generative adversarial networks for extreme learned image compression [C]//IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2020: 221–231. DOI: 10.1109/ICCV.2019.00031 |
27 | KAPLANYAN A S, SOCHENOV A, LEIMKÜHLER T, et al. DeepFovea: neural reconstruction for foveated rendering and video compression using learned statistics of natural videos [J]. ACM transactions on graphics, 2019, 38(6): 1–13. DOI: 10.1145/3355089.3356557 |
28 | LU G, OUYANG W L, XU D, et al. Deep kalman filtering network for video compression artifact reduction [C]//European Conference on Computer Vision. Springer, 2018: 591–608. DOI: 10.1007/978-3-030-01264-9_35 |
29 | GUO Y H, ZHANG X, WU X L. Deep multi-modality soft-decoding of very low bit-rate face videos [C]//28th ACM International Conference on Multimedia. ACM, 2020: 3947–3955. DOI: 10.1145/3394171.3413709 |
30 | PRABHAKAR R, CHANDAK S, CHIU C, et al. Reducing latency and bandwidth for video streaming using keypoint extraction and digital puppetry [C]//Data Compression Conference (DCC). IEEE, 2021: 360. DOI: 10.1109/DCC50243.2021.00057 |
31 | SANDLER M, HOWARD A, ZHU M L, et al. MobileNetV2: inverted residuals and linear bottlenecks [C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2018: 4510–4520. DOI: 10.1109/CVPR.2018.00474 |
32 | NAGRANI A, CHUNG J S, ZISSERMAN A. VoxCeleb: a large-scale speaker identification dataset [C]//18th Annual Conference of the International Speech Communication Association. ISCA, 2017: 2616–2620. DOI: 10.21437/interspeech.2017-950 |
33 | BULAT A, TZIMIROPOULOS G. How far are we from solving the 2D & 3D face alignment problem? (and a dataset of 230 000 3D facial landmarks) [C]//IEEE International Conference on Computer Vision (ICCV). IEEE, 2017: 1021–1030. DOI: 10.1109/ICCV.2017.116 |
34 | AMOS B, LUDWICZUK B, SATYANARAYANAN M. Openface: a general-purpose face recognition library with mobile applications: CMU-CS-16-118 [R]. USA: School of Computer Science, Carnegie Mellon University, 2016. DOI:10.13140/RG.2.2.26719.07842 |
35 | JIANG X, WANG H, CHEN Y, et al. MNN: a universal and efficient inference engine [C]//Third Conference on Machine Learning and Systems. MLSys, 2020, 2: 1–13. DOI: 10.48550/arXiv.2002.12418 |
36 | NVIDIA. NVIDIA TensorRT [EB/OL]. [2022-02-22]. |
[1] | LI Yiming, LI Weihua, SHEN Zan, NI Bingbing. Crowd Counting for Real Monitoring Scene [J]. ZTE Communications, 2020, 18(2): 74-82. |
[2] | Yan Ye, Yong He, YeKui Wang, Hendry. SHVC, the Scalable Extensions of HEVC, and Its Applications [J]. ZTE Communications, 2016, 14(1): 24-41. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||