Recent Advances in Video Coding for Machines Standard and Technologies

doi:10.12142/ZTECOM.202401008

Abstract

Abstract:

To improve the performance of video compression for machine vision analysis tasks, a video coding for machines (VCM) standard working group was established to promote standardization procedures. In this paper, recent advances in video coding for machine standards are presented and comprehensive introductions to the use cases, requirements, evaluation frameworks and corresponding metrics of the VCM standard are given. Then the existing methods are presented, introducing the existing proposals by category and the research progress of the latest VCM conference. Finally, we give conclusions.

Key words: video coding for machines, VCM, video compression

ZHANG Qiang, MEI Junjun, GUAN Tao, SUN Zhewen, ZHANG Zixiang, YU Li. Recent Advances in Video Coding for Machines Standard and Technologies[J]. ZTE Communications, 2024, 22(1): 62-76.

Figures/Tables 9

References 62

1	DUAN L Y, LOU Y H, BAI Y, et al. Compact descriptors for video analysis: the emerging MPEG standard [J]. IEEE multimedia, 2019, 26( 2): 44– 54. DOI: 10.1109/MMUL.2018.2873844
2	DUAN L Y, CHANDRASEKHAR V, CHEN J, et al. Overview of the MPEG-CDVS standard [J]. IEEE transactions on image processing, 2016, 25( 1): 179– 194. DOI: 10.1109/TIP.2015.2500034
3	LE N, ZHANG H L, CRICRI F, et al. Image coding for machines: an end-to-end learned approach [C]// IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021: 1590– 1594. DOI: 10.1109/ICASSP39728.2021.9414465
4	LE N, ZHANG H L, CRICRI F, et al. Learned image coding for machines: a content-adaptive approach [C]// IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2021: 1– 6. DOI: 10.1109/ICME51207.2021.9428224
5	TU H Y, LI L, ZHOU W G, et al. Semantic scalable image compression with cross-layer priors [C]// The 29th ACM International Conference on Multimedia. ACM, 2021: 4044– 4052. DOI: 10.1145/3474085.3475533
6	CHOI H, BAJIĆ I V. Deep feature compression for collaborative object detection [C]// The 25th IEEE International Conference on Image Processing (ICIP). IEEE, 2018: 3743– 3747. DOI: 10.1109/ICIP.2018.8451100
7	CHOI H, BAJIĆ I V. Latent-space scalability for multi-task collaborative intelligence [C]// IEEE International Conference on Image Processing (ICIP). IEEE, 2021: 3562– 3566. DOI: 10.1109/ICIP42928.2021.9506712
8	DUAN L Y, LIU J Y, YANG W H, et al. Video coding for machines: a paradigm of collaborative compression and intelligent analytics [J]. IEEE transactions on image processing, 2020, 29: 8680– 8695. DOI: 10.1109/TIP.2020.3016485
9	GAO W, LIU S, XU X Z, et al. Recent standard development activities on video coding for machines [EB/OL]. ( 2021-05-26) [ 2023-06-06].
10	BJONTEGAARD G. Calculation of average PSNR differences between RD-curves [J]. Computer science, 2001
11	REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks [J]. IEEE transactions on pattern analysis and machine intelligence, 2017, 39( 6): 1137– 1149. DOI: 10.1109/TPAMI.2016.2577031
12	LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context [M]. Computer Vision. Cham: Springer International Publishing, 2014: 740– 755. DOI: 10.1007/978-3-319-10602-1_48
13	Open images dataset. Open images dataset v6 [EB/OL]. [ 2023-06-06].
14	TELEDYNE FLIR. Free FLIR thermal dataset for algorithm training [EB/OL]. [ 2023-06-06].
15	ROY A, GUINAUDEAU C, BREDIN H, et al. TVD: a reproducible and multiply aligned TV series dataset [C]// The 9th International Conference on Language Resources and Evaluation. ELRA, 2014: 418– 425
16	BAJIC I, CHOI H, HOSSEINI E, et al. Sfu-hw-objects-v1 [EB/OL]. ( 2020-06-24) [ 2023-06-06].
17	WANG Z D, ZHENG L, LIU Y X, et al. Towards real-time multi-object tracking [M]. Computer Vision. Cham: Springer International Publishing, 2020: 107– 122. DOI: 10.1007/978-3-030-58621-8_7
18	LIN W Y, LIU H B, LIU S Z, et al. Human in events: a large-scale benchmark for human-centric video analysis in complex events [EB/OL]. ( 2020-05-09) [ 2023-06-06].
19	FAN H, LI Y, XIONG B, et al. Pyslowfast [EB/OL]. [ 2023-07-06].
20	WANG J D, SUN K, CHENG T H, et al. Deep high-resolution representation learning for visual recognition [J]. IEEE transactions on pattern analysis and machine intelligence, 2021, 43( 10): 3349– 3364. DOI: 10.1109/TPAMI.2020.2983686
21	ANDRILUKA M, PISHCHULIN L, GEHLER P, et al. 2D human pose estimation: New benchmark and state of the art analysis [C]// IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2014: 3686– 3693. DOI: 10.1109/CVPR.2014.471
22	VTM. VVC Test Model [EB/OL]. [ 2023-06-06].
23	TOMAR S. Converting video formats with FFmpeg [J]. Linux journal, 2006, 146: 93– 94
24	MPEG ISO/IEC. The test results of compressing P-layer feature maps on the Mask R-CNN network: m59942 [S]. 2022
25	MPEG ISO/IEC. A feature map compression method by using generated feature frames for object segmentation: m57958 [S]. 2021
26	MPEG ISO/IEC. Investigation on deep feature compression framework for multi-task: m58772 [S]. 2022
27	MPEG ISO/IEC. Compression of FPN multi-scale features for object detection using VVC: m59562 [S]. 2022
28	MPEG ISO/IEC. Performance of the enhanced M with bottom-up MSFF: m60197 [S]. 2022
29	MPEG ISO/IEC. Response to CfE on video coding for machine from canon: m60821 [S]. 2022
30	MPEG ISO/IEC. Experimental results of feature compression using CompressAI: m56716 [S]. 2021
31	BÉGAINT J, RACAPÉ F, FELTMAN S, et al. CompressAI: a PyTorch library and evaluation platform for end-to-end compression research [EB/OL]. ( 2020-11-05) [ 2023-06-06].
32	BALLÉ J, MINNEN D, SINGH S, et al. Variational image compression with a scale hyperprior [EB/OL]. ( 2018-02-01) [ 2023-06-06].
33	ZHANG G Y, LIU Z, XU X, et al. Learning-based feature compression for instance segmentation: m60240 [S]. Geneva: ISO/IEC MPEG, 2022
34	CHENG Z X, SUN H M, TAKEUCHI M, et al. Learned image compression with discretized Gaussian mixture likelihoods and attention modules [C]// Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020: 7936– 7945. DOI: 10.1109/CVPR42600.2020.00796
35	MPEG ISO/IEC. An end-to-end video feature compressing method with feature fusion modules: m60803 [S]. 2022
36	MPEG ISO/IEC. A DCT based feature compression algorithm: m58000 [S]. 2020
37	HOWARD A G, ZHU M, CHEN B, et al. Mobilenets: efficient convolutional neural networks for mobile vision applications [EB/OL]. ( 2017-04-17) [ 2023-06-06].
38	MPEG ISO/IEC. A DWT based feature compression algorithm: m58001 [S] 2020
39	MPEG ISO/IEC. DCT based feature compression: m60257 [S]. 2022
40	MPEG ISO/IEC. Advanced feature map compression based on optimal transformation with VVC and DeepCABAC: m58787 [S]. 2022
41	MPEG ISO/IEC. Response to CfE: a transformation-based feature map compression method: m60788 [S]. 2022
42	MPEG ISO/IEC. Improvement of a feature map compression using predicted p-layer feature map for object detection: m59512 [S]. 2022
43	MPEG ISO/IEC. End-to-end learning-based image reconstruction framework for hybrid machine and human vision: m58994 [S]. 2022
44	MPEG ISO/IEC. Response from Hanbat National University and ETRI to CfE on video coding for machines: m60761 [S]. 2022
45	MPEG ISO/IEC. Response to VCM CfE: multi-scale feature compression with -adaptive feature channel truncation: m60799 [S]. 2022
46	MPEG ISO/IEC. An end-to-end image feature compressing method with feature fusion module: m60802 [S]. 2022
47	MPEG ISO/IEC. Response to VCM call for evidence from Tencent and Wuhan University–a learning-based feature compression framework: m60925 [S]. 2022
48	MPEG ISO/IEC. Hybrid loss training based on reversed SIMO in feature compression for object detection and instance segmentation: m63174 [S]. 2023
49	MINNEN D, BALLÉ J, TODERICI G. Joint autoregressive and hierarchical priors for learned image compression [C]// The 32nd International Conference on Neural Information Processing Systems. ACM, 2018: 10794– 10803. DOI: 10.5555/3327546.3327736
50	MPEG ISO/IEC. ZJU’s response to VCM CfE: deep learning-based compression for machine vision: m56445 [S]. 2021
51	MPEG ISO/IEC. End-to-end image compression towards machine vision for object detection: m57500 [S]. 2021
52	MPEG ISO/IEC. End-to-end learning-based compression for object detection: m58165 [S]. 2021
53	MPEG ISO/IEC. Object detection and instance segmentation results with recent NN codecs: m58050 [S]. 2021
54	MPEG ISO/IEC. Response to call for proposals from Ericsson: m60757 [S]. 2022
55	KALVA H, ADZIC V, KRAUSE B F A, et al. Response to VCM CfP from the Florida Atlantic University and OP solutions, LLC: m60743 [S]. Geneva: ISO/IEC MPEG, 2022
56	WANG C Y, BOCHKOVSKIY A, LIAO H. YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors [C]// The IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2023: 7464– 7475. DOI: 10.1109/cvpr52729.2023.00721
57	MPEG ISO/IEC. Response to VCM CfP: video coding with machine-attention: m60378 [S]. 2022
58	MPEG ISO/IEC. Video coding for machines CfP response from Alibaba and City University of Hong Kong: m60737 [S]. 2022
59	MPEG ISO/IEC. Response to VCM call for proposals–an EVC based solution: m60779 [S]. 2022
60	MPEG ISO/IEC. Response to the CfP on video coding for machine from Zhejiang University: m60741 [S]. 2022
61	MPEG ISO/IEC. Response to VCM call for proposals from Tencent–an end-to-end learning based solution: m60777 [S]. 2020
62	MPEG ISO/IEC. Response to the CfP of the VCM by Nokia: m60753 [S]. 2022

Tasks	Network Architecture	Training Dataset
Object detection	Faster R-CNN with ResNeXt-101 backbone ^{[ 11]}	COCO train2017 ^{[ 12]} OpenImageV6 ^{[ 13]} FLIR ^{[ 14]} TVD ^{[ 15]} SFU-HW-Object-v1 ^{[ 16]}
Instance segmentation	Faster R-CNN with ResNeXt-101 backbone	OpenImageV6 TVD
Object tracking	JDE-1088x608 ^{[ 17]}	HiEve ^{[ 18]} TVD
Action recognition*	SlowFast ^{[ 19]}	HiEve
Pose estimation*	HRNet ^{[ 20]}	COCO train2017 MPII Human Pose ^{[ 21]} HiEve

Tasks	Network Architecture	Training Dataset
Object detection	Faster R-CNN with ResNeXt-101 backbone ^{[ 11]}	COCO train2017 ^{[ 12]} OpenImageV6 ^{[ 13]} FLIR ^{[ 14]} TVD ^{[ 15]} SFU-HW-Object-v1 ^{[ 16]}
Instance segmentation	Faster R-CNN with ResNeXt-101 backbone	OpenImageV6 TVD
Object tracking	JDE-1088x608 ^{[ 17]}	HiEve ^{[ 18]} TVD
Action recognition*	SlowFast ^{[ 19]}	HiEve
Pose estimation*	HRNet ^{[ 20]}	COCO train2017 MPII Human Pose ^{[ 21]} HiEve

Instance Segmentation			Object Detection			Object Detection
Overall	BD-rate over video	BD-rate over feature	Overall	BD-rate over video	BD-rate over feature	Overall	BD-rate over video	BD-rate over feature
Ref. [ 44]	-87.44%	-97.58%	Ref. [ 44]	-79.21%	-95.56%	Ref. [ 44]	-81.11%	-94.15%
Ref. [ 41]	63.69%	-74.43%	Ref. [ 41]	-47.46%	-89.48%	Ref. [ 41]	-54.51%	-85.06%
Ref. [ 45]	-80.18%	-97.09%	Ref. [ 45]	-93.04%	-98.60%	Ref. [ 45]	-94.46%	-98.34%
Ref. [ 35]	218.93%	-33.01%	Ref. [ 46]	-19.35%	-83.38%	Ref. [ 29]	-70.39%	-91.14%
Ref. [ 29]	-77.40%	-95.84%	Ref. [ 29]	-78.11%	-95.84%
Ref. [ 47]	-64.94%	-92.17%	Ref. [ 47]	-69.08%	-92.30%

Instance Segmentation			Object Detection			Object Detection
Overall	BD-rate over video	BD-rate over feature	Overall	BD-rate over video	BD-rate over feature	Overall	BD-rate over video	BD-rate over feature
Ref. [ 44]	-87.44%	-97.58%	Ref. [ 44]	-79.21%	-95.56%	Ref. [ 44]	-81.11%	-94.15%
Ref. [ 41]	63.69%	-74.43%	Ref. [ 41]	-47.46%	-89.48%	Ref. [ 41]	-54.51%	-85.06%
Ref. [ 45]	-80.18%	-97.09%	Ref. [ 45]	-93.04%	-98.60%	Ref. [ 45]	-94.46%	-98.34%
Ref. [ 35]	218.93%	-33.01%	Ref. [ 46]	-19.35%	-83.38%	Ref. [ 29]	-70.39%	-91.14%
Ref. [ 29]	-77.40%	-95.84%	Ref. [ 29]	-78.11%	-95.84%
Ref. [ 47]	-64.94%	-92.17%	Ref. [ 47]	-69.08%	-92.30%

[1]	Bin Li, Jizheng Xu. An Introduction to High Efficiency Video Coding Range Extensions [J]. ZTE Communications, 2016, 14(1): 12-18.
[2]	Ruimin Hu, Rui Zhong, Zhongyuan Wang, and Zhen Han. 3D Perception Algorithms: Towards Perceptually Driven Compression of 3D Video [J]. ZTE Communications, 2013, 11(1): 11-16.