ZTE Communications ›› 2024, Vol. 22 ›› Issue (1): 62-76.DOI: 10.12142/ZTECOM.202401008
• Review • Previous Articles Next Articles
ZHANG Qiang1,2, MEI Junjun1,2, GUAN Tao1,2, SUN Zhewen3, ZHANG Zixiang3, YU Li3()
Received:
2023-07-02
Online:
2024-03-29
Published:
2024-03-28
About author:
ZHANG Qiang is the Director of the Big Video Committee of ZTE Corporation. His research interests include computer vision, audio and video codec, transmission, and network architecture.Supported by:
ZHANG Qiang, MEI Junjun, GUAN Tao, SUN Zhewen, ZHANG Zixiang, YU Li. Recent Advances in Video Coding for Machines Standard and Technologies[J]. ZTE Communications, 2024, 22(1): 62-76.
Add to citation manager EndNote|Ris|BibTeX
URL: http://zte.magtechjournal.com/EN/10.12142/ZTECOM.202401008
Tasks | Network Architecture | Training Dataset |
---|---|---|
Object detection | Faster R-CNN with ResNeXt-101 backbone [ | COCO train2017 [ OpenImageV6 [ FLIR [ TVD [ SFU-HW-Object-v1 [ |
Instance segmentation | Faster R-CNN with ResNeXt-101 backbone | OpenImageV6 TVD |
Object tracking | JDE-1088x608 [ | HiEve [ TVD |
Action recognition* | SlowFast [ | HiEve |
Pose estimation* | HRNet [ | COCO train2017 MPII Human Pose [ HiEve |
Table 1 Information about machine vision tasks
Tasks | Network Architecture | Training Dataset |
---|---|---|
Object detection | Faster R-CNN with ResNeXt-101 backbone [ | COCO train2017 [ OpenImageV6 [ FLIR [ TVD [ SFU-HW-Object-v1 [ |
Instance segmentation | Faster R-CNN with ResNeXt-101 backbone | OpenImageV6 TVD |
Object tracking | JDE-1088x608 [ | HiEve [ TVD |
Action recognition* | SlowFast [ | HiEve |
Pose estimation* | HRNet [ | COCO train2017 MPII Human Pose [ HiEve |
Instance Segmentation | Object Detection | Object Detection | ||||||
---|---|---|---|---|---|---|---|---|
Overall | BD-rate over video | BD-rate over feature | Overall | BD-rate over video | BD-rate over feature | Overall | BD-rate over video | BD-rate over feature |
Ref. [ | -87.44% | -97.58% | Ref. [ | -79.21% | -95.56% | Ref. [ | -81.11% | -94.15% |
Ref. [ | 63.69% | -74.43% | Ref. [ | -47.46% | -89.48% | Ref. [ | -54.51% | -85.06% |
Ref. [ | -80.18% | -97.09% | Ref. [ | -93.04% | -98.60% | Ref. [ | -94.46% | -98.34% |
Ref. [ | 218.93% | -33.01% | Ref. [ | -19.35% | -83.38% | Ref. [ | -70.39% | -91.14% |
Ref. [ | -77.40% | -95.84% | Ref. [ | -78.11% | -95.84% | |||
Ref. [ | -64.94% | -92.17% | Ref. [ | -69.08% | -92.30% |
Table 2 Proposal summary results on TVD-overall
Instance Segmentation | Object Detection | Object Detection | ||||||
---|---|---|---|---|---|---|---|---|
Overall | BD-rate over video | BD-rate over feature | Overall | BD-rate over video | BD-rate over feature | Overall | BD-rate over video | BD-rate over feature |
Ref. [ | -87.44% | -97.58% | Ref. [ | -79.21% | -95.56% | Ref. [ | -81.11% | -94.15% |
Ref. [ | 63.69% | -74.43% | Ref. [ | -47.46% | -89.48% | Ref. [ | -54.51% | -85.06% |
Ref. [ | -80.18% | -97.09% | Ref. [ | -93.04% | -98.60% | Ref. [ | -94.46% | -98.34% |
Ref. [ | 218.93% | -33.01% | Ref. [ | -19.35% | -83.38% | Ref. [ | -70.39% | -91.14% |
Ref. [ | -77.40% | -95.84% | Ref. [ | -78.11% | -95.84% | |||
Ref. [ | -64.94% | -92.17% | Ref. [ | -69.08% | -92.30% |
1 | DUAN L Y, LOU Y H, BAI Y, et al. Compact descriptors for video analysis: the emerging MPEG standard [J]. IEEE multimedia, 2019, 26( 2): 44– 54. DOI: 10.1109/MMUL.2018.2873844 |
2 | DUAN L Y, CHANDRASEKHAR V, CHEN J, et al. Overview of the MPEG-CDVS standard [J]. IEEE transactions on image processing, 2016, 25( 1): 179– 194. DOI: 10.1109/TIP.2015.2500034 |
3 | LE N, ZHANG H L, CRICRI F, et al. Image coding for machines: an end-to-end learned approach [C]// IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021: 1590– 1594. DOI: 10.1109/ICASSP39728.2021.9414465 |
4 | LE N, ZHANG H L, CRICRI F, et al. Learned image coding for machines: a content-adaptive approach [C]// IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2021: 1– 6. DOI: 10.1109/ICME51207.2021.9428224 |
5 | TU H Y, LI L, ZHOU W G, et al. Semantic scalable image compression with cross-layer priors [C]// The 29th ACM International Conference on Multimedia. ACM, 2021: 4044– 4052. DOI: 10.1145/3474085.3475533 |
6 | CHOI H, BAJIĆ I V. Deep feature compression for collaborative object detection [C]// The 25th IEEE International Conference on Image Processing (ICIP). IEEE, 2018: 3743– 3747. DOI: 10.1109/ICIP.2018.8451100 |
7 | CHOI H, BAJIĆ I V. Latent-space scalability for multi-task collaborative intelligence [C]// IEEE International Conference on Image Processing (ICIP). IEEE, 2021: 3562– 3566. DOI: 10.1109/ICIP42928.2021.9506712 |
8 | DUAN L Y, LIU J Y, YANG W H, et al. Video coding for machines: a paradigm of collaborative compression and intelligent analytics [J]. IEEE transactions on image processing, 2020, 29: 8680– 8695. DOI: 10.1109/TIP.2020.3016485 |
9 | GAO W, LIU S, XU X Z, et al. Recent standard development activities on video coding for machines [EB/OL]. ( 2021-05-26) [ 2023-06-06]. |
10 | BJONTEGAARD G. Calculation of average PSNR differences between RD-curves [J]. Computer science, 2001 |
11 | REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks [J]. IEEE transactions on pattern analysis and machine intelligence, 2017, 39( 6): 1137– 1149. DOI: 10.1109/TPAMI.2016.2577031 |
12 | LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context [M]. Computer Vision. Cham: Springer International Publishing, 2014: 740– 755. DOI: 10.1007/978-3-319-10602-1_48 |
13 | Open images dataset. Open images dataset v6 [EB/OL]. [ 2023-06-06]. |
14 | TELEDYNE FLIR. Free FLIR thermal dataset for algorithm training [EB/OL]. [ 2023-06-06]. |
15 | ROY A, GUINAUDEAU C, BREDIN H, et al. TVD: a reproducible and multiply aligned TV series dataset [C]// The 9th International Conference on Language Resources and Evaluation. ELRA, 2014: 418– 425 |
16 | BAJIC I, CHOI H, HOSSEINI E, et al. Sfu-hw-objects-v1 [EB/OL]. ( 2020-06-24) [ 2023-06-06]. |
17 | WANG Z D, ZHENG L, LIU Y X, et al. Towards real-time multi-object tracking [M]. Computer Vision. Cham: Springer International Publishing, 2020: 107– 122. DOI: 10.1007/978-3-030-58621-8_7 |
18 | LIN W Y, LIU H B, LIU S Z, et al. Human in events: a large-scale benchmark for human-centric video analysis in complex events [EB/OL]. ( 2020-05-09) [ 2023-06-06]. |
19 | FAN H, LI Y, XIONG B, et al. Pyslowfast [EB/OL]. [ 2023-07-06]. |
20 | WANG J D, SUN K, CHENG T H, et al. Deep high-resolution representation learning for visual recognition [J]. IEEE transactions on pattern analysis and machine intelligence, 2021, 43( 10): 3349– 3364. DOI: 10.1109/TPAMI.2020.2983686 |
21 | ANDRILUKA M, PISHCHULIN L, GEHLER P, et al. 2D human pose estimation: New benchmark and state of the art analysis [C]// IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2014: 3686– 3693. DOI: 10.1109/CVPR.2014.471 |
22 | VTM. VVC Test Model [EB/OL]. [ 2023-06-06]. |
23 | TOMAR S. Converting video formats with FFmpeg [J]. Linux journal, 2006, 146: 93– 94 |
24 | MPEG ISO/IEC. The test results of compressing P-layer feature maps on the Mask R-CNN network: m59942 [S]. 2022 |
25 | MPEG ISO/IEC. A feature map compression method by using generated feature frames for object segmentation: m57958 [S]. 2021 |
26 | MPEG ISO/IEC. Investigation on deep feature compression framework for multi-task: m58772 [S]. 2022 |
27 | MPEG ISO/IEC. Compression of FPN multi-scale features for object detection using VVC: m59562 [S]. 2022 |
28 | MPEG ISO/IEC. Performance of the enhanced M with bottom-up MSFF: m60197 [S]. 2022 |
29 | MPEG ISO/IEC. Response to CfE on video coding for machine from canon: m60821 [S]. 2022 |
30 | MPEG ISO/IEC. Experimental results of feature compression using CompressAI: m56716 [S]. 2021 |
31 | BÉGAINT J, RACAPÉ F, FELTMAN S, et al. CompressAI: a PyTorch library and evaluation platform for end-to-end compression research [EB/OL]. ( 2020-11-05) [ 2023-06-06]. |
32 | BALLÉ J, MINNEN D, SINGH S, et al. Variational image compression with a scale hyperprior [EB/OL]. ( 2018-02-01) [ 2023-06-06]. |
33 | ZHANG G Y, LIU Z, XU X, et al. Learning-based feature compression for instance segmentation: m60240 [S]. Geneva: ISO/IEC MPEG, 2022 |
34 | CHENG Z X, SUN H M, TAKEUCHI M, et al. Learned image compression with discretized Gaussian mixture likelihoods and attention modules [C]// Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020: 7936– 7945. DOI: 10.1109/CVPR42600.2020.00796 |
35 | MPEG ISO/IEC. An end-to-end video feature compressing method with feature fusion modules: m60803 [S]. 2022 |
36 | MPEG ISO/IEC. A DCT based feature compression algorithm: m58000 [S]. 2020 |
37 | HOWARD A G, ZHU M, CHEN B, et al. Mobilenets: efficient convolutional neural networks for mobile vision applications [EB/OL]. ( 2017-04-17) [ 2023-06-06]. |
38 | MPEG ISO/IEC. A DWT based feature compression algorithm: m58001 [S] 2020 |
39 | MPEG ISO/IEC. DCT based feature compression: m60257 [S]. 2022 |
40 | MPEG ISO/IEC. Advanced feature map compression based on optimal transformation with VVC and DeepCABAC: m58787 [S]. 2022 |
41 | MPEG ISO/IEC. Response to CfE: a transformation-based feature map compression method: m60788 [S]. 2022 |
42 | MPEG ISO/IEC. Improvement of a feature map compression using predicted p-layer feature map for object detection: m59512 [S]. 2022 |
43 | MPEG ISO/IEC. End-to-end learning-based image reconstruction framework for hybrid machine and human vision: m58994 [S]. 2022 |
44 | MPEG ISO/IEC. Response from Hanbat National University and ETRI to CfE on video coding for machines: m60761 [S]. 2022 |
45 | MPEG ISO/IEC. Response to VCM CfE: multi-scale feature compression with -adaptive feature channel truncation: m60799 [S]. 2022 |
46 | MPEG ISO/IEC. An end-to-end image feature compressing method with feature fusion module: m60802 [S]. 2022 |
47 | MPEG ISO/IEC. Response to VCM call for evidence from Tencent and Wuhan University–a learning-based feature compression framework: m60925 [S]. 2022 |
48 | MPEG ISO/IEC. Hybrid loss training based on reversed SIMO in feature compression for object detection and instance segmentation: m63174 [S]. 2023 |
49 | MINNEN D, BALLÉ J, TODERICI G. Joint autoregressive and hierarchical priors for learned image compression [C]// The 32nd International Conference on Neural Information Processing Systems. ACM, 2018: 10794– 10803. DOI: 10.5555/3327546.3327736 |
50 | MPEG ISO/IEC. ZJU’s response to VCM CfE: deep learning-based compression for machine vision: m56445 [S]. 2021 |
51 | MPEG ISO/IEC. End-to-end image compression towards machine vision for object detection: m57500 [S]. 2021 |
52 | MPEG ISO/IEC. End-to-end learning-based compression for object detection: m58165 [S]. 2021 |
53 | MPEG ISO/IEC. Object detection and instance segmentation results with recent NN codecs: m58050 [S]. 2021 |
54 | MPEG ISO/IEC. Response to call for proposals from Ericsson: m60757 [S]. 2022 |
55 | KALVA H, ADZIC V, KRAUSE B F A, et al. Response to VCM CfP from the Florida Atlantic University and OP solutions, LLC: m60743 [S]. Geneva: ISO/IEC MPEG, 2022 |
56 | WANG C Y, BOCHKOVSKIY A, LIAO H. YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors [C]// The IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2023: 7464– 7475. DOI: 10.1109/cvpr52729.2023.00721 |
57 | MPEG ISO/IEC. Response to VCM CfP: video coding with machine-attention: m60378 [S]. 2022 |
58 | MPEG ISO/IEC. Video coding for machines CfP response from Alibaba and City University of Hong Kong: m60737 [S]. 2022 |
59 | MPEG ISO/IEC. Response to VCM call for proposals–an EVC based solution: m60779 [S]. 2022 |
60 | MPEG ISO/IEC. Response to the CfP on video coding for machine from Zhejiang University: m60741 [S]. 2022 |
61 | MPEG ISO/IEC. Response to VCM call for proposals from Tencent–an end-to-end learning based solution: m60777 [S]. 2020 |
62 | MPEG ISO/IEC. Response to the CfP of the VCM by Nokia: m60753 [S]. 2022 |
[1] | Bin Li, Jizheng Xu. An Introduction to High Efficiency Video Coding Range Extensions [J]. ZTE Communications, 2016, 14(1): 12-18. |
[2] | Ruimin Hu, Rui Zhong, Zhongyuan Wang, and Zhen Han. 3D Perception Algorithms: Towards Perceptually Driven Compression of 3D Video [J]. ZTE Communications, 2013, 11(1): 11-16. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||