Crowd Counting for Real Monitoring Scene

doi:10.12142/ZTECOM.202002009

Abstract

Abstract:

Crowd counting is a challenging task in computer vision as realistic scenes are always filled with unfavourable factors such as severe occlusions, perspective distortions and diverse distributions. Recent state-of-the-art methods based on convolutional neural network (CNN) weaken these factors via multi-scale feature fusion or optimal feature selection through a front switch-net. L2 regression is used to regress the density map of the crowd, which is known to lead to an average and blurry result, and affects the accuracy of crowd count and position distribution. To tackle these problems, we take full advantage of the application of generative adversarial networks (GANs) in image generation and propose a novel crowd counting model based on conditional GANs to predict high-quality density maps from crowd images. Furthermore, we innovatively put forward a new regularizer so as to help boost the accuracy of processing extremely crowded scenes. Extensive experiments on four major crowd counting datasets are conducted to demonstrate the better performance of the proposed approach compared with recent state-of-the-art methods.

Key words: crowd counting, density, generative adversarial network

LI Yiming, LI Weihua, SHEN Zan, NI Bingbing. Crowd Counting for Real Monitoring Scene[J]. ZTE Communications, 2020, 18(2): 74-82.

Figures/Tables 10

Figure 1 Examples of crowd scene from the ShanghaiTech dataset[1].

Figure 2 Architecture of the proposed Crowd Counting Network for Real Monitoring Scene (RMSN): The top level is the structure of generator Glarge, the middle part is the structure of generator Gsmall, and the bottom part is the discriminators Dlarge and Dsmall that have the same structure.

Figure 3 Comparisons of MAE for different ?λcvalues on ShanghaiTech Part_B.

Table 1 Comparisons of errors for training with different losses

	Part A		Part B		WorldExpo’10
Objective	MAE	MSE	MAE	MSE	AMAE
$L_{E}$	95.8	149.4	24.1	36.4	9.95
$L_{I}$	83.2	131.3	18.4	28.8	8.48
$L_{I I}$	75.7	102.7	17.2	27.4	7.5

Table 1 Comparisons of errors for training with different losses

	Part A		Part B		WorldExpo’10
Objective	MAE	MSE	MAE	MSE	AMAE
$L_{E}$	95.8	149.4	24.1	36.4	9.95
$L_{I}$	83.2	131.3	18.4	28.8	8.48
$L_{I I}$	75.7	102.7	17.2	27.4	7.5

Table 2 Comparison of RMSN with other three state-of-the-art CNN-based methods on ShanghaiTech dataset

	Part A		Part B
Methods	MAE	MSE	MAE	MSE
The approach in Ref. [3]	181.8	277.7	32.0	49.8
MCNN^[1]	110.2	173.2	26.4	41.3
Switch-CNN^[2]	90.4	135.0	21.6	33.4
The proposed RMSN	86.2	145.4	17.2	27.4

Figure 4 Two test images sampled from the ShanghaiTech Part A dataset (From left to right, the four columns successively denote test images, ground-truth density maps, our estimated density maps and the multi-column convolutional neural network （MCNN）’s[1] respectively).

Table 3 Comparison of RMSN with other four state-of-the-art CNN-based methods on the WorldExpo’10 dataset

Methods	Scene 1	Scene 2	Scene 3	Scene 4	Scene 5	Average
The approach in Ref. [3]	9.8	14.1	14.3	22.2	3.7	12.9
MCNN^[1]	3.4	20.6	12.9	13.0	8.1	11.6
Switch-CNN^[2]	4.4	15.7	10.0	11.0	5.9	9.4
CP-CNN^[21]	2.9	14.7	10.5	10.4	5.8	8.9
The proposed RMSN	4.1	14.05	9.6	11.8	2.9	8.49

Table 4 Comparative results on the UCF_CC_50 dataset

Methods	MAE	MSE
The approach in Ref. [28]	419.5	541.6
The approach in Ref. [3]	467.0	498.5
MCNN^[1]	377.6	509.1
Switch-CNN^[2]	318.1	439.2
CP-CNN^[21]	295.8	320.9
The proposed RMSN	291.0	404.6

Table 5 Comparative results on the UCSD dataset

Methods	MAE	MSE
Kernel Ridge Regression^[12]	2.16	7.45
Cumulative Attribute Regression^[14]	2.07	6.86
The approach in Ref. [3]	1.60	3.31
Switch-CNN^[2]	1.62	2.10
The proposed RMSN	1.47	1.98

Figure 5 One test video information sampled from the UCSD dataset (from left to right and top to bottom, the four images successively denote real time source, density map, velocity map and retention map respectively).

References 28

1	ZHANG Y Y, ZHOU D S, CHEN S Q, et al. Single⁃image crowd counting via multi⁃column convolutional neural network [C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA, 2016: 589–597. DOI: 10.1109/cvpr.2016.70 DOI
2	SAM D B, SURYA S, BABU R V. Switching convolutional neural network for crowd counting [C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA, 2017. DOI:10.1109/cvpr.2017.429 DOI
3	ZHANG C, LI H, WANG X, et al. Cross⁃scene crowd counting via deep convolutional neural networks [C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, USA, 2015: 833–841. DOI:10.1109/cvpr.2015.7298684 DOI
4	GOODFELLOW I J, POUGET⁃ABADIE J, MIRZA M, et al. Generative adversarial networks [J]. Advances in neural information processing systems, 2014(3): 2672–2680
5	MIRZA M, OSINDERO S. Conditional generative adversarial nets [EB/OL]. (2014⁃11⁃06) [2018⁃10⁃12].
6	CHEN X, DUAN Y, HOUTHOOFT R, et al. InfoGAN: interpretable representation learning by information maximizing generative adversarial nets [C]//Conference and Workshop on Neural Information Processing Systems. Barcelona, Spain, 2016
7	ISOLA P, ZHU J⁃Y, ZHOU T, et al. Image⁃to⁃image translation with conditional adversarial networks [EB/OL]. (2016⁃09⁃21) [2018⁃10⁃12].
8	RONNEBERGER O, FISCHER P, BROX T. U⁃Net: convolutional networks for biomedical image segmentation [C]//International Conference on Medical Image Computing and Computer⁃Assisted Intervention. Munich, Germany, 2015: 234–241. DOI: 10.1007/978⁃3⁃319⁃24574⁃4_28 DOI
9	LIN Z L, DAVIS L S. Shape⁃based human detection and segmentation via hierarchical part⁃template matching [J]. IEEE transactions on pattern analysis and machine intelligence, 2010, 32(4): 604–618. DOI: 10.1109/tpami.2009.204 DOI
10	WANG M, WANG X. Automatic adaptation of a generic pedestrian detector to a specific traffic scene [C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Colorado Springs, USA, 2011. DOI: 10.1109/cvpr.2011.5995698 DOI
11	WU B, NEVATIA R. Detection of multiple, partially occluded humans in a single image by bayesian combination of edgelet part detectors [C]//Tenth IEEE International Conference on Computer Vision (ICCV’05). Beijing, China, 2005: 90–97. DOI: 10.1109/iccv.2005.74 DOI
12	AN S J, LIU W Q, VENKATESH S. Face recognition using kernel ridge regression [C]//IEEE Conference on Computer Vision and Pattern Recognition. Minneapolis, USA, 2007: 1110⁃1116. DOI:10.1109/cvpr.2007.383105 DOI
13	CHANA B, LIANG Z⁃S J, VASCONCELOS N. Privacy preserving crowd monitoring: counting people without people models or tracking [C]//IEEE Conference on Computer Vision and Pattern Recognition. Anchorage, USA, 2008: 1–7. DOI: 10.1109/cvpr.2008.4587569 DOI
14	CHEN K, LOY C C, GONG S, et al. Feature mining for localised crowd counting [C]//British Machine Vision Conference. Surrey, UK, 2012. DOI: 10.5244/c.26.21 DOI
15	KONG D, GRAY D, TAO H. A viewpoint invariant approach for crowd counting [C]//International Conference on Pattern Recognition. Hong Kong, China, 2006. DOI: 10.1109/icpr.2006.197 DOI
16	BANSAL A, VENKATESH K S. People counting in high density crowds from still images [EB/OL]. (2015⁃07⁃30) [2018⁃10⁃12].
17	RABAUD V, BELONGIE S J. Counting crowded moving objects [C]//IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06). New York, USA, 2006: 705–711. DOI: 10.1109/cvpr.2006.92 DOI
18	BROSTOW G J, CIPOLLA R. Unsupervised bayesian detection of independent motion in crowds [C]//IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, USA, 2006: 594–601. DOI: 10.1109/cvpr.2006.320 DOI
19	WANG C, ZHANG H, YANG L, et al. Deep people counting in extremely dense crowds [C]//ACM International Conference on Multimedia. Brisbane, Australia, 2015. DOI: 10.1145/2733373.2806337 DOI
20	BOOMINATHAN L, KRUTHIVENTI S S, BABU R V. CrowdNet: a deep convolutional network for dense crowd counting [C]//ACM Conference on Multimedia. Vienna, Austria, 2016. DOI: 10.1145/2964284.2967300 DOI
21	SINDAGI V A, PATEL V M. Generating high⁃quality crowd density maps using contextual pyramid CNNs [C]//IEEE International Conference on Computer Vision. Venice, Italy, 2017
22	LI C, WAND M. Precomputed real⁃time texture synthesis with markovian generative adversarial networks [C]//European Conference on Computer Vision. Amsterdam, Netherland, 2016: 702–716. DOI: 10.1007/978⁃3⁃319⁃46487⁃9_43 DOI
23	PATHAK D, KRAHENBUHL P, DONAHUE J, et al. Context encoders: feature learning by inpainting [C]//IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA, 2016: 2536–2544. DOI: 10.1109/cvpr.2016.278 DOI
24	JOHNSON J, ALAHI A, LI F F. Perceptual losses for real⁃time style transfer and super⁃resolution [C]//European Conference on Computer Vision. Amsterdam, Netherland, 2016: 694–711. DOI: 10.1007/978⁃3⁃319⁃46475⁃6_43 DOI
25	HINTON G E, SALAKHUTDINOV R R. Reducing the dimensionality of data with neural networks [J]. Science, 2006, 313(5786): 504–507. DOI: 10.1126/science.1127647 DOI
26	ZHANG H, SINDAGI V, PATEL V M. Image de⁃raining using a conditional generative adversarial network [EB/OL]. (2017⁃01⁃21) [2018⁃10⁃12].
27	SHEN Z, XU Y, NI B B, et al. Crowd counting via adversarial cross scale consistency pursuit [C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, USA, 2018: 5245–5254. DOI: 10.1109/cvpr.2018.00550 DOI
28	IDREES H, SALEEMI I, SEIBERT C, et al. Multi⁃source multi⁃scale counting in extremely dense crowd images [C]//IEEE Conference on Computer Vision and Pattern Recognition. Portland, USA, 2013: 2547–2554. DOI: 10.1109/cvpr.2013.329 DOI