Label Enhancement for Scene Text Detection

doi:10.12142/ZTECOM.202204011

Abstract

Abstract:

Segmentation-based scene text detection has drawn a great deal of attention, as it can describe the text instance with arbitrary shapes based on its pixel-level prediction. However, most segmentation-based methods suffer from complex post-processing to separate the text instances which are close to each other, resulting in considerable time consumption during the inference procedure. A label enhancement method is proposed to construct two kinds of training labels for segmentation-based scene text detection in this paper. The label distribution learning (LDL) method is used to overcome the problem brought by pure shrunk text labels that might result in sub-optimal detection performance. The experimental results on three benchmarks demonstrate that the proposed method can consistently improve the performance without sacrificing inference speed.

Key words: label enhancement, scene text detection, semantic segmentation

MEI Junjun, GUAN Tao, TONG Junwen. Label Enhancement for Scene Text Detection[J]. ZTE Communications, 2022, 20(4): 89-95.

Figures/Tables 6

References 40

1	LIAO M H, ZOU Z S, WAN Z Y, et al. Real-time scene text detection with differentiable binarization and adaptive scale fusion [J]. IEEE transactions on pattern analysis and machine intelligence, 2022, early access. DOI: 10.1109/TPAMI.2022.3155612 DOI
2	WANG W H, XIE E Z, LI X, et al. Shape robust text detection with progressive scale expansion network [C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2019: 9336–9345
3	XU N, LIU Y P, GENG X. Label enhancement for label distribution learning [J]. IEEE transactions on knowledge and data engineering, 2021, 33(4): 1632–1643. DOI: 10.1109/TKDE.2019.294704020 DOI
4	GENG X, YIN C, ZHOU Z H. Facial age estimation by learning from label distributions [J]. IEEE transactions on pattern analysis and machine intelligence, 2013, 35(10): 2401–2412. DOI: 10.1109/TPAMI.2013.51 DOI
5	TIAN Z, HUANG W L, HE T, et al. Detecting text in natural image with connectionist text proposal network [C]//European Conference on Computer Vision. Springer, 2016: 56–72. DOI: 10.1007/978-3-319-46484-8_4 DOI
6	LIAO M, SHI B, BAI X. Textboxes: a fast text detector with a single deep neural network [C]//Thirty-First AAAI Conference on Artificial Intelligence. AAAI, 2017. DOI: 10.1609/aaai.v31i1.11196 DOI
7	REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks [J]. IEEE transactions on pattern analysis and machine intelligence, 2017, 39(6): 1137–1149. DOI: 10.1109/TPAMI.2016.2577031 DOI
8	MA J Q, SHAO W Y, YE H, et al. Arbitrary-oriented scene text detection via rotation proposals [J]. IEEE transactions on multimedia, 2018, 20(11): 3111–3122. DOI: 10.1109/TMM.2018.2818020 DOI
9	LIU Y L, JIN L W. Deep matching prior network: toward tighter multi-oriented text detection [C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017: 3454–3461. DOI: 10.1109/CVPR.2017.368 DOI
10	LIAO M H, SHI B G, BAI X. TextBoxes++: a single-shot oriented scene text detector [J]. IEEE transactions on image processing, 2018, 27(8): 3676–3690. DOI: 10.1109/TIP.2018.2825107 DOI
11	ZHOU X, YAO C, WEN H, et al. East: an efficient and accurate scene text detector [C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017: 2642–2651. DOI: 10.1109/CVPR.2017.283 DOI
12	HE W H, ZHANG X Y, YIN F, et al. Deep direct regression for multi-oriented scene text detection [C]//IEEE International Conference on Computer Vision. IEEE, 2017: 745–753. DOI: 10.1109/ICCV.2017.87 DOI
13	LONG J, SHELHAMER E, DARRELL T. Fully convolutional networks for semantic segmentation [C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2015: 3431–3440. DOI: 10.1109/CVPR.2015.7298965 DOI
14	ZHANG Z, ZHANG C, SHEN W, et al. Multi-oriented text detection with fully convolutional networks [C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2016, pp. 4159–4167. DOI: 10.1109/CVPR.2016.451 DOI
15	YAO C, BAI X, SANG N, et al. Scene text detection via holistic, multi-channel prediction [EB/OL]. (2016-07-05)[2021-06-01].
16	DENG D, LIU H F, LI X L, et al. Pixellink: detecting scene text via instance segmentation [C]//Thirty-Second AAAI Conference on Artificial Intelligence. AAAI, 2018. DOI: 10.1609/aaai.v32i1.12269 DOI
17	GAO B-B, XING C, XIE C-W, et al. Deep label distribution learning with label ambiguity [J]. IEEE transactions on image processing, 2017, 26(6): 2825–2838. DOI: 10.1109/TIP.2017.2689998 DOI
18	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition [C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2016: 770–778. DOI: 10.1109/CVPR.2016.90 DOI
19	NAYEF N, YIN F, BIZID I, et al. ICDAR2017 robust reading challenge on multi-lingual scene text detection and script identification-RRC-MLT [C]//14th IAPR International Conference on Document Analysis and Recognition (ICDAR). IAPR, 2017: 1454–1459. DOI: 10.1109/ICDAR.2017.237 DOI
20	CHNG C K, CHAN C S. Total-text: a comprehensive dataset for scene text detection and recognition [C]//14th IAPR International Conference on Document Analysis and Recognition (ICDAR). IAPR, 2017: 935–942. DOI: 10.1109/ICDAR.2017.157 DOI
21	YAO C, BAI X, LIU W Y, et al. Detecting texts of arbitrary orientations in natural images [C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2012: 1083–1090. DOI: 10.1109/CVPR.2012.6247787 DOI
22	KARATZAS D, GOMEZ-BIGORDA L, NICOLAOU A, et al. ICDAR 2015 competition on robust reading [C]//13th International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2015: 56–1160. DOI: 10.1109/ICDAR.2015.7333942 DOI
23	LYU P Y, YAO C, WU W H, et al. Multi-oriented scene text detection via corner localization and region segmentation [C]//IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2018: 7553–7563. DOI: 10.1109/CVPR.2018.00788 DOI
24	LONG S B, RUAN J Q, ZHANG W J, et al. TextSnake: a flexible representation for detecting text of arbitrary shapes [C]//European Conference on Computer Vision. Springer, 2018: 20–36. DOI: 10.1007/978-3-030-01216-8_2 DOI
25	WANG X B, JIANG Y Y, LUO Z B, et al. Arbitrary shape scene text detection with adaptive text region representation [C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2019: 6442–6451. DOI: 10.1109/CVPR.2019.00661 DOI
26	LYU P Y, LIAO M H, YAO C, et al. Mask TextSpotter: An end-to-end trainable neural network for spotting text with arbitrary shapes [C]//European Conference on Computer Vision. Springer, 2018: 67–83. DOI: 10.1007/978-3-030-01264-9_5 DOI
27	XU Y C, WANG Y K, ZHOU W, et al. TextField: learning a deep direction field for irregular scene text detection [J]. IEEE transactions on image processing, 2019, 28(11): 5566–5579. DOI: 10.1109/TIP.2019.2900589 DOI
28	ZHANG C Q, LIANG B R, HUANG Z M, et al. Look more than once: an accurate detector for text of arbitrary shapes [C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2019: 10544–10553. DOI: 10.1109/CVPR.2019.01080 DOI
29	BAEK Y, LEE B, HAN D, et al. Character region awareness for text detection [C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2019: 9357–9366. DOI: 10.1109/CVPR.2019.00959 DOI
30	LIU Z C, LIN G S, YANG S, et al. Towards robust curve text detection with conditional spatial expansion [C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2019: 7261–7270. DOI: 10.1109/CVPR.2019.00744 DOI
31	YE J, CHEN Z, LIU J H, et al. Textfusenet: scene text detection with richer fused features [C]//Twenty-Ninth International Joint Conference on Artificial Intelligence. IJCAI, 2020: 516–522. DOI: 10.24963/ijcai.2020/72 DOI
32	HE T, HUANG W L, QIAO Y, et al. Text-attentional convolutional neural network for scene text detection [J]. IEEE transactions on image processing, 2016, 25(6): 2529–2541. DOI: 10.1109/TIP.2016.2547588 DOI
33	LIAO M H, ZHU Z, SHI B G, et al. Rotation-sensitive regression for oriented scene text detection [C]//IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2018: 5909–5918. DOI: 10.1109/CVPR.2018.00619 DOI
34	LIU Z, LIN G, YANG S, et al. Learning markov clustering networks for scene text detection [C]//IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2018. DOI: 10.1109/CVPR.2018.00725 DOI
35	XUE C H, LU S J, ZHAN F N. Accurate scene text detection through border semantics awareness and bootstrapping [C]//European Conference on Computer Vision. Springer, 2018: 355–372. DOI: 10.1007/978-3-030-01270-0_22 DOI
36	XUE C H, LU S J, ZHANG W. MSR: multi-scale shape regression for scene text detection [EB/OL]. (2019-01-09)[2021-06-01].
37	TIAN Z T, SHU M, LYU P Y, et al. Learning shape-aware embedding for scene text detection [C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2019: 4229–4238. DOI: 10.1109/CVPR.2019.00436 DOI
38	LIU X, ZHOU G J, ZHANG R, et al. An accurate segmentation-based scene text detector with context attention and repulsive text border [C]//IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE, 2020: 2344–2352. DOI: 10.1109/CVPRW50498.2020.00283 DOI
39	BAEK Y, SHIN S, BAEK J, et al. Character region attention for text spotting [C]//European Conference on Computer Vision. Springer, 2020: 504–521. DOI: 10.1007/978-3-030-58526-6_30 DOI
40	QIN S, BISSACCO A, RAPTIS M, et al. Towards unconstrained end-to-end text spotting [C]//IEEE International Conference on Computer Vision. IEEE, 2019: 4704–4714. DOI: 10.1109/ICCV.2019.00480 DOI

Method	Precision/%	Recall/%	F-measure/%
ResNet-18	84.7	77.0	80.6
ResNet-18 + Dis	86.5	80.6	83.5
ResNet-18 + Dis + Bor	88.1	79.9	83.8
ResNet-50	90.5	77.9	83.7
ResNet-50 + Dis	90.9	80.6	85.4
ResNet-50 + Dis + Bor	93.8	81.7	87.3

Method	Precision/%	Recall/%	F-measure/%
ResNet-18	84.7	77.0	80.6
ResNet-18 + Dis	86.5	80.6	83.5
ResNet-18 + Dis + Bor	88.1	79.9	83.8
ResNet-50	90.5	77.9	83.7
ResNet-50 + Dis	90.9	80.6	85.4
ResNet-50 + Dis + Bor	93.8	81.7	87.3

Method	Precision/%	Recall/%	F-measure/%
TextSnake^[24]	82.7	74.5	78.4
ATRR^[25]	80.9	76.2	78.5
Mask TextSpotter^[26]	82.5	75.6	78.6
TextField^[27]	81.2	79.9	80.6
LOMO*^[28]	87.6	79.3	83.3
CRAFT^[29]	87.6	79.9	83.6
CSE^[30]	81.4	79.1	80.2
PSENet-1s^[2]	84.0	78.0	80.9
TextFuseNet-ResNet-50^[31]	83.2	87.5	85.3
DB-ResNet-50 (800)^[1]	87.1	82.5	84.7
Ours-ResNet-50 (800)	89.1	82.4	85.6

Method	Precision/%	Recall/%	F-measure/%
TextSnake^[24]	82.7	74.5	78.4
ATRR^[25]	80.9	76.2	78.5
Mask TextSpotter^[26]	82.5	75.6	78.6
TextField^[27]	81.2	79.9	80.6
LOMO*^[28]	87.6	79.3	83.3
CRAFT^[29]	87.6	79.9	83.6
CSE^[30]	81.4	79.1	80.2
PSENet-1s^[2]	84.0	78.0	80.9
TextFuseNet-ResNet-50^[31]	83.2	87.5	85.3
DB-ResNet-50 (800)^[1]	87.1	82.5	84.7
Ours-ResNet-50 (800)	89.1	82.4	85.6

Method	Precision/%	Recall/%	F-measure/%
Text-CNN^[32]	71	61	69
DeepReg^[12]	77	70	74
RRPN^[8]	82	68	74
RRD^[33]	87	73	79
MCN^[34]	88	79	83
PixelLink^[16]	83	73.2	77.8
Corner^[23]	87.6	76.2	81.5
TextSnake^[24]	83.2	73.9	78.3
Scene text detection with bootstrapping and semantics-aware text border techniques^[35]	83.0	77.4	80.1
MSR^[36]	87.4	76.7	81.7
CRAFT^[29]	88.2	78.2	82.9
SAE^[37]	84.2	81.7	82.9
DB-ResNet-50 (736)^[1]	91.5	79.2	84.9
An accurate segmentation-based detector^[38]	88.8	83.5	86.1
Ours-ResNet-50 (736)	93.8	81.7	87.3