摘要
Scene text detection attracts much attention in computer vision, because it can be widely used in many applications such as real-time text translation, automatic information entry, blind person assistance, robot sensing and so on. Though many methods have been proposed for horizontal and oriented texts, detecting irregular shape texts such as curved texts is still a challenging problem. To solve the problem, we propose a robust scene text detection method with adaptive text region representation. Given an input image, a text region proposal network is first used for extracting text proposals. Then, these proposals are verified and refined with a refinement network.Here, recurrent neural network based adaptive text region representation is proposed for text region refinement, where a pair of boundary points are predicted each time step until no new points are found.In this way, text regions of arbitrary shapes are detected and represented with adaptive number of boundary points. This gives more accurate description of text regions. Experimental results on five benchmarks, namely, CTW1500, TotalText, ICDAR2013, ICDAR2015 and MSRATD500, show that the proposed method achieves state-ofthe-art in scene text detection.
場景文本檢測在計(jì)算機(jī)視覺中引起了廣泛的關(guān)注唆貌,因?yàn)樗梢詮V泛應(yīng)用于實(shí)時文本翻譯拍埠,自動信息輸入箕昭,盲人輔助诚撵,機(jī)器人傳感等多種應(yīng)用中精居。雖然已經(jīng)提出了許多用于水平文本和定向文本的檢測方法勉耀,但是檢測諸如彎曲文本的不規(guī)則形狀文本仍然是一個具有挑戰(zhàn)性的問題。為了解決該問題亏较,我們提出了一種魯棒的具有自適應(yīng)文本區(qū)域表示的場景文本檢測方法静尼。給定一張輸入圖像白粉,首先使用一個文本RPN網(wǎng)絡(luò)提取文本建議區(qū)域。然后鼠渺,利用一個細(xì)化網(wǎng)絡(luò)(refinement network)來校正和細(xì)化上述提取出的文本建議區(qū)域。這里眷细,提出了基于自適應(yīng)文本區(qū)域表示的RNN網(wǎng)絡(luò)用來對文本區(qū)域細(xì)化拦盹,其中每個時間步長預(yù)測一對邊界點(diǎn),直到?jīng)]有新點(diǎn)發(fā)現(xiàn)為止溪椎。這樣普舆,就可以檢測任意形狀的文本區(qū)域恬口,該文本區(qū)域由自適應(yīng)個數(shù)的邊界點(diǎn)來表示。這樣可以更準(zhǔn)確地表示出文本區(qū)域沼侣。本文在五個數(shù)據(jù)集(CTW1500祖能,TotalText,ICDAR2013蛾洛,ICDAR2015和MSRA-TD500)上進(jìn)行了實(shí)驗(yàn),表明所提出的方法在場景文本檢測中是最先進(jìn)的方法养铸。
- 首先使用RPN提取候選區(qū)域
- refinement network 對上述候選區(qū)域進(jìn)行校正和細(xì)化,使其更精準(zhǔn)
- 自適應(yīng)地尋找邊界點(diǎn)
1轧膘、介紹
Text is the most fundamental medium for communicating semantic information. It appears everywhere in daily life: on street nameplates, store signs, product packages, restaurant menus and so on. Such texts in natural environment are known as scene texts. Automatically detecting and recognizing scene texts can be very rewarding with numerous applications, such as real-time text translation, blind person assistance, shopping, robots, smart cars and education. An end-to-end text recognition system usually consists of two steps: text detection and text recognition. In text detection, text regions are detected and labeled with their bounding boxes. And in text recognition, text information
is retrieved from the detected text regions. Text detection is an important step for end-to-end text recognition, without which texts can not be recognized from scene images. Therefore, scene text detection attracts much attention these years.
文本是用于傳遞語義信息的最基本媒介钞螟。 它出現(xiàn)在日常生活的各個角落:街道銘牌,商店標(biāo)志谎碍,產(chǎn)品包裝鳞滨,餐廳菜單等。 自然環(huán)境中的這些文本被稱為場景文本蟆淀。 自動檢測和識別場景文本可以應(yīng)用在很多地方拯啦,例如實(shí)時文本翻譯,盲人輔助熔任,購物提岔,機(jī)器人,智能汽車和教育笋敞。一個端到端的文本識別系統(tǒng)常常包括兩部分:文本檢測和文本識別碱蒙。在文本檢測部分,檢測到文本并使用邊界框標(biāo)注夯巷。在文本識別部分赛惩,從檢測到的文本區(qū)域中提取出文本。文本檢測是端到端文本識別的重要步驟趁餐,沒有它就不能從場景圖片中識別出文本喷兼。因此,在這些年后雷,場景文本檢測引起了很多關(guān)注季惯。
While traditional optical character reader (OCR) techniques can only deal with texts on printed documents or business cards, scene text detection tries to detect various texts in complex scenes. Due to complex backgrounds and variations of font, size, color, language, illumination condition and orientation, scene text detection becomes a very challenging task. And its performance was poor when hand designed features and traditional classifiers were used before deep learning methods become popular. However, the performance has been much improved in recent years, significantly benefitted from the development of deep learning. Meanwhile, the research focus of text detection has shifted from horizontal scene texts [10] to multi-oriented scene texts [9] and more challenging curved or arbitrary shape scene texts [19]. Therefore, arbitrary shape scene text detection is focused on in this paper.
雖然傳統(tǒng)OCR技術(shù)只能處理打印文檔或者名片上的文本,但場景文本檢測嘗試檢測復(fù)雜場景中的各種文本臀突。由于復(fù)雜的背景和不同的字體勉抓、尺寸、顏色候学、語言藕筋、光照條件和方向,場景文本檢測成為了非常有挑戰(zhàn)的任務(wù)梳码。在深度學(xué)習(xí)方法流行之前隐圾,使用手工設(shè)計(jì)特征和傳統(tǒng)分類器檢測方法性能往往較差伍掀。然而,近年來暇藏,檢測性能得到很大改善蜜笤,這很大程度上得益于深度學(xué)習(xí)的發(fā)展。與此同時盐碱,文本檢測的研究方向從水平場景文本轉(zhuǎn)變到了多方向的場景文本和更具有挑戰(zhàn)的彎曲文本或任意形狀的場景文本把兔。因此,本文重點(diǎn)介紹任意形狀的場景文本甸各。
In this paper, we propose an arbitrary shape scene text detection method using adaptive text region representation, as shown in Figure 1. Given an input image, a text region proposal network (Text-RPN) is first used for obtaining text proposals. The Convolutional Neural Network (CNN) feature maps of the input image are also obtained in this step. Then, text proposals are verified and refined with a refinement network, whose input are the text proposal features obtained by using region of interest (ROI) pooling to the CNN feature maps. Here, three branches including text/non-text classification, bounding box refinement and recurrent neural network (RNN) based adaptive text region representation exist in the refinement network. In the RNN, a pair of boundary points are predicted each time step until the stop label is predicted. In this way, arbitrary shape text regions can be represented with adaptive number of boundary points. For performance evaluation, the proposed method is tested on five benchmarks, namely, CTW1500, TotalText, ICDAR2013, ICDAR2015 and MSRA-TD500. Experimental results show that the proposed method can process not only multi-oriented scene texts but also arbitrary shape scene texts including curved texts. Moreover, it achieves state-of-the-art performances on the five datasets.
本文中垛贤,我們提出了一種任意形狀場景文本的檢測方法,該方法使用了自適應(yīng)的文本區(qū)域表示趣倾,如圖1所示聘惦。給一張輸入圖像,首先使用 Text-RPN 獲得文本建議框以及輸入圖像的CNN feature maps儒恋;然后用一個細(xì)化網(wǎng)絡(luò)校正和細(xì)化文本建議框善绎,該細(xì)化網(wǎng)絡(luò)的輸入是上述CNN feature maps經(jīng)過ROI處理后的文本建議框。細(xì)化網(wǎng)絡(luò)包括三個分支诫尽,分別是:文本/非文本分類禀酱,邊框細(xì)化和基于自適應(yīng)文本區(qū)域表示的RNN。在RNN中牧嫉,每個時間步長預(yù)測一對邊界點(diǎn)剂跟。這樣,就可以了使用適應(yīng)數(shù)量的邊界點(diǎn)表示不規(guī)則文本區(qū)域了酣藻。在五個數(shù)據(jù)集(CTW1500曹洽,TotalText,ICDAR2013辽剧,ICDAR2015和MSRA-TD500)上進(jìn)行了實(shí)驗(yàn)送淆,進(jìn)行了性能測試。實(shí)驗(yàn)表明怕轿,本文提出的方法不僅可以處理多方向的文本也可以處理不規(guī)則場景文本包括彎曲文本偷崩。在五個數(shù)據(jù)集上都實(shí)現(xiàn)了SOTA的性能。
2撞羽、相關(guān)工作
Traditional sliding window based and Connected component (CC) based scene text detection methods had been widely used before deep learning became the most promising machine learning tool. Sliding window based methods [27, 32] move a multi-scale window over an image and classify the current patch as text or non-text. CC based methods, especially the Maximally Stable Extremal Regions (MSER) based methods [26, 30], get character candidates by extracting CCs. And then, these candidate CCs are classified as text or non-text. These methods usually adopt a bottom-up strategy and often need several steps to detect texts (e.g., character detection, text line construction and text line classification). As each step may lead to misclassification, the performances of these traditional text detection methods are poor.
在深度學(xué)習(xí)成為最受歡迎的機(jī)器學(xué)習(xí)之前阐斜,基于傳統(tǒng)滑動窗口和基于連通域的場景文本檢測方法已經(jīng)被廣泛使用》欧裕基于傳統(tǒng)滑動窗口的方法在圖像上移動一個多尺度的窗口來把文本和非文本分開智听。基于連通域的方法渡紫,特別是基于最大穩(wěn)定極值區(qū)域(Maximally Stable Extremal Regions到推,MSER)的方法,通過提取連通域來獲取字符候選惕澎。然后把這些字符候選連通區(qū)域分為文本和非文本莉测。這些方法通常采用自下而上的策略,通常需要幾個步驟來檢測文本(例如唧喉,字符檢測捣卤,文本行構(gòu)建,文本行分類)八孝。由于每一步都可能導(dǎo)致誤分類董朝,所以這些傳統(tǒng)文本檢測表現(xiàn)很差。
Recently, deep learning based methods have become popular in scene text detection. These methods can be divided into three groups, including bounding box regression based methods, segmentation based methods, and combined methods. Bounding box regression based methods [5, 8, 11, 12, 13, 16], which are inspired by general object detection methods such as SSD [14] and Faster RCNN [23], treat text as a kind of object and directly estimate its bounding box as the detection result. Segmentation based methods [3, 19, 33] try to solve the problem by segmenting text regions from the background and an additional step is needed to get the final bounding boxes. Combined methods [20] use a similar strategy as Mask R-CNN [4], in which both segmentation and bounding box regression are used for better performance. However, its processing time is increased because more steps are needed than previous methods. Among the three kinds of methods, bounding box regression based methods are the most popular in scene text detection, benefitted from the development of general object detection.
最近干跛,基于深度學(xué)習(xí)的方法在場景文本檢測上更受歡迎了子姜。這些方法可以分成三組,包括基于邊框回歸的方法楼入,基于分割的方法和兩者結(jié)合的方法哥捕。基于邊框回歸的方法是受到一邊物體檢測方法的啟發(fā)嘉熊,例如SSD遥赚,F(xiàn)aster RCNN,將文本視為一種對象阐肤,直接估計(jì)其邊框作為檢測結(jié)果凫佛。基于分割的方法嘗試通過從背景中分割文本區(qū)域來解決問題孕惜,需要一個額外的步驟去獲取最終的邊界框愧薛。兩者結(jié)合的方法使用和Mask RCNN相似的策略,利用分割和邊框回歸獲得更好的表現(xiàn)诊赊。然而厚满,由于比之前的方法需要的步驟更多,所以它的處理時間加長了碧磅。在這三種方法中碘箍,基于邊框回歸的方法是場景文本檢測中最受歡迎的,這得益于通用目標(biāo)檢測的方法鲸郊。
- 基于深度學(xué)習(xí)的場景文本檢測方法大致分三種
- 基于邊界框直接回歸方法丰榴,受到通用物體檢測算法(SSD、FasterRCNN)的啟發(fā)
- 基于分割的方法
- 基于分割+邊界框回歸的方法秆撮,性能最好四濒,但處理時間更長
For bounding box regression based methods, they can be divided into one stage methods and two-stage methods. One-stage methods including Deep Direct Regression [5], TextBox [12], TextBoxes++ [11], DMPNet [16], SegLink [24] and EAST [34], directly estimate bounding boxes of text regions in one step. Two-stage methods include R2CNN [8], RRD [13], RRPN [22], IncepText [28] and FEN [31]. They consist of text proposal generation stage, in which candidate text regions are generated, and bounding box refinement stage, in which candidate text regions are verified and refined to generate the final detection result. Two-stage methods usually achieve higher performances than one-stage methods. Therefore, the idea of two-stage detection is used in this paper.
對于基于邊界框回歸的方法,可以分為一階段方法和兩階段方法。一階段方法包括Deep Direct Regression盗蟆,TextBox戈二,TextBoxs++,DMPNet喳资,SegLink和EAST觉吭,在一步中直接估計(jì)文本區(qū)域的邊界框。兩階段方法包括R2CNN仆邓,RRD鲜滩,RRPN,IncepText和FEN节值。它們包括文本建議框生成步驟(生成候選框)和邊框微調(diào)步驟(候選框確認(rèn)和微調(diào)以生成最終的檢測結(jié)果)徙硅。兩階段方法通常比一階段的方法能獲得更高的性能,所以本文使用兩階段檢測的思想搞疗。
While most proposed scene text detection methods can only deal with horizontal or oriented texts, detecting arbitrary shape texts such as curved text attracts more attention recently. In CTD [17], a polygon of fixed 14 points are used to represent text region. Meanwhile, recurrent transverse and longitudinal offset connection (TLOC) is proposed for accurate curved text detection. Though a polygon of fixed 14 points is enough for most text regions, it is not enough for some long curve text lines. Besides, 14 points are too many for most horizontal and oriented texts, while 4 points are enough for these texts. In TextSnake [19], a text instance is described as a sequence of ordered, overlapping disks centered at symmetric axes of text regions. Each disk is associated with potentially variable radius and orientation, which are estimated via a Fully Convolutional Network (FCN) model. Moreover, Mask TextSpotter [20] which is inspired by Mask R-CNN, can handle text instances of irregular shapes via semantic segmentation. Though TextSnake and Mask TextSpotter both can deal with text of arbitrary shapes, pixel-wise predictions are both needed in them, which need heavy computation.
雖然大多數(shù)提出的場景文本檢測方法只能處理水平或定向文本嗓蘑,但檢測任意形狀文本如彎曲文本最近吸引了更多關(guān)注。在CTD中贴汪,使用14個點(diǎn)的多邊形表示文本區(qū)域脐往。同時,提出了循環(huán)橫向和縱向偏移連接(TLOC)用于更精確地彎曲文本檢測扳埂。雖然一個多邊形14個點(diǎn)對于大多數(shù)文本區(qū)域足夠了业簿,但是對于一些長的彎曲文本行是不夠的。除此之外阳懂,14個點(diǎn)對于大多數(shù)水平和定向文本太多了梅尤,4個點(diǎn)足夠了。在TextSnake中岩调,文本實(shí)例被描述為以文本區(qū)域?yàn)橐晕谋緟^(qū)域的對稱軸為中心的有序重疊圓序列巷燥。每個圓都與可變半徑和方向有潛在的聯(lián)系,這些都是通過FCN來估算的号枕。還有缰揪,Mask TextSpotter受Mask RCNN的啟發(fā),通過語義分割處理不規(guī)則形狀的文本實(shí)例葱淳。雖然TextSnake和Mask TextSpotter都可以處理不規(guī)則形狀的文本钝腺,但是像素級別的預(yù)測需要更大的計(jì)算量。
Considering a polygon of fixed number of points is not suitable for representing text regions of different shapes, an adaptive text region representation using different numbers of points for texts of different shapes is proposed in this paper. Meanwhile, a RNN is employed to learn the adaptive representation of each text region, with which text regions can be directly labeled and pixel-wise segmentation is not needed.
考慮到使用固定點(diǎn)數(shù)量的多邊形來表示不同形狀的文本區(qū)域是不合適的赞厕,本文提出了使用不同的點(diǎn)的數(shù)量的自適應(yīng)文本區(qū)域來表示不同形狀的文本艳狐。同時,RNN用來學(xué)習(xí)每個文本區(qū)域的自適應(yīng)表示皿桑,這樣可以直接標(biāo)記文本區(qū)域毫目,并且不需要逐像素分割蔬啡。
3、方法論
Figure 1 shows the flowchart of the proposed method for arbitrary shape text detection, which is a two-stage detection method. It consists of two steps: text proposal and proposal refinement. In text proposal, a Text-RPN is used to generate text proposals of an input image. Meanwhile, the CNN feature maps of the input image are obtained here, which can be used in the following. Then, text proposals are verified and refined through a refinement network. In this step, text/non-text classification, bounding box regression and RNN based adaptive text region representation are included. Finally, text regions labeled with polygons of adaptive number of points are output as the detection result.
?圖1展示了不規(guī)則形狀文本檢測方法的流程圖镀虐,是一個兩階段的檢測方法箱蟆。包含兩步:Text Proposal和Proposal refinement。在Text Proposal這一步粉私,使用Text-RPN生成一張圖像的Text Proposal顽腾。同時近零,獲得輸入圖像的CNN feature maps诺核,接下來會使用。然后久信,通過微調(diào)網(wǎng)絡(luò)驗(yàn)證和微調(diào)Text Proposal窖杀。在這一步,包括了文本/非文本分類裙士,邊框回歸和自適應(yīng)文本區(qū)域表示的RNN入客。最后,使用自適應(yīng)點(diǎn)數(shù)的多邊形標(biāo)記的文本區(qū)域作為輸出結(jié)果腿椎。
3.1 Adaptive text region representation
The existing scene text detection methods use polygons of fixed number of points to represent text regions. For horizontal texts, 2 points (left-top point and bottom-right point) are used to represent the text regions. For multi-oriented texts, the 4 points of their bounding boxes are used to represent these regions. Moreover, for curved texts, 14 points are adopted in CTW1500 [17] for text region representation. However, for some very complex scene texts, such as curved long text, even 14 points may be not enough to represent them well. While for most scene texts such as horizontal texts and oriented texts, less than 14 points are enough and using 14 points to represent these text regions is a waste.
現(xiàn)有的場景文本檢測方法使用具有固定數(shù)量點(diǎn)的多變形來表示文本區(qū)域桌硫。對于水平文本,兩個點(diǎn)(左上和右下)就可以表示文本區(qū)域啃炸。對于多方向文本取试,四個點(diǎn)就可以表示文本區(qū)域闺骚。對于彎曲文本,在CTW1500使用14個點(diǎn)來表示文本區(qū)域。然而弊琴,對于一些復(fù)雜的場景文本,例如彎曲長文本驳癌,即使14個點(diǎn)也不足以很好地表示它們娄琉。但是對于大多數(shù)場景文本例如水平文本和定向文本,少于14個點(diǎn)就足夠表示了筑公,使用14個點(diǎn)表示這些文本區(qū)域太浪費(fèi)了雳窟。
- 水平文本區(qū)域:兩個點(diǎn)表示即可
- 多方向文本區(qū)域:四個點(diǎn)表示即可
- 彎曲文本區(qū)域:14個點(diǎn)表示
- 較長的彎曲文本區(qū)域:14個點(diǎn)可能不夠
- 那么如果統(tǒng)一使用14個點(diǎn)表示文本區(qū)域,顯然對水平文本和多方向文本較為浪費(fèi)匣屡,因此本文使用自適應(yīng)的方式封救,根據(jù)文本的表現(xiàn)形式自適應(yīng)確定文本區(qū)域點(diǎn)數(shù)的多少
Therefore, it is reasonable to consider using polygons of adaptive numbers of points to represent text regions. Easily, we can imagine that corner points on the boundary of a text region can be used for region representation, as shown in Figure 2 (a). And this is similar as the method for annotating general objects [1]. However, the points in this way are not arranged in a direction and it may be difficult to learn the representation. In the method for annotating general objects, human correction may be needed for accurate segmentation. Considering text regions usually have approximate symmetry top boundary and down boundary as shown in Figure 3, using the pairwise points from the two boundaries for text region representation may be more suitable. It is much easier to learn the pairwise boundary points from one end to the other end of text region, as shown in Figure 2 (b). In this way, different scene text regions can be represented by different numbers of points precisely, as shown in Figure 3. Moreover, to our knowledge, we are the first to use adaptive numbers of pairwise points for text
region representation.
因此,考慮使用自適應(yīng)數(shù)量的點(diǎn)去表示文本區(qū)域耸采。很容易想到兴泥,使用一個文本區(qū)域邊框的角點(diǎn)去表示文本,如圖2(a)虾宇。這類似于標(biāo)注通用目標(biāo)的方法搓彻。然而,用這種方式,這些點(diǎn)不是沿一個方向排列的旭贬,可能很難去學(xué)習(xí)這個表示怔接。在這個標(biāo)注通用目標(biāo)的方法中,可能需要人工校正來進(jìn)行精確分割稀轨《笃辏考慮到文本區(qū)域通常具有近似對稱的上邊界和下邊界,如圖3所示奋刽,使用兩個邊界的成對點(diǎn)進(jìn)行文本區(qū)域表示可能更合適瓦侮。從文本區(qū)域的一邊到另一邊學(xué)習(xí)成對邊界點(diǎn)會更容易,如圖2(b)所示佣谐。這樣肚吏,不同的文本區(qū)域可以使用不用的點(diǎn)精確的表示,如圖3所示狭魂。而且罚攀,據(jù)我們所知,我們是第一個使用自適應(yīng)成對點(diǎn)的數(shù)量來表示文本區(qū)域的雌澄。
- 自適應(yīng)標(biāo)注斋泄,上下邊界成對點(diǎn)標(biāo)注
3.2 Text proposal
When an input image is given, the first step of the proposed method is text proposal, in which text region candidates called text proposals are generated by Text-RPN. The Text-RPN is similar as RPN in Faster R-CNN [23] except different backbone networks and anchor sizes. In the proposed method, the backbone network is SE-VGG16 as shown in Table 1, which is obtained by adding Squeezeand-Excitation (SE) blocks [7] to VGG16 [25]. As shown in Figure 4, SE blocks adaptively recalibrate(校準(zhǔn)) channel-wise feature responses by explicitly modelling interdependencies between channels, which can produce significant performance improvement. Here, FC means fully connected layer and ReLU means Rectified Linear Unit function. Moreover, because scene texts usually have different sizes, anchor sizes are set as {32, 64, 128, 256, 512} for covering more texts while aspect ratios {0.5, 1, 2} are kept.
給定輸入圖像之后,第一步就是Text Proposal镐牺,在這里文本區(qū)域候選框稱為Text-RPN生成的Text Proposal炫掐。除了不同的骨干網(wǎng)絡(luò)和anchor大小之外,Text-RPN和Faster-RCNN中的RPN是相似的任柜。在此方法中卒废,主干網(wǎng)絡(luò)是SE-VGG16,如表1宙地,加入了SE(Squeeze-and-Excitation)模塊的VGG16摔认。如圖4所示,SE模塊通過使用通道的相互依賴型宅粥,自適應(yīng)的重新校準(zhǔn)了通道方面的特征響應(yīng)参袱,產(chǎn)生了顯著的性能提升。FC表示全連接層秽梅,ReLU表示整流線性單元函數(shù)抹蚀。此外,由于場景文本通常具有不同的大小企垦,為了覆蓋更多文本环壤,因此anchor大小設(shè)置為 {32,64,128,512}{32,64,128,512}? ,保持縱橫比 {0.5,1,2}{0.5,1,2}?钞诡。
3.3. Proposal refinement
After text proposal, text region candidates in the input image are generated, which will be verified and refined in this step. As shown in Figure 1 a refinement network is employed for proposal refinement, which consists of several branches: text/non-text classification, bounding box regression and RNN based adaptive text region representation. Here, text/non-text classification and bounding box regression are similar as other two-stage text detection methods, while the last branch is proposed for arbitrary shape text representation.
在 text proposal 之后郑现,在輸入圖像中生成的文本區(qū)域候選框湃崩,在這一步進(jìn)行驗(yàn)證和微調(diào)。如圖1 所示接箫,微調(diào)網(wǎng)絡(luò)用于候選框微調(diào)攒读,包含若干分支:文本/非文本分類,邊框回歸和基于自適應(yīng)文本表示的RNN辛友。在這里薄扁,文本/非文本分類和邊框回歸和其他的兩階段文本檢測的方法是類似的,最后一個分支用于任意形狀的文本表示废累。
For the proposed branch, the input are the features of each text proposal, which are obtained by using ROI pooling to the CNN feature maps generated with SE-VGG16. The output target of this branch is the adaptive number of boundary points for each text region. Because the output length changes for different text regions, it is reasonable to use RNN to predict these points. Therefore, Long ShortTerm Memory (LSTM) [6] is used here, which is a kind of RNN and popular for processing sequence learning problem, such as machine translation, speech recognition, image caption and text recognition.
對于最后一個分支邓梅,輸入是每個 text proposal的特征,這些特征是在SE-VGG16生成的CNN feature maps上通過 ROI Pooling后獲得的九默。輸出是每個文本區(qū)域自適應(yīng)邊框點(diǎn)的數(shù)量震放。因?yàn)檩敵鲩L度是不同的文本區(qū)域,所以使用RNN預(yù)測這些點(diǎn)驼修。因此,在這里使用了LSTM诈铛,它是一種RNN乙各,并且用于處理序列學(xué)習(xí)問題,例如機(jī)器翻譯幢竹,語音識別耳峦,圖像標(biāo)題和文本識別。
Though it is proposed that pairwise boundary points are used for text region representation, different ways can be used for pairwise points representation. Easily, we can imagine that using the coordinates of two pairwise points to represent them. In this way, the coordinates of pairwise points are used as the regression targets as shown in Figure 5. However, pairwise points can be represented in a different way, using the coordinate of their center point , the distance from the center point to them , and their orientation . However, the angle target is not stable in some special situations. For example, angle near is very similar to angle near in spatial, but their angles are quite different. This makes the network hard to learn the angle target well. Besides, the orientation can be represented by and , which can be predicted stably. However, more parameters are needed. Therefore, the coordinates of points are used as the regression targets in the proposed method.
盡管提出成對邊界點(diǎn)用于文本區(qū)域表示焕毫,但是可以使用不同方式用于成對點(diǎn)表示蹲坷。很容易想到,可以使用兩對點(diǎn)的坐標(biāo) 來表示邑飒。這樣循签,每對點(diǎn)的坐標(biāo)都可以作為回歸目標(biāo),如圖5所示疙咸。然而县匠,每對點(diǎn)都可以使用不同的方式表示,使用中心點(diǎn)的坐標(biāo)撒轮,中心點(diǎn)到兩個點(diǎn)的距離和他們的旋轉(zhuǎn)角度 乞旦。然而,目標(biāo)角度在某些特殊情況下不好確定题山。例如兰粉,和 在空間中是非常相似的,但是他們的角度是不同的顶瞳。這個就很難正確的學(xué)習(xí)玖姑。除此之外崖蜜,方向可以通過 和 來表示,這個可以很好地預(yù)測客峭。然而豫领,需要很多參數(shù)。因此舔琅,本文方法使用點(diǎn)的坐標(biāo) 來作為回歸目標(biāo)等恐。
- 一種方式是兩個點(diǎn)
- 一種方式中心點(diǎn) + 距中心點(diǎn)長度 + 方向角度
-
但是第二種方式存在角度不明確或者為保證角度明確而造成參數(shù)過多的問題,因此選用第一種方式
The inputs of all time steps in the LSTM used here are the same, which are the ROI pooling features of the corresponding text proposals. And the output of each time step are the coordinates of the pairwise points on text region boundary. Meanwhile, as adaptive numbers of points are used for different text regions, a stop label is needed to represent when the predicting network stops. Because stop label prediction is a classification problem while coordinates prediction is a regression problem, it is not appropriate to put them in the same branch. Therefore, there are two branches in each time step of the LSTM: one for point coordinate regression and one for stop label prediction. At each time step the coordinates of two pairwise boundary points of text region and the label stop/continue are predicted. If the label is continue, the coordinates of another two points and a new label are predicted in the next time step. Otherwise, the prediction stops and text region is represented with the points predicted before. In this way, text regions in the input image can be detected and represented with different polygons made up by the predicted pairwise points.
LSTM每個時間點(diǎn)的輸入都是相同的备蚓,都是相應(yīng)text proposal的 ROI pooling特征课蔬。每個時間點(diǎn)的輸出是文本區(qū)域框的對點(diǎn)的坐標(biāo)。同時郊尝,由于不同的文本區(qū)域使用自適應(yīng)點(diǎn)數(shù)二跋,因此需要停止標(biāo)簽來表示預(yù)測網(wǎng)絡(luò)何時停止。因?yàn)橥V箻?biāo)簽預(yù)測是分類問題流昏,而坐標(biāo)預(yù)測是回歸問題扎即,所以將他們放在同一分支是不合適的。因此况凉,LSTM的每個時間點(diǎn)有兩個分支:一個點(diǎn)的坐標(biāo)的回歸谚鄙,一個停止標(biāo)簽的預(yù)測。在每一個時間點(diǎn)刁绒,都預(yù)測文本區(qū)域的兩個對邊的點(diǎn)的坐標(biāo)和標(biāo)簽 stop/continue闷营。如果標(biāo)簽是continue,在下一個時間點(diǎn)將會預(yù)測另外兩個點(diǎn)的坐標(biāo)和下一個標(biāo)簽知市,否則傻盟,預(yù)測停止,文本區(qū)域使用之前預(yù)測的點(diǎn)表示嫂丙。這樣娘赴,輸入圖像中的文本區(qū)域就可以使用不同的多邊形(通過預(yù)測的對點(diǎn)組成的)來檢測和表示了。
While Non-Maximum Suppression (NMS) is extensively used to post-process detection candidates by general object detection methods, it is also needed in the proposed method. As the detected text regions are represented with polygons, normal NMS which is computed based on the area of horizontal bounding box is not suitable here. Instead, a polygon NMS is used, which is computed based on the area of the polygon of text region. After NMS, the remaining text regions are output as the detection result.
雖然NMS被廣泛用于通用目標(biāo)檢測方法的后處理檢測候選框奢入,但是在本文提出的方法中也需要NMS筝闹。因?yàn)槲谋緟^(qū)域是通過多邊形來表示的,所以使用普通的基于水平邊框的NMS來計(jì)算是不合適的腥光。所以使用多邊形NMS关顷, 是基于多邊形文本區(qū)域的面積計(jì)算的。在NMS后武福,剩余的文本區(qū)域就作為檢測結(jié)果輸出了议双。
3.4. Training objective
As Text-RPN in the proposed method is similar as the RPN in Faster R-CNN [23], the training loss of Text-RPN is also computed in the similar way as it. Therefore, in this section, we only focus on the loss function of refinement network in proposal refinement. The loss defined on each proposal is the sum of a text/non-text classification loss, a bounding box regression loss, a boundary points regression loss and a stop/continue label classification loss. The multitask loss function on each proposal is defined as:
由于Text-RPN和Faster R-CNN中的RPN是相似的,所以Text-RPN的訓(xùn)練損失也是以相同的方式計(jì)算的捉片。因此平痰,在這一部分汞舱,我們只關(guān)注微調(diào)網(wǎng)絡(luò)的損失函數(shù)。每一個proposal的損失函數(shù)定義為文本/非文本分類損失宗雇,邊框回歸損失和邊界點(diǎn)回歸損失和停止/繼續(xù)標(biāo)簽分類損失之和昂芜。在每一個proposal上的多任務(wù)損失函數(shù)定義為:
, and are balancing parameters that control the trade-off between these terms and they are set as 1 in the proposed method.
For the text/non-text classification loss term, is the indicator of the class label. Text is labeled as 1 (), and background is labeled as 0 (). The parameter is the probability over text and background classes computed after softmax. Then, is the log loss for true class .
, and 是這些項(xiàng)的平衡參數(shù),在本文方法中它們設(shè)置為1赔蒲。
對于文本/非文本損失項(xiàng)泌神,是分類標(biāo)簽的標(biāo)記。是文本時 舞虱,不是文本時 欢际。參數(shù) 是softmax計(jì)算后的文本和非文本的置信度。是真值 對數(shù)損失矾兜。
For the bounding box regression loss term, is a tuple of true bounding box regression targets including coordinates of the center point and its width and height, and is the predicted tuple for each text proposal. We use the parameterization for and given in Faster R-CNN[23], in which and specify scale-invariant translation and log-space height/width shift relative to an object proposal.
For the boundary points regression loss term, is a tuple of true coordinates of boundary points, and is the predicted tuple for the text label. To make the points learned suitable for text of different scales, the learning targets should also be processed to make them scale invariant. The parameters are processed as following:
where and denote the coordinates of the boundary points, and denote the coordinates of the center point of the corresponding text proposal, and denote the width and height of this proposal.
對于邊界點(diǎn)的回歸項(xiàng)损趋, 是真實(shí)邊界點(diǎn)坐標(biāo)的元組, 是預(yù)測出的坐標(biāo)元組椅寺。為了使學(xué)習(xí)到的點(diǎn)適應(yīng)不同尺度的文本浑槽,學(xué)習(xí)目標(biāo)也應(yīng)該被處理為使它們尺度不變。參數(shù)被處理為如下樣子:
Let indicates or , is defined as the smooth L1 loss as in Faster R-CNN [23]:
For the stop/continue label classification loss term, it is also a binary classification and its loss is formatted similar as text/non-text classification loss.
對于停止/繼續(xù)標(biāo)簽分類損失項(xiàng)配并,它也是一個二分類括荡,它的損失格式類似于文本/非文本分類損失。
4 實(shí)驗(yàn)
4.1. 數(shù)據(jù)集
Five benchmarks are used in this paper for performance evaluation, which are introduced in the following:
在本文中使用了五個評價基準(zhǔn)進(jìn)行性能驗(yàn)證溉旋,介紹如下:
- CTW1500: The CTW1500 dataset [17] contains 500 test images and 1000 training images, which contain multi-oriented text, curved text and irregular shape text. Text regions in this dataset are labeled with 14 scene text boundary points at sentence level.
CTW1500:CTW1500數(shù)據(jù)集包含500張測試圖片和1000張訓(xùn)練圖片,包含多方向文本嫉髓,彎曲文本和不規(guī)則文本观腊。在這個數(shù)據(jù)集中文本區(qū)域使用14個場景文本邊框點(diǎn)以句子級別來標(biāo)注。 - TotalText: The TotalText dataset [2] consists of 300 test images and 1255 training images with more than 3 different text orientations: horizontal, multi-oriented, and curved. The texts in these images are labeled at word level with adaptive number of corner points.
TotalText:TotalText數(shù)據(jù)集包含300張測試圖片和1255張訓(xùn)練圖片算行,超過三種不同的文本方向:水平梧油,多向和彎曲。在這些圖像中州邢,文本是以單詞級別標(biāo)記儡陨,具有自適應(yīng)角點(diǎn)數(shù)。 - ICDAR2013: The ICDAR2013 dataset [10] contains focused scene texts for ICDAR Robust Reading Competition. It includes 233 test images and 229 training images. The scene texts are horizontal and labeled with horizontal bounding boxes made up by 2 points at word level.
ICDAR2013:ICDAR2013數(shù)據(jù)集包含ICDAR Robust Reading Competition的重點(diǎn)場景文本量淌。包括233張測試圖像和229張訓(xùn)練圖像骗村。場景文本是水平的,使用2個點(diǎn)的邊界框以單詞級別標(biāo)注呀枢。 - ICDAR2015: The ICDAR2015 dataset [9] focuses on incidental scene text in ICDAR Robust Reading Competition. It includes 500 testing images and 1000 training images. The scene texts have different orientations, which are labeled with inclined boxes made up by 4 points at word level.
ICDAR2015:ICDAR2015數(shù)據(jù)集側(cè)重于ICDAR Robust Reading Competition中的非主要的場景文本胚股。包括500張測試圖像和1000張訓(xùn)練圖像。它們有不同的方向裙秋,使用四個點(diǎn)組成的傾斜框以單詞級別標(biāo)注琅拌。 - MSRA-TD500: The MSRA-TD500 dataset [29] contains 200 test images and 300 training images, that contain arbitrarily-oriented texts in both Chinese and English. The texts are labeled with inclined boxes made up by 4 points at sentence level. Some long straight text lines exist in the dataset.
MSRA-TD1500:MSRA-TD500數(shù)據(jù)集包含200張測試圖像和300張訓(xùn)練圖像缨伊,包含中英文任意方向的文本。通過四個點(diǎn)的傾斜框以句子級別標(biāo)注进宝。數(shù)據(jù)集中存在一些長直文本行刻坊。
The evaluation for text detection follows the ICDAR evaluation protocol in terms of Recall, Precision and Hmean. Recall represents the ratio of the number of correctly detected text regions to the total number of text regions in the dataset while Precision represents the ratio of the number of correctly detected text regions to the total number of detected text regions. Hmean is single measure of quality by combining recall and precision. A detected text region is considered as correct if its overlap with the ground truth text region is larger than a given threshold. The computation of the three evaluation terms is usually different for different datasets. While the results on ICDAR 2013 and ICDAR 2015 can be evaluated through ICDAR robust reading competition platform, the results of the other three datasets can be evaluated with the given evaluation methods corresponding to them
?文本檢測的評估遵循ICDAR評估協(xié)議的 Recall, Precision和 Hmean党晋。Recall 表示正確檢測到的文本區(qū)域數(shù)與數(shù)據(jù)集中文本區(qū)域總數(shù)之比谭胚,Precision表示正確檢測到的文本區(qū)域數(shù)與檢測到的文本總數(shù)之比。Hmean通過結(jié)合recall和precision來衡量質(zhì)量隶校。如果檢測到的文本區(qū)域與Ground Truth 文本區(qū)域的重疊面積大于給定的閾值則認(rèn)為是正確的漏益。這三個評估項(xiàng)在不同的數(shù)據(jù)集上計(jì)算的方式不同。在 ICDAR2013和 ICDAR2015 上的結(jié)果可以通過ICDAR robust reading competition平臺來驗(yàn)證深胳,其他三個的數(shù)據(jù)集可以按照它們相應(yīng)的驗(yàn)證方法去驗(yàn)證绰疤。
4.2. Implementation details
Our scene text detection network is initialized with pretrained VGG16 model for ImageNet classification. When the proposed method is tested on the five datasets, different models are used for them, which are trained using only the training images of each dataset with data augmentation. All models are trained 10 × 104 iterations in total. Learning rates start from 10?3 , and are multiplied by 1/10 after 2 × 104 , 6 × 104 and 8 × 104 iterations. We use 0.0005 weight decay and 0.9 momentum. We use multi-scale training, setting the short side of training images as {400, 600, 720, 1000, 1200}, while maintaining the long side at 2000.
我們的場景文本檢測網(wǎng)絡(luò)是在ImageNet預(yù)訓(xùn)練的VGG16上初始化的。當(dāng)在五個數(shù)據(jù)集上測試所提出的方法時舞终,使用不同的模型轻庆,這些模型僅使用每個數(shù)據(jù)集的訓(xùn)練圖像進(jìn)行數(shù)據(jù)增強(qiáng)訓(xùn)練。全部的模型總共訓(xùn)練 迭代步數(shù)敛劝。學(xué)習(xí)率從 開始余爆,分別在迭代步數(shù)時乘以 衰減蛾方。使用 0.0005的權(quán)重衰減和0.9的動量。使用多尺度訓(xùn)練上陕,設(shè)置訓(xùn)練圖像的短邊為 {400,600,720,1000,1200} 桩砰,保持長邊 2000不變。
Because adaptive text region representation is used in the proposed method, it can be simply used for these datasets with text regions labeled with different numbers of points. As ICDAR 2013, ICDAR 2015 and MSRA-TD500 are labeled with quadrilateral boxes, they are easy to be transformed into pairwise points. However, for CTW1500 dataset and TotalText dataset, some operations are needed to transform the ground truthes into the form we needed.
因?yàn)樵诒疚姆椒ㄖ惺褂昧俗赃m應(yīng)文本區(qū)域表示释簿,所以可以很簡單的使用這些數(shù)據(jù)集中通過不同的點(diǎn)數(shù)標(biāo)記的文本區(qū)域亚隅。在ICDAR 2013, ICDAR 2015 和 MSRA-TD500使用四邊形框標(biāo)注,很容易轉(zhuǎn)換成對點(diǎn)庶溶。然而煮纵,對于 CTW1500數(shù)據(jù)集 和TotalText 數(shù)據(jù)集,需要一些操作把ground truth 轉(zhuǎn)換為我們需要的形式偏螺。
Text regions in CTW1500 are labeled with 14 points, which are needed to be transformed into adaptive number of pairwise points. First, the 14 points are grouped into 7 point pairs. Then, we compute the intersection angle for each point, which is the angle of the two vectors from current point to its nearby two points. And for each point pair, the angle is the smaller one of the two points. Next, point pairs are sorted according to their angles in descending order and we try to remove each point pair in the order. If the ratio of the polygon areas after removing operation to the original area is larger than 0.93, this point pair can be removed. Otherwise, the operation stops and the remaining points are used in the training for text region representation.
在CTW1500中行疏,使用14個點(diǎn)標(biāo)記文本區(qū)域,需要轉(zhuǎn)換成自適應(yīng)成對點(diǎn)數(shù)砖茸。首先隘擎,14個點(diǎn)是由7對點(diǎn)組成的。然后凉夯,我們計(jì)算每個點(diǎn)的交叉角货葬,就是從當(dāng)前點(diǎn)到其附近兩個點(diǎn)的矢量的角度采幌。對于每對點(diǎn)角度是兩個點(diǎn)中較小的一個。接下來震桶,點(diǎn)對按照他們的角度降序排序休傍,我們嘗試刪除排列中的每個點(diǎn)對。如果刪除操作之后多邊形區(qū)域與原始區(qū)域的面積比大于0.93蹲姐,則刪除該點(diǎn)對磨取。否則,操作停止柴墩,剩余的點(diǎn)用于文本區(qū)域表示的訓(xùn)練忙厌。
- CTW1500的轉(zhuǎn)變不太清楚,使用的時候再看
Moreover, text regions in TotalText are labeled with adaptive number of points, but these points are not pairwise. For text regions labeled with even number of points, it is easy to process by group them into pairs. For text regions labeled with odd number of points, the start two points and the end two points should be found first, and then the corresponding points to the remaining points are found based on their distances to the start points on the boundary.
The results of the proposed method are obtained on single scale input image with one trained model. Because test image scale has a deep impact on the detection results, such as FOTS [15] uses different scales for different datasets, we also use different test scales for different datasets for best performance. In our experiments, the scale for ICDAR 2013 is 960 × 1400, the scale for ICDAR 2015 is 1200 × 2000 and the scales for other datasets are all 720 × 1280.
The proposed method is implemented in Caffe and the experiments are finished using a Nvidia P40 GPU.
在TotalText中的文本區(qū)域是通過自適應(yīng)的點(diǎn)數(shù)來標(biāo)記的江咳,但是這些點(diǎn)不是對點(diǎn)逢净。對于標(biāo)有偶數(shù)個點(diǎn)的文本區(qū)域,可以很容易的將它們成對分組歼指。但是對于標(biāo)有奇數(shù)點(diǎn)數(shù)的文本區(qū)域爹土,首先找到開始的兩個點(diǎn)和最后的兩個點(diǎn),然后根據(jù)它們到邊界開始點(diǎn)的距離找到剩余點(diǎn)的對應(yīng)點(diǎn)踩身。
? 本文方法的結(jié)果是用一個訓(xùn)練模型在單尺度輸入圖像中獲得的胀茵。因?yàn)闇y試圖像的尺度對檢測結(jié)果有很大影響,例如FOTS對不用的數(shù)據(jù)集使用不同的尺度挟阻,我們也對不同的測試集使用不同的尺度以獲得最好的表現(xiàn)琼娘。在我們的試驗(yàn)中,ICDAR2013的尺度是 960×1400附鸽,ICDAR2015的尺度是 1200×2000轨奄,其他數(shù)據(jù)集的尺度都是720×1280。
? 本文方法實(shí)在Caffe上實(shí)現(xiàn)的拒炎,在 Nvidai P40 GPU上完成實(shí)驗(yàn)。
4.3 消融研究
In the proposed method the backbone network is SEVGG16, while VGG16 is usually used by other state-of-the-art methods. To verify the effectiveness of the backbone network, we test the proposed method with different backbone networks (SE-VGG16 vs VGG16) on CTW1500 dataset and ICDAR 2015 dataset as shown in Table 2. The results show that SE-VGG16 is better than VGG16, with which better performances achieved on the two datasets.
本文方法使用的主干網(wǎng)絡(luò)是 SE-VGG16挨务,其他先進(jìn)的方法使用的是VGG16击你。為了驗(yàn)證主干網(wǎng)絡(luò)的優(yōu)點(diǎn),我們用不同的主干網(wǎng)絡(luò)(SE-VGG16和VGG16)在CTW1500數(shù)據(jù)集和ICDAR2015數(shù)據(jù)集上做了測試谎柄,如表2丁侄。結(jié)果顯示SE-VGG16是比VGG16好的,在兩個數(shù)據(jù)集上都有好的表現(xiàn)朝巫。
Meanwhile, an adaptive text region representation is proposed for text of arbitrary shapes in this paper. To validate its effectiveness for scene text detection, we add an ablation study on text region representation on CTW1500 dataset. For comparison, fixed text region representation directly uses the fixed 14 points as the regression targets in the experiment. Table 3 shows the experimental results of different text region representation methods on CTW1500 dataset. The recall of the method with adaptive representation is much higher than fixed representation (80.2% vs 76.4%). It justifies that the adaptive text region representation is more suitable for texts of arbitrary shapes.
同時鸿摇,本文對于不規(guī)則形狀的文本還提出了自適應(yīng)文本區(qū)域表示。為了驗(yàn)證它的優(yōu)點(diǎn)劈猿,我們在數(shù)據(jù)集CTW1500上添加了一個文本區(qū)域表示的消融研究拙吉。作為對比潮孽,直接使用14個點(diǎn)固定文本區(qū)域表示作為實(shí)驗(yàn)的回歸目標(biāo)。表3顯示了在數(shù)據(jù)集CTW1500上的不同區(qū)域表示方法的實(shí)驗(yàn)結(jié)果筷黔。使用自適應(yīng)表示方法的Recall比固定表示的 Recall高出很多()往史。證明了,自適應(yīng)文本區(qū)域表示更適合不規(guī)則形狀的文本佛舱。
4.4. Comparison with State-of-the-arts
To show the performance of the proposed method for different shape texts, we test it on several benchmarks. We first compare its performance with state-of-the-arts on CTW1500 and TotalText which both contains challenging multi-oriented and curved texts. Then we compare the methods on the two most widely used benchmarks: ICDAR2013 and ICDAR2015. At last we compare them on MSRA-TD500 which contains long straight text lines and multi language texts (Chinese+English).
為了顯示所提出方法對不同形狀文本的性能椎例,我們在幾個基準(zhǔn)中進(jìn)行測試。我們首先將其性能與數(shù)據(jù)集CTW1500和TotalText上的最新技術(shù)進(jìn)行比較(包含具有挑戰(zhàn)性的多向和彎曲文本)请祖。然后我們比較兩種最廣泛使用的基準(zhǔn)測試方法:ICDAR20113 和 ICDAR2015订歪。最后我們在數(shù)據(jù)集MSRA-TD500上進(jìn)行了比較,包含長直文本行和多語言文本(中英)肆捕。
Table 4 and Table 5 compare the proposed method with state-of-the-art methods on CTW1500 and TotalText, respectively. The propose method is much better than all other methods on CTW1500 including the methods designed for curved texts such as CTD, CTD+TLOC and TextSnake (Hmean: 80.1% vs 69.5%, 73.4% and 75.6%). Meanwhile, it also achieves better performance (Hmean: 78.5%) than all other methods on TotalText. The performances on the two datasets containing challenging multi-oriented and curved texts mean that the proposed method can detect scene text of arbitrary shapes.
表4和表5分別是在數(shù)據(jù)集 CTW 1500和TotalText上的本文方法和最新方法的比較刷晋。在CTW1500上,本文方法是比其他的方法(包括彎曲文本的方法CTD福压,CTD+TLOC和TextSnake)更好掏秩。同時,在數(shù)據(jù)集TotalText上荆姆,也實(shí)現(xiàn)了更好的性能(Hmean:78.5%)蒙幻。在兩個包含挑戰(zhàn)性的多向和彎取文本數(shù)據(jù)集上的表現(xiàn)說明了本文方法可以檢測任意形狀的場景文本。
Table 6 shows the experimental results on ICDAR2013 dataset. The proposed method achieves the best performance same as Mask Textspotter, whose Hmean both are 91.7%. Because the proposed method is tested on single scale input image with single model, only the results generated in this situation are used here. The results show that the proposed method can also process horizontal text well.
表6顯示了在數(shù)據(jù)集ICDAR2013上的實(shí)驗(yàn)結(jié)果胆筒。該方法實(shí)現(xiàn)了與Mask Textspotter相同的最佳性能邮破,其中Hmean均為91.7%。由于所提出的方法是在單一模型的單一尺度輸入圖像上進(jìn)行測試的仆救,因此這里僅使用在這種情況下生成的結(jié)果抒和。他的結(jié)果表明,所提出的方法也可以很好地處理水平文本彤蔽。
Table 7 shows the experimental results on ICDAR 2015 dataset and the proposed method achieve the second best performance, which is only a little lower than FOTS (Hmean: 87.6% vs 88.0%). While FOTS is trained end-toend by combining text detection and recognition, the proposed method is only trained for text detection, which is much easier to train than FOTS. And the results tested on single scale input image with single mode are used here. The results show that the proposed method achieves comparable performance with state-of-the-arts, which means it can also process multi-oriented text well.
表7顯示了ICDAR 2015數(shù)據(jù)集的實(shí)驗(yàn)結(jié)果摧莽,所提出的方法達(dá)到了第二好的性能,僅略低于FOTS(Hmean:87.6%vs 88.0%)顿痪。雖然FOTS通過結(jié)合文本檢測和識別進(jìn)行端到端訓(xùn)練镊辕,但是所提出的方法僅針對文本檢測進(jìn)行訓(xùn)練,這比FOTS更容易訓(xùn)練蚁袭。本文采用單模型單尺度輸入圖像獲取測試結(jié)果征懈。結(jié)果表明,該方法與現(xiàn)有技術(shù)具有可比性揩悄,可以很好地處理多向文本卖哎。
Table 8 shows the results on MSRA-TD500 dataset and it shows that our detection method can support long straight text line detection and Chinese+English detection well. It achieves Hmean of 83.6% and is better than all other methods.
? 表8顯示了在MSRA-TD500數(shù)據(jù)集的結(jié)果,表明我們的檢測方法可以很好地支持長直文本行檢測和中文+英文檢測。 它實(shí)現(xiàn)了83.6%的Hmean并且優(yōu)于所有其他方法亏娜。
4.5. Speed
The speed of the proposed method is compared with two other methods as shown in Table 9, which are all able to deal with arbitrary shape scene text. From the results, we can see that the speed of the proposed method is much faster than the other two methods. While pixel-wise prediction is needed in Mask Textspotter and TextSnake, it is not needed in the proposed method and less computation is needed.
將所提出方法的速度與表9中所示的兩種其他方法進(jìn)行比較焕窝,這些方法都能夠處理任意形狀的場景文本。 從結(jié)果中我們可以看出照藻,所提方法的速度比其他兩種方法快得多袜啃。因?yàn)樵贛ask Textspotter和TextSnake中需要像素預(yù)測,但在所提出的方法中不需要它幸缕,所以本文方法會有較少的計(jì)算群发。
4.6. Qualitative results
Figure 6 illustrates qualitative results on CTW1500, TotalText, ICDAR2013, ICDAR2015 and MSRA-TD500. It shows that the proposed method can deal with various texts of arbitrarily oriented or curved, different languages, nonuniform illuminations and different text lengths at word level or sentence level.
圖6顯示了在CTW1500,TotalText发乔,ICDAR2013熟妓,ICDAR2015和MSRA-TD500的定性結(jié)果。 它表明栏尚,所提出的方法可以處理任意定向或彎曲的各種文本起愈,不同的語言,不均勻的照明和在單詞級別或句子級別的不同文本長度译仗。
5. Conclusion
In this paper, we propose a robust arbitrary shape scene text detection method with adaptive text region representation. After text proposal using a Text-RPN, each text region is verified and refined using a RNN for predicting adaptive number of boundary points. Experiments on five benchmarks show that the proposed method can not only detect horizontal and oriented scene texts but also work well for arbitrary shape scene texts. Particularly, it outperforms existing methods significantly on CTW1500 and MSRA-TD500, which are typical of curved texts and multi-oriented texts, respectively. In the future, the proposed method can be improved in several aspects. First, arbitrary shape scene text detection may can be improved by using corner point detection. This will require easier annotations for training images. Second, to fulfill the final goal of text recognition end-to-end text recognition for arbitrary shape scene text will be considered.
在本文中抬虽,我們提出了一種具有自適應(yīng)文本區(qū)域表示的魯棒的任意形狀場景文本檢測方法。 在使用文本RPN的text proposal之后纵菌,使用RNN來驗(yàn)證和細(xì)化每個文本區(qū)域以預(yù)測邊界點(diǎn)的自適應(yīng)數(shù)量阐污。 對五個基準(zhǔn)測試的實(shí)驗(yàn)表明,該方法不僅可以檢測水平和定向場景文本咱圆,而且可以很好地適用于任意形狀的場景文本笛辟。特別是,它在CTW1500和MSRA-TD500上顯著優(yōu)于現(xiàn)有方法序苏,分別是曲線文本和多向文本的典型手幢。
?將來,可以在幾個方面改進(jìn)所提出的方法忱详。 首先围来,可以通過使用角點(diǎn)檢測來改善任意形狀場景文本檢測。 這將需要更容易的標(biāo)注來訓(xùn)練圖像匈睁。 其次管钳,為了實(shí)現(xiàn)文本識別的最終目標(biāo),將考慮對任意形狀場景文本進(jìn)行端到端文本識別软舌。