Tesseract源碼分析(二)——識別與糾錯

tess4.0中主要的數(shù)據(jù)結(jié)構(gòu)

  1. Page analysis result: PAGE_RES (ccstruct/pageres.h).
  2. Page analysis result contains a list of block analysis result field: BLOCK_RES_LIST.
  3. Block analysis result: BLOCK_RES (ccstruct/pageres.h).
  4. Block analysis result contains a list of row analysis result field: ROW_RES_LIST.
  5. Row analysis result: ROW_RES (ccstruct/pageres.h).
  6. Row analysis result contains a list of word analysis result field: WERD_RES_LIST.
  7. WERD_RES(ccstruct/pageres.h) is a collection of publicly accessible members that gathers information about a word result.

源碼分析

Tesseract主要文字識別主要流程:二值化,切分處理歹颓,識別奢米,糾錯等步驟。上篇文章總結(jié)了二值化與切分處理的過程讶迁,本文主要總結(jié)識別和糾錯兩部分步驟的處理過程连茧。

字符識別

pass 1 recongnize

Classify the blobs in the word and permute the results. Find the worst blob in the word and chop it up. Continue this process until a good answer has been found or all the blobs have been chopped up enough. The results are returned in the WERD_RES.

  • 調(diào)用棧
    1. main [api/tesseractmain.cpp] ->
    2. TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
    3. TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
    4. TessBaseAPI::Recognize [api/baseapi.cpp] ->
    5. Tesseract::recog_all_words [ccmain/control.cpp] ->
    6. Tesseract::RecogAllWordsPassN [ccmain/control.cpp] ->
    7. Tesseract::classify_word_and_language [ccmain/ control.cpp] ->
    8. Tesseract::classify_word_pass1 [ccmain/ control.cpp] ->
    9. Tesseract::match_word_pass_n [ccmain/ control.cpp] ->
    10. Tesseract::tess_segment_pass_n [ccmain/ tessbox.cpp] ->
    11. ** Wordrec::set_pass1() [wordrec/ tface.cpp] -> **
    12. Tesseract::recog_word [ccmain/ tfacepp.cpp] ->
    13. Tesseract::recog_word_recursive [ccmain/ tfacepp.cpp] ->
    14. Wordrec::cc_recog [wordrec/ tface.cpp] ->
    15. Wordrec::chop_word_main [wordrec/ chopper.cpp]

pass 2 recongnize

The processing difference of pass 1 and pass 2 is at the word set style which is in font-weight.

  • 調(diào)用棧
    1. main [api/tesseractmain.cpp] ->
    2. TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
    3. TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
    4. TessBaseAPI::Recognize [api/baseapi.cpp] ->
    5. Tesseract::recog_all_words [ccmain/control.cpp] ->
    6. Tesseract::RecogAllWordsPassN [ccmain/control.cpp] ->
    7. Tesseract::classify_word_and_language [ccmain/ control.cpp] ->
    8. Tesseract::classify_word_pass2 [ccmain/ control.cpp] ->
    9. Tesseract::match_word_pass_n [ccmain/ control.cpp] ->
    10. Tesseract::tess_segment_pass_n [ccmain/ tessbox.cpp] ->
    11. ** Wordrec::set_pass2() [wordrec/ tface.cpp] -> **
    12. Tesseract::recog_word [ccmain/ tfacepp.cpp] ->
    13. Tesseract::recog_word_recursive [ccmain/ tfacepp.cpp] ->
    14. Wordrec::cc_recog [wordrec/ tface.cpp] ->
    15. Wordrec::chop_word_main [wordrec/ chopper.cpp]

LSTM recongnize contained in pass 1 recongnize

  1. main [api/tesseractmain.cpp] ->
  2. TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
  3. TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
  4. TessBaseAPI::Recognize [api/baseapi.cpp] ->
  5. Tesseract::recog_all_words [ccmain/control.cpp] ->
  6. Tesseract::RecogAllWordsPassN [ccmain/control.cpp] ->
  7. Tesseract::classify_word_and_language [ccmain/ control.cpp] ->
  8. Tesseract::classify_word_pass1 [ccmain/ control.cpp] ->
  9. Tesseract::LSTMRecognizeWord [ccmain/linerec.cpp] ->
  10. LSTMRecognizer::RecognizeLine [lstm/lstmrecognizer.cpp] ->
  11. LSTMRecognizer::RecognizeLine [lstm/lstmrecognizer.cpp] ->
  12. Tesseract::SearchWords [ccmain/linerec.cpp]

The next passes are only required for Tess-only

pass 3 recongnize

Walk over the page finding sequences of words joined by fuzzy spaces. Extract them as a sublist, process the sublist to find the optimal arrangement of spaces then replace the sublist in the ROW_RES.

  1. main [api/tesseractmain.cpp] ->
  2. TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
  3. TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
  4. TessBaseAPI::Recognize [api/baseapi.cpp] ->
  5. Tesseract::recog_all_words [ccmain/control.cpp] ->
  6. Tesseract::fix_fuzzy_spaces [ccmain/fixspace.cpp] ->
  7. Tesseract::fix_sp_fp_word [ccmain/fixspace.cpp] ->
  8. Tesseract::fix_fuzzy_space_list [ccmain/fixspace.cpp]

pass 4 recongnize

dictionary_correction_pass

If a word has multiple alternates check if the best choice is in the dictionary. If not, replace it with an alternate that exists in the dictionary.

  1. main [api/tesseractmain.cpp] ->
  2. TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
  3. TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
  4. TessBaseAPI::Recognize [api/baseapi.cpp] ->
  5. Tesseract::recog_all_words [ccmain/control.cpp] ->
  6. Tesseract::dictionary_correction_pass [ccmain/control.cpp]
bigram_correction_pass
  1. main [api/tesseractmain.cpp] ->
  2. TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
  3. TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
  4. TessBaseAPI::Recognize [api/baseapi.cpp] ->
  5. Tesseract::recog_all_words [ccmain/control.cpp] ->
  6. Tesseract::bigram_correction_pass [ccmain/control.cpp]

pass 5 recongnize

Gather statistics on rejects.

  1. main [api/tesseractmain.cpp] ->
  2. TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
  3. TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
  4. TessBaseAPI::Recognize [api/baseapi.cpp] ->
  5. Tesseract::recog_all_words [ccmain/control.cpp] ->
  6. Tesseract::rejection_passes [ccmain/control.cpp] ->
  7. REJMAP::rej_word_bad_quality [ccstruct/rejctmap.cpp]

pass 6 recongnize

Do whole document or whole block rejection pass

  1. main [api/tesseractmain.cpp] ->
  2. TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
  3. TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
  4. TessBaseAPI::Recognize [api/baseapi.cpp] ->
  5. Tesseract::recog_all_words [ccmain/control.cpp] ->
  6. Tesseract::rejection_passes [ccmain/control.cpp] ->
  7. Tesseract::quality_based_rejection [ccmain/docqual.cpp] ->
  8. Tesseract::doc_and_block_rejection [ccmain/docqual.cpp] ->
  9. reject_whole_page [ccmain/docqual.cpp] ->
  10. REJMAP::rej_word_block_rej [ccstruct/rejctmap.cpp]

It seems to lack the pass 7 recongnize in the source code.

pass 8 recongnize

Smooth the fonts for the document.

  1. main [api/tesseractmain.cpp] ->
  2. TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
  3. TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
  4. TessBaseAPI::Recognize [api/baseapi.cpp] ->
  5. Tesseract::recog_all_words [ccmain/control.cpp] ->
  6. Tesseract::font_recognition_pass [ccmain/control.cpp]

pass 9 recongnize

Check the correctness of the final results.

  1. main [api/tesseractmain.cpp] ->
  2. TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
  3. TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
  4. TessBaseAPI::Recognize [api/baseapi.cpp] ->
  5. Tesseract::recog_all_words [ccmain/control.cpp] ->
  6. Tesseract::blamer_pass [ccmain/control.cpp] ->
  7. Tesseract::script_pos_pass [ccmain/control.cpp]

After all the recongnization, Tess removes empty words, as these mess up the result iterators.

段落檢測

This is called after rows have been identified and words are recognized. Much of this could be implemented before word recognition, but text helps to identify bulleted lists and gives good signals for sentence boundaries.

pass 1 detection

Detect sequences of lines that all contain leader dots (.....) These are likely Tables of Contents. If there are three text lines in a row with leader dots, it's pretty safe to say the middle one should be a paragraph of its own.

  1. main [api/tesseractmain.cpp] ->
  2. TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
  3. TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
  4. TessBaseAPI::Recognize [api/baseapi.cpp] ->
  5. TessBaseAPI::DetectParagraphs [api/baseapi.cpp] ->
  6. DetectParagraphs [ccmain/paragraphs.cpp] ->
  7. DetectParagraphs [ccmain/paragraphs.cpp] ->
  8. SeparateSimpleLeaderLines [ccmain/paragraphs.cpp] ->
  9. LeftoverSegments [ccmain/paragraphs.cpp]

pass 2a detection

Find any strongly evidenced start-of-paragraph lines. If they're followed by two lines that look like body lines, make a paragraph model for that and see if that model applies throughout the text (that is, "smear" it).

  1. main [api/tesseractmain.cpp] ->
  2. TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
  3. TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
  4. TessBaseAPI::Recognize [api/baseapi.cpp] ->
  5. TessBaseAPI::DetectParagraphs [api/baseapi.cpp] ->
  6. DetectParagraphs [ccmain/paragraphs.cpp] ->
  7. DetectParagraphs [ccmain/paragraphs.cpp] ->
  8. StrongEvidenceClassify [ccmain/paragraphs.cpp]

pass 2b detection

If we had any luck in pass 2a, we got part of the page and didn't know how to classify a few runs of rows. Take the segments that didn't find a model and reprocess them individually.

  1. main [api/tesseractmain.cpp] ->
  2. TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
  3. TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
  4. TessBaseAPI::Recognize [api/baseapi.cpp] ->
  5. TessBaseAPI::DetectParagraphs [api/baseapi.cpp] ->
  6. DetectParagraphs [ccmain/paragraphs.cpp] ->
  7. DetectParagraphs [ccmain/paragraphs.cpp] ->
  8. LeftoverSegments [ccmain/paragraphs.cpp] ->
  9. StrongEvidenceClassify [ccmain/paragraphs.cpp]

pass 3 detection

These are the dregs for which we didn't have enough strong textual and geometric clues to form matching models for. Let's see if the geometric clues are simple enough that we could just use those.

  1. main [api/tesseractmain.cpp] ->
  2. TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
  3. TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
  4. TessBaseAPI::Recognize [api/baseapi.cpp] ->
  5. TessBaseAPI::DetectParagraphs [api/baseapi.cpp] ->
  6. DetectParagraphs [ccmain/paragraphs.cpp] ->
  7. DetectParagraphs [ccmain/paragraphs.cpp] ->
  8. LeftoverSegments [ccmain/paragraphs.cpp] ->
  9. GeometricClassify [ccmain/paragraphs.cpp] ->
  10. DowngradeWeakestToCrowns [ccmain/paragraphs.cpp]

pass 4 detection

Take everything that's still not marked up well and clear all markings.

  1. main [api/tesseractmain.cpp] ->
  2. TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
  3. TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
  4. TessBaseAPI::Recognize [api/baseapi.cpp] ->
  5. TessBaseAPI::DetectParagraphs [api/baseapi.cpp] ->
  6. DetectParagraphs [ccmain/paragraphs.cpp] ->
  7. DetectParagraphs [ccmain/paragraphs.cpp] ->
  8. LeftoverSegments [ccmain/paragraphs.cpp] ->
  9. SetUnknown [ccmain/paragraphs_internal.h]

Convert all of the unique hypothesis runs to PARAs.

ConvertHypothesizedModelRunsToParagraphs [ccmain/paragraphs.cpp]

Finally, clean up any dangling NULL row paragraph parents.

CanonicalizeDetectionResults [ccmain/paragraphs.cpp]

糾錯

dictionary error correction

Verify whether the recongnized word is in the word_dic (unicharset)

  • 調(diào)用棧
  1. main [api/tesseractmain.cpp] ->
  2. TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
  3. TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
  4. TessBaseAPI::Recognize [api/baseapi.cpp] ->
  5. Tesseract::recog_all_words [ccmain/control.cpp] ->
  6. Tesseract::RecogAllWordsPassN [ccmain/control.cpp] ->
  7. Tesseract::classify_word_and_language [ccmain/ control.cpp] ->
  8. Tesseract::classify_word_pass1 [ccmain/ control.cpp] ->
  9. Tesseract::tess_segment_pass_n [ccmain/ tessbox.cpp] ->
  10. Tesseract::recog_word [ccmain/ tfacepp.cpp] ->
  11. Wordrec::dict_word [wordrec/ tface.cpp] ->
  12. Dict::valid_word [dict/ dict.cpp]
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市巍糯,隨后出現(xiàn)的幾起案子啸驯,更是在濱河造成了極大的恐慌,老刑警劉巖祟峦,帶你破解...
    沈念sama閱讀 221,273評論 6 515
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件罚斗,死亡現(xiàn)場離奇詭異,居然都是意外死亡宅楞,警方通過查閱死者的電腦和手機针姿,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 94,349評論 3 398
  • 文/潘曉璐 我一進店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來厌衙,“玉大人搓幌,你說我怎么就攤上這事⊙嘎幔” “怎么了溉愁?”我有些...
    開封第一講書人閱讀 167,709評論 0 360
  • 文/不壞的土叔 我叫張陵,是天一觀的道長饲趋。 經(jīng)常有香客問我拐揭,道長,這世上最難降的妖魔是什么奕塑? 我笑而不...
    開封第一講書人閱讀 59,520評論 1 296
  • 正文 為了忘掉前任堂污,我火速辦了婚禮,結(jié)果婚禮上龄砰,老公的妹妹穿的比我還像新娘盟猖。我一直安慰自己,他們只是感情好换棚,可當我...
    茶點故事閱讀 68,515評論 6 397
  • 文/花漫 我一把揭開白布式镐。 她就那樣靜靜地躺著,像睡著了一般固蚤。 火紅的嫁衣襯著肌膚如雪娘汞。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 52,158評論 1 308
  • 那天夕玩,我揣著相機與錄音你弦,去河邊找鬼惊豺。 笑死,一個胖子當著我的面吹牛禽作,可吹牛的內(nèi)容都是我干的尸昧。 我是一名探鬼主播,決...
    沈念sama閱讀 40,755評論 3 421
  • 文/蒼蘭香墨 我猛地睜開眼旷偿,長吁一口氣:“原來是場噩夢啊……” “哼烹俗!你這毒婦竟也來了?” 一聲冷哼從身側(cè)響起狸捅,我...
    開封第一講書人閱讀 39,660評論 0 276
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎累提,沒想到半個月后尘喝,有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 46,203評論 1 319
  • 正文 獨居荒郊野嶺守林人離奇死亡斋陪,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 38,287評論 3 340
  • 正文 我和宋清朗相戀三年朽褪,在試婚紗的時候發(fā)現(xiàn)自己被綠了。 大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片无虚。...
    茶點故事閱讀 40,427評論 1 352
  • 序言:一個原本活蹦亂跳的男人離奇死亡缔赠,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出友题,到底是詐尸還是另有隱情嗤堰,我是刑警寧澤,帶...
    沈念sama閱讀 36,122評論 5 349
  • 正文 年R本政府宣布度宦,位于F島的核電站踢匣,受9級特大地震影響,放射性物質(zhì)發(fā)生泄漏戈抄。R本人自食惡果不足惜离唬,卻給世界環(huán)境...
    茶點故事閱讀 41,801評論 3 333
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望划鸽。 院中可真熱鬧输莺,春花似錦、人聲如沸裸诽。這莊子的主人今日做“春日...
    開封第一講書人閱讀 32,272評論 0 23
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽丈冬。三九已至尸折,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間殷蛇,已是汗流浹背实夹。 一陣腳步聲響...
    開封第一講書人閱讀 33,393評論 1 272
  • 我被黑心中介騙來泰國打工橄浓, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人亮航。 一個月前我還...
    沈念sama閱讀 48,808評論 3 376
  • 正文 我出身青樓荸实,卻偏偏與公主長得像,于是被迫代替她去往敵國和親缴淋。 傳聞我的和親對象是個殘疾皇子准给,可洞房花燭夜當晚...
    茶點故事閱讀 45,440評論 2 359

推薦閱讀更多精彩內(nèi)容