1.首先安裝 Tesseract-OCR (windows )
??Github
? ?Wiki
2.設(shè)置環(huán)境變量
? ? 1)PATH 增加? tesseract 安裝目錄
? ? 2)新增系統(tǒng)變量 TESSDATA_PREFIX='安裝目錄文件夾下\tessdata'
3.nodejs 調(diào)用
?安裝wraper暮蹂,npm install node-tesseract
var tesseract = require('node-tesseract');
// Recognize text of any language in any format
tesseract.process(__dirname+'/test.png',function(err, text) {
? ? if(err) {
? ? ? ? ? ? console.error(err);
? ?} else {
? ? ? ? ? ? ?console.log(text);
? ? ? ? ? ? ? }
? ? ? ?});
4. 多語言
通過-l 選項跟衅,依次增加語言
tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]
var options = {
l: 'chi_sim+eng',
psm: 6,
};
tesseract.process(__dirname + '/test.jpg', options, function(err, text) {
if(err) {
console.error(err);
} else {
console.log('----------------------------W');
console.log(text);
}
});
-psm:Member name Value Description
PSM_OSD_ONLY 0 Orientation and script detection only.
PSM_AUTO_OSD 1 Automatic page segmentation with orientation and script detection. (OSD)
PSM_AUTO_ONLY 2 Automatic page segmentation, but no OSD, or OCR.
PSM_AUTO 3 Fully automatic page segmentation, but no OSD.
PSM_SINGLE_COLUMN 4 Assume a single column of text of variable sizes.
PSM_SINGLE_BLOCK_VERT_TEXT 5 Assume a single uniform block of vertically aligned text.
PSM_SINGLE_BLOCK 6 Assume a single uniform block of text. (Default.)
PSM_SINGLE_LINE 7 Treat the image as a single text line.
PSM_SINGLE_WORD 8 Treat the image as a single word.
PSM_CIRCLE_WORD 9 Treat the image as a single word in a circle.
PSM_SINGLE_CHAR 10 Treat the image as a single character.
PSM_SPARSE_TEXT 11 Find as much text as possible in no particular order.
PSM_SPARSE_TEXT_OSD 12 Sparse text with orientation and script det.
PSM_RAW_LINE 13 Treat the image as a single text line, bypassing hacks that are Tesseract-specific.
5.配合 GraphicsMagick 識別驗證碼的例子
? ?參考鏈接? ??gm
6.Localized OCR(識別特定區(qū)域)
有時候需要識別特定區(qū)域,可以通過UZN 文件配合 -psm 4參數(shù)實現(xiàn)
Tesseract can read in uzn files, and use them instead of doing its own segmentation, on two conditions:
The segmentation mode PSM_SINGLE_COLUMN must be used (Check manpage for details)
The uzn file must be named imageName.uzn, so for scan01.png the uzn file must be named scan01.uzn
兩個前提:
-psm 4
uzn 文件與圖片文件名稱相同
https://github.com/charlesw/tesseract/issues/66
例如萤捆,考慮有內(nèi)容如下的圖片test.png:
This is a text
? ? ? ? ? ? This is another test
? ? ? This is a last test
test.uzn:
50 65 100 15 Text
命令行輸入:
?"tesseract.exe test.png test -psm 4"
輸出結(jié)果:
This is another test
7. node-tesseract 增加hocr輸出
兩種方法:
1) 首先修改安裝目錄下? \lib\tesseract.js ,第22行處options增加hocr屬性
options: {
? ? ? ? ? ?'l': 'eng',
? ? ? ? ? 'psm': 3,
? ? ? ? ? 'config': null,
? ? ? ? ? 'binary': 'tesseract',
? ? ? ? ? 'hocr':null
},
70行增加:
if (options.hocr !== null) {
command.push('hocr');
}
如果想要輸出hocr格式喊积,參考:
var options = {
? ? ? ? ? l: 'chi_sim+eng',
? ? ? ? ? psm: 4,
? ? ? ? ? hocr:'hocr'
};
tesseract.process( '/test.png', options, function(err, text) {
? ? ? ? ? if(err) {
? ? ? ? ? ? ? ? ? ? console.error(err);
? ? ? ? } else {
????????????????console.log(text);
? ????????}
????????});
2)參考 Git?pull request赋兵,好像沒更新,需要修改到自己的 tesseract.js 文件中