為什么用word1.2?
最新的word分詞是1.3版本炬太,但是用1.3的時候會出現(xiàn)一些Bug,產(chǎn)生Java.lang.OutOfMemory錯誤隆敢,所以還是用比較穩(wěn)定的1.2版本发皿。
在Lucene 6.1.0中,實(shí)現(xiàn)一個Analyzer的子類拂蝎,也就是構(gòu)建自己的Analyzer的時候穴墅,需要實(shí)現(xiàn)的方法是createComponet(String fieldName),而在Word 1.2中温自,沒有實(shí)現(xiàn)這個方法(word 1.2對lucene 4.+的版本支持較好)玄货,運(yùn)用ChineseWordAnalyzer運(yùn)行的時候會提示:
Exception in thread "main" java.lang.AbstractMethodError: org.apache.lucene.analysis.Analyzer.createComponents(Ljava/lang/String;)Lorg/apache/lucene/analysis/Analyzer$TokenStreamComponents;
at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:140)
所以要對ChineseWordAnalyzer做一些修改.
實(shí)現(xiàn)createComponet(String fieldName)方法
新建一個Analyzer的子類MyWordAnalyzer,根據(jù)ChinesWordAnalyzer改寫:
public class MyWordAnalyzer extends Analyzer {
Segmentation segmentation = null;
public MyWordAnalyzer() {
segmentation = SegmentationFactory.getSegmentation(
SegmentationAlgorithm.BidirectionalMaximumMatching);
}
public MyWordAnalyzer(Segmentation segmentation) {
this.segmentation = segmentation;
}
@Override
protected TokenStreamComponents createComponents(String fieldName) {
Tokenizer tokenizer = new MyWordTokenizer(segmentation);
return new TokenStreamComponents(tokenizer);
}
}
其中segmentation屬性可以設(shè)置分詞所用的算法悼泌,默認(rèn)的是雙向最大匹配算法松捉。接著要實(shí)現(xiàn)的是MyWordTokenizer,也是模仿ChineseWordTokenizer來寫:
public class MyWordTokenizer extends Tokenizer{
private final CharTermAttribute charTermAttribute
= addAttribute(CharTermAttribute.class);
private final OffsetAttribute offsetAttribute
= addAttribute(OffsetAttribute.class);
private final PositionIncrementAttribute
positionIncrementAttribute
= addAttribute(PositionIncrementAttribute.class);
private Segmentation segmentation = null;
private BufferedReader reader = null;
private final Queue<Word> words = new LinkedTransferQueue<>();
private int startOffset=0;
public MyWordTokenizer() {
segmentation = SegmentationFactory.getSegmentation(
SegmentationAlgorithm.BidirectionalMaximumMatching);
}
public MyWordTokenizer(Segmentation segmentation) {
this.segmentation = segmentation;
}
private Word getWord() throws IOException {
Word word = words.poll();
if(word == null){
String line;
while( (line = reader.readLine()) != null ){
words.addAll(segmentation.seg(line));
}
startOffset = 0;
word = words.poll();
}
return word;
}
@Override
public final boolean incrementToken() throws IOException {
reader=new BufferedReader(input);
Word word = getWord();
if (word != null) {
int positionIncrement = 1;
//忽略停用詞
while(StopWord.is(word.getText())){
positionIncrement++;
startOffset += word.getText().length();
word = getWord();
if(word == null){
return false;
}
}
charTermAttribute.setEmpty().append(word.getText());
offsetAttribute.setOffset(startOffset, startOffset
+word.getText().length());
positionIncrementAttribute.setPositionIncrement(
positionIncrement);
startOffset += word.getText().length();
return true;
}
return false;
}
}
incrementToken()是必需要實(shí)現(xiàn)的方法馆里,返回true的時候表示后面還有token隘世,返回false表示解析結(jié)束掉盅。在incrementToken()的第一行,將input的值賦給reader以舒,input是Tokenizer為Reader的對象趾痘,在Tokenizer中還有另一個Reader對象——inputPending,在Tokenizer中源碼如下:
public abstract class Tokenizer extends TokenStream {
/** The text source for this Tokenizer. */
protected Reader input = ILLEGAL_STATE_READER;
/** Pending reader: not actually assigned to input until reset() */
private Reader inputPending = ILLEGAL_STATE_READER;
input中存儲的是需要解析的文本蔓钟,但是文本是先放到inputPending中永票,直到調(diào)用了reset方法之后才將值賦給input。
reset()方法定義如下:
@Override
public void reset() throws IOException {
super.reset();
input = inputPending;
inputPending = ILLEGAL_STATE_READER;
}
在調(diào)用reset()方法之前滥沫,input里面的是沒有需要解析的文本信息的侣集,所以要在reset()之后再將input的值賦給reader(一個BufferedReader 的對象)。
做了上面的修改之后兰绣,就可以運(yùn)用Word 1.2里面提供的算法進(jìn)行分詞了:
測試類MyWordAnalyzerTest
public class MyWordAnalyzerTest {
public static void main(String[] args) throws IOException {
String text = "乒乓球拍賣完了";
Analyzer analyzer = new MyWordAnalyzer();
TokenStream tokenStream = analyzer.tokenStream("text", text);
// 準(zhǔn)備消費(fèi)
tokenStream.reset();
// 開始消費(fèi)
while (tokenStream.incrementToken()) {
// 詞
CharTermAttribute charTermAttribute
= tokenStream.getAttribute(CharTermAttribute.class);
// 詞在文本中的起始位置
OffsetAttribute offsetAttribute
= tokenStream.getAttribute(OffsetAttribute.class);
// 第幾個詞
PositionIncrementAttribute positionIncrementAttribute
= tokenStream
.getAttribute(PositionIncrementAttribute.class);
System.out.println(charTermAttribute.toString() + " "
+ "(" + offsetAttribute.startOffset() + " - "
+ offsetAttribute.endOffset() + ") "
+ positionIncrementAttribute.getPositionIncrement());
}
// 消費(fèi)完畢
tokenStream.close();
}
}
結(jié)果如下:
因?yàn)樵趇ncreamToken()中世分,將停止詞去掉了,所以分詞結(jié)果中沒有出現(xiàn)“了”缀辩。從上面的結(jié)果也可以看到臭埋,Word分詞可以將句子分解為“乒乓球拍”和“賣完”,對比用SmartChineseAnalyzer():
綜上Word的分詞效果還是不錯的臀玄。