Analyzer(分析器)
https://cwiki.apache.org/confluence/display/solr/Understanding+Analyzers,+Tokenizers,+and+Filters
分詞器就是將句子分成單個(gè)的詞,過(guò)濾器就是對(duì)分詞的結(jié)果進(jìn)行篩選粱栖,例如中文中將“的”“呀”這些對(duì)句子主體意思影響不大的詞刪除徐许,英語(yǔ)中類似的就是"is"第股,"a"等等贾铝。
分析器包括兩個(gè)部分:tokenizer(分詞器)和filter(分詞過(guò)濾器邓了,它們將按照所列的順序發(fā)生作用)索守。for example:
<fieldType name="text_ik_analysis" class="solr.TextField" sortMissingLast="true" omitNorms="true" autoGeneratePhraseQueries="false">
<analyzer type="index">
<tokenizer class="org.wltea.analyzer.lucene.IKTokenizerFactory" useSmart="true"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
<filter class="solr.LengthFilterFactory" min="2" max="20" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="org.wltea.analyzer.lucene.IKTokenizerFactory" useSmart="true"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
<filter class="solr.LengthFilterFactory" min="2" max="20" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
</fieldType>
Tokenizer(分詞器)
常見(jiàn)的分詞器有:
- KeywordTokenizerFactory:不管什么內(nèi)容趴俘,整句當(dāng)成一個(gè)關(guān)鍵字
- LetterTokenizerFactory:根據(jù)字母來(lái)分詞睹簇,拋棄非字母的部分奏赘,例如:"I can't" ==> "I", "can", "t"
- WhitespaceTokenizerFactory:根據(jù)空格來(lái)分詞寥闪,例如:"I do" ==> "I", "do"
- IKTokenizerFactory:IK分詞器
Filter(過(guò)濾器)
常見(jiàn)的過(guò)濾器:
- LowerCaseFilterFactory:將大寫字母轉(zhuǎn)換成小寫太惠,不處理非字母部分
- SynonymFilterFactory:同義詞
- LengthFilterFactory: 限定字符長(zhǎng)度
- RemoveDuplicatesTokenFilterFactory:移除重復(fù)文本