搞定了中文分詞下一個(gè)就是要來(lái)搞定拼音分詞了,拼音分詞有分為全拼和簡(jiǎn)拼
附上:
喵了個(gè)咪的博客:http://w-blog.cn
Solr官網(wǎng):http://lucene.apache.org/solr/
PS:8.0.0版本已經(jīng)發(fā)布,本文使用此時(shí)較為穩(wěn)定的7.7.1版本
一佑钾,全拼分詞
> wget http://files.cnblogs.com/files/wander1129/pinyin.zip
> unzip pinyin.zip
> mv pinyin4j-2.5.0.jar server/solr-webapp/webapp/WEB-INF/lib
> mv pinyinAnalyzer4.3.1.jar server/solr-webapp/webapp/WEB-INF/lib
> vim server/solr/new_core/conf/managed-schema
<fieldType name="text_pinyin" class="solr.TextField" positionIncrementGap="0">
<analyzer type="index">
<tokenizer class="org.wltea.analyzer.lucene.IKTokenizerFactory"/>
<filter class="com.shentong.search.analyzers.PinyinTransformTokenFilterFactory" minTermLenght="2" />
<filter class="com.shentong.search.analyzers.PinyinNGramTokenFilterFactory" minGram="1" maxGram="20" />
</analyzer>
<analyzer type="query">
<tokenizer class="org.wltea.analyzer.lucene.IKTokenizerFactory"/>
<filter class="com.shentong.search.analyzers.PinyinTransformTokenFilterFactory" minTermLenght="2" />
<filter class="com.shentong.search.analyzers.PinyinNGramTokenFilterFactory" minGram="1" maxGram="20" />
</analyzer>
</fieldType>
自制分詞規(guī)則
> webapps/solr/WEB-INF/classes/
> cd /usr/local/solr-7.7.1/server/solr-webapp/webapp/WEB-INF
> mkdir classes
> wget http://pic.w-blog.cn/ikanalyzer-solr5.zip
> unzip ikanalyzer-solr5.zip
> cd ikanalyzer-solr5/
> mv ext.dic ../server/solr-webapp/webapp/WEB-INF/classes/
> mv IKAnalyzer.cfg.xml ../server/solr-webapp/webapp/WEB-INF/classes/
> mv stopword.dic ../server/solr-webapp/webapp/WEB-INF/classes/
> vim ext.dic
美團(tuán)
簡(jiǎn)拼分詞
> wget http://pic.w-blog.cn/pinyinTokenFilter-1.1.0-RELEASE.jar
> mv pinyinTokenFilter-1.1.0-RELEASE.jar server/solr-webapp/webapp/WEB-INF/lib
> vim server/solr/new_core/conf/managed-schema
<fieldType name="text_jian_pinyin" class="solr.TextField">
<analyzer type="index">
<tokenizer class="org.wltea.analyzer.lucene.IKTokenizerFactory" isMaxWordLength="false" useSmart="false" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="top.pinyin.index.solr.PinyinTokenFilterFactory" pinyin="true" isFirstChar="true" minTermLenght="2" />
<filter class="com.shentong.search.analyzers.PinyinNGramTokenFilterFactory" minGram="2" maxGram="20" />
</analyzer>
<analyzer type="query">
<tokenizer class="org.wltea.analyzer.lucene.IKTokenizerFactory" isMaxWordLength="false" useSmart="false" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
<field name="app_name" type="text_jian_pinyin" indexed="true" stored="true" />