7篓足、自定義分詞和中文分詞（lucene筆記）

一、自定義分詞器

這里我們自定義一個停用分詞器晶框，也就是在進行分詞的時候?qū)⒛承┰~過濾掉排抬。
MyStopAnalyzer.java

package cn.itcast.util;
import java.io.Reader;
import java.util.Set;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.LetterTokenizer;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.StopAnalyzer;
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.util.Version;

public class MyStopAnalyzer extends Analyzer {
    
    @SuppressWarnings("rawtypes")
    private Set stops;//用于存放分詞信息
    
    public MyStopAnalyzer() {
        stops = StopAnalyzer.ENGLISH_STOP_WORDS_SET;//默認停用的語匯信息
    }
    
    //這里可以將通過數(shù)組產(chǎn)生分詞對象
    public MyStopAnalyzer(String[] sws) {
        //System.out.println(StopAnalyzer.ENGLISH_STOP_WORDS_SET);
        stops = StopFilter.makeStopSet(Version.LUCENE_35, sws, true);//最后的參數(shù)表示忽略大小寫
        stops.addAll(StopAnalyzer.ENGLISH_STOP_WORDS_SET);
    }

    @Override
    public TokenStream tokenStream(String fieldName, Reader reader) {
        //注意：在分詞過程中會有一個過濾器鏈，最開始的過濾器接收一個Tokenizer授段，而最后一個接收一個Reader流
        //這里我們看到我們可以在過濾器StopFilter中接收LowerCaseFilter畜埋，而LowerCaseFilter接收一個Tokenizer
        //當然如果要添加更多的過濾器還可以繼續(xù)添加
        return new StopFilter(Version.LUCENE_35, new LowerCaseFilter(Version.LUCENE_35, 
                new LetterTokenizer(Version.LUCENE_35, reader)), stops);
    }
}

說明：

這里我們定義一個Set集合用來存放分詞信息，其中在無參構(gòu)造器我們將默認停用分詞器中停用的語匯單元賦給stops畴蒲，這樣我們就可以使用默認停用分詞器中停用的語匯悠鞍。而我們通過一個字符串數(shù)組將我們自己想要停用的詞傳遞進來，同時stops不接受泛型，也就是說不能直接將字符串數(shù)組賦值給stops咖祭，而需要使用makeStopSet方法將需要停用的詞轉(zhuǎn)換為相應的語匯單元掩宜，然后再添加給stops進行存儲。
自定義的分詞器需要繼承Analyzer接口么翰，實現(xiàn)tokenStream方法牺汤，此方法接收三個參數(shù)，第一個是版本浩嫌，最后一個是停用的語匯單元檐迟，這里是stops，而第二個參數(shù)是別的分詞器码耐，因為分詞過程中是一個分詞器鏈追迟。

測試：
TestAnalyzer.java

@Test
public void test04(){
    //對中文分詞不適用
    Analyzer analyzer = new MyStopAnalyzer(new String[]{"I","you"});
    Analyzer analyzer2 = new StopAnalyzer(Version.LUCENE_35);//停用分詞器
    
    String text = "how are you thank you I hate you";
    System.out.println("************自定義分詞器***************");
    AnalyzerUtils.displayAllTokenInfo(text, analyzer);
    System.out.println("************停用分詞器***************");
    AnalyzerUtils.displayAllTokenInfo(text, analyzer2);
}

說明：從測試結(jié)果中我們可以很容易看出自定義分詞器和默認分詞器之間的區(qū)別，自定義分詞相比默認分詞器多了我們自定義的詞語骚腥。

二敦间、中文分詞器

這里我們使用MMSEG中文分詞器，其分詞信息使用的是搜狗詞庫束铭。我們使用的是版本1.8.5.這個版本的包中有兩個可用的jar包：

mmseg4j-all-1.8.5.jar
mmseg4j-all-1.8.5-with-dic.jar

其中第二個相比第一個多了相關(guān)的語匯信息廓块，便于我們進行分詞，當然我們可以使用第一個契沫，但是這樣便和默認分詞器沒有多大差別带猴，我們在方法中直接測試：

@Test
public void test02(){
    //對中文分詞不適用
    Analyzer analyzer1 = new StandardAnalyzer(Version.LUCENE_35);//標準分詞器
    Analyzer analyzer2 = new StopAnalyzer(Version.LUCENE_35);//停用分詞器
    Analyzer analyzer3 = new SimpleAnalyzer(Version.LUCENE_35);//簡單分詞器
    Analyzer analyzer4 = new WhitespaceAnalyzer(Version.LUCENE_35);//空格分詞器
    Analyzer analyzer5 = new MMSegAnalyzer();
    
    
    String text = "西安市雁塔區(qū)";
    AnalyzerUtils.displayToken(text, analyzer1);
    AnalyzerUtils.displayToken(text, analyzer2);
    AnalyzerUtils.displayToken(text, analyzer3);
    AnalyzerUtils.displayToken(text, analyzer4);
    AnalyzerUtils.displayToken(text, analyzer5);
}

說明：此時我們直接使用MMSEG中文分詞器，測試結(jié)果為：

我們看到和默認的分詞器并無多大差別懈万，當然我們也可以在方法中指定相關(guān)語匯信息存放的目錄：

Analyzer analyzer5 = new MMSegAnalyzer(new File("E:/API/Lucene/mmseg/data"));

此時的測試結(jié)果為：

在目錄E:/API/Lucene/mmseg/data中存在四個文件：

chars.dic
units.dic
words.dic
words-my.dic

這寫文件便存放了相關(guān)的語匯單元浓利，當然如果我們想停用某些詞，可以在最后一個文件中直接進行添加钞速。

三贷掖、同義詞索引（1）

3.1思路

說明：首先我們需要使用MMSEG進行分詞，之后我們自定義的分詞器從同義詞容器中取得相關(guān)的同義詞渴语，然后將同義詞存儲在同一個位置苹威，我們在之前講過，就是同一個偏移量可以有多個語匯單元驾凶。

3.2 自定義分詞器

MySameAnalyzer.java

package cn.itcast.util;
import java.io.Reader;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import com.chenlb.mmseg4j.Dictionary;
import com.chenlb.mmseg4j.MaxWordSeg;
import com.chenlb.mmseg4j.analysis.MMSegTokenizer;

public class MySameAnalyzer extends Analyzer {

    @Override
    public TokenStream tokenStream(String fieldName, Reader reader) {
        
        Dictionary dic = Dictionary.getInstance("E:/API/Lucene/mmseg/data");
        
        //我們首先使用MMSEG進行分詞牙甫，將相關(guān)內(nèi)容分成一個一個語匯單元
        return new MySameTokenFilter(new MMSegTokenizer(new MaxWordSeg(dic), reader));
    }
}

說明：和之前一樣還是需要實現(xiàn)Analyzer接口。這里我們實例化Dictionary對象调违，此對象是單例的窟哺，用于保存相關(guān)的語匯信息〖技纾可以看到且轨，首先是經(jīng)過MMSEG分詞器，將相關(guān)內(nèi)容分成一個一個的語匯單元。

自定義同義詞過濾器MySameTokenFilter.java

package cn.itcast.util;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

public class MySameTokenFilter extends TokenFilter {
    
    private CharTermAttribute cta = null;

    protected MySameTokenFilter(TokenStream input) {
        super(input);
        cta = this.addAttribute(CharTermAttribute.class);
    }

    @Override
    public boolean incrementToken() throws IOException {
        if(!this.input.incrementToken()){//如果輸入進來的內(nèi)容中沒有元素
            return false;
        }
        //如果有旋奢，則需要進行相應的處理泳挥，進行同義詞的判斷處理
        String[] sws = getSameWords(cta.toString());
        if(sws != null){
            //處理
            for(String s : sws){
                cta.setEmpty();
                cta.append(s);
            }
        }
        return true;
    }
    
    private String[] getSameWords(String name){
        Map<String, String[]> maps = new HashMap<String, String[]>();
        maps.put("中國", new String[]{"天朝", "大陸"});
        maps.put("我", new String[]{"咱", "俺"});
        return maps.get(name);
    }
}

說明：這里我們需要定義一個CharTermAttribute 屬性，在之前說過至朗，這個類相當于在分詞流中的一個標記屉符。

相關(guān)方法AnalyzerUtils.java

public static void displayAllTokenInfo(String str, Analyzer analyzer){
    try {
        TokenStream stream = analyzer.tokenStream("content", new StringReader(str));
        PositionIncrementAttribute pia = stream.addAttribute(PositionIncrementAttribute.class);
        OffsetAttribute oa = stream.addAttribute(OffsetAttribute.class);
        CharTermAttribute cta = stream.addAttribute(CharTermAttribute.class);
        TypeAttribute ta = stream.addAttribute(TypeAttribute.class);

        while (stream.incrementToken()) {
            System.out.print("位置增量： " + pia.getPositionIncrement());//詞與詞之間的空格
            System.out.print("，單詞： " + cta + "[" + oa.startOffset() + "," + oa.endOffset() + "]");
            System.out.print("锹引，類型： " + ta.type()) ;
            System.out.println();
        }
        
    } catch (IOException e) {
        e.printStackTrace();
    }
}

測試：

@Test
public void test05(){
    //對中文分詞不適用
    Analyzer analyzer = new MySameAnalyzer();
    
    String text = "我來自中國西安市雁塔區(qū)";
    System.out.println("************自定義分詞器***************");
    AnalyzerUtils.displayAllTokenInfo(text, analyzer);
}

說明：整個執(zhí)行流程就是：

1.首先實例化一個自定義的分詞器MySameAnalyzer矗钟，在此分詞器中實例化一個MySameTokenFilter過濾器，而從過濾器中的參數(shù)中可以看到接收MMSEG分詞器嫌变，而MySameTokenFilter的構(gòu)造方法中接收一個分詞流吨艇，然后將CharTermAttribute加入到此流中。
2.在displayAllTokenInfo方法中我們調(diào)用incrementToken方法時先是調(diào)用getSameWords方法查看分詞流中有沒有同義詞初澎，如果沒有則直接返回秸应，否則進行相關(guān)的處理虑凛。
3.在這里的處理方式中碑宴，先是使用方法setEmpty將原來的語匯單元清除，然后將此語匯單元同義詞添加進去桑谍，但是這樣就將原來的語匯單元刪除了延柠，這顯然不符合要求。測試結(jié)果為：

5

可以看到將“我”換成了“俺”锣披，將“中國”換成了“大陸”贞间。也就是說我們使用同義詞將原來的詞語替換掉了。

解決方法
我們之前說過雹仿，每個語匯單元都有一個位置增热，這個位置由PositionIncrTerm屬性保存，如果兩個語匯單元的位置相同胧辽，或者說距離為0峻仇，那么就表示是同義詞了。而我們看到上面的測試結(jié)果中每個語匯單元的距離都為1邑商，顯然不是同義詞摄咆。而對于上面例子中的問題，我們可以這樣解決：
MySameTokenFilter.java

package cn.itcast.util;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.Stack;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.util.AttributeSource;

public class MySameTokenFilter extends TokenFilter {
    
    private CharTermAttribute cta = null;
    private PositionIncrementAttribute pia = null;
    private AttributeSource.State current ;
    private Stack<String> sames = null;

    protected MySameTokenFilter(TokenStream input) {
        super(input);
        cta = this.addAttribute(CharTermAttribute.class);
        pia = this.addAttribute(PositionIncrementAttribute.class);
        sames = new Stack<String>();
    }

    @Override
    public boolean incrementToken() throws IOException {
        while(sames.size() > 0){
            //將元素出棧人断，并且獲取這個同義詞
            String str = sames.pop();
            restoreState(current);//還原到原來的狀態(tài)
            cta.setEmpty();
            cta.append(str);
            //設(shè)置位置為0
            pia.setPositionIncrement(0);
            return true;
        }
        
        if(!this.input.incrementToken()){//如果輸入進來的內(nèi)容中沒有元素
            return false;
        }
        if(getSameWords(cta.toString())){
            //如果有同義詞吭从，捕獲當前的狀態(tài)
            current = captureState();
        }
        return true;
    }
    
    private boolean getSameWords(String name){
        Map<String, String[]> maps = new HashMap<String, String[]>();
        maps.put("中國", new String[]{"天朝", "大陸"});
        maps.put("我", new String[]{"咱", "俺"});
        String[] sws = maps.get(name);
        if(sws != null){
            for(String s : sws){
                sames.push(s);
            }
            return true;
        }
        return false;
    }
}

說明：

1.首先我們添加了三個屬性PositionIncrementAttribute 、AttributeSource.State恶迈、Stack涩金，分別是位置屬性、當前狀態(tài)、棧鸭廷。其中棧用來保存同義詞單元枣抱。在構(gòu)造函數(shù)中初始化相關(guān)屬性。
2.在調(diào)用incrementToken方法開始時我們先使用方法incrementToken辆床，讓標記CharTermAttribute 向后移動一個位置佳晶，同時將本位置（current ）保留下來。而此時第一個語匯單元“我”已經(jīng)寫入到分詞流中了讼载，然后我們利用current在讀取到同義詞之后回到前一個位置進行添加同義詞轿秧，其實就是將同義詞的位置設(shè)置為0（同義詞之間的位置為0），這樣就將原始單元和同義詞單元都寫入到了分詞流中了咨堤。這就將第一個單元的同義詞設(shè)置好了菇篡，立即返回，進入到下一個語匯單元進行處理一喘。
測試結(jié)果為：

6

下面我們編寫一個測試方法進行同義詞查詢操作：

@Test
public void test06() throws CorruptIndexException, LockObtainFailedException, IOException{
    //對中文分詞不適用
    Analyzer analyzer = new MySameAnalyzer();
    
    String text = "我來自中國西安市雁塔區(qū)";
    Directory dir = new RAMDirectory();
    IndexWriter write = new IndexWriter(dir, new IndexWriterConfig(Version.LUCENE_35, analyzer));
    Document doc = new Document();
    doc.add(new Field("content", text, Field.Store.YES, Field.Index.ANALYZED));
    write.addDocument(doc);
    write.close();
    IndexSearcher searcher = new IndexSearcher(IndexReader.open(dir));
    //TopDocs tds = searcher.search(new TermQuery(new Term("content", "中國")), 10);
    TopDocs tds = searcher.search(new TermQuery(new Term("content", "大陸")), 10);
    Document d = searcher.doc(tds.scoreDocs[0].doc);
    System.out.println(d.get("content"));
    System.out.println("************自定義分詞器***************");
    AnalyzerUtils.displayAllTokenInfo(text, analyzer);
}

說明：我們在查詢的時候可以使用“中國”的同義詞“大陸”進行查詢驱还。但是這種方式并不好，因為將將同義詞等信息都寫死了凸克，不便于管理议蟆。

四、同義詞索引（2）

（工程lucene_analyzer02）
這里我們專門創(chuàng)建一個類用來存放同義詞：
SamewordContext.java

package cn.itcast.util;
public interface SamewordContext {
    public String[] getSamewords(String name);
}

實現(xiàn)SimpleSamewordContext.java

package cn.itcast.util;
import java.util.HashMap;
import java.util.Map;

public class SimpleSamewordContext implements SamewordContext {
    
    private Map<String, String[]> maps = new HashMap<String, String[]>();
    
    public SimpleSamewordContext() {
        maps.put("中國", new String[]{"天朝", "大陸"});
        maps.put("我", new String[]{"咱", "俺"});
    }
    
    @Override
    public String[] getSamewords(String name) {
        return  maps.get(name);
    }
}

說明：這里我們只是簡單的實現(xiàn)了接口萎战，封裝了一些同義詞咐容，之后我們在使用的時候便可以使用此類來獲取同義詞。測試我們需要改進相關(guān)的類：
MySameTokenFilter.java

package cn.itcast.util;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.Stack;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.util.AttributeSource;

public class MySameTokenFilter extends TokenFilter {
    
    private CharTermAttribute cta = null;
    private PositionIncrementAttribute pia = null;
    private AttributeSource.State current ;
    private Stack<String> sames = null;
    private SamewordContext samewordContext ;//用來存儲同義詞

    protected MySameTokenFilter(TokenStream input, SamewordContext samewordContext) {
        super(input);
        cta = this.addAttribute(CharTermAttribute.class);
        pia = this.addAttribute(PositionIncrementAttribute.class);
        sames = new Stack<String>();
        this.samewordContext = samewordContext;
    }

    @Override
    public boolean incrementToken() throws IOException {
        
        while(sames.size() > 0){
            //將元素出棧蚂维，并且獲取這個同義詞
            String str = sames.pop();
            restoreState(current);//還原到原來的狀態(tài)
            cta.setEmpty();
            cta.append(str);
            //設(shè)置位置為0
            pia.setPositionIncrement(0);
            return true;
        }
        
        if(!this.input.incrementToken()){//如果輸入進來的內(nèi)容中沒有元素
            return false;
        }
        if(addSames(cta.toString())){
            //如果有同義詞戳粒，捕獲當前的狀態(tài)
            current = captureState();
        }
        return true;
    }
    
    private boolean addSames(String name){
        String[] sws = samewordContext.getSamewords(name);
        if(sws != null){
            for(String s : sws){
                sames.push(s);
            }
            return true;
        }
        return false;
    }
}

說明：在此類中我們太添加了一個屬性SamewordContext，用來保存相關(guān)的同義詞虫啥，在方法addSames中使用此類來獲取相關(guān)的同義詞蔚约。于是我們在后面使用MySameTokenFilter類的時候需要通過構(gòu)造函數(shù)將此類傳遞進去。注意：這里需要面向接口編程涂籽，在后面我們需要想更換同義詞存儲類苹祟，只需要重現(xiàn)實現(xiàn)接口即可。

最后編輯于：2017.12.03 08:16:45

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末又活，一起剝皮案震驚了整個濱河市苔咪，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌柳骄，老刑警劉巖团赏，帶你破解...
沈念sama閱讀 221,198評論 6贊 514
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場離奇詭異耐薯，居然都是意外死亡舔清，警方通過查閱死者的電腦和手機丝里，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 94,334評論 3贊 398
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來体谒，“玉大人杯聚，你說我怎么就攤上這事∈阊鳎” “怎么了幌绍？”我有些...
開封第一講書人閱讀 167,643評論 0贊 360
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長故响。經(jīng)常有香客問我傀广，道長，這世上最難降的妖魔是什么彩届？我笑而不...
開封第一講書人閱讀 59,495評論 1贊 296
?港島之戀（遺憾婚禮）
正文為了忘掉前任伪冰，我火速辦了婚禮，結(jié)果婚禮上樟蠕，老公的妹妹穿的比我還像新娘贮聂。我一直安慰自己，他們只是感情好寨辩，可當我...
茶點故事閱讀 68,502評論 6贊 397
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布吓懈。她就那樣靜靜地躺著，像睡著了一般捣染。火紅的嫁衣襯著肌膚如雪骄瓣。梳的紋絲不亂的頭發(fā)上停巷，一...
開封第一講書人閱讀 52,156評論 1贊 308
城市分裂傳說
那天耍攘，我揣著相機與錄音，去河邊找鬼畔勤。笑死蕾各，一個胖子當著我的面吹牛，可吹牛的內(nèi)容都是我干的庆揪。我是一名探鬼主播式曲，決...
沈念sama閱讀 40,743評論 3贊 421
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長吁一口氣：“原來是場噩夢啊……” “哼缸榛！你這毒婦竟也來了吝羞？” 一聲冷哼從身側(cè)響起，我...
開封第一講書人閱讀 39,659評論 0贊 276
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤内颗，失蹤者是張志新（化名）和其女友劉穎钧排，沒想到半個月后，有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體均澳，經(jīng)...
沈念sama閱讀 46,200評論 1贊 319
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡恨溜，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 38,282評論 3贊 340
?白月光啟示錄
正文我和宋清朗相戀三年符衔，在試婚紗的時候發(fā)現(xiàn)自己被綠了。大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片糟袁。...
茶點故事閱讀 40,424評論 1贊 352
活死人
序言：一個原本活蹦亂跳的男人離奇死亡判族，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出项戴，到底是詐尸還是另有隱情形帮，我是刑警寧澤，帶...
沈念sama閱讀 36,107評論 5贊 349
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布周叮，位于F島的核電站沃缘，受9級特大地震影響，放射性物質(zhì)發(fā)生泄漏则吟。R本人自食惡果不足惜槐臀，卻給世界環(huán)境...
茶點故事閱讀 41,789評論 3贊 333
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望氓仲。院中可真熱鬧水慨，春花似錦、人聲如沸敬扛。這莊子的主人今日做“春日...
開封第一講書人閱讀 32,264評論 0贊 23
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽啥箭。三九已至谍珊，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間急侥，已是汗流浹背砌滞。一陣腳步聲響...
開封第一講書人閱讀 33,390評論 1贊 271
情欲美人皮
我被黑心中介騙來泰國打工，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留坏怪，地道東北人贝润。一個月前我還...
沈念sama閱讀 48,798評論 3贊 376
代替公主和親
正文我出身青樓，卻偏偏與公主長得像铝宵，于是被迫代替她去往敵國和親打掘。傳聞我的和親對象是個殘疾皇子，可洞房花燭夜當晚...
茶點故事閱讀 45,435評論 2贊 359

7祥绞、自定義分詞和中文分詞（lucene筆記）