IK分詞器源碼解析（一）：構(gòu)造字典樹

最近在搞ES堡牡，結(jié)合了IK分詞器午绳，偶然間看到IK的主詞典中有27萬的詞筒溃，加上其他的拓展詞庫差不多也有小一百萬了马篮，于是比較好奇IK是如何判斷用戶輸入的詞是否在詞庫中的，于是索性下載了IK的源碼讀一讀怜奖，接下來是分詞流程的解析浑测。
首先先看一下主類，是一個(gè)用來測試的類

public class IKSegmenterTest {
   static String text = "IK Analyzer是一個(gè)結(jié)合詞典分詞和文法分詞的中文分詞開源工具包歪玲。它使用了全新的正向迭代最細(xì)粒度切分算法迁央。";
   public static void main(String[] args) throws IOException {
       IKSegmenter segmenter = new IKSegmenter(new StringReader(text), false);
       Lexeme next;
       System.out.print("非智能分詞結(jié)果：");
       while((next=segmenter.next())!=null){
           System.out.print(next.getLexemeText()+" ");
       }
       System.out.println();
       System.out.println("----------------------------分割線------------------------------");
       IKSegmenter smartSegmenter = new IKSegmenter(new StringReader(text), true);
       System.out.print("智能分詞結(jié)果：");
       while((next=smartSegmenter.next())!=null) {
           System.out.print(next.getLexemeText() + " ");
       }
   }
}

然后從第一行的構(gòu)造器點(diǎn)擊去，這里解釋一下這個(gè)useSmart滥崩。IK分詞器有兩種分詞模式岖圈，一種是ik_max_word，這種模式下會對字符串進(jìn)行最細(xì)粒度的拆分钙皮，如會將“中華人民共和國人民大會堂”拆分為“中華人民共和國蜂科、中華人民、中華短条、華人导匣、人民共和國、人民茸时、共和國贡定、大會堂、大會可都、會堂等詞語缓待。而另一種ik_smart模式會做粗粒度的拆分比如拆分成“中華人民共和國、人民大會堂”汹粤。構(gòu)造器源碼如下

    /**
     * IK分詞器構(gòu)造函數(shù)
     * @param input 
     * @param useSmart 為true命斧，使用智能分詞策略
     * 
     * 非智能分詞：細(xì)粒度輸出所有可能的切分結(jié)果
     * 智能分詞： 合并數(shù)詞和量詞，對分詞結(jié)果進(jìn)行歧義判斷
     */
    public IKSegmenter(Reader input , boolean useSmart){
        this.input = input;
        this.cfg = DefaultConfig.getInstance();
        this.cfg.setUseSmart(useSmart);
        this.init();
    }

然后從getInstance點(diǎn)進(jìn)去看一下如何構(gòu)造的config

    /**
     * 返回單例
     * @return Configuration單例
     */
    public static Configuration getInstance(){
        return new DefaultConfig();
    }

在這里我們發(fā)現(xiàn)IK通過static搞了一個(gè)簡單的單例模式嘱兼，繼續(xù)追蹤new部分源碼

private DefaultConfig(){        
        props = new Properties();
        
        InputStream input = this.getClass().getClassLoader().getResourceAsStream(FILE_NAME);
        if(input != null){
            try {
                props.loadFromXML(input);
            } catch (InvalidPropertiesFormatException e) {
                e.printStackTrace();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }

可以看到這里讀取了Resource文件夾下的配置文件放到了props里国葬，也就是IKAnalyzer.cfg.xml，它的配置文件讀取過程不像Spring那么復(fù)雜芹壕，直接就通過類庫就完成了汇四。這個(gè)文件主要是用來配置用戶自己拓展的詞庫的，順手貼一下配置文件踢涌。

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">  
<properties>  
    <comment>IK Analyzer 擴(kuò)展配置</comment>
    <!--用戶可以在這里配置自己的擴(kuò)展字典 -->
    <entry key="ext_dict">ext.dic;</entry> 
    <!--用戶可以在這里配置自己的擴(kuò)展停止詞字典-->
    <entry key="ext_stopwords">stopword.dic;</entry> 
</properties>

到這里算是把config初始化完成了通孽，我們回到最上面來看init方法里面是如何初始化的

/**
    * 初始化
    */
   private void init(){
       //初始化詞典單例
       Dictionary.initial(this.cfg);
       //初始化分詞上下文
       this.context = new AnalyzeContext(this.cfg);
       //加載子分詞器
       this.segmenters = this.loadSegmenters();
       //加載歧義裁決器
       this.arbitrator = new IKArbitrator();
   }

這里的代碼也是比較簡單的，能看出來這里是通過剛才加載的配置文件來加載詞庫睁壁，加載分詞器背苦。但是上下文和裁決器還是不知道是什么互捌，我們先看一下詞典是如何初始化的

   /**
    * 詞典初始化
    * 由于IK Analyzer的詞典采用Dictionary類的靜態(tài)方法進(jìn)行詞典初始化
    * 只有當(dāng)Dictionary類被實(shí)際調(diào)用時(shí)，才會開始載入詞典行剂，
    * 這將延長首次分詞操作的時(shí)間
    * 該方法提供了一個(gè)在應(yīng)用加載階段就初始化字典的手段
    * @return Dictionary
    */
   public static Dictionary initial(Configuration cfg){
       if(singleton == null){
           synchronized(Dictionary.class){
               if(singleton == null){
                   singleton = new Dictionary(cfg);
                   return singleton;
               }
           }
       }
       return singleton;
   }

我們可以看到這里用了一個(gè)面試中經(jīng)常被問到但是我們從來沒用過（手動狗頭）的雙重檢查的方式實(shí)現(xiàn)了一個(gè)單例模式秕噪，然后點(diǎn)進(jìn)去看一下new方法部分

private Dictionary(Configuration cfg){
        this.cfg = cfg;
        this.loadMainDict();
        this.loadStopWordDict();
        this.loadQuantifierDict();
    }

我們可以看到這里load了三個(gè)詞典，分別是主詞典厚宰，停止詞詞典以及數(shù)量詞詞典腌巾，我們挑主詞典的加載看一下

    private void loadMainDict(){
        //建立一個(gè)主詞典實(shí)例
        _MainDict = new DictSegment((char)0);
        //讀取主詞典文件
        InputStream is = this.getClass().getClassLoader().getResourceAsStream(cfg.getMainDictionary());
        if(is == null){
            throw new RuntimeException("Main Dictionary not found!!!");
        }
        
        try {
            BufferedReader br = new BufferedReader(new InputStreamReader(is , "UTF-8"), 512);
            String theWord = null;
            do {
                theWord = br.readLine();
                if (theWord != null && !"".equals(theWord.trim())) {
                    _MainDict.fillSegment(theWord.trim().toLowerCase().toCharArray());
                }
            } while (theWord != null);
            
        } catch (IOException ioe) {
            System.err.println("Main Dictionary loading exception.");
            ioe.printStackTrace();
            
        }finally{
            try {
                if(is != null){
                    is.close();
                    is = null;
                }
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
        //加載擴(kuò)展詞典
        this.loadExtDict();
    }

首先它new了一個(gè)DictSegment對象，并傳了一個(gè)0進(jìn)去铲觉，我們點(diǎn)擊去看一下澈蝙。

/**
 * 詞典樹分段，表示詞典樹的一個(gè)分枝
 */
class DictSegment implements Comparable<DictSegment>{
    
    //公用字典表撵幽，存儲漢字
    private static final Map<Character , Character> charMap = new HashMap<Character , Character>(16 , 0.95f);
    //數(shù)組大小上限
    private static final int ARRAY_LENGTH_LIMIT = 3;

    
    //Map存儲結(jié)構(gòu)
    private Map<Character , DictSegment> childrenMap;
    //數(shù)組方式存儲結(jié)構(gòu)
    private DictSegment[] childrenArray;
    
    
    //當(dāng)前節(jié)點(diǎn)上存儲的字符
    private Character nodeChar;
    //當(dāng)前節(jié)點(diǎn)存儲的Segment數(shù)目
    //storeSize <=ARRAY_LENGTH_LIMIT 灯荧，使用數(shù)組存儲， storeSize >ARRAY_LENGTH_LIMIT ,則使用Map存儲
    private int storeSize = 0;
    //當(dāng)前DictSegment狀態(tài) ,默認(rèn) 0 , 1表示從根節(jié)點(diǎn)到當(dāng)前節(jié)點(diǎn)的路徑表示一個(gè)詞
    private int nodeState = 0;

通過查看DictSegment這個(gè)類并齐，我們可以確認(rèn)IK是使用的字典樹對用戶的輸入進(jìn)行的匹配漏麦，可以說是用空間換取了時(shí)間，這棵樹的根結(jié)點(diǎn)是0况褪，然后每個(gè)節(jié)點(diǎn)存儲一個(gè)字符撕贞。
然后根據(jù)config中的默認(rèn)配置把主詞典的文件加載到內(nèi)存中，這里是個(gè)循環(huán)测垛，一次加載一行捏膨。這里給大家看一下主詞典的文件，一行一個(gè)詞食侮，一共27萬行

image.png

然后我們往下看是如何使用加載到內(nèi)存中的詞進(jìn)行fillSegment操作的

    /**
     * 加載填充詞典片段
     * @param charArray
     */
    void fillSegment(char[] charArray){
        this.fillSegment(charArray, 0 , charArray.length , 1); 
    }

這里貌似沒什么好說的号涯，繼續(xù)往下看

    private synchronized void fillSegment(char[] charArray , int begin , int length , int enabled){
        //獲取字典表中的漢字對象
        Character beginChar = new Character(charArray[begin]);
        Character keyChar = charMap.get(beginChar);
        //字典中沒有該字，則將其添加入字典
        if(keyChar == null){
            charMap.put(beginChar, beginChar);
            keyChar = beginChar;
        }
        
        //搜索當(dāng)前節(jié)點(diǎn)的存儲锯七，查詢對應(yīng)keyChar的keyChar链快，如果沒有則創(chuàng)建
        DictSegment ds = lookforSegment(keyChar , enabled);
        if(ds != null){
            //處理keyChar對應(yīng)的segment
            if(length > 1){
                //詞元還沒有完全加入詞典樹
                ds.fillSegment(charArray, begin + 1, length - 1 , enabled);
            }else if (length == 1){
                //已經(jīng)是詞元的最后一個(gè)char,設(shè)置當(dāng)前節(jié)點(diǎn)狀態(tài)為enabled，
                //enabled=1表明一個(gè)完整的詞眉尸，enabled=0表示從詞典中屏蔽當(dāng)前詞
                ds.nodeState = enabled;
            }
        }

    }

首先這里把傳入的這一行字符串中的第一個(gè)詞拿了出來域蜗，放到一個(gè)用來存儲漢字的公共字典里，至于為什么不像主詞庫一樣放到文件里噪猾，可能是覺得維護(hù)兩個(gè)文件比較麻煩霉祸，也可能后面能看到原因，但是這里最讓我奇怪的是這個(gè)用來做公共字典的HashMap袱蜡，它的負(fù)載因子沒有用默認(rèn)的0.75丝蹭，而是用了0.95，這里是個(gè)需要之后思考的地方

    //公用字典表坪蚁，存儲漢字
    private static final Map<Character , Character> charMap = new HashMap<Character , Character>(16 , 0.95f);

然后走到lookforSegment方法奔穿，這里是將詞庫中的詞加入到字典樹的方法镜沽，也是一個(gè)比較重要的方法是，我們點(diǎn)進(jìn)去看一下

/**
     * 查找本節(jié)點(diǎn)下對應(yīng)的keyChar的segment  * 
     * @param keyChar
     * @param create  =1如果沒有找到巫橄，則創(chuàng)建新的segment ; =0如果沒有找到淘邻，不創(chuàng)建，返回null
     * @return
     */
    private DictSegment lookforSegment(Character keyChar ,  int create){
        
        DictSegment ds = null;

        if(this.storeSize <= ARRAY_LENGTH_LIMIT){
            //獲取數(shù)組容器湘换，如果數(shù)組未創(chuàng)建則創(chuàng)建數(shù)組
            DictSegment[] segmentArray = getChildrenArray();            
            //搜尋數(shù)組
            DictSegment keySegment = new DictSegment(keyChar);
            int position = Arrays.binarySearch(segmentArray, 0 , this.storeSize, keySegment);
            if(position >= 0){
                ds = segmentArray[position];
            }
        
            //遍歷數(shù)組后沒有找到對應(yīng)的segment
            if(ds == null && create == 1){
                ds = keySegment;
                if(this.storeSize < ARRAY_LENGTH_LIMIT){
                    //數(shù)組容量未滿，使用數(shù)組存儲
                    segmentArray[this.storeSize] = ds;
                    //segment數(shù)目+1
                    this.storeSize++;
                    Arrays.sort(segmentArray , 0 , this.storeSize);
                    
                }else{
                    //數(shù)組容量已滿统阿，切換Map存儲
                    //獲取Map容器彩倚，如果Map未創(chuàng)建,則創(chuàng)建Map
                    Map<Character , DictSegment> segmentMap = getChildrenMap();
                    //將數(shù)組中的segment遷移到Map中
                    migrate(segmentArray ,  segmentMap);
                    //存儲新的segment
                    segmentMap.put(keyChar, ds);
                    //segment數(shù)目+1 ，  必須在釋放數(shù)組前執(zhí)行storeSize++ 扶平， 確保極端情況下帆离，不會取到空的數(shù)組
                    this.storeSize++;
                    //釋放當(dāng)前的數(shù)組引用
                    this.childrenArray = null;
                }

            }           
            
        }else{
            //獲取Map容器，如果Map未創(chuàng)建,則創(chuàng)建Map
            Map<Character , DictSegment> segmentMap = getChildrenMap();
            //搜索Map
            ds = (DictSegment)segmentMap.get(keyChar);
            if(ds == null && create == 1){
                //構(gòu)造新的segment
                ds = new DictSegment(keyChar);
                segmentMap.put(keyChar , ds);
                //當(dāng)前節(jié)點(diǎn)存儲segment數(shù)目+1
                this.storeSize ++;
            }
        }

        return ds;
    }

我們可以看到结澄，IK的字典樹存儲規(guī)則是按是否超過ARRAY_LENGTH_LIMIT也就是3來決定的

如果沒超過3哥谷，則子節(jié)點(diǎn)用數(shù)組存儲。

判斷如果當(dāng)前節(jié)點(diǎn)的存儲個(gè)數(shù)在3及以下麻献，先在數(shù)組中進(jìn)行一個(gè)二分查找们妥，如果能夠找到就把值拿出來。
如果沒有找到且允許新建的話勉吻，再判斷是否容量超過3监婶，沒超過則直接使用數(shù)組存儲，加入到數(shù)組中并進(jìn)行sort齿桃，由于DictSegment重寫了compareTo方法惑惶，所以是按存儲字符的字典序進(jìn)行排序的。如果超過3短纵，則改為用map存儲带污，將之前數(shù)組中的內(nèi)容放到新建的map中并將數(shù)組的內(nèi)存釋放掉。

超過3則用map存儲香到。

搜索整個(gè)map鱼冀，若有則把值拿出來，沒有則put

之后我們遞歸執(zhí)行這個(gè)過程养渴，直到這一個(gè)字符串中的所有字符都加入到字典樹中雷绢，當(dāng)最后一個(gè)字符放入字典樹時(shí)，會設(shè)置一個(gè)標(biāo)記位理卑，也就是之前的enabled翘紊，來表示當(dāng)前是一個(gè)完整詞的結(jié)尾。當(dāng)主詞典所有詞都加入字典樹之后藐唠，主詞典加載完畢帆疟，然后加載擴(kuò)展詞典等其他詞典鹉究，過程完全一致