知識(shí)圖譜-LSTM+CRF知識(shí)抽取實(shí)戰(zhàn)

一、引言

本文的idea主要來源于LSTM+CRF的命名實(shí)體識(shí)別西剥，在命名實(shí)體識(shí)別中，可以通過BIO或者BIOSE等標(biāo)注進(jìn)行人名亿汞、地名瞭空、機(jī)構(gòu)名或者其他專有名詞的識(shí)別，那么把三元組的主語疗我、謂語咆畏、賓語（也可理解為：實(shí)體-關(guān)系-實(shí)體）三個(gè)部分當(dāng)成三個(gè)需要識(shí)別的專有名詞，也就可以實(shí)現(xiàn)三元組的抽取了吴裤，基于此想法旧找，具體實(shí)踐看看效果。

二麦牺、實(shí)踐簡(jiǎn)介

1钮蛛、數(shù)據(jù)來源

本文主要基于歷史文章中的人物關(guān)系抽取鞭缭，數(shù)據(jù)來源于http://www.lishixinzhi.com和http://www.uuqgs.com/

2、預(yù)測(cè)類別（7個(gè)）

主語開頭：B-SUBJECT
主語非開頭：I-SUBJECT
謂語開頭：B-PREDICATE
謂語非開頭：I-PREDICATE
賓語開頭：B-OBJECT
賓語非開頭：I-OBJECT
其他：O

3魏颓、框架

keras

4岭辣、模型結(jié)構(gòu)

本次抽取本質(zhì)上還是基于LSTM的一個(gè)分類問題，至于CRF層甸饱，完全是為了保證序列的輸出嚴(yán)格性沦童，因?yàn)镃RF對(duì)于預(yù)測(cè)序列有較強(qiáng)的的限制性，比如B-PRESON后面只能為I-PERSON或者O之類的限制叹话。

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 91, 100)           60000     
_________________________________________________________________
bidirectional_1 (Bidirection (None, 91, 100)           60400     
_________________________________________________________________
time_distributed_1 (TimeDist (None, 91, 7)             707       
_________________________________________________________________
crf_1 (CRF)                  (None, 91, 7)             119       
=================================================================
Total params: 121,226
Trainable params: 61,226
Non-trainable params: 60,000

5偷遗、項(xiàng)目流程

    # 獲取詞典映射
    word2id, tag2id, id2word, id2tag = getWordAndTagId('train.txt')
    # 獲取句子和標(biāo)注
    sentences, tags = getSentencesAndTags('train.txt')
    # 將句子和標(biāo)注轉(zhuǎn)換為id
    sentencesIds, tagsIds = sentencesAndTags2id(sentences, tags,word2id, tag2id)
    # 將句子和標(biāo)注進(jìn)行填充，確保輸入維度一致
    sentencesIds = pad_sequences(sentencesIds, padding='post')
    tagsIds = pad_sequences(tagsIds, padding='post')
    print(sentencesIds.shape)
    print(tagsIds.shape)
    # 載入模型
    model = model(len(word2id),100,sentencesIds.shape[1],len(tag2id))
   # 訓(xùn)練
    history = model.fit(sentencesIds, tagsIds.reshape([len(tagsIds),-1,1]), epochs=500)

三驼壶、數(shù)據(jù)標(biāo)注

關(guān)于訓(xùn)練數(shù)據(jù)鹦肿，未找到合適的標(biāo)注數(shù)據(jù)，只能自己標(biāo)注了辅柴，如下：

長 B-SUBJECT
孫 I-SUBJECT
無 I-SUBJECT
忌 I-SUBJECT
看 O
到 O
外 B-PREDICATE
甥 I-PREDICATE
承 B-OBJECT
乾 I-OBJECT
、 O
李 B-OBJECT
泰 I-OBJECT
都 O
完 O
了 O
瞭吃。 O

唐 B-SUBJECT
玄 I-SUBJECT
宗 I-SUBJECT
有 O
兩 O
個(gè) O
同 O
母 O
妹 B-PREDICATE
妹 I-PREDICATE
： O
金 B-OBJECT
仙 I-OBJECT
公 I-OBJECT
主 I-OBJECT
和 O
玉 B-OBJECT
真 I-OBJECT
公 I-OBJECT
主 I-OBJECT
碌嘀。 O

...此處省略n多

李 B-SUBJECT
文 I-SUBJECT
有 O
兩 O
個(gè) O
妹 B-PREDICATE
妹 I-PREDICATE
， O
一 O
個(gè) O
叫 O
宇 B-OBJECT
宇 I-OBJECT
歪架， O
一 O
個(gè) O
叫 O
佳 B-OBJECT
佳 I-OBJECT
股冗。 O

四、實(shí)戰(zhàn)

1和蚪、數(shù)據(jù)預(yù)處理

1.1 詞典映射

主要是低頻詞過濾,字與id的映射（word2id）止状、預(yù)測(cè)類別與id的映射（lable2id），具體實(shí)現(xiàn)方式各有不同攒霹，不做重點(diǎn)講解怯疤，但要特別注意未登錄詞的處理：

 word_size = len(words)
 word2id = {count[0]: index for index, count in enumerate(words,start=1)}  
 id2word = {index: count[0] for index, count in enumerate(words,start=1)}
 tag2id = {count[0]: index for index, count in enumerate(tags)}
 id2tag = {index: count[0] for index, count in enumerate(tags)}
 #  填充詞
 word2id['<PAD>'] = 0
#  未登錄詞
 word2id['<UNK>'] = word_size + 1

1.2 從訓(xùn)練文件中獲取句子和標(biāo)簽

def getSentencesAndTags(filePath):
    '''
    從文件里面獲取句子和標(biāo)注
    :param filePath:
    :return:
    '''
    with open(filePath,encoding='utf-8') as file:
        wordsAndtags=[line.split() for line in file]
        sentences=[]
        tags=[]
        sentence=[]
        tag=[]
        for wordAndTag in wordsAndtags:
            if len(wordAndTag)==2:
                sentence.append(wordAndTag[0])
                tag.append(wordAndTag[1])
            else:
                sentences.append(sentence)
                tags.append(tag)
                sentence=[]
                tag = []
    return sentences,tags

1.3 輸入文本轉(zhuǎn)id

將輸入的文本，通過詞典催束，轉(zhuǎn)換成數(shù)字序列：

def sentencesAndTags2id(sentences,tags,word2id, tag2id):
    '''
    將句子和標(biāo)注轉(zhuǎn)換為id
    :param sentences:
    :param tags:
    :param word2id:
    :param tag2id:
    :return:
    '''
    sentencesIds = [[word2id.get(char,len(word2id)) for char in sentence] for sentence in sentences]
    tagsIds = [[tag2id[char] for char in tag] for tag in tags]
    return sentencesIds,tagsIds

1.4 數(shù)據(jù)填充

為了保證數(shù)據(jù)的維度一致集峦，進(jìn)行句子填充

from keras_preprocessing.sequence import pad_sequences
sentencesIds = pad_sequences(sentencesIds, padding='post')
tagsIds = pad_sequences(tagsIds, padding='post')

2、模型構(gòu)建

def model(vocabSize,embeddingDim,inputLength,tagSize):
    model = Sequential()
    model.add(Embedding(vocabSize + 1,embeddingDim,input_length=inputLength,mask_zero=True))
    model.add(Bidirectional(LSTM(50, return_sequences=True)))
    model.add(TimeDistributed(Dense(tagSize)))
    crf_layer = CRF(tagSize, sparse_target=True)
    model.add(crf_layer)
    model.compile('adam', loss=crf_layer.loss_function, metrics=[crf_layer.accuracy])
    model.summary()
    return model

3抠刺、測(cè)試

通過簡(jiǎn)單的測(cè)試結(jié)果如下：

image.png

比較簡(jiǎn)單的句子上都能取得比較好的成果塔淤，但是由于訓(xùn)練數(shù)據(jù)不夠，還是會(huì)出現(xiàn)無法抽取到結(jié)果或者抽取錯(cuò)誤的情況速妖，比如：