NLP入門(mén)（五）用深度學(xué)習(xí)實(shí)現(xiàn)命名實(shí)體識(shí)別（NER）

前言

??在文章：NLP入門(mén)（四）命名實(shí)體識(shí)別（NER）中扔傅，筆者介紹了兩個(gè)實(shí)現(xiàn)命名實(shí)體識(shí)別的工具——NLTK和Stanford NLP耍共。在本文中，我們將會(huì)學(xué)習(xí)到如何使用深度學(xué)習(xí)工具來(lái)自己一步步地實(shí)現(xiàn)NER钩骇，只要你堅(jiān)持看完铝量，就一定會(huì)很有收獲的。
??OK慢叨，話(huà)不多說(shuō)哄辣，讓我們進(jìn)入正題。
??幾乎所有的NLP都依賴(lài)一個(gè)強(qiáng)大的語(yǔ)料庫(kù)赠尾，本項(xiàng)目實(shí)現(xiàn)NER的語(yǔ)料庫(kù)如下(文件名為train.txt，一共42000行毅弧，這里只展示前15行气嫁，可以在文章最后的Github地址下載該語(yǔ)料庫(kù))：

played on Monday ( home team in CAPS ) :
VBD IN NNP ( NN NN IN NNP ) :
O O O O O O O O O O
American League
NNP NNP
B-MISC I-MISC
Cleveland 2 DETROIT 1
NNP CD NNP CD
B-ORG O B-ORG O
BALTIMORE 12 Oakland 11 ( 10 innings )
VB CD NNP CD ( CD NN )
B-ORG O B-ORG O O O O O
TORONTO 5 Minnesota 3
TO CD NNP CD
B-ORG O B-ORG O
......

簡(jiǎn)單介紹下該語(yǔ)料庫(kù)的結(jié)構(gòu)：該語(yǔ)料庫(kù)一共42000行，每三行為一組够坐，其中寸宵，第一行為英語(yǔ)句子，第二行為每個(gè)句子的詞性（關(guān)于英語(yǔ)單詞的詞性元咙，可參考文章：NLP入門(mén)（三）詞形還原（Lemmatization））梯影，第三行為NER系統(tǒng)的標(biāo)注，具體的含義會(huì)在之后介紹。
??我們的NER項(xiàng)目的名稱(chēng)為DL_4_NER陪白，結(jié)構(gòu)如下：

NER項(xiàng)目的名稱(chēng)

項(xiàng)目中每個(gè)文件的功能如下：

utils.py: 項(xiàng)目配置及數(shù)據(jù)導(dǎo)入
data_processing.py: 數(shù)據(jù)探索
Bi_LSTM_Model_training.py: 模型創(chuàng)建及訓(xùn)練
Bi_LSTM_Model_predict.py: 對(duì)新句子進(jìn)行NER預(yù)測(cè)

??接下來(lái)锐膜，筆者將結(jié)合代碼文件捞奕，分部介紹該項(xiàng)目的步驟院促，當(dāng)所有步驟介紹完畢后弄抬，我們的項(xiàng)目就結(jié)束了懊亡，而你叹誉，也就知道了如何用深度學(xué)習(xí)實(shí)現(xiàn)命名實(shí)體識(shí)別（NER）艰争。
??Let's begin!

項(xiàng)目配置

??第一步，是項(xiàng)目的配置及數(shù)據(jù)導(dǎo)入，在utils.py文件中實(shí)現(xiàn)，完整的代碼如下：

# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd

# basic settings for DL_4_NER Project
BASE_DIR = "F://NERSystem"
CORPUS_PATH = "%s/train.txt" % BASE_DIR

KERAS_MODEL_SAVE_PATH = '%s/Bi-LSTM-4-NER.h5' % BASE_DIR
WORD_DICTIONARY_PATH = '%s/word_dictionary.pk' % BASE_DIR
InVERSE_WORD_DICTIONARY_PATH = '%s/inverse_word_dictionary.pk' % BASE_DIR
LABEL_DICTIONARY_PATH = '%s/label_dictionary.pk' % BASE_DIR
OUTPUT_DICTIONARY_PATH = '%s/output_dictionary.pk' % BASE_DIR

CONSTANTS = [
             KERAS_MODEL_SAVE_PATH,
             InVERSE_WORD_DICTIONARY_PATH,
             WORD_DICTIONARY_PATH,
             LABEL_DICTIONARY_PATH,
             OUTPUT_DICTIONARY_PATH
             ]

# load data from corpus to from pandas DataFrame
def load_data():
    with open(CORPUS_PATH, 'r') as f:
        text_data = [text.strip() for text in f.readlines()]
    text_data = [text_data[k].split('\t') for k in range(0, len(text_data))]
    index = range(0, len(text_data), 3)

    # Transforming data to matrix format for neural network
    input_data = list()
    for i in range(1, len(index) - 1):
        rows = text_data[index[i-1]:index[i]]
        sentence_no = np.array([i]*len(rows[0]), dtype=str)
        rows.append(sentence_no)
        rows = np.array(rows).T
        input_data.append(rows)

    input_data = pd.DataFrame(np.concatenate([item for item in input_data]),\
                               columns=['word', 'pos', 'tag', 'sent_no'])

    return input_data

在該代碼中青瀑，先是設(shè)置了語(yǔ)料庫(kù)文件的路徑CORPUS_PATH，KERAS模型保存路徑KERAS_MODEL_SAVE_PATH萧诫，以及在項(xiàng)目過(guò)程中會(huì)用到的三個(gè)字典的保存路徑（以pickle文件形式保存）WORD_DICTIONARY_PATH，LABEL_DICTIONARY_PATH， OUTPUT_DICTIONARY_PATH缴饭。然后是load_data()函數(shù)暑劝，它將語(yǔ)料庫(kù)中的文本以Pandas中的DataFrame結(jié)構(gòu)展示出來(lái)智嚷，該數(shù)據(jù)框的前30行如下：

         word  pos     tag sent_no
0      played  VBD       O       1
1          on   IN       O       1
2      Monday  NNP       O       1
3           (    (       O       1
4        home   NN       O       1
5        team   NN       O       1
6          in   IN       O       1
7        CAPS  NNP       O       1
8           )    )       O       1
9           :    :       O       1
10   American  NNP  B-MISC       2
11     League  NNP  I-MISC       2
12  Cleveland  NNP   B-ORG       3
13          2   CD       O       3
14    DETROIT  NNP   B-ORG       3
15          1   CD       O       3
16  BALTIMORE   VB   B-ORG       4
17         12   CD       O       4
18    Oakland  NNP   B-ORG       4
19         11   CD       O       4
20          (    (       O       4
21         10   CD       O       4
22    innings   NN       O       4
23          )    )       O       4
24    TORONTO   TO   B-ORG       5
25          5   CD       O       5
26  Minnesota  NNP   B-ORG       5
27          3   CD       O       5
28  Milwaukee  NNP   B-ORG       6
29          3   CD       O       6

在該數(shù)據(jù)框中卖丸，word這一列表示文本語(yǔ)料庫(kù)中的單詞，pos這一列表示該單詞的詞性猜嘱，tag這一列表示NER的標(biāo)注衅枫，sent_no這一列表示該單詞在第幾個(gè)句子中。

數(shù)據(jù)探索

??接著朗伶，第二步是數(shù)據(jù)探索弦撩，即對(duì)輸入的數(shù)據(jù)（input_data）進(jìn)行一些數(shù)據(jù)review，完整的代碼（data_processing.py）如下:

# -*- coding: utf-8 -*-

import pickle
import numpy as np
from collections import Counter
from itertools import accumulate
from operator import itemgetter
import matplotlib.pyplot as plt
import matplotlib as mpl
from utils import BASE_DIR, CONSTANTS, load_data

# 設(shè)置matplotlib繪圖時(shí)的字體
mpl.rcParams['font.sans-serif']=['SimHei']

# 數(shù)據(jù)查看
def data_review():

    # 數(shù)據(jù)導(dǎo)入
    input_data = load_data()

    # 基本的數(shù)據(jù)review
    sent_num = input_data['sent_no'].astype(np.int).max()
    print("一共有%s個(gè)句子论皆。\n"%sent_num)

    vocabulary = input_data['word'].unique()
    print("一共有%d個(gè)單詞益楼。"%len(vocabulary))
    print("前10個(gè)單詞為：%s.\n"%vocabulary[:11])

    pos_arr = input_data['pos'].unique()
    print("單詞的詞性列表：%s.\n"%pos_arr)

    ner_tag_arr = input_data['tag'].unique()
    print("NER的標(biāo)注列表：%s.\n" % ner_tag_arr)

    df = input_data[['word', 'sent_no']].groupby('sent_no').count()
    sent_len_list = df['word'].tolist()
    print("句子長(zhǎng)度及出現(xiàn)頻數(shù)字典：\n%s." % dict(Counter(sent_len_list)))

    # 繪制句子長(zhǎng)度及出現(xiàn)頻數(shù)統(tǒng)計(jì)圖
    sort_sent_len_dist = sorted(dict(Counter(sent_len_list)).items(), key=itemgetter(0))
    sent_no_data = [item[0] for item in sort_sent_len_dist]
    sent_count_data = [item[1] for item in sort_sent_len_dist]
    plt.bar(sent_no_data, sent_count_data)
    plt.title("句子長(zhǎng)度及出現(xiàn)頻數(shù)統(tǒng)計(jì)圖")
    plt.xlabel("句子長(zhǎng)度")
    plt.ylabel("句子長(zhǎng)度出現(xiàn)的頻數(shù)")
    plt.savefig("%s/句子長(zhǎng)度及出現(xiàn)頻數(shù)統(tǒng)計(jì)圖.png" % BASE_DIR)
    plt.close()

    # 繪制句子長(zhǎng)度累積分布函數(shù)(CDF)
    sent_pentage_list = [(count/sent_num) for count in accumulate(sent_count_data)]

    # 尋找分位點(diǎn)為quantile的句子長(zhǎng)度
    quantile = 0.9992
    #print(list(sent_pentage_list))
    for length, per in zip(sent_no_data, sent_pentage_list):
        if round(per, 4) == quantile:
            index = length
            break
    print("\n分位點(diǎn)為%s的句子長(zhǎng)度:%d." % (quantile, index))

    # 繪制CDF
    plt.plot(sent_no_data, sent_pentage_list)
    plt.hlines(quantile, 0, index, colors="c", linestyles="dashed")
    plt.vlines(index, 0, quantile, colors="c", linestyles="dashed")
    plt.text(0, quantile, str(quantile))
    plt.text(index, 0, str(index))
    plt.title("句子長(zhǎng)度累積分布函數(shù)圖")
    plt.xlabel("句子長(zhǎng)度")
    plt.ylabel("句子長(zhǎng)度累積頻率")
    plt.savefig("%s/句子長(zhǎng)度累積分布函數(shù)圖.png" % BASE_DIR)
    plt.close()

# 數(shù)據(jù)處理
def data_processing():
    # 數(shù)據(jù)導(dǎo)入
    input_data = load_data()

    # 標(biāo)簽及詞匯表
    labels, vocabulary = list(input_data['tag'].unique()), list(input_data['word'].unique())

    # 字典列表
    word_dictionary = {word: i+1 for i, word in enumerate(vocabulary)}
    inverse_word_dictionary = {i+1: word for i, word in enumerate(vocabulary)}
    label_dictionary = {label: i+1 for i, label in enumerate(labels)}
    output_dictionary = {i+1: labels for i, labels in enumerate(labels)}

    dict_list = [word_dictionary, inverse_word_dictionary,label_dictionary, output_dictionary]

    # 保存為pickle形式
    for dict_item, path in zip(dict_list, CONSTANTS[1:]):
        with open(path, 'wb') as f:
            pickle.dump(dict_item, f)

#data_review()

調(diào)用data_review()函數(shù)猾漫，輸出的結(jié)果如下：

一共有13998個(gè)句子。

一共有24339個(gè)單詞感凤。
前10個(gè)單詞為：['played' 'on' 'Monday' '(' 'home' 'team' 'in' 'CAPS' ')' ':' 'American'].

單詞的詞性列表：['VBD' 'IN' 'NNP' '(' 'NN' ')' ':' 'CD' 'VB' 'TO' 'NNS' ',' 'VBP' 'VBZ'
 '.' 'VBG' 'PRP$' 'JJ' 'CC' 'JJS' 'RB' 'DT' 'VBN' '"' 'PRP' 'WDT' 'WRB'
 'MD' 'WP' 'POS' 'JJR' 'WP$' 'RP' 'NNPS' 'RBS' 'FW' '$' 'RBR' 'EX' "''"
 'PDT' 'UH' 'SYM' 'LS' 'NN|SYM'].

NER的標(biāo)注列表：['O' 'B-MISC' 'I-MISC' 'B-ORG' 'I-ORG' 'B-PER' 'B-LOC' 'I-PER' 'I-LOC'
 'sO'].

句子長(zhǎng)度及出現(xiàn)頻數(shù)字典：
{1: 177, 2: 1141, 3: 620, 4: 794, 5: 769, 6: 639, 7: 999, 8: 977, 9: 841, 10: 501, 11: 395, 12: 316, 13: 339, 14: 291, 15: 275, 16: 225, 17: 229, 18: 212, 19: 197, 20: 221, 21: 228, 22: 221, 23: 230, 24: 210, 25: 207, 26: 224, 27: 188, 28: 199, 29: 214, 30: 183, 31: 202, 32: 167, 33: 167, 34: 141, 35: 130, 36: 119, 37: 105, 38: 112, 39: 98, 40: 78, 41: 74, 42: 63, 43: 51, 44: 42, 45: 39, 46: 19, 47: 22, 48: 19, 49: 15, 50: 16, 51: 8, 52: 9, 53: 5, 54: 4, 55: 9, 56: 2, 57: 2, 58: 2, 59: 2, 60: 3, 62: 2, 66: 1, 67: 1, 69: 1, 71: 1, 72: 1, 78: 1, 80: 1, 113: 1, 124: 1}.

分位點(diǎn)為0.9992的句子長(zhǎng)度:60.

在該語(yǔ)料庫(kù)中悯周，一共有13998個(gè)句子，比預(yù)期的42000/3=14000個(gè)句子少兩個(gè)陪竿。一個(gè)有24339個(gè)單詞禽翼，單詞量還是蠻大的，當(dāng)然族跛，這里對(duì)單詞沒(méi)有做任何處理闰挡，直接保留了語(yǔ)料庫(kù)中的形式（后期可以繼續(xù)優(yōu)化）。單詞的詞性可以參考文章：NLP入門(mén)（三）詞形還原（Lemmatization）庸蔼。我們需要注意的是解总，NER的標(biāo)注列表為['O' ,'B-MISC', 'I-MISC', 'B-ORG' ,'I-ORG', 'B-PER' ,'B-LOC' ,'I-PER', 'I-LOC','sO']，因此姐仅，本項(xiàng)目的NER一共分為四類(lèi)：PER（人名），LOC（位置）掏膏，ORG（組織）以及MISC劳翰，其中B表示開(kāi)始，I表示中間馒疹，O表示單字詞佳簸，不計(jì)入NER，sO表示特殊單字詞颖变。
??接下來(lái)生均，讓我們考慮下句子的長(zhǎng)度，這對(duì)后面的建模時(shí)填充的句子長(zhǎng)度有有參考作用腥刹。句子長(zhǎng)度及出現(xiàn)頻數(shù)的統(tǒng)計(jì)圖如下：

句子長(zhǎng)度及出現(xiàn)頻數(shù)統(tǒng)計(jì)圖

可以看到马胧，句子長(zhǎng)度基本在60以下，當(dāng)然衔峰，這也可以在輸出的句子長(zhǎng)度及出現(xiàn)頻數(shù)字典中看到佩脊。那么，我們是否可以選在一個(gè)標(biāo)準(zhǔn)作為后面模型的句子填充的長(zhǎng)度呢垫卤？答案是威彰，利用出現(xiàn)頻數(shù)的累計(jì)分布函數(shù)的分位點(diǎn)，在這里穴肘，我們選擇分位點(diǎn)為0.9992,對(duì)應(yīng)的句子長(zhǎng)度為60歇盼，如下圖：

句子長(zhǎng)度累積分布函數(shù)圖

??接著是數(shù)據(jù)處理函數(shù)data_processing()，它的功能主要是實(shí)現(xiàn)單詞评抚、標(biāo)簽字典旺遮，并保存為pickle文件形式赵讯，便于后續(xù)直接調(diào)用。

建模

??在第三步中耿眉，我們建立Bi-LSTM模型來(lái)訓(xùn)練訓(xùn)練边翼，完整的Python代碼（Bi_LSTM_Model_training.py）如下：

# -*- coding: utf-8 -*-
import pickle
import numpy as np
import pandas as pd
from utils import BASE_DIR, CONSTANTS, load_data
from data_processing import data_processing
from keras.utils import np_utils, plot_model
from keras.models import Sequential
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Bidirectional, LSTM, Dense, Embedding, TimeDistributed


# 模型輸入數(shù)據(jù)
def input_data_for_model(input_shape):

    # 數(shù)據(jù)導(dǎo)入
    input_data = load_data()
    # 數(shù)據(jù)處理
    data_processing()
    # 導(dǎo)入字典
    with open(CONSTANTS[1], 'rb') as f:
        word_dictionary = pickle.load(f)
    with open(CONSTANTS[2], 'rb') as f:
        inverse_word_dictionary = pickle.load(f)
    with open(CONSTANTS[3], 'rb') as f:
        label_dictionary = pickle.load(f)
    with open(CONSTANTS[4], 'rb') as f:
        output_dictionary = pickle.load(f)
    vocab_size = len(word_dictionary.keys())
    label_size = len(label_dictionary.keys())

    # 處理輸入數(shù)據(jù)
    aggregate_function = lambda input: [(word, pos, label) for word, pos, label in
                                            zip(input['word'].values.tolist(),
                                                input['pos'].values.tolist(),
                                                input['tag'].values.tolist())]

    grouped_input_data = input_data.groupby('sent_no').apply(aggregate_function)
    sentences = [sentence for sentence in grouped_input_data]

    x = [[word_dictionary[word[0]] for word in sent] for sent in sentences]
    x = pad_sequences(maxlen=input_shape, sequences=x, padding='post', value=0)
    y = [[label_dictionary[word[2]] for word in sent] for sent in sentences]
    y = pad_sequences(maxlen=input_shape, sequences=y, padding='post', value=0)
    y = [np_utils.to_categorical(label, num_classes=label_size + 1) for label in y]

    return x, y, output_dictionary, vocab_size, label_size, inverse_word_dictionary


# 定義深度學(xué)習(xí)模型：Bi-LSTM
def create_Bi_LSTM(vocab_size, label_size, input_shape, output_dim, n_units, out_act, activation):
    model = Sequential()
    model.add(Embedding(input_dim=vocab_size + 1, output_dim=output_dim,
                        input_length=input_shape, mask_zero=True))
    model.add(Bidirectional(LSTM(units=n_units, activation=activation,
                                 return_sequences=True)))
    model.add(TimeDistributed(Dense(label_size + 1, activation=out_act)))
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    return model


# 模型訓(xùn)練
def model_train():

    # 將數(shù)據(jù)集分為訓(xùn)練集和測(cè)試集，占比為9:1
    input_shape = 60
    x, y, output_dictionary, vocab_size, label_size, inverse_word_dictionary = input_data_for_model(input_shape)
    train_end = int(len(x)*0.9)
    train_x, train_y = x[0:train_end], np.array(y[0:train_end])
    test_x, test_y = x[train_end:], np.array(y[train_end:])

    # 模型輸入?yún)?shù)
    activation = 'selu'
    out_act = 'softmax'
    n_units = 100
    batch_size = 32
    epochs = 10
    output_dim = 20

    # 模型訓(xùn)練
    lstm_model = create_Bi_LSTM(vocab_size, label_size, input_shape, output_dim, n_units, out_act, activation)
    lstm_model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, verbose=1)

    # 模型保存
    model_save_path = CONSTANTS[0]
    lstm_model.save(model_save_path)
    plot_model(lstm_model, to_file='%s/LSTM_model.png' % BASE_DIR)

    # 在測(cè)試集上的效果
    N = test_x.shape[0]  # 測(cè)試的條數(shù)
    avg_accuracy = 0  # 預(yù)測(cè)的平均準(zhǔn)確率
    for start, end in zip(range(0, N, 1), range(1, N+1, 1)):
        sentence = [inverse_word_dictionary[i] for i in test_x[start] if i != 0]
        y_predict = lstm_model.predict(test_x[start:end])
        input_sequences, output_sequences = [], []
        for i in range(0, len(y_predict[0])):
            output_sequences.append(np.argmax(y_predict[0][i]))
            input_sequences.append(np.argmax(test_y[start][i]))

        eval = lstm_model.evaluate(test_x[start:end], test_y[start:end])
        print('Test Accuracy: loss = %0.6f accuracy = %0.2f%%' % (eval[0], eval[1] * 100))
        avg_accuracy += eval[1]
        output_sequences = ' '.join([output_dictionary[key] for key in output_sequences if key != 0]).split()
        input_sequences = ' '.join([output_dictionary[key] for key in input_sequences if key != 0]).split()
        output_input_comparison = pd.DataFrame([sentence, output_sequences, input_sequences]).T
        print(output_input_comparison.dropna())
        print('#' * 80)

    avg_accuracy /= N
    print("測(cè)試樣本的平均預(yù)測(cè)準(zhǔn)確率：%.2f%%." % (avg_accuracy * 100))

model_train()

在上面的代碼中鸣剪，先是通過(guò)input_data_for_model()函數(shù)來(lái)處理好進(jìn)入模型的數(shù)據(jù)组底，其參數(shù)為input_shape，即填充句子時(shí)的長(zhǎng)度筐骇。然后是創(chuàng)建Bi-LSTM模型create_Bi_LSTM()债鸡，模型的示意圖如下：

Bi-LSTM模型示意圖

最后，是在輸入的數(shù)據(jù)上進(jìn)行模型訓(xùn)練铛纬，將原始的數(shù)據(jù)分為訓(xùn)練集和測(cè)試集厌均，占比為9:1，訓(xùn)練的周期為10次告唆。

模型訓(xùn)練

??運(yùn)行上述模型訓(xùn)練代碼棺弊，一共訓(xùn)練10個(gè)周期，訓(xùn)練時(shí)間大概為500s擒悬，在訓(xùn)練集上的準(zhǔn)確率達(dá)99%以上模她，在測(cè)試集上的平均準(zhǔn)確率為95%以上。以下是最后幾個(gè)測(cè)試集上的預(yù)測(cè)結(jié)果：

......(前面的輸出已忽略)
Test Accuracy: loss = 0.000986 accuracy = 100.00%
          0      1      2
0   Cardiff  B-ORG  B-ORG
1         1      O      O
2  Brighton  B-ORG  B-ORG
3         0      O      O
################################################################################

1/1 [==============================] - 0s 10ms/step
Test Accuracy: loss = 0.000274 accuracy = 100.00%
          0      1      2
0  Carlisle  B-ORG  B-ORG
1         0      O      O
2      Hull  B-ORG  B-ORG
3         0      O      O
################################################################################

1/1 [==============================] - 0s 9ms/step
Test Accuracy: loss = 0.000479 accuracy = 100.00%
           0      1      2
0    Chester  B-ORG  B-ORG
1          1      O      O
2  Cambridge  B-ORG  B-ORG
3          1      O      O
################################################################################

1/1 [==============================] - 0s 9ms/step
Test Accuracy: loss = 0.003092 accuracy = 100.00%
            0      1      2
0  Darlington  B-ORG  B-ORG
1           4      O      O
2     Swansea  B-ORG  B-ORG
3           1      O      O
################################################################################

1/1 [==============================] - 0s 8ms/step
Test Accuracy: loss = 0.000705 accuracy = 100.00%
             0      1      2
0       Exeter  B-ORG  B-ORG
1            2      O      O
2  Scarborough  B-ORG  B-ORG
3            2      O      O
################################################################################
測(cè)試樣本的平均預(yù)測(cè)準(zhǔn)確率：95.55%.

??該模型在原始數(shù)據(jù)上的識(shí)別效果還是可以的懂牧。
??訓(xùn)練完模型后鹅颊，BASE_DIR中的所有文件如下：

模型訓(xùn)練完后的所有文件截圖

模型預(yù)測(cè)

??最后妆毕，也許是整個(gè)項(xiàng)目最為激動(dòng)人心的時(shí)刻，因?yàn)樘覀円谛聰?shù)據(jù)集上測(cè)試模型的識(shí)別效果彻亲。預(yù)測(cè)新數(shù)據(jù)的識(shí)別結(jié)果的完整Python代碼（Bi_LSTM_Model_predict.py）如下：

# -*- coding: utf-8 -*-
# Name entity recognition for new data

# Import the necessary modules
import pickle
import numpy as np
from utils import CONSTANTS
from keras.preprocessing.sequence import pad_sequences
from keras.models import load_model
from nltk import word_tokenize

# 導(dǎo)入字典
with open(CONSTANTS[1], 'rb') as f:
    word_dictionary = pickle.load(f)
with open(CONSTANTS[4], 'rb') as f:
    output_dictionary = pickle.load(f)

try:
    # 數(shù)據(jù)預(yù)處理
    input_shape = 60
    sent = 'New York is the biggest city in America.'
    new_sent = word_tokenize(sent)
    new_x = [[word_dictionary[word] for word in new_sent]]
    x = pad_sequences(maxlen=input_shape, sequences=new_x, padding='post', value=0)

    # 載入模型
    model_save_path = CONSTANTS[0]
    lstm_model = load_model(model_save_path)

    # 模型預(yù)測(cè)
    y_predict = lstm_model.predict(x)

    ner_tag = []
    for i in range(0, len(new_sent)):
        ner_tag.append(np.argmax(y_predict[0][i]))

    ner = [output_dictionary[i] for i in ner_tag]
    print(new_sent)
    print(ner)

    # 去掉NER標(biāo)注為O的元素
    ner_reg_list = []
    for word, tag in zip(new_sent, ner):
        if tag != 'O':
            ner_reg_list.append((word, tag))

    # 輸出模型的NER識(shí)別結(jié)果
    print("NER識(shí)別結(jié)果：")
    if ner_reg_list:
        for i, item in enumerate(ner_reg_list):
            if item[1].startswith('B'):
                end = i+1
                while end <= len(ner_reg_list)-1 and ner_reg_list[end][1].startswith('I'):
                    end += 1

                ner_type = item[1].split('-')[1]
                ner_type_dict = {'PER': 'PERSON: ',
                                'LOC': 'LOCATION: ',
                                'ORG': 'ORGANIZATION: ',
                                'MISC': 'MISC: '
                                }
                print(ner_type_dict[ner_type],\
                    ' '.join([item[0] for item in ner_reg_list[i:end]]))
    else:
        print("模型并未識(shí)別任何有效命名實(shí)體警绩。")

except KeyError as err:
    print("您輸入的句子有單詞不在詞匯表中飞苇，請(qǐng)重新輸入抹沪！")
    print("不在詞匯表中的單詞為：%s." % err)

輸出結(jié)果為：

['New', 'York', 'is', 'the', 'biggest', 'city', 'in', 'America', '.']
['B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'O']
NER識(shí)別結(jié)果：
LOCATION:  New York
LOCATION:  America

??接下來(lái)，再測(cè)試三個(gè)筆者自己想的句子：

輸入為：

sent = 'James is a world famous actor, whose home is in London.'

輸出結(jié)果為：

['James', 'is', 'a', 'world', 'famous', 'actor', ',', 'whose', 'home', 'is', 'in', 'London', '.']
['B-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'O']
NER識(shí)別結(jié)果：
PERSON:  James
LOCATION:  London

輸入為：

sent = 'Oxford is in England, Jack is from here.'

輸出為：

['Oxford', 'is', 'in', 'England', ',', 'Jack', 'is', 'from', 'here', '.']
['B-PER', 'O', 'O', 'B-LOC', 'O', 'B-PER', 'O', 'O', 'O', 'O']
NER識(shí)別結(jié)果：
PERSON:  Oxford
LOCATION:  England
PERSON:  Jack

輸入為：

sent = 'I love Shanghai.'

輸出為：

['I', 'love', 'Shanghai', '.']
['O', 'O', 'B-LOC', 'O']
NER識(shí)別結(jié)果：
LOCATION:  Shanghai

在上面的例子中吻氧，只有Oxford的識(shí)別效果不理想，模型將它識(shí)別為PERSON咏连，其實(shí)應(yīng)該是ORGANIZATION盯孙。

??接下來(lái)是三個(gè)來(lái)自CNN和wikipedia的句子：

輸入為：

sent = "the US runs the risk of a military defeat by China or Russia"

輸出為：

['the', 'US', 'runs', 'the', 'risk', 'of', 'a', 'military', 'defeat', 'by', 'China', 'or', 'Russia']
['O', 'B-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'O', 'B-LOC']
NER識(shí)別結(jié)果：
LOCATION:  US
LOCATION:  China
LOCATION:  Russia

輸入為：

sent = "Home to the headquarters of the United Nations, New York is an important center for international diplomacy."

輸出為：

['Home', 'to', 'the', 'headquarters', 'of', 'the', 'United', 'Nations', ',', 'New', 'York', 'is', 'an', 'important', 'center', 'for', 'international', 'diplomacy', '.']
['O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'O', 'B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
NER識(shí)別結(jié)果：
ORGANIZATION:  United Nations
LOCATION:  New York

輸入為：

sent = "The United States is a founding member of the United Nations, World Bank, International Monetary Fund."

輸出為:

['The', 'United', 'States', 'is', 'a', 'founding', 'member', 'of', 'the', 'United', 'Nations', ',', 'World', 'Bank', ',', 'International', 'Monetary', 'Fund', '.']
['O', 'B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'O', 'B-ORG', 'I-ORG', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O']
NER識(shí)別結(jié)果：
LOCATION:  United States
ORGANIZATION:  United Nations
ORGANIZATION:  World Bank
ORGANIZATION:  International Monetary Fund

??這三個(gè)例子識(shí)別全部正確。

總結(jié)

??到這兒祟滴，筆者的這個(gè)項(xiàng)目就差不多了振惰。我們有必要對(duì)這個(gè)項(xiàng)目做個(gè)總結(jié)。
??首先是這個(gè)項(xiàng)目的優(yōu)點(diǎn)垄懂。它的優(yōu)點(diǎn)在于能夠讓你一步步地實(shí)現(xiàn)NER骑晶，而且除了語(yǔ)料庫(kù)痛垛，你基本熟悉了如何創(chuàng)建一個(gè)識(shí)別NER系統(tǒng)的步驟，同時(shí)桶蛔，對(duì)深度學(xué)習(xí)模型及其應(yīng)用也有了深刻理解匙头。因此，好處是顯而易見(jiàn)的仔雷。當(dāng)然蹂析，在實(shí)際工作中，語(yǔ)料庫(kù)的整理才是最耗費(fèi)時(shí)間的碟婆，能夠占到90%或者更多的時(shí)間电抚，因此，有一個(gè)好的語(yǔ)料庫(kù)你才能展開(kāi)工作竖共。
??接著講講這個(gè)項(xiàng)目的缺點(diǎn)蝙叛。第一個(gè)，是語(yǔ)料庫(kù)不夠大公给，當(dāng)然借帘，約14000條句子也夠了，但本項(xiàng)目沒(méi)有對(duì)句子進(jìn)行文本預(yù)處理妓布，所以姻蚓，有些單詞的變形可能無(wú)法進(jìn)入詞匯表。第二個(gè)匣沼，缺少對(duì)新詞的處理狰挡，一旦句子中出現(xiàn)一個(gè)新的單詞，這個(gè)模型便無(wú)法處理释涛，這是后期需要完善的地方加叁。第三個(gè)，句子的填充長(zhǎng)度為60唇撬，如果輸入的句子長(zhǎng)度大于60它匕，則后面的部分將無(wú)法有效識(shí)別。
??因此窖认，后續(xù)還有更多的工作需要去做豫柬，當(dāng)然，做一個(gè)中文NER也是可以考慮的扑浸。
??本項(xiàng)目已上傳Github,地址為 https://github.com/percent4/DL_4_NER 烧给。：歡迎大家參考~

注意：本人現(xiàn)已開(kāi)通微信公眾號(hào)： Python爬蟲(chóng)與算法（微信號(hào)為：easy_web_scrape），歡迎大家關(guān)注哦~~

參考文獻(xiàn)

BOOK： Applied Natural Language Processing with Python喝噪， Taweh Beysolow II
WEBSITE：https://github.com/Apress/applied-natural-language-processing-w-python
WEBSITE: NLP入門(mén)（四）命名實(shí)體識(shí)別（NER）: http://www.reibang.com/p/16e1f6a7aaef

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末础嫡，一起剝皮案震驚了整個(gè)濱河市，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌榴鼎，老刑警劉巖伯诬，帶你破解...
沈念sama閱讀 216,843評(píng)論 6贊 502
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場(chǎng)離奇詭異巫财，居然都是意外死亡盗似，警方通過(guò)查閱死者的電腦和手機(jī)，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 92,538評(píng)論 3贊 392
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門(mén)翁涤，熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)桥言，“玉大人，你說(shuō)我怎么就攤上這事葵礼『虐ⅲ” “怎么了？”我有些...
開(kāi)封第一講書(shū)人閱讀 163,187評(píng)論 0贊 353
道士緝兇錄：失蹤的賣(mài)姜人
文/不壞的土叔我叫張陵鸳粉，是天一觀的道長(zhǎng)扔涧。經(jīng)常有香客問(wèn)我，道長(zhǎng)届谈，這世上最難降的妖魔是什么枯夜？我笑而不...
開(kāi)封第一講書(shū)人閱讀 58,264評(píng)論 1贊 292
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮艰山，結(jié)果婚禮上湖雹，老公的妹妹穿的比我還像新娘。我一直安慰自己曙搬，他們只是感情好摔吏，可當(dāng)我...
茶點(diǎn)故事閱讀 67,289評(píng)論 6贊 390
惡毒庶女頂嫁案：這布局不是一般人想出來(lái)的
文/花漫我一把揭開(kāi)白布。她就那樣靜靜地躺著纵装，像睡著了一般征讲。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上橡娄，一...
開(kāi)封第一講書(shū)人閱讀 51,231評(píng)論 1贊 299
城市分裂傳說(shuō)
那天诗箍，我揣著相機(jī)與錄音，去河邊找鬼挽唉。笑死滤祖，一個(gè)胖子當(dāng)著我的面吹牛，可吹牛的內(nèi)容都是我干的瓶籽。我是一名探鬼主播匠童，決...
沈念sama閱讀 40,116評(píng)論 3贊 418
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開(kāi)眼，長(zhǎng)吁一口氣：“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼棘劣！你這毒婦竟也來(lái)了俏让？” 一聲冷哼從身側(cè)響起楞遏，我...
開(kāi)封第一講書(shū)人閱讀 38,945評(píng)論 0贊 275
萬(wàn)榮殺人案實(shí)錄
序言：老撾萬(wàn)榮一對(duì)情侶失蹤茬暇，失蹤者是張志新（化名）和其女友劉穎首昔，沒(méi)想到半個(gè)月后，有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體糙俗，經(jīng)...
沈念sama閱讀 45,367評(píng)論 1贊 313
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡勒奇，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 37,581評(píng)論 2贊 333
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了巧骚。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片赊颠。...
茶點(diǎn)故事閱讀 39,754評(píng)論 1贊 348
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡，死狀恐怖劈彪，靈堂內(nèi)的尸體忽然破棺而出竣蹦，到底是詐尸還是另有隱情，我是刑警寧澤沧奴，帶...
沈念sama閱讀 35,458評(píng)論 5贊 344
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布痘括，位于F島的核電站，受9級(jí)特大地震影響滔吠，放射性物質(zhì)發(fā)生泄漏纲菌。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,068評(píng)論 3贊 327
男人毒藥：我在死后第九天來(lái)索命
文/蒙蒙一疮绷、第九天我趴在偏房一處隱蔽的房頂上張望翰舌。院中可真熱鬧，春花似錦冬骚、人聲如沸椅贱。這莊子的主人今日做“春日...
開(kāi)封第一講書(shū)人閱讀 31,692評(píng)論 0贊 22
一樁弒父案唉韭，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽(yáng)夜涕。三九已至，卻和暖如春属愤，著一層夾襖步出監(jiān)牢的瞬間女器，已是汗流浹背。一陣腳步聲響...
開(kāi)封第一講書(shū)人閱讀 32,842評(píng)論 1贊 269
情欲美人皮
我被黑心中介騙來(lái)泰國(guó)打工住诸，沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留驾胆，地道東北人。一個(gè)月前我還...
沈念sama閱讀 47,797評(píng)論 2贊 369
代替公主和親
正文我出身青樓贱呐，卻偏偏與公主長(zhǎng)得像丧诺，于是被迫代替她去往敵國(guó)和親。傳聞我的和親對(duì)象是個(gè)殘疾皇子奄薇，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 44,654評(píng)論 2贊 354

NLP入門(mén)（五）用深度學(xué)習(xí)實(shí)現(xiàn)命名實(shí)體識(shí)別（NER）

前言

項(xiàng)目配置

數(shù)據(jù)探索

建模

模型訓(xùn)練

模型預(yù)測(cè)

總結(jié)

參考文獻(xiàn)

推薦閱讀更多精彩內(nèi)容