創(chuàng)建一個(gè)簡(jiǎn)單的模型理解句子某些詞的語(yǔ)義(NER)
加載一些包
import glob
import pandas as pd
import tensorflow as tf
from keras import Sequential
from keras.utils import pad_sequences, to_categorical
from keras.preprocessing.text import Tokenizer
from keras.layers import Embedding, Bidirectional, LSTM, TimeDistributed, Dense
加載標(biāo)簽和語(yǔ)句
在ner
文件夾里面有一堆原始數(shù)據(jù)孤紧,每句話后面是每個(gè)詞的標(biāo)簽。
example :
An Iraqi court has sentenced 11 men to death for the massive truck bombings in Baghdad last August that killed more than 100 people .,O B-gpe O O O O O O O O O O O O O B-geo O B-tim O O O O O O O,DT JJ NN VBZ VBN CD NNS TO NN IN DT JJ NN NNS IN NNP JJ NNP WDT VBD JJR IN CD NNS .
files = glob.glob('./ner/*.tags')
data_pd = pd.concat([pd.read_csv(f, header=None, names=['text', 'label', 'pos']) for f in files], ignore_index=True)
print(data_pd.info())
序列化文本
首先把文本和標(biāo)簽Token
化
# This class allows to vectorize a text corpus, by turning each text into either a sequence of integers (each integer
# being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary,
# based on word count, based on tf-idf...
text_tok = Tokenizer(filters='[\\]^\t\n', lower=False, split=' ', oov_token='<OOV>')
pos_tok = Tokenizer(filters='\t\n', lower=False, split=' ', oov_token='<OOV>')
ner_tok = Tokenizer(filters='\t\n', lower=False, split=' ', oov_token='<OOV>')
text_tok.fit_on_texts(data_pd['text'])
pos_tok.fit_on_texts(data_pd['pos'])
ner_tok.fit_on_texts(data_pd['label'])
ner_config = ner_tok.get_config()
text_config = text_tok.get_config()
print(ner_config)
# print(text_config)
這里打印了標(biāo)簽的Token信息,出現(xiàn)率越高的詞索引會(huì)比較小。
{'num_words': None, 'filters': '\t\n', 'lower': False, 'split': ' ', 'char_level': False, 'oov_token': '<OOV>', 'document_count': 62010, 'word_counts': '{"O": 1146068, "B-gpe": 20436, "B-geo": 48876, "B-tim": 26296, "I-tim": 8493, "B-org": 26195, "I-org": 21899, "B-per": 21984, "I-per": 22270, "I-geo": 9512, "B-art": 503, "B-nat": 238, "B-eve": 391, "I-eve": 318, "I-art": 364, "I-gpe": 244, "I-nat": 62}', 'word_docs': '{"B-geo": 31660, "B-gpe": 16565, "B-tim": 22345, "O": 61999, "B-org": 20478, "I-org": 11011, "I-tim": 5526, "B-per": 17499, "I-per": 13805, "I-geo": 7738, "B-art": 425, "B-nat": 211, "B-eve": 361, "I-eve": 201, "I-art": 207, "I-gpe": 224, "I-nat": 50}', 'index_docs': '{"3": 31660, "9": 16565, "4": 22345, "2": 61999, "5": 20478, "8": 11011, "11": 5526, "7": 17499, "6": 13805, "10": 7738, "12": 425, "17": 211, "13": 361, "15": 201, "14": 207, "16": 224, "18": 50}', 'index_word': '{"1": "<OOV>", "2": "O", "3": "B-geo", "4": "B-tim", "5": "B-org", "6": "I-per", "7": "B-per", "8": "I-org", "9": "B-gpe", "10": "I-geo", "11": "I-tim", "12": "B-art", "13": "B-eve", "14": "I-art", "15": "I-eve", "16": "I-gpe", "17": "B-nat", "18": "I-nat"}', 'word_index': '{"<OOV>": 1, "O": 2, "B-geo": 3, "B-tim": 4, "B-org": 5, "I-per": 6, "B-per": 7, "I-org": 8, "B-gpe": 9, "I-geo": 10, "I-tim": 11, "B-art": 12, "B-eve": 13, "I-art": 14, "I-eve": 15, "I-gpe": 16, "B-nat": 17, "I-nat": 18}'}
- 標(biāo)簽的意義
- geo = Geographical entity
- org = Organization
- per = Person
- gpe = Geopolitical entity
- tim = Time indicator
- art = Artifact
- eve = Event
- nat = Natural phenomenon
-
B-
/I
該前綴代表開始
與緊跟
例如,August 19
舱污,B-tim I-tim
言秸。 這兩個(gè)詞都是時(shí)間软能,所以要表示開始與結(jié)束標(biāo)識(shí),這里的開始就是Augst
=>B-tim
每一個(gè)字Token化之后举畸,接著用token來(lái)表示整個(gè)句子查排,因?yàn)橛?jì)算的過(guò)程都是通過(guò)數(shù)字完成的。
# eval convert string to dictionary
text_vocab = eval(text_config['index_word'])
print("Unique words in vocab:", len(text_vocab))
ner_vocab = eval(ner_config['index_word'])
print("Unique NER tags in vocab:", len(ner_vocab))
# Transforms each text in texts to a sequence of integers.
x_tok = text_tok.texts_to_sequences(data_pd['text'])
y_tok = ner_tok.texts_to_sequences(data_pd['label'])
這里打印兩個(gè)例子抄沮,可以看到語(yǔ)義標(biāo)簽以及句子都轉(zhuǎn)成了矩陣(看起來(lái)像個(gè)數(shù)組)
# O B-gpe O O O O O O O O O O O O O B-geo O B-tim O O O O O O O
# [2, 9, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 4, 2, 2, 2, 2, 2, 2, 2]
print(data_pd['label'][0], y_tok[0])
# An Iraqi court has sentenced 11 men to death for the massive truck bombings in Baghdad last August that killed more than 100 people .
# [316, 89, 233, 13, 1112, 494, 240, 7, 248, 12, 2, 913, 1485, 528, 5, 146, 61, 570, 16, 38, 50, 55, 671, 39, 3]
print(data_pd['text'][0], x_tok[0])
把所有輸入與輸出的數(shù)據(jù)處理成同樣的長(zhǎng)度.
因?yàn)樵趖ensorflow計(jì)算過(guò)程中跋核,所有輸入輸出的的數(shù)據(jù)需要相同。
過(guò)長(zhǎng)的數(shù)據(jù)會(huì)被截掉后半段叛买,過(guò)段的數(shù)據(jù)會(huì)在句子后加上填充符砂代。
max_len = 50
# padding, String, "pre" or "post" (optional, defaults to "pre"): pad either before or after each sequence.
x_pad = pad_sequences(x_tok, padding='post', maxlen=max_len)
y_pad = pad_sequences(y_tok, padding='post', maxlen=max_len)
print(x_pad.shape, y_pad.shape)
最后對(duì)輸出值進(jìn)行one-hot 處理。
因?yàn)檩敵鼋Y(jié)果是個(gè)類別
率挣, 如果用 1 刻伊、2、 3 數(shù)字描述類別1椒功、類別2捶箱、類別3狮暑,會(huì)得到 類別3 = 類別1 + 類別2 的錯(cuò)誤現(xiàn)象收恢。
# Since there are multiple labels, each label token needs to be one-hot encoded like so:
num_classes = len(ner_vocab) + 1
Y = to_categorical(y_pad, num_classes)
# (62010, 50, 19)
print(Y.shape)
這里打印了一個(gè)例子來(lái)看看没佑,可以看到B-gpe已經(jīng)被轉(zhuǎn)化成了長(zhǎng)度為19的1維舉證弟跑。
所以每一句話都是維度[50, 19]的矩陣。
# B-gpe => 9 => [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
print('finally covert ', ner_vocab.get('9'), '=>', y_pad[0][1], '=>', Y[0][1])
構(gòu)建模型
vocab_size = len(text_vocab) + 1
embedding_dim = 64
rnn_units = 100
BATCH_SIZE = 90
num_classes = len(ner_vocab) + 1
dropout = 0.2
# None means it is a dynamic shape. It can take any value depending on the batch size you choose.
# num_units in TensorFlow is the number of hidden states, Positive integer, dimensionality of the output space.
# TimeDistributed , This wrapper allows to apply a layer to every temporal slice of an input.
# kernel_initializer Initializer for the kernel weights matrix, used for the linear transformation of the inputs
def build_model_bilstm(vocab_size, embedding_dim, rnn_units, batch_size, classes):
return Sequential([
Embedding(vocab_size, embedding_dim, mask_zero=True, batch_input_shape=[batch_size, None]),
Bidirectional(LSTM(units=rnn_units, return_sequences=True, dropout=dropout, kernel_initializer=tf.keras.initializers.he_normal())),
TimeDistributed(Dense(rnn_units, activation='relu')),
Dense(classes, activation='softmax')
])
model = build_model_bilstm(vocab_size=vocab_size, embedding_dim=embedding_dim, rnn_units=rnn_units, batch_size=BATCH_SIZE, classes=num_classes)
print(model.summary())
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
訓(xùn)練與驗(yàn)證模型
X = x_pad
total_sentences = ner_config.get('document_count')
test_size = round(total_sentences / BATCH_SIZE * 0.2)
test_size = BATCH_SIZE * test_size
X_train = X[test_size:]
Y_train = Y[test_size:]
X_test = X[0:test_size]
Y_test = Y[0:test_size]
model.fit(X_train, Y_train, batch_size=BATCH_SIZE, epochs=15)
model.evaluate(X_test, Y_test, batch_size=BATCH_SIZE)
準(zhǔn)確率為96.34%
138/138 [==============================] - 2s 4ms/step - loss: 0.0872 - accuracy: 0.9634
預(yù)測(cè)句子
y_pred = model.predict(X_test, batch_size=BATCH_SIZE)
# convert prediction one-hot encoding back to number
y_pred = tf.argmax(y_pred, -1)
y_pnp = y_pred.numpy()
# convert ground true one-hot encode back to number
y_ground_true = tf.argmax(Y_test, -1)
y_ground_true_pnp = y_ground_true.numpy()
for i in range(10):
x = 'sentence=> ' + text_tok.sequences_to_texts([X_test[i]])[0]
ground_true = 'ground_true=> ' + ner_tok.sequences_to_texts([y_ground_true_pnp[i]])[0]
prediction = 'prediction=> ' + ner_tok.sequences_to_texts([y_pnp[i]])[0]
template = '|'.join(['{' + str(index) + ': <15}' for index, x in enumerate(x.split(' '))])
print(template.format(*x.split(' ')))
print(template.format(*ground_true.split(' ')))
print(template.format(*prediction.split(' ')))
print('\n')
打印其中兩個(gè)例子软族,可以看到預(yù)測(cè)還是挺準(zhǔn)確的
sentence=> |An |Iraqi |court |has |sentenced |11 |men |to |death |for |the |massive |truck |bombings |in |Baghdad |last |August |that |killed |more |than |100 |people |. |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV>
ground_true=> |O |B-gpe |O |O |O |O |O |O |O |O |O |O |O |O |O |B-geo |O |B-tim |O |O |O |O |O |O |O |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV>
prediction=> |O |B-gpe |O |O |O |O |O |O |O |O |O |O |O |O |O |B-geo |O |B-tim |O |O |O |O |O |O |O |O |O |O |O |O |O |O |O |O |O |O |O |O |O |O |O |O |O |O |O |O |O |O |O |O
sentence=> |The |court |convicted |the |men |of |planning |and |implementing |the |August |19 |attacks |on |the |Iraqi |Ministries |of |Finance |and |Foreign |Affairs |. |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV>
ground_true=> |O |O |O |O |O |O |O |O |O |O |B-tim |I-tim |O |O |O |B-gpe |O |O |B-org |I-org |I-org |I-org |O |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV> |<OOV>
prediction=> |O |O |O |O |O |O |O |O |O |O |B-tim |I-tim |O |O |O |B-gpe |B-org |I-org |I-org |I-org |I-org |I-org |O |O |O |O |O |O |O |O |O |O |O |O |O |O |O |O |O |O |O |O |O |O |O |O |O |O |O |O